An Evaluation of Standard Retrieval Algorithms and a Binary Neural

Total Page:16

File Type:pdf, Size:1020Kb

An Evaluation of Standard Retrieval Algorithms and a Binary Neural An Evaluation of Standard Retrieval Algorithms and a Binary Neural Approach Jim Austin Victoria J Ho dge Dept of Computer Science Dept of Computer Science University of York UK University of York UK This research was supp orted by an EPSRC studentship Contact Address Victoria J Ho dge Dept of Computer Science University of York Heslington York UK YO DD Tel Fax vickycsyorkacuk Running Title A Neural Retrieval Approach A Neural Retrieval Approach An Evaluation of Standard Retrieval Algorithms and a Binary Neural Approach Abstract In this pap er we evaluate a selection of data retrieval algorithms for storage eciency retrieval sp eed and partial matching capabilities using a large information retrieval dataset We evaluate standard data structures for example inverted le lists and hash tables but also a novel binary neu ral network that incorp orates singleep o ch training sup erimp osed co ding and asso ciative matching in a binary matrix data structure We identify the strengths and weaknesses of the approaches From our evaluation the novel neural network approach is sup erior with resp ect to training sp eed and partial match retrieval time From the results we make recommenda tions for the appropriate usage of the novel neural approach Keywords Information Retrieval Algorithm Binary Neural Network Cor relation Matrix Memory WordDocument Asso ciation Partial Match Storage Eciency Sp eed of Training Sp eed of Retrieval A Neural Retrieval Approach Many computational implementations require algorithms that are storage ecient may b e rapidly trained with data and allow fast retrieval of selected data Information Retrieval IR requires the storage of massive sets of word to do cument asso ciations The underlying principle of most IR systems is to retrieve stored data on the basis of queries supplied by the user The data structure must allow the do cuments matching query terms to b e retrieved using some form of indexing This inevitably requires an algorithm that is ecient for storage allows rapid training of the asso ciations fast retrieval of do cuments matching the query terms and additionally p ermits partial matching M of N query terms where M N and N is the total number of query terms and M is the number of those terms that must match Many metho dologies have b een p osited for storing these asso ciations see for example including inverted le lists hash tables do cument vectors sup erimp osed co ding techniques and Latent Semantic Indexing LSI In this pap er we compare Perfect tech niques ie those that preserve all word to do cument asso ciations as opp osed to Imp erfect metho dologies such as LSI A representation strategy used in many systems are do cument vectors There are various adaptations of the underlying strategy but fundamentally the do c uments are represented by vectors with elements representing an attribute for each word in the corpus The do cument vectors form a matrix representation of the corpus with one row p er do cument vector and one column p er word at tribute For binary do cument vectors used in for example Koller Sahami the weights are Bo olean so if a word is present in a do cument the appropriate bit in the do cument vector is set to The matrix may also b e integerbased as used in Goldszmidt Sahami where w represents the number of times j k w or d is present in document By activating the appropriate columns words j k the do cuments containing those words may b e retrieved from the matrix LSI decomp oses a worddocument matrix to pro duce a metalevel representation of A Neural Retrieval Approach the corpus with the aim of correlating terms and extracting do cument topics LSI reduces the storage using Singular Valued Decomp osition SVD factor analysis Although this serves to reduce storage it also discards information and compresses out the worddocument asso ciations we need for our evaluation There are alternative hashing strategies as compared to the standard hash structure evaluated in this pap er They are aimed at partial matching but tend to retrieve false p ositives do cuments app ear to match that should not and are thus Imp erfect There may b e insucient dimensions for uniqueness so many words may hash to the same bits and thus false matches will b e re trieved The extra matches then have to b e rechecked for correctness and the false matches eliminated thus slowing retrieval They are describ ed in and include address generation hashing see also and hashing with descriptors see also There are also sup erimp osed co ding SIC techniques describ ed in for hash table partial matching applications but again these are Imp er fect and tend to overretrieve for example onelevel sup erimp osed co ding see also and twolevel sup erimp osed co ding see also We wish to avoid the information loss inherent in LSI and also the false p ositives of SIC with their intrinsic requirement for a p ostmatch recheck to eliminate false matches Therefore we concentrate on Perfect lexical techniques We analyse an inverted le list a slightly more sophisticated version of which is used in the Go ogle search engine against hash tables using various hash functions against binary asso ciative memory AURA We implement a novel binary matrix version of do cument vectors using AURA where the word do cument asso ciations are added incrementally and sup erimp osed Therefore training requires only a singlepass through the worddocument asso ciation list In all cases we evaluate the algorithms in their standard form without any so phisticated improvements to provide a valid comparison of the approaches We A Neural Retrieval Approach evaluate the algorithms for storage use training sp eed retrieval sp eed and partial matching capabilities Knuth p osits that hash tables are sup erior to inverted le lists with resp ect to sp eed but the inverted le list uses slightly less memory Knuth also details sup erimp osed co ding techniques but fo cuses on approaches with multiple bits set as describ ed ab ove that inevitably generate false p ositives We implement orthogonal singlebit set vectors for individual words and do cuments that pro duce no false p ositives AURA may b e used in Imp erfect mo de where multiple bits are set to represent words and do cuments this reduces the vector length required to store all asso ciations but generates false matches as describ ed previously and in Knuth In this pap er we fo cus on using AURA in Perfect mo de For our evaluation we use the Reuters Newswire text corpus as the dataset is a standard Information Retrieval b enchmark It is also large consisting of do cuments with on average approximately words p er do cument This allows the extraction of a large set of wordtodocument asso ciations for a thor ough evaluation of the algorithms We discarded any do cuments that contained little text and do cuments that were rep etitions This left do cuments that were further tidied to leave lowercase alphab etical and six standard punctua tion characters only All alphab etical characters were changed to lower case to ensure matching ie The b ecomes the to ensure all instances match We felt numbers and the other punctuation characters added little value to the dataset and are unlikely to b e stored in an IR system We left punctuation characters as control verication values We extracted all remaining words including the six punctuation symbols from the do cuments and derived wordtodocument asso ciations We created a le with the do cument 1 training time and thus sp eed is implementation dep endent To minimise variance we preserve as much similarity b etween the data structures as p ossible particularly during training see section subsections and for details A Neural Retrieval Approach ID integers from to and a list of each word that o ccurs in that do cu ment The fraction of do cuments that contain each individual word are shown in the graph see gure The minimal fraction is do cument and the maximal fraction is do cuments We analyse the three data structures for multiple single query term retrievals and also for partial match retrievals where do cuments matching N of M query terms are retrieved We assume that all words searched are present we do not consider error cases in this pap er eg words not present or sp elling errors Data structures Inverted File List IFL For the inverted le list compared in this pap er we use an array of words sorted alphab etically and linked to an array of lists This data structure min imises storage and provides exibility The array of words provides an index into the array of lists see gure and app endix A for the C implementation A word is passed to an indexing function binary search through the alphab et ically sorted array of words that returns the p osition of the word in the array The do cument list stored at array p osition X represents the do cuments asso ci ated with the word at p osition X in the alphab etic word array The lists are only as long as the number of do cuments asso ciated with the words to minimise storage but yet can easily b e extended to incorp orate new asso ciations The do cument ID is app ended to the head of the list for sp eed app ending at the tail requires a traversal of the entire list and thus slows training The approach requires minimal storage for the word array the array need only b e as long as the number of words stored as compared to the hash table b elow for example that requires the word storage array to b e only full but additions of new words would require the array to b e reorganised
Recommended publications
  • Full Functional Verification of Linked Data Structures
    Full Functional Verification of Linked Data Structures Karen Zee Viktor Kuncak Martin C. Rinard MIT CSAIL, Cambridge, MA, USA EPFL, I&C, Lausanne, Switzerland MIT CSAIL, Cambridge, MA, USA ∗ [email protected] viktor.kuncak@epfl.ch [email protected] Abstract 1. Introduction We present the first verification of full functional correctness for Linked data structures such as lists, trees, graphs, and hash tables a range of linked data structure implementations, including muta- are pervasive in modern software systems. But because of phenom- ble lists, trees, graphs, and hash tables. Specifically, we present the ena such as aliasing and indirection, it has been a challenge to de- use of the Jahob verification system to verify formal specifications, velop automated reasoning systems that are capable of proving im- written in classical higher-order logic, that completely capture the portant correctness properties of such data structures. desired behavior of the Java data structure implementations (with the exception of properties involving execution time and/or mem- 1.1 Background ory consumption). Given that the desired correctness properties in- In principle, standard specification and verification approaches clude intractable constructs such as quantifiers, transitive closure, should work for linked data structure implementations. But in prac- and lambda abstraction, it is a challenge to successfully prove the tice, many of the desired correctness properties involve logical generated verification conditions. constructs such transitive closure and
    [Show full text]
  • CSI33 Data Structures
    Outline CSI33 Data Structures Department of Mathematics and Computer Science Bronx Community College December 2, 2015 CSI33 Data Structures Outline Outline 1 Chapter 13: Heaps, Balances Trees and Hash Tables Hash Tables CSI33 Data Structures Chapter 13: Heaps, Balances Trees and Hash TablesHash Tables Outline 1 Chapter 13: Heaps, Balances Trees and Hash Tables Hash Tables CSI33 Data Structures Chapter 13: Heaps, Balances Trees and Hash TablesHash Tables Python Dictionaries Various names are given to the abstract data type we know as a dictionary in Python: Hash (The languages Perl and Ruby use this terminology; implementation is a hash table). Map (Microsoft Foundation Classes C/C++ Library; because it maps keys to values). Dictionary (Python, Smalltalk; lets you "look up" a value for a key). Association List (LISP|everything in LISP is a list, but this type is implemented by hash table). Associative Array (This is the technical name for such a structure because it looks like an array whose 'indexes' in square brackets are key values). CSI33 Data Structures Chapter 13: Heaps, Balances Trees and Hash TablesHash Tables Python Dictionaries In Python, a dictionary associates a value (item of data) with a unique key (to identify and access the data). The implementation strategy is the same in any language that uses associative arrays. Representing the relation between key and value is a hash table. CSI33 Data Structures Chapter 13: Heaps, Balances Trees and Hash TablesHash Tables Python Dictionaries The Python implementation uses a hash table since it has the most efficient performance for insertion, deletion, and lookup operations. The running times of all these operations are better than any we have seen so far.
    [Show full text]
  • CIS 120 Midterm I October 13, 2017 Name (Printed): Pennkey (Penn
    CIS 120 Midterm I October 13, 2017 Name (printed): PennKey (penn login id): My signature below certifies that I have complied with the University of Pennsylvania’s Code of Academic Integrity in completing this examination. Signature: Date: • Do not begin the exam until you are told it is time to do so. • Make sure that your username (a.k.a. PennKey, e.g. stevez) is written clearly at the bottom of every page. • There are 100 total points. The exam lasts 50 minutes. Do not spend too much time on any one question. • Be sure to recheck all of your answers. • The last page of the exam can be used as scratch space. By default, we will ignore anything you write on this page. If you write something that you want us to grade, make sure you mark it clearly as an answer to a problem and write a clear note on the page with that problem telling us to look at the scratch page. • Good luck! 1 1. Binary Search Trees (16 points total) This problem concerns buggy implementations of the insert and delete functions for bi- nary search trees, the correct versions of which are shown in Appendix A. First: At most one of the lines of code contains a compile-time (i.e. typechecking) error. If there is a compile-time error, explain what the error is and one way to fix it. If there is no compile-time error, say “No Error”. Second: even after the compile-time error (if any) is fixed, the code is still buggy—for some inputs the function works correctly and produces the correct BST, and for other inputs, the function produces an incorrect tree.
    [Show full text]
  • Verifying Linked Data Structure Implementations
    Verifying Linked Data Structure Implementations Karen Zee Viktor Kuncak Martin Rinard MIT CSAIL EPFL I&C MIT CSAIL Cambridge, MA Lausanne, Switzerland Cambridge, MA [email protected] viktor.kuncak@epfl.ch [email protected] Abstract equivalent conjunction of properties. By processing each conjunct separately, Jahob can use different provers to es- The Jahob program verification system leverages state of tablish different parts of the proof obligations. This is the art automated theorem provers, shape analysis, and de- possible thanks to formula approximation techniques[10] cision procedures to check that programs conform to their that create equivalent or semantically stronger formulas ac- specifications. By combining a rich specification language cepted by the specialized decision procedures. with a diverse collection of verification technologies, Jahob This paper presents our experience using the Jahob sys- makes it possible to verify complex properties of programs tem to obtain full correctness proofs for a collection of that manipulate linked data structures. We present our re- linked data structure implementations. Unlike systems that sults using Jahob to achieve full functional verification of a are designed to verify partial correctness properties, Jahob collection of linked data structures. verifies the full functional correctness of the data structure implementation. Not only do our results show that such full functional verification is feasible, the verified data struc- 1 Introduction tures and algorithms provide concrete examples that can help developers better understand how to achieve similar results in future efforts. Linked data structures such as lists, trees, graphs, and hash tables are pervasive in modern software systems. But because of phenomena such as aliasing and indirection, it 2 Example has been a challenge to develop automated reasoning sys- tems that are capable of proving important correctness prop- In this section we use a verified association list to demon- erties of such data structures.
    [Show full text]
  • CSCI 2041: Data Types in Ocaml
    CSCI 2041: Data Types in OCaml Chris Kauffman Last Updated: Thu Oct 18 22:43:05 CDT 2018 1 Logistics Assignment 3 multimanager Reading I Manage multiple lists I OCaml System Manual: I Records to track lists/undo Ch 1.4, 1.5, 1.7 I option to deal with editing I Practical OCaml: Ch 5 I Higher-order funcs for easy bulk operations Goals I Post tomorrow I Tuples I Due in 2 weeks I Records I Algebraic / Variant Types Next week First-class / Higher Order Functions 2 Overview of Aggregate Data Structures / Types in OCaml I Despite being an older functional language, OCaml has a wealth of aggregate data types I The table below describe some of these with some characteristics I We have discussed Lists and Arrays at some length I We will now discuss the others Elements Typical Access Mutable Example Lists Homoegenous Index/PatMatch No [1;2;3] Array Homoegenous Index Yes [|1;2;3|] Tuples Heterogeneous PatMatch No (1,"two",3.0) Records Heterogeneous Field/PatMatch No/Yes {name="Sam"; age=21} Variant Not Applicable PatMatch No type letter = A | B | C; Note: data types can be nested and combined in any way I Array of Lists, List of Tuples I Record with list and tuple fields I Tuple of list and Record I Variant with List and Record or Array and Tuple 3 Tuples # let int_pair = (1,2);; val int_pair : int * int = (1, 2) I Potentially mixed data # let mixed_triple = (1,"two",3.0);; I Commas separate elements val mixed_triple : int * string * float = (1, "two", 3.) I Tuples: pairs, triples, quadruples, quintuples, etc.
    [Show full text]
  • Alexandria Manual
    alexandria Manual draft version Alexandria software and associated documentation are in the public domain: Authors dedicate this work to public domain, for the benefit of the public at large and to the detriment of the authors' heirs and successors. Authors intends this dedication to be an overt act of relinquishment in perpetuity of all present and future rights under copyright law, whether vested or contingent, in the work. Authors understands that such relinquishment of all rights includes the relinquishment of all rights to enforce (by lawsuit or otherwise) those copyrights in the work. Authors recognize that, once placed in the public domain, the work may be freely reproduced, distributed, transmitted, used, modified, built upon, or oth- erwise exploited by anyone for any purpose, commercial or non-commercial, and in any way, including by methods that have not yet been invented or conceived. In those legislations where public domain dedications are not recognized or possible, Alexan- dria is distributed under the following terms and conditions: Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTIC- ULAR PURPOSE AND NONINFRINGEMENT.
    [Show full text]
  • TAL Recognition in O(M(N2)) Time
    TAL Recognition in O(M(n2)) Time Sanguthevar Rajasekaran Dept. of CISE, Univ. of Florida raj~cis.ufl.edu Shibu Yooseph Dept. of CIS, Univ. of Pennsylvania [email protected] Abstract time needed to multiply two k x k boolean matri- ces. The best known value for M(k) is O(n 2"3vs) We propose an O(M(n2)) time algorithm (Coppersmith, Winograd, 1990). Though our algo- for the recognition of Tree Adjoining Lan- rithm is similar in flavor to those of Graham, Har- guages (TALs), where n is the size of the rison, & Ruzzo (1976), and Valiant (1975) (which input string and M(k) is the time needed were Mgorithms proposed for recognition of Con- to multiply two k x k boolean matrices. text Pree Languages (CFLs)), there are crucial dif- Tree Adjoining Grammars (TAGs) are for- ferences. As such, the techniques of (Graham, et al., malisms suitable for natural language pro- 1976) and (Valiant, 1975) do not seem to extend to cessing and have received enormous atten- TALs (Satta, 1993). tion in the past among not only natural language processing researchers but also al- gorithms designers. The first polynomial 2 Tree Adjoining Grammars time algorithm for TAL parsing was pro- A Tree Adjoining Grammar (TAG) consists of a posed in 1986 and had a run time of O(n6). quintuple (N, ~ U {~}, I, A, S), where Quite recently, an O(n 3 M(n)) algorithm N is a finite set of has been proposed. The algorithm pre- nonterminal symbols, sented in this paper improves the run time is a finite set of terminal symbols disjoint from of the recent result using an entirely differ- N, ent approach.
    [Show full text]
  • Appendix a Answers to Some of the Exercises
    Appendix A Answers to Some of the Exercises Answers for Chapter 2 2. Let f (x) be the number of a 's minus the number of b 's in string x. A string is in the language defined by the grammar just when • f(x) = 1 and • for every proper prefix u of x, f( u) < 1. 3. Let the string be given as array X [O .. N -1] oflength N. We use the notation X ! i, (pronounced: X drop i) for 0:::; i :::; N, to denote the tail of length N - i of array X, that is, the sequence from index ion. (Or: X from which the first i elements have been dropped.) We write L for the language denoted by the grammar. Lk is the language consisting of the catenation of k strings from L. We are asked to write a program for computing the boolean value X E L. This may be written as X ! 0 E L1. The invariant that we choose is obtained by generalizing constants 0 and 1 to variables i and k: P : (X E L = X ! i ELk) /\ 0:::; i :::; N /\ 0:::; k We use the following properties: (0) if Xli] = a, i < N, k > 0 then X ! i E Lk = X ! i + 1 E Lk-1 (1) if Xli] = b, i < N, k ~ 0 then X ! i E Lk = X ! i + 1 E Lk+1 (2) X ! N E Lk = (k = 0) because X ! N = 10 and 10 E Lk for k = 0 only (3) X ! i E LO = (i = N) because (X ! i = 10) = (i = N) and LO = {€} The program is {N ~ O} k := Ij i:= OJ {P} *[ k =1= 0/\ i =1= N -+ [ Xli] = a -+ k := k - 1 404 Appendix A.
    [Show full text]
  • Specification Patterns and Proofs for Recursion Through the Store
    Specification patterns and proofs for recursion through the store Nathaniel Charlton and Bernhard Reus University of Sussex, Brighton Abstract. Higher-order store means that code can be stored on the mutable heap that programs manipulate, and is the basis of flexible soft- ware that can be changed or re-configured at runtime. Specifying such programs is challenging because of recursion through the store, where new (mutual) recursions between code are set up on the fly. This paper presents a series of formal specification patterns that capture increas- ingly complex uses of recursion through the store. To express the neces- sary specifications we extend the separation logic for higher-order store given by Schwinghammer et al. (CSL, 2009), adding parameter passing, and certain recursively defined families of assertions. Finally, we apply our specification patterns and rules to an example program that exploits many of the possibilities offered by higher-order store; this is the first larger case study conducted with logical techniques based on work by Schwinghammer et al. (CSL, 2009), and shows that they are practical. 1 Introduction and motivation Popular \classic" languages like ML, Java and C provide facilities for manipulat- ing code stored on the heap at runtime. With ML one can store newly generated function values in heap cells; with Java one can load new classes at runtime and create objects of those classes on the heap. Even for C, where the code of the program is usually assumed to be immutable, programs can dynamically load and unload libraries at runtime, and use function pointers to invoke their code.
    [Show full text]
  • Modification and Objects Like Myself
    JOURNAL OF OBJECT TECHNOLOGY Online at http://www.jot.fm. Published by ETH Zurich, Chair of Software Engineering ©JOT, 2004 Vol. 3, No. 8, September-October 2004 The Theory of Classification Part 14: Modification and Objects like Myself Anthony J H Simons, Department of Computer Science, University of Sheffield, U.K. 1 INTRODUCTION This is the fourteenth article in a regular series on object-oriented theory for non- specialists. Previous articles have built up models of objects [1], types [2] and classes [3] in the λ-calculus. Inheritance has been shown to extend both types [4] and implementations [5, 6], in the contrasting styles found in the two main families of object- oriented languages. One group is based on (first-order) types and subtyping, and includes Java and C++, whereas the other is based on (second-order) classes and subclassing, and includes Smalltalk and Eiffel. The most recent article demonstrated how generic types (templates) and generic classes can be added to the Theory of Classification [7], using various Stack and List container-types as examples. The last article concentrated just on the typeful aspects of generic classes, avoiding the tricky issue of their implementations, even though previous articles have demonstrated how to model both types and implementations separately [4, 5] and in combination [6]. This is because many of the operations of a Stack or a List return modified objects “like themselves”; and it turns out that this is one of the more difficult things to model in the Theory of Classification. The attentive reader will have noticed that we have so far avoided the whole issue of object modification.
    [Show full text]
  • Automatically Verified Implementation of Data Structures Based on AVL Trees Martin Clochard
    Automatically verified implementation of data structures based on AVL trees Martin Clochard To cite this version: Martin Clochard. Automatically verified implementation of data structures based on AVL trees. 6th Working Conference on Verified Software: Theories, Tools and Experiments (VSTTE), Jul 2014, Vienna, Austria. hal-01067217 HAL Id: hal-01067217 https://hal.inria.fr/hal-01067217 Submitted on 23 Sep 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Automatically verified implementation of data structures based on AVL trees Martin Clochard1,2 1 Ecole Normale Sup´erieure, Paris, F-75005 2 Univ. Paris-Sud & INRIA Saclay & LRI-CNRS, Orsay, F-91405 Abstract. We propose verified implementations of several data struc- tures, including random-access lists and ordered maps. They are derived from a common parametric implementation of self-balancing binary trees in the style of Adelson-Velskii and Landis trees. The development of the specifications, implementations and proofs is carried out using the Why3 environment. The originality of this approach is the genericity of the specifications and code combined with a high level of proof automation. 1 Introduction Formal specification and verification of the functional behavior of complex data structures like collections of elements is known to be challenging [1,2].
    [Show full text]
  • An English-Persian Dictionary of Algorithms and Data Structures
    An EnglishPersian Dictionary of Algorithms and Data Structures a a zarrabibasuacir ghodsisharifedu algorithm B A all pairs shortest path absolute p erformance guarantee b alphab et abstract data typ e alternating path accepting state alternating Turing machine d Ackermans function alternation d Ackermanns function amortized cost acyclic digraph amortized worst case acyclic directed graph ancestor d acyclic graph and adaptive heap sort ANSI b adaptivekdtree antichain adaptive sort antisymmetric addresscalculation sort approximation algorithm adjacencylist representation arc adjacencymatrix representation array array index adjacency list array merging adjacency matrix articulation p oint d adjacent articulation vertex d b admissible vertex assignmentproblem adversary asso ciation list algorithm asso ciative binary function asso ciative array binary GCD algorithm asymptotic b ound binary insertion sort asymptotic lower b ound binary priority queue asymptotic space complexity binary relation binary search asymptotic time complexity binary trie asymptotic upp er b ound c bingo sort asymptotically equal to bipartite graph asymptotically tightbound bipartite matching
    [Show full text]