Index Strategies for Efficient and Effective Entity Search
Total Page:16
File Type:pdf, Size:1020Kb
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-21-2015 INDEX STRATEGIES FOR EFFICIENT AND EFFECTIVE ENTITY SEARCH Huy T. Vu San Jose State University Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Databases and Information Systems Commons, and the Systems Architecture Commons Recommended Citation Vu, Huy T., "INDEX STRATEGIES FOR EFFICIENT AND EFFECTIVE ENTITY SEARCH" (2015). Master's Projects. 396. DOI: https://doi.org/10.31979/etd.t8kz-y8gn https://scholarworks.sjsu.edu/etd_projects/396 This Master's Project is brought to you for free and open access by the Master's Theses and Graduate Research at SJSU ScholarWorks. It has been accepted for inclusion in Master's Projects by an authorized administrator of SJSU ScholarWorks. For more information, please contact [email protected]. INDEX STRATEGIES FOR EFFICIENT AND EFFECTIVE ENTITY SEARCH A Thesis Presented to The Faculty of the Department of Computer Science San José State University In Partial Fulfillment of the Requirement of the Degree Master of Science by Huy T. Vu May 2015 The Designated Thesis Committee Approves the Thesis Titled © 2015 Huy T. Vu ALL RIGHTS RESERVED INDEX STRATEGIES FOR EFFICIENT AND EFFECTIVE ENTITY SEARCH by Huy T. Vu APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE SAN JOSÉ STATE UNIVERSITY May 2015 ______________________________________________ Dr. Thanh Tran Department of Computer Science ______________________________________________ Dr. Tsau-Young Lin Department of Computer Science ______________________________________________ Dr. Chris Tseng Department of Computer Science ABSTRACT The volume of structured data has rapidly grown in recent years, when data-entity emerged as an abstraction that captures almost every data pieces. As a result, searching for a desired piece of information on the web could be a challenge in term of time and relevancy because the number of matching entities could be very large for a given query. This project concerns with the efficiency and effectiveness of such entity queries. The work contains two major parts: implement inverted indexing strategies so that queries can be searched in minimal time, and rank results based on features that are independent of the query itself. ACKNOWLEDGEMENTS I would like to thank my family for their great support throughout my study at San Jose State University I would like to express my gratitude to Dr. Thanh Tran, my thesis advisor, for his supports and motivations in the completion of this project I would like to thank my project committee member, Dr. Tsau-Young Lin and Dr. Chris Tseng, for their contribution in the completion of this project Table of Contents 1. Introduction ................................................................................................................ 5 2. Related Works ........................................................................................................... 7 2.1. Inverted Indexing ................................................................................................. 7 2.1.1. Vertical Indexing ........................................................................................ 8 2.1.2. Horizontal Indexing .................................................................................... 9 2.1.3. Reduced Indexing .................................................................................... 10 2.2. Learning To Rank .............................................................................................. 11 2.2.1. TF-IDF ..................................................................................................... 11 2.2.2. Okapi BM25 ............................................................................................. 13 2.2.3. BM25F & Variants .................................................................................... 14 2.3. Query-Independent Features ............................................................................ 14 3. Project Design.......................................................................................................... 15 3.1. Definition ........................................................................................................... 15 3.1.1. Problem Formulation................................................................................ 15 3.1.2. Terminology ............................................................................................. 15 3.2. Technology ........................................................................................................ 18 3.2.1. Apache Lucene ........................................................................................ 18 3.2.2. Solr & Java API ........................................................................................ 19 3.3. Query Efficiencies.............................................................................................. 19 3.4. Query Effectiveness .......................................................................................... 20 3.4.1. Simple Relevancy Score ......................................................................... 20 3.4.2. Independent Features .............................................................................. 21 4. Implementation ........................................................................................................ 22 4.1. Solr Server Setup .............................................................................................. 22 4.1.1. Schema .................................................................................................... 22 4.1.2. Configuration ........................................................................................... 23 4.2. Index Time Implementation ............................................................................... 23 4.2.1. Indexer ..................................................................................................... 24 4.2.2. Analyzer ................................................................................................... 24 4.2.3. Payload Filter ........................................................................................... 25 4.3. Search Time Implementation ............................................................................. 26 4.3.1. Search Handler ........................................................................................ 27 4.3.2. Scorer ...................................................................................................... 27 4.3.3. Timer........................................................................................................ 28 4.3.4. Result Formatter ...................................................................................... 28 5. Performance ............................................................................................................ 29 5.1. Query Types ...................................................................................................... 29 5.1.1. Default Term Query ................................................................................. 29 5.1.2. Term Query Over Single Field ................................................................. 29 5.1.3. Query Over Multiple Fields ...................................................................... 30 5.2. Medicare Healthful Contacts Dataset ................................................................ 30 1 5.2.1. Examination of Ten Queries .................................................................... 31 5.2.2. Experiment in Depth ................................................................................ 33 5.3. International Aiding Dataset .............................................................................. 35 5.3.1. Examination of Ten Queries .................................................................... 36 5.3.2. Experiment in Depth ................................................................................ 38 5.4. Discussion ......................................................................................................... 39 5.5. Reduced Indexing Factor .................................................................................. 40 6. Conclusion ............................................................................................................... 44 References .............................................................................................................. 45 2 List of Figures Figure 1: Growth of data, projection into year 2020 ........................................................ 5 Figure 2: Illustration of inverted index.............................................................................. 7 Figure 3: A sample entity object represented as document in Lucene ............................ 8 Figure 4: Lucene architecture and flow diagram ........................................................... 18 Figure 5: Solr architecture ............................................................................................. 19 Figure 6: A typical design of Solr schema ..................................................................... 22 Figure 7: Configuration of one request handler ............................................................. 23 Figure 8: Separate storage location for indexer ............................................................ 24 Figure 9: Example of payloads in inverted index ..........................................................