INFORMATION RETRIEVAL USING LUCENE and WORDNET a Thesis
Total Page:16
File Type:pdf, Size:1020Kb
INFORMATION RETRIEVAL USING LUCENE AND WORDNET A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Jhon Whissel December, 2009 INFORMATION RETRIEVAL USING LUCENE AND WORDNET Jhon Whissel Thesis Approved: Accepted: Advisor Dean of the College Dr. Chien-Chung Chan Dr. Chand Midha Committee Member Dean of the Graduate School Dr. Zhong-Hui Duan Dr. George R. Newkome Committee Member Date Dr. Kathy Liszka Department Chair Dr. Chien-Chung Chan ii ABSTRACT This thesis outlines the use of Apache Lucene, its subproject Nutch, and WordNet as tools for information retrieval, a science of searching through data in order to obtain knowledge that has become increasingly relevant in the Information Age. Lucene is a software library released by the Apache Software Foundation which provides information retrieval capabilities to programmers. Nutch, based on Lucene, adds Internet search functionality. WordNet is a lexical database which groups similar words into sets of synonyms, or synsets, and tracks their semantic relationships. This creates a combination of a dictionary and a thesaurus which may be browsed as such or used in software applications. Lucene and Nutch are released under the Apache Software License and are free and open source. WordNet is released under a similar license. The availability of free and open source tools such as Lucene, Nutch, and WordNet grants software developers a base from which they may create applications that can be tailored to their specifications, while simultaneously eliminating the need to create the entire code base for their projects from scratch. This practice can result in reduced programming time and lower software development costs. The remainder of this document outlines the use of these tools and then presents the methodology for the integration of their combined capabilities into a search engine capable of online information retrieval. This discussion culminates with a demonstration of how WordNet may be employed to remove search query ambiguity. iii DEDICATION This thesis is dedicated to My father, Fred Whissel For inspiring creativity My mother, Barbara Whissel For inspiring perseverance My aunt, Carol Decker For her unwavering support And to all of my family For allowing me to stand on their shoulders So that I might reach greater heights. iv ACKNOWLEDGEMENTS I would like to extend my most sincere gratitude to everyone who has played a role in my life and education throughout the years. The guidance, support, and inspiration of these numerous individuals has helped shaped me into who I am today. Many thanks to Dr. Pelz, Peggy Speck, Chuck Van Tilburg, and the rest of the faculty and staff of the Department of Computer Science for both sharing their knowledge as well as lending their aid whenever it was called upon. Though I was unable to attend classes with all of you, there was not a single faculty member who did not contribute in some way to my education over the last several semesters. Thanks to the Department as well for awarding me with an assistantship, providing me with the opportunity to teach at the University of Akron and an experience that I will never forget. Special thanks in particular to Dr. Chan for agreeing to become my mentor and aiding me with my research. The crucial guidance and direction that I received has bestowed upon me the skills needed to successfully complete this thesis. Thanks to my Aunt Carol and the rest of my family for helping me through the challenges I have faced, and for demonstrating selfless compassion and understanding. Finally, I would like to offer my deepest thanks to my parents for raising me, instilling me with their wisdom, setting a good example for me, molding my personality, and without whose sacrifices the life I know today would not be possible. v TABLE OF CONTENTS Page LIST OF FIGURES ...........................................................................................................xi CHAPTER I. INTRODUCTION ..........................................................................................................1 1.1. Society in Transition .......................................................................................1 1.2. The Information Revolution ...........................................................................2 1.3. Information as a Commodity ..........................................................................3 1.4. Impact of the Internet ......................................................................................3 1.5. Information Abstractions ................................................................................4 1.6. Challenges of Information Management ........................................................5 1.7. Information Retrieval ......................................................................................6 1.8. Software Development in the Information Age ..............................................6 1.9. Free and Open Source Software Movements ..................................................7 1.10. Open Source Today .......................................................................................8 1.11. Drawbacks of Traditional Search Engines ....................................................9 1.12. Open Source Information Retrieval ............................................................10 1.13. Benefits of Open Source Development .......................................................10 1.14. WordNet ......................................................................................................11 vi 1.15. Combination of Open Source Tools ............................................................11 1.16. Working with Lucene and WordNet ............................................................12 II. APACHE LUCENE AND WORDNET ......................................................................13 2.1. Introduction to Lucene ..................................................................................13 2.2. Features of Lucene ........................................................................................14 2.3. Lucene Drawbacks and Nutch ......................................................................14 2.4. The Lucene Query Parser ..............................................................................15 2.5. Terms and Operators .....................................................................................16 2.6. Term Modifiers .............................................................................................16 2.7. Lucene Indexes .............................................................................................17 2.8. Index Segments .............................................................................................18 2.9. The WordNet Database .................................................................................18 2.10. Word Relationships .....................................................................................19 2.11. Word Types ..................................................................................................20 2.12. Word Types Excluded from WordNet .........................................................20 2.13. Relationships Between Word Types ............................................................21 2.14. Using WordNet ...........................................................................................21 2.15. Type Relationships in WordNet ..................................................................22 2.16. Noun Hypernyms and Hyponyms ...............................................................22 2.17. Noun Holonyms and Meronyms .................................................................23 2.18. Verb Hypernyms and Hyponyms ................................................................24 2.19. Verb Entailments .........................................................................................24 2.20. Adjective Relationships ..............................................................................25 vii 2.21. Adverb Relationships ..................................................................................25 2.22. Lexical Relationships in WordNet ..............................................................26 2.23. Practical Applications of WordNet .............................................................27 III. IMPLEMENTING A WEB SEARCH ENGINE .......................................................28 3.1. Using Nutch ..................................................................................................28 3.2. Installing the Java Environment ....................................................................29 3.3. Selecting a Web Interface .............................................................................30 3.4. Installing a Shell Environment ......................................................................31 3.5. Web Crawling with Nutch .............................................................................31 3.6. Large Scale vs. Small Scale Crawls ..............................................................33 3.7. Advantages of Large Scale Crawls ...............................................................33 3.8. Preparing Nutch for Crawling .......................................................................34 3.9. Configuring Nutch ........................................................................................35 3.10. Creating