Query Based Stemming (PDF)

Query Based Stemming by Elizabeth Tudhope University of Waterloo Department of Computer Science Technical Report CS-96-31 Waterloo, Ontario, Canada, 1996 © Elizabeth Tudhope 1996 Abstract In information retrieval the relevancy of a document to a particular query is based on a comparison of the terms appearing in the query with the terms appearing in the document. Morphological variants of words (i.e. locate, locates, located, locating) often carry the same or similar meaning. Such terms should be considered equivalent for information retrieval purposes. Stemming is a simple application of natural language processing that is commonly applied at index time to reduce morphological variants to a common root form. This thesis first examines several approaches to stemming. Some of the problems associated with the use of stemming as a query expansion technique are then discussed. The construction of equivalence classes of words for stemming at query time is presented as a possible alternative method to address some of these problems. ii Acknowledgments I would like to thank my supervisor, Frank Tompa for his sound advice, suggestions, and attention to detail. I would also like to thank my readers Forbes Burkowski and Rick Kazman for the time they took to carefully read my thesis and offer encouragement and suggestions. A very special thanks goes to Charlie Clarke for always having an open door. His unending patience, encouragement, and advice is greatly appreciated. I am grateful for the hardware, software, technical and academic support provided by University of Waterloo MultiText project. I gratefully acknowledge funding provided by Natural Sciences and Engineering Research Council, Information Technology Research Center and the University of Waterloo Department of Computer Science. iii Contents 1. MOTIVATION ...................................................................................................................................... 1 1.1. PROBLEM SETTING ............................................................................................................................ 1 1.2. RETRIEVAL EVALUATION ................................................................................................................... 9 1.3. PROBLEM STATEMENT ......................................................................................................................16 1.4. THESIS OUTLINE...............................................................................................................................17 2. BACKGROUND ...................................................................................................................................19 2.1. INTRODUCTION TO STEMMING...........................................................................................................19 2.2. COMMON APPROACHES.....................................................................................................................22 2.3. EVALUATING STEMMING ALGORITHMS..............................................................................................27 3. QUERY BASED STEMMING: INDEXING PHASE..........................................................................33 3.1. INTRODUCTION TO QUERY BASED STEMMING ....................................................................................33 3.2. PREVIOUS EXPERIMENTS WITH SELECTIVE STEMMING .......................................................................36 3.3. OVERVIEW OF INTER-STEM SYSTEM ARCHITECTURE ..........................................................................38 3.4. STEM FORMATION MODULE ..............................................................................................................41 3.4.1. Tokenization/Word Identification.............................................................................................41 3.4.2. Normalization..........................................................................................................................43 3.4.3. Transformation ........................................................................................................................43 iv 3.4.4. Stemming ................................................................................................................................46 3.5. CLASS CONSTRUCTION MODULE .......................................................................................................46 3.5.1. Class Formation.......................................................................................................................46 3.5.2. Class Pruning...........................................................................................................................47 3.5.3. Structure Class.........................................................................................................................47 3.6. SAMPLE EQUIVALENCE CLASS GENERATION ......................................................................................48 3.6.1. Tokenization............................................................................................................................49 3.6.2. Normalization..........................................................................................................................50 3.6.3. Transformation ........................................................................................................................51 3.6.4. Stemming ................................................................................................................................52 3.6.5. Class Formation.......................................................................................................................54 3.6.6. Class Pruning...........................................................................................................................56 3.6.7. Structure Class.........................................................................................................................56 4. QUERY EXPANSION USING THE EQUIVALENCE CLASSES.....................................................57 4.1. CLASS ORDERING .............................................................................................................................57 4.1.1. Lexical Distance.......................................................................................................................58 4.1.2. gram ........................................................................................................................................61 4.1.3. Term Co-occurrence.................................................................................................................63 4.1.4. Term Importance .....................................................................................................................64 4.1.5. Concept Clusters ......................................................................................................................65 4.2. EXPANSION METHODS ......................................................................................................................67 4.2.1. Command Line ........................................................................................................................67 4.2.2. Batch .......................................................................................................................................68 4.2.3. Interactive................................................................................................................................70 5. EVALUATION OF QUERY BASED STEMMING............................................................................73 5.1. TREC................................................................................................................................................74 5.1.1. Overview..................................................................................................................................74 5.1.2. The Task..................................................................................................................................75 5.1.3. The TREC Test Collection .......................................................................................................76 5.1.4. Query Topics............................................................................................................................79 5.1.5. Relevance.................................................................................................................................80 5.1.6. Evaluation................................................................................................................................82 5.2. MULTITEXT RETRIEVAL ENGINE .......................................................................................................82 v 5.3. EQUIVALENCE CLASS CONSTRUCTION FOR THE TREC COLLECTION ...................................................85 5.4. QUERY BASED STEMMING EXPERIMENTS ...........................................................................................89 5.5. RESULTS ..........................................................................................................................................93 6. CONCLUSIONS .................................................................................................................................103 6.1. SUMMARY ......................................................................................................................................103

Load more