Rienianpi Space Model and Similarity-Based Web Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
RIENIANPI SPACE MODEL AND SIMILARITY-BASED WEB RETRIEVAL -4 THESIS SUBMITTEDTO THE FACULTYOF GRADUATESTUDIES AND RESEARCH IN PARTIALFULFILL~IENT OF THE REQUIREMENTS FOR THE DEGREEOF D~CTOROF PHILOSOPHY IN COMPUTERSCIENCE UNIVERSITYOF REGINA BY Zhiwei Wang Regina, Sas kat chewan February 24, 2001 @ Copyright 2001: Zhiwei Wang National Librafy Bibliothèque nationaIe I*I of Canada du Canada Acquisitions and Acquisitions et Bibliographie Services services bibliographiques 395 Wellington Street 395, rue Wellington Ottawa ON K1A ON4 Ottawa ON KIA ON4 Canada Canada The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Lhrary of Canada to Bibliothèque nationale du Canada de reproduce, loan, distriiute or sell reproduire, prêter, distribuer ou copies of this thesis in microforrn, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/fïh, de reproduction sur papier ou sur format électronique. The author retains ownership of the L'auteur conserve !ô popriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or othenirise de celle-ci ne doivent être imprimés reproduced without the author' s ou autrement reproduits sans son permission. autorisation. Abstract Similarity-based rnatching is widely used in the vector space model. However, the widespread adoption of similarity-based matching is hampered by disagreements over hov~similarity measures should be constructed and how large databases should be indexed so the similarity matching is even possible. This thesis intends to overcome these hindrances and to establish a theoretical basis and implementation guidelines f~rapplying similarity-based matching in Web retrieval. The thesis analyzes the vector space madel and shows that FVeb space would be modeled more exactly as a curved space rather than as a Euclidean space. Based on this, the thesis claims that it is inappropriate to atternpt to apply a single simi- larity/dissimilarity measure globally on Web space. The thesis proposes a Riemann space model that explains previously unexplained phenornena. In the Riemann space model, dissimilarity functions are integrated into a single form of geodesic distances, which can be locally computed in a uniform formula. To some extent, this answers the long-existing open problem of identifying conditions for the use of a particular sirnilari ty/dissimilarity measure. According to the theory of the Riemann space model, we propose a multi-stage approach that combines exact matching and partial matching in the design of new WeP retrieval systems. In this approach, a retrieval system first forms a nei,nhborhood of a cluery. This can be done using exact matching. Then in the chosen neighborhood, more complicated similarity-based matching is performed. The documents are ranked accorcling to their geodesic distances to the query. This is equivalent to using a ranking function specialiy designed for the given neighborhood. Since the similarity-based matching is performed only in a neighborhood, the computationa1 cost involved in the search process wodd be reduced. The Riemann space mode1 provides a sound t heoretical basis for t his multi-stage approach. As a dernonstration of application, we designed and implemented a personal Web retrieval (PWR) system. DiEerent from curent search engines, subject trees, and metasearch engines, this system is a client side program. It works Iike a personal secretary. Tt reads Web documents, ranks them according to their geodesic distances to the query, and also considers the user's general search interests. It can be viewed as a prototype of intelligent Web retrieval systems. Acknowledgment s First, 1 would like to express my gratitude to my supervisor, Dr. R. B. Maguire. Without his valuable guidance and kind encouragement, this thesis would have been impossible. 1 am also grateful to my previous supervisor, Dr. S. K. M. Wong for his instructive training and kind help over a long period of time. 1 thank the committee members. Dr. B. Gilligan, Dr. S. K. M. Wong, and Dr. Y. Y. Yao, and the external examiner Dr. R. G. Goebel. Their instructive suggestions and comments greatly improved the thesis. I gratefully acknowledge the financial support provided by the Government of Saskatchewan, The Faculty of Graduate Stuclies and Research, and the Department of Computer Science. Especially, 1 appreciate the full-tirne career opportunity 1 have had since 1994 in the Department of Computer Science, University of Regina. 1 thank Dr. K. E. Denford, former Dean of the Faculty of Science and Dr. L. V. Saston, former Head of Computer Science. 1 am also indebted to my wife Xiaozhu, my daughter Min, and son Zheng for their devoted love and support. Contents Abstract Acknowledgrnents Table of Contents List of Tables List of Figures ix Chapter 1 Introduction 1 1.1 FiTorldWideWeb . 1 1.2 Retrieval Techniques , . 5 1.3 -4n Overview of the Riemann Space Model . - . 18 1.3.1 A Brief History of Non-Euclidean Geometry . 1S 1.:3.2 Riemann Space Model . 21 1.4 Surnmary of Contributions . 24 1.5 ThesisOutline. 26 Chapter 2 A Geometrical Analysis of the Vector Space Mode1 28 2.1 Generaiized Linear Similarity -Measures ................. 28 2.1.1 Definition ............................. '3s 2.1.2 Examples ............................. 29 2.2 An Analysis of Generalized Similarity Measures ............ Y3 2.2.1 Notations and Definitions .................... 34 2-22 The PseudeCosine Measure ................... 3s 2.23 The Cosine Measure ....................... 42 2.3 Existence Theorems ............................ 44 2.3.1 Preliminaries ........................... 45 2.3.2 The First Existence Theorem .................. 46 2.3..3 The Second Existence Theorem ................. 49 24 User Preference, Query, and Underlying Surface ............ 51 Chapter 3 The Theory of Riemann Space Mode1 54 3.1 Motivation ................................. 54 3.2 Forma1 Properties of Similarity and Dissimilarity ........... 5, 3.2- 1 Some Inappropriateness in the Conventional Similarity Measures 57 :3.2.2 Axiomatic Properties of Similarity and Dissirnilarity Functions 60 3. .3 The Notion of Curved Web Space .................... 63 3.3.1 Integration of Dissiinilarity Functions .............. 65 .3.3.2 Local Linearization of Global Dissimilarity ........... 64 3.4 Mathematical Xotions in Riemann Space Mode1 ............ 69 3.4.1 Topological Space ......................... 70 3.4.2 Differentiable blanifold ...................... 71 3.4.3 Tangent Space ........................... 73 3.4.4 Riemannian Metrics and Geodesics ............... 74 3.5 Curvilinear Coordinate System ..................... 77 i3.6 Keyword Analysis: the Topography in Web space ........... S2 Chapter 4 Application of the Riemann Space Mode1 87 4.1 An Overview of Web Retrieval Systems ................. Sï 4.2 A Persona1 Web Retrieval System .................... 93 4.3 System Outline ancl Architecture .................... 96 4.4 Automatic Query Formulating ...................... 102 4.4. 1 Query B y Example ........................ 102 4.4.1 User .M odeling ........................... 105 4.5 Dynamic Keyword Analysis ....................... 10'7 4.5. Local Coordinate System ..................... 10s 4.5.2 Statistical versus Semantical Analysis .............. 109 4.6 Best First Search ............................. 112 Chapter 5 Sarnple Implementation 116 5.1 Introduction ................................ 116 5.2 Algorithm Outline ............................ 11s 5.3 System Features .............................. 110 5.3.1 The Main Window ........................ 120 .5..3 .1 Contïolled Vocabulary ...................... 122 5.3.3 Keyword Extraction ....................... 123 5.3.4 Document Collection ....................... 124 5.3.5 The URL-Keyword Window ................... 126 5.3 -6 The Keyword-Keyword Window ................. 127 5.3.7 Ranking .............................. 127 Chapter 6 Conclusion and Discussion 132 6.1 Main Contributions ............................ 132 6.1.1 The Riemann Space ~Model.................... 132 6.1 -2 Local Linearization ........................ 133 6.1.3 User Modeling and Subspace ................... 134 6-1.4 Multi-S tage Personal Web Retrieval ............... 135 6.2 Future Research and Open Problems .................. 136 Appendix A Sarnple Source Code 139 Bibliography 151 vii List of Tables 1.1 Growth of the Internet .......................... 3 4.1 A List of Some Subject 'Trees ...................... S9 4.2 A List of Some Search Engines ...................... 91 4.3 -4 List of Some Metasearch Engines ................... 94 1 Cornparison of Different Sorting Algorithms .............. 129 List of Figures Classification of Retrieval Techniques .................. 7 A Contour Curve ............................. 37 A Contour Curve in Pseudo-Cosine Measure Mode1. The underlying hypersurface U is a plane . The contour curve usually is a straight line. 40 A Contour Curve in Cosine Measure Mode1. The underlying hypersur- face Ci is a sphere. The contour curve usually is a circle- ....... 43 The Four Mociules of the PWR System ................ 98 Query Formulating Module ....................... 100 lnternet Searching Module ....................... 101 The Algorithm of the PWR Systern ................... 119 Main Windotv ............................... 121 Main Window with Keytvords Extracted