Intelligent Web Exploration by Pavel Kalinov
Total Page:16
File Type:pdf, Size:1020Kb
Intelligent Web Exploration Author Kalinov, Pavel Published 2012 Thesis Type Thesis (PhD Doctorate) School School of Information and Communication Technology DOI https://doi.org/10.25904/1912/635 Copyright Statement The author owns the copyright in this thesis, unless stated otherwise. Downloaded from http://hdl.handle.net/10072/365635 Griffith Research Online https://research-repository.griffith.edu.au Intelligent Web Exploration by Pavel Kalinov MSc Information Technology Institute for Integrated and Intelligent Systems (IIIS) School of Information and Communication Technology Science, Environment, Engineering and Technology Griffith University Submitted in fulfilment of the requirements of the degree of Doctor of Philosophy December 2011 Abstract The hyperlinked part of the internet known as \the Web" arose without much planning for a future of millions of publishers and countless pieces of online content. It has no in-built mechanism to find anything, so tools external to it were introduced: initially web directories and then search engines. Search engines are based on machine learning and have been extremely successful. However, they have some inherent limitations and cannot, by design, address some needs: they serve the \information locating" need only and not \information discovery". Search engine users have learned to accept them and in many cases do not realise how their search has been limited by shortcomings of the model. Before the advent of the search engine, web directories were the only information- finding tool on the web. They were manually built and could not compete economically with the efficiency of search engines. This lead to their virtual extinction, with the effect that the \information discovery" need of users is no longer served by any major information provider. Furthermore, none of the dominant information-finding models account for the person of the user in any meaningful way controllable by (or even visible to) the user. This work proposes a method to combine a search engine, a web directory and a personal information management agent into an intelligent Web Exploration Engine in a way which bridges the gaps between these seemingly unrelated tools. Our hybrid, for which we have developed a proof-of-concept prototype [Kalinov et al., 2010b], allows users to both locate specific data and to discover new information. Information discovery is served by a web directory which is built with the assistance of a dynamic hierarchical classifier we developed [Kalinov et al., 2010a]. The category structure achieved by it is also the basis of a large number of nested search engines, allowing information locating both in general (similar to a \standard" search engine) and in a variety of contexts selectable by the user. Personalisation in our model is distributed: user modelling happens where the user is, and is handled by a personal agent. The agent sends relevant information to the exploration engine at search time, allowing truly personalised search which accounts for the user's cognitive and current search context, while at the same time preserving privacy. This work is currently in the initial stages of commercialisation, and preliminary patent research has been done with view of patenting the concept of connecting distributed and independent personalisation agents to any number of search providers. i ii List of Publications 1. \Building a Dynamic Classifier for Large Text Data Collections", Pavel Kalinov, Bela Stantic and Abdul Sattar. In \Proceedings of the 21st Australasian Database Conference - ADC2010", Australian Computer Society, CRPIT Series, ISBN 978-1-920682-85-9 2. \Let's Trust Users - It is Their Search", Pavel Kalinov, Bela Stantic and Abdul Sattar. In \Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - WI-IAT 2010", IEEE Computer Society, ISBN 978-0- 7695-4191-4 3. \Towards Real Intelligent Web Exploration", Pavel Kalinov, Bela Stantic and Abdul Sattar. 14th Asia-Pacific Web Conference - APWEB 2012. 4. \Proposed Model for Really Intelligent Web Exploration", Pavel Kalinov, Bela Stantic and Abdul Sattar. \Web Intelligence and Agent Systems" journal. (to be submitted) iii Acknowledgements Many thanks for the support, advice and ideas to: The School of ICT at Griffith University, and more specifically my supervisors Prof. Abdul Sattar and Dr. Bela Stantic for their guidance and technical help through this whole project, Dr. Sankalp Khanna (for listening to some of my more advanced ideas with a straight face), Dr. Michael Blumenstein (for pointing me to useful information and people), Dr. John Zakos (for first telling me I don't know anything yet, and later telling me I may have got it right this time; and for a coffee at Macy's) and Prof. Vladimir Estivill-Castro (for some handy pointers). Dr. Charles Gretton for having the patience to help me edit the first draft of my confir- mation paper, and Prof. Gabriella Pasi for her very informative and structured keynote talk at the Web Intelligence 2010 conference in Toronto which helped me greatly with my review of personalisation techniques. Last but not least, my wife Elena Vasileva and my parents, for putting up with me for so long. v Statement of Originality This work has not previously been submitted for a degree or diploma to any university. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made in the thesis itself. Signed: December 2011 vii Contents Abstract i List of Publications iii Contents ix List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Introduction . .1 1.2 Contribution, Scope and Limitations . .5 1.3 This Document . .6 2 Background and Motivation 7 2.1 Summary . .7 2.2 Brief History . .7 2.3 Structure of the Web . .8 2.3.1 Physical Structure . .8 2.3.2 Logical Structure . .9 2.3.3 Document Interconnection (Hyperlinking): The Web . 12 2.4 Information Finding: Aspects and Approaches . 13 2.4.1 Information Locating . 14 2.4.2 Information Discovery . 25 2.5 Personalisation . 32 ix 2.5.1 Aspects of Personalisation . 32 2.5.2 Personalisation Solutions . 33 2.5.3 Inherent Problems of Personalised Solutions . 43 2.5.4 Avoidable Problems of Personalised Solutions . 44 2.5.5 Personal Web Assistants . 44 2.6 Issues Arising From Current Solutions . 45 2.6.1 Consequences of Some Approaches . 45 2.6.2 Research Issues . 51 3 Exploration Engine 53 3.1 Summary . 53 3.2 Research Question . 53 3.2.1 Research Sub-questions . 54 3.2.2 Assumptions and Limitations . 55 3.2.3 Practical Tasks . 56 3.3 General Outline of the Solution . 57 3.3.1 Hybrid Web Directory . 57 3.3.2 User-Controlled Ontology . 59 3.3.3 Personal to Global Ontology . 59 3.3.4 Prototype and Practical Considerations . 59 3.4 Advantages of the Proposed Solution . 60 3.4.1 Usable User Profile . 60 3.4.2 Improved General Usability . 62 3.4.3 Expressive Queries . 63 3.4.4 Exhaustive Exploration of a Topic . 63 3.4.5 User-Specified Search Context . 63 3.4.6 Busting the Filter Bubble . 64 3.4.7 Information Scent . 64 3.4.8 Recommendation Engine and Information Discovery . 64 3.5 Disadvantages of the Proposed solution . 65 3.5.1 Expensive Backend, Complexity . 65 3.5.2 Complicated Frontend, User Investment . 66 3.6 Specifics of the Exploration Engine . 66 3.6.1 Browsing the Directory . 66 3.6.2 Exploring the Directory . 68 3.6.3 Searching in the Directory . 69 3.6.4 Query and Query Expansion . 70 3.6.5 Enhancements . 73 4 Related Work 75 4.1 Summary . 75 4.2 Building the User Model . 75 4.2.1 Knowledge Representation and Ontology Matching . 78 4.3 Building the Web Directory . 79 4.3.1 Data Pre-processing . 79 4.3.2 Data Representation . 80 4.3.3 Indexing . 81 4.3.4 Dimensionality Reduction . 82 4.3.5 Unsupervised Clustering, SOM . 86 4.3.6 Classification . 90 5 Implementation and Practical Issues 93 5.1 Summary . 93 5.2 General Approach . 94 5.3 General Setup . 95 5.4 Selected Approaches to Tasks . 95 5.4.1 Building the User Model . 96 5.4.2 Knowledge Representation and Ontology Matching . 97 5.4.3 Data Acquisition and Processing . 100 5.4.4 Self-Organising Maps and Broken Dreams . 107 5.4.5 Classification . 113 5.5 Dataset and Experimental Study . 117 5.5.1 Test Setup . 117 5.5.2 Available Data . 118 5.5.3 Algorithm Testing and Experimental Results . 122 5.6 The Floating Query . 137 5.7 User Testing, or Lack Thereof . 138 6 Conclusion 141 6.1 State of the Art . 141 6.2 Proposed Solution . 143 6.3 Summary of Contributions . 144 6.4 Future Work . 147 Bibliography xvii Index of Terms xxix List of Figures 2.1 Context based information supply [Pasi, 2010]. Note the implied grouping of user profiling, context knowledge and web data on the remote side. 41 3.1 Context based information supply: our proposal. User profiling is on the user side, information-related activities are on the server side. A (partial) user profile is sent on to the server after being filtered through the current activity context. 57 3.2 User browsing the directory. Exploration has not been personalised yet: in the upper box (highlighted by bold rectangle) no relevance feedback has been added by the user. The keywords from the list in the bottom right box (highlighted) assist the user to formulate a search query and search either within the directory or at a number of external search providers. These keywords automatically emerge from the classifier data and are not supplied by the editor.