Intelligent Web Exploration

Author Kalinov, Pavel

Published 2012

Thesis Type Thesis (PhD Doctorate)

School School of Information and Communication Technology

DOI https://doi.org/10.25904/1912/635

Copyright Statement The author owns the copyright in this thesis, unless stated otherwise.

Downloaded from http://hdl.handle.net/10072/365635

Griffith Research Online https://research-repository.griffith.edu.au Intelligent Web Exploration

by

Pavel Kalinov MSc Information Technology

Institute for Integrated and Intelligent Systems (IIIS) School of Information and Communication Technology Science, Environment, Engineering and Technology Griffith University

Submitted in fulfilment of the requirements of the degree of Doctor of Philosophy

December 2011 Abstract

The hyperlinked part of the known as “the Web” arose without much planning for a future of millions of publishers and countless pieces of online content. It has no in-built mechanism to find anything, so tools external to it were introduced: initially web directories and then search engines. Search engines are based on machine learning and have been extremely successful. However, they have some inherent limitations and cannot, by design, address some needs: they serve the “information locating” need only and not “information discovery”. users have learned to accept them and in many cases do not realise how their search has been limited by shortcomings of the model. Before the advent of the search engine, web directories were the only information- finding tool on the web. They were manually built and could not compete economically with the efficiency of search engines. This lead to their virtual extinction, with the effect that the “information discovery” need of users is no longer served by any major information provider. Furthermore, none of the dominant information-finding models account for the person of the user in any meaningful way controllable by (or even visible to) the user. This work proposes a method to combine a search engine, a web directory and a personal information management agent into an intelligent Web Exploration Engine in a way which bridges the gaps between these seemingly unrelated tools. Our hybrid, for which we have developed a proof-of-concept prototype [Kalinov et al., 2010b], allows users to both locate specific data and to discover new information. Information discovery is served by a web directory which is built with the assistance of a dynamic hierarchical classifier we developed [Kalinov et al., 2010a]. The category structure achieved by it is also the basis of a large number of nested search engines, allowing information locating both in general (similar to a “standard” search engine) and in a variety of contexts selectable by the user. Personalisation in our model is distributed: user modelling happens where the user is, and is handled by a personal agent. The agent sends relevant information to the exploration engine at search time, allowing truly personalised search which accounts for the user’s cognitive and current search context, while at the same time preserving privacy. This work is currently in the initial stages of commercialisation, and preliminary patent research has been done with view of patenting the concept of connecting distributed and independent personalisation agents to any number of search providers.

i ii List of Publications

1. “Building a Dynamic Classifier for Large Text Data Collections”, Pavel Kalinov, Bela Stantic and Abdul Sattar. In “Proceedings of the 21st Australasian Database Conference - ADC2010”, Australian Computer Society, CRPIT Series, ISBN 978-1-920682-85-9

2. “Let’s Trust Users - It is Their Search”, Pavel Kalinov, Bela Stantic and Abdul Sattar. In “Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - WI-IAT 2010”, IEEE Computer Society, ISBN 978-0- 7695-4191-4

3. “Towards Real Intelligent Web Exploration”, Pavel Kalinov, Bela Stantic and Abdul Sattar. 14th Asia-Pacific Web Conference - APWEB 2012.

4. “Proposed Model for Really Intelligent Web Exploration”, Pavel Kalinov, Bela Stantic and Abdul Sattar. “Web Intelligence and Agent Systems” journal. (to be submitted)

iii

Acknowledgements

Many thanks for the support, advice and ideas to:

The School of ICT at Griffith University, and more specifically my supervisors Prof. Abdul Sattar and Dr. Bela Stantic for their guidance and technical help through this whole project, Dr. Sankalp Khanna (for listening to some of my more advanced ideas with a straight face), Dr. Michael Blumenstein (for pointing me to useful information and people), Dr. John Zakos (for first telling me I don’t know anything yet, and later telling me I may have got it right this time; and for a coffee at Macy’s) and Prof. Vladimir Estivill-Castro (for some handy pointers).

Dr. Charles Gretton for having the patience to help me edit the first draft of my confir- mation paper, and Prof. Gabriella Pasi for her very informative and structured keynote talk at the Web Intelligence 2010 conference in Toronto which helped me greatly with my review of personalisation techniques.

Last but not least, my wife Elena Vasileva and my parents, for putting up with me for so long.

v

Statement of Originality

This work has not previously been submitted for a degree or diploma to any university. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made in the thesis itself.

Signed:

December 2011

vii

Contents

Abstract i

List of Publications iii

Contents ix

List of Figures xiii

List of Tables xv

1 Introduction 1 1.1 Introduction ...... 1 1.2 Contribution, Scope and Limitations ...... 5 1.3 This Document ...... 6

2 Background and Motivation 7 2.1 Summary ...... 7 2.2 Brief History ...... 7 2.3 Structure of the Web ...... 8 2.3.1 Physical Structure ...... 8 2.3.2 Logical Structure ...... 9 2.3.3 Document Interconnection (Hyperlinking): The Web ...... 12 2.4 Information Finding: Aspects and Approaches ...... 13 2.4.1 Information Locating ...... 14 2.4.2 Information Discovery ...... 25 2.5 Personalisation ...... 32

ix 2.5.1 Aspects of Personalisation ...... 32 2.5.2 Personalisation Solutions ...... 33 2.5.3 Inherent Problems of Personalised Solutions ...... 43 2.5.4 Avoidable Problems of Personalised Solutions ...... 44 2.5.5 Personal Web Assistants ...... 44 2.6 Issues Arising From Current Solutions ...... 45 2.6.1 Consequences of Some Approaches ...... 45 2.6.2 Research Issues ...... 51

3 Exploration Engine 53 3.1 Summary ...... 53 3.2 Research Question ...... 53 3.2.1 Research Sub-questions ...... 54 3.2.2 Assumptions and Limitations ...... 55 3.2.3 Practical Tasks ...... 56 3.3 General Outline of the Solution ...... 57 3.3.1 Hybrid Web Directory ...... 57 3.3.2 User-Controlled Ontology ...... 59 3.3.3 Personal to Global Ontology ...... 59 3.3.4 Prototype and Practical Considerations ...... 59 3.4 Advantages of the Proposed Solution ...... 60 3.4.1 Usable User Profile ...... 60 3.4.2 Improved General Usability ...... 62 3.4.3 Expressive Queries ...... 63 3.4.4 Exhaustive Exploration of a Topic ...... 63 3.4.5 User-Specified Search Context ...... 63 3.4.6 Busting the Filter Bubble ...... 64 3.4.7 Information Scent ...... 64 3.4.8 Recommendation Engine and Information Discovery ...... 64 3.5 Disadvantages of the Proposed solution ...... 65 3.5.1 Expensive Backend, Complexity ...... 65 3.5.2 Complicated Frontend, User Investment ...... 66 3.6 Specifics of the Exploration Engine ...... 66 3.6.1 Browsing the Directory ...... 66 3.6.2 Exploring the Directory ...... 68 3.6.3 Searching in the Directory ...... 69 3.6.4 Query and Query Expansion ...... 70 3.6.5 Enhancements ...... 73

4 Related Work 75 4.1 Summary ...... 75 4.2 Building the User Model ...... 75 4.2.1 Knowledge Representation and Ontology Matching ...... 78 4.3 Building the Web Directory ...... 79 4.3.1 Data Pre-processing ...... 79 4.3.2 Data Representation ...... 80 4.3.3 Indexing ...... 81 4.3.4 Dimensionality Reduction ...... 82 4.3.5 Unsupervised Clustering, SOM ...... 86 4.3.6 Classification ...... 90

5 Implementation and Practical Issues 93 5.1 Summary ...... 93 5.2 General Approach ...... 94 5.3 General Setup ...... 95 5.4 Selected Approaches to Tasks ...... 95 5.4.1 Building the User Model ...... 96 5.4.2 Knowledge Representation and Ontology Matching ...... 97 5.4.3 Data Acquisition and Processing ...... 100 5.4.4 Self-Organising Maps and Broken Dreams ...... 107 5.4.5 Classification ...... 113 5.5 Dataset and Experimental Study ...... 117 5.5.1 Test Setup ...... 117 5.5.2 Available Data ...... 118 5.5.3 Algorithm Testing and Experimental Results ...... 122 5.6 The Floating Query ...... 137 5.7 User Testing, or Lack Thereof ...... 138

6 Conclusion 141 6.1 State of the Art ...... 141 6.2 Proposed Solution ...... 143 6.3 Summary of Contributions ...... 144 6.4 Future Work ...... 147

Bibliography xvii

Index of Terms xxix List of Figures

2.1 Context based information supply [Pasi, 2010]. Note the implied grouping of user profiling, context knowledge and web data on the remote side. . . . 41

3.1 Context based information supply: our proposal. User profiling is on the user side, information-related activities are on the server side. A (partial) user profile is sent on to the server after being filtered through the current activity context...... 57

3.2 User browsing the directory. Exploration has not been personalised yet: in the upper box (highlighted by bold rectangle) no relevance feedback has been added by the user. The keywords from the list in the bottom right box (highlighted) assist the user to formulate a search query and search either within the directory or at a number of external search providers. These keywords automatically emerge from the classifier data and are not supplied by the editor. (example with actual data) ...... 67

3.3 User exploring the directory. The system recommends a sub-category and shows some examples from it (individual site listed in upper highlighted box; in front of its name are the relevance feedback tools: buttons to mark it as relevant, bookmark it, mark it as not relevant or report it to the editors as error/spam). In the middles box: the system lists other sub-categories at the same level, ordered by relevance to the query. Lower box: ordering options...... 69

3.4 Advanced search in the directory. The user has already supplied some positive and negative feedback (in the upper highlighted box). A positive example appears in the results list and is displayed with a bold font (per- sonalised presentation) in the lower highlighted box...... 71

xiii 3.5 Normalised weights of query terms. The initial two-word query was ex- panded with the term vectors of documents marked as relevant or not rel- evant (since there is more than one document, we have an IDF component hence their values are not integer). As the user moves from Business into the more specific Transportation and Logistics category, the query gets re- ordered by local importance of terms. (example with actual data) ...... 72

4.1 SOM update procedure (source: SOM Toolbox)...... 89

5.1 The forming of a complex research query: term vectors from documents in a category form the category vector, which is weighted in respect to neighbouring categories and then used to modify the user’s query. The resulting query vector is the original user’s query modified with context- specific weights which represent the user’s cognitive and current contexts. . 99 5.2 Consequences of bad initialisation. Vertical lines represent new iterations. . 134 List of Tables

2.1 Geotargeted search results: number of country-specific results if query originates from Australia, Canada, India, South Africa, the UK or the USA (top row) and is submitted through the respective Google domain (tested in March 2011 with query “earthmoving equipment”). “World” results are not country specific. None of the results (from 5 out of 6 locations) were sites of actual earthmoving equipment manufacturers. . . . 36

5.1 The Open Directory data...... 119 5.2 Average classification accuracy ...... 135 5.3 Stochastic adjustment to distribution. The first column shows the real distribution of instances, the second column shows what distribution the MNB-SPDA algorithm converged to, and the third is the boost it gave to problematic categories (or handicap where it made negative corrections to categories in which it was too successful)...... 136 5.4 Class-specific success rates. Note how MNB-SPDA has sacrificed accuracy in the Regional class in order to improve all others...... 137 5.5 Algorithm training times, normalised. MNB-SPDA trains less than the ordinary train-on-error variant because it makes less errors to train on. DCC has no training phase...... 137 5.6 Algorithm classification times, normalised...... 138 5.7 Evolution of the original search query as the user moves through the English-language category into Equipment and Aerospace subcategories. . . 138

xv

1 Introduction

1.1 Introduction

The internet, and more specifically the hyper-linked part of it we call “The Web”, is the Twentieth century invention with probably the largest impact on our lives. It radically changed the way we seek and exchange information, thereby changing society as a whole. However, it emerged and developed quite unplanned and many problems still exist with it, most notably the issue of navigating through this virtual sea of information.

One of the fundamental characteristics of the web is the unstructured and decentralised nature of its publishing model, which facilitated its explosive growth and made it what it is today. Enormous databases are constantly added to it by creating web interfaces to legacy systems; offline data is being transferred online (books are digitised, newspaper archives going back more than a hundred years go online etc.); most importantly, enormous amounts of content are constantly being created specifically for the web.

Accumulating too much knowledge though was never considered a source of joy in itself: “For in much wisdom is much grief, and he who increases knowledge increases sorrow.” (Ecclesiastes 1:18).

If knowledge is to be useful, it has to be structured in some usable way and the least part of that structure is for it to be easily findable. If knowledge is not findable, it is virtually non-existent for the person who cannot find it. The web’s unstructured and decentralised nature however makes finding relevant information a task with an ever increasing difficulty, as no in-built information-seeking solution exists. Users have to

1 CHAPTER 1. INTRODUCTION employ a variety of methods to find the information they need, all of them external to the web itself. Some of these methods are human-powered and some are based on the application of machine learning algorithms. The latter are typically represented by the search engine, which powers the keyword search paradigm. For a number of historic and business reasons, this has emerged as the dominating information-finding solution. Another paradigm, that of browsing (or exploration), is usually associated with human- powered solutions, typically web directories maintained by editors. Machine learning was never successfully applied to it, which resulted in the decline of services offering such solutions and a situation where the alternative keyword search paradigm is virtually the only one in active use.

What is often overlooked is that the two paradigms provide solutions to different information-finding needs. Keyword search allows locating specific documents users want to find, while browsing allows discovery of documents that users did not previously know existed. Helping users to locate a document allows them to deepen their knowledge in a field which is already familiar to them. Helping them to discover new documents that they did not know existed allows them to broaden their knowledge in a field which has been so far unfamiliar. These are two complementary needs that should have complementary solutions. In the “offline world”, when we want to start learning about something (such as starting a new subject at school), we have a teacher or other person pointing us to introductory texts which explain basic concepts and allow us to start asking more specific questions. A search engine answers those specific questions, but does not let us find the background information which would enable us to formulate them. The decline in the browsing paradigm has created a situation where exploration of the web is severely restricted by the initial ignorance of people about an unfamiliar topic. Users are confined to visiting small pockets of sites which (according to a search engine) match a limited number of “keywords” from the particular user’s vocabulary and general “cognitive background” (or context), without the ability to expand that vocabulary and background and start asking different questions (if we have to be precise, we have to also point out that a “search query consisting of a list of keywords” is not actually equivalent to a “meaningful question” in the first place, which is a quite separate issue of the keyword search paradigm).

Nevertheless, a significant part of the scientific community is spending enormous efforts in improving the keyword search paradigm solution. Far less effort has been focused on satisfying the information discovery need. Users have been forced to employ search engines in ways they were not supposed to be used, in order to satisfy this need. The gradual decline of web directories and the concentration of the majority of research to search engines (as well as some inherent issues with them) has played a highly detrimental role to user experience on the web, a development sadly gone largely unnoticed because it happened over the course of many years and because most people tend to take the web’s operation for granted and lacking alternatives.

Furthermore, neither of the two dominant paradigms takes into account the person of

2 1.1. INTRODUCTION the user. They offer generic solutions centred on the available data and not on the users’ needs, their cultural and educational background or the purposes to which they intend to apply the knowledge they seek. The users’ views, opinions and assessment of documents that are offered to them is not taken into account either. Current models provide the same search results for “Physics” to a university professor of quantum physics and to a third-grade student searching for basic explanations of fundamental concepts. A human would tailor his answer to a question depending on who is asking, and what his level of comprehension of the matter is; a web directory or a search engine does not. Generally speaking, when people browse or search they seek information relevant to a general topic or a specific question. But, what does it mean for information to be “relevant to a general topic or a specific question”? This is a typical ill-posed problem lacking an unambiguous and measurable definition of its object; there are formal measures that can be applied, such as “normalised number of occurrences of the query term in a document”, but they are usually meaningless to a human and are sometimes counter-productive. However, even though we lack a definition for relevance, search results still need to be returned to users by order of relevance (the most relevant first), or they will not be useful. Search engines then define their own relevance criteria, which humans may not agree with (we have to remember also that different people may have different criteria). In all dominant search models though, search providers (search engines or web directories) do not allow their users to employ their own criteria to achieve personally relevant result ranking. The fact that search engines use “system values” and measurements to rank documents by relevance to a query makes them predictable and also exploitable. Web site owners can, and do, take advantage of some aspects of ranking algorithms and position their sites higher in search results by manipulating some aspects of the sites. Much of the content currently being published online has been influenced by the desire to manipulate the dominant search engine model; content is being “optimised” for the search engines and not for humans. Humans consider some of this content “digital garbage”, but have no way to filter it out of their search results. Genuine, high quality resources that are not “search engine optimised” rank below such digital garbage and are thus not discoverable by people relying on search engines to find them: much useful information is being lost through people not being able to find it. Another issue with modern search engines is that, for a number of independent and unrelated reasons, they have no way to build a credible model of the user and his1 long- term interests, or at least to connect some of his previous searches to his current search. Search engines treat search queries as one-off events, not related to a long-term interest of the person or related to previous searches. The user receives the same results again and again, even though he may not find them useful, or is already familiar with them; even

1Throughout this text, “the user” is referred to as “he” and not “she” or (the ridiculously politically correct) “it” because male users are more numerous than female users in almost all countries where gender- specific data is available [International Telecommunication Union, 2008], so statistically the user is more likely to be a “he” than anything else.

3 CHAPTER 1. INTRODUCTION worse - search engines sometimes try to address this issue and treat queries as related, but a) they only do it for the current search session as they cannot connect it to previous sessions, and b) they do it implicitly, hidden from the user who is not consulted or even told about implicit assumptions being made on his behalf. Search engines are suited to the task of finding some (small) piece of information fast, but do not support a long-term research over a topic. Users have no “search provider” - search engine, web directory or another entity, able to “profile” their interests and background and assist them in this task.

On the other hand, despite much research into the issue, users have no tools which would allow them to profile themselves and then apply these profiles to their interaction with search engines and achieve personalised search. There are no personal information management tools either, to let users manage the flood of information that passes in front of their eyes every day. Existing “bookmarking” features of web browsers are inadequate in many respects, as are existing and proposed “desktop search engines” and browser toolbars (some of which also raise privacy concerns, as they are developed and distributed by search engine companies and have been found to leak user’s personal data back to the company [Google, 2011b]). Personal agents that would assist user interaction with search engines and help “tune” web search to the person’s interests and preferences have long been proposed and developed, but have not become used widely (or at all); the reason for this may not be implementation issues or shortcomings of individual agents but probably a deeper issue inherent in the general approach used for their development.

Many projects exist aiming to improve the situation in various aspects of the web’s op- eration and users’ interactions with it. However, most of them are developed or sponsored by major players in the search engine space; if they are not, they are often influenced by earlier developments that are. This leads to “business logic” creeping into develop- ments, where some possible improvements are neglected because there is no business sense in developing them, or they are against a particular company’s mode of operation. For example, all search providers prefer to accumulate data about their users: it assists them in providing personalised services, but more importantly it is also an asset which can be sold to advertisers (directly or through expensive targeted advertising, where targeting has been achieved using this data). This approach has created massive problems both for search providers and for users. Search providers have technical difficulties to obtain the information, store it and process it, and legal issues regarding its extent and the duration of data retention. Users have a problem because data that they would want saved and processed does not get saved or processed because search providers cannot do it or are prevented from doing it by legislation (plus, users are wary of handing their personal data to private companies, which are also in most cases foreign). Still, no solutions exist or are being developed where saving and processing of data in a meaningful and useful way is not dependent on search providers; an independent approach may be useful for end-users, but since it is not as useful for large companies, it is not being done.

4 1.2. CONTRIBUTION, SCOPE AND LIMITATIONS

In this work we address some of the main problems of the modern web, which at first glance seem unrelated to each other. However, in the course of our research, it became apparent that a simple change of approach can address several significant issues at the same time, bridging gaps in various models which have traditionally been treated separately. This new approach exploits just one advantage missing in most current research: since it is not done for the benefit of a large search provider (or from the point of view of a large search provider), it can afford to approach issues from the end-users’ point of view and solve them by giving more control to users.

As an alternative to the dominant information-finding models on the web, we pro- pose a complex mechanism consisting of two independent parts: one part based on the infrastructure of a large search provider (which is indispensable for the task of indexing a significant part of the web) and one part distributed over end-users’ computers where it creates personal user profiles in the most logical place where they can be created - where the user is. A cooperation between these two sides can allow both information locating and information discovery in a truly personalised way.

We can therefore define the main research question of this work as: “How to allow users to explore the web taking into account their information background and their preferences, and also allow them to manage their own data at the same time?”.

1.2 Contribution, Scope and Limitations

This study examines the current state of web search engines, the web itself, search person- alisation solutions and some related issues and problems. A conceptual gap was identified in the lack of a meaningful way for search engines to build and maintain a usable profile for each user and acquire and utilise information about both the general and the current context of every web search. The thesis proposes a distributed approach to profiling end- users which is both technically and legally preferable to the current state of the art, as well as preserving user privacy. A revival of web directories is also proposed as a means to address another gap in the current search model: the lack of solutions satisfying users’ “information finding” need.

We also identified a major ethical problem of the current state-of-the-art solutions, namely that modern personalisation solutions can create a positive feedback loop with the user, where not only the user influences the solutions but they in turn influence and shape the user. In our opinion, allowing machines and algorithms to educate people and form their world view is wrong. The Web Exploration Engine we propose allows users to break this feedback cycle and find the information they want, no matter what the algorithms think about its suitability. While this was an unintended consequence of the work, it is still a major contribution it can make for the benefit of web searchers worldwide.

As part of the research leading to this document, a proof-of-concept prototype was

5 CHAPTER 1. INTRODUCTION built, which necessitated the practical application and testing of a number of methods and algorithms. One of them - the Multinomial Na¨ıve Bayesian classifier - was significantly changed to fit a dynamic environment, which is another contribution of the work. Having in mind that this is mostly the work of one person which is not supported by a major company, some practical limitations and constraints to its scope had to be recognised. For example, our prototype was only intended to show that a system such as we propose can be created, it did not need to be of “real world” quality. Also, where we needed to model an end-user’s knowledge and preferences, we had to simplify this model and equate it with the collection of web documents he has accessed; this was the only data we could realistically collect without major additional developments and expenses. The collection of web-related data (web pages and their classification) was considered external to the scope of this work, and we had to use an open-data source for our experiments, the Open Directory Project. More importantly, we could not test for usability the proof-of-concept prototype. Our proposed model consists of a technical solution creating some order in a vast amount of content (web documents). While the prototype was shown to work as intended, we lacked the amount of content which would make the system comparable to a commercial search engine; for such a comparison, we would need both the database of such a search engine and a significant investment in editorial work which would involve many months labour of a number of editors. We acknowledge this as a shortcoming of the thesis and hope to rectify it in a future work.

1.3 This Document

The rest of this document is organised as follows: 2. The “Background and Motivation” chapter provides a brief history of the develop- ment of the modern web and describes some of the unresolved issues it has with finding information, some of which we aim to address. 3. In “Exploration Engine” we outline our proposal addressing these issues. We also provide some usage scenarios describing the proposed system from the point of view of its various users and illustrating the benefits to everyday information search. 4. In “Related Work” we describe the state of the art in some areas related to various aspects of the implementation of our proposed Exploration engine. 5. In the “Implementation and Practical Issues” chapter we describe the approach we have taken to address each specific problem in order to implement a prototype of our solution, and how it fits into the whole mechanism. 6. The “Conclusion” chapter briefly summarises our proposal and its benefits, and provides an outline of some future work which can further extend and improve the system.

6 2 Background and Motivation

2.1 Summary

In this chapter, we present a brief history of the development of the internet and an overview of the structure of the web. We discuss the information finding models that exist, what users’ needs they satisfy, how their practical implementations work and what issues arise from them. We also look into various models for personalised information retrieval, their technical aspects and limitations. We identify the huge negative impact of a number of problems in the state-of-the art solutions, which serves as basis for our identification of the research question of this thesis and allows us to break it up into several sub-goals addressing different issues.

2.2 Brief History

The internet as we know it today resulted from the amalgamation of a number of earlier networks. Initially, it consisted of a small number of interlinked computers owned by large organisations. Interlinking was not “free for all” - in order to get connected, a computer had to go through a slow and elaborate procedure of syncing protocols and data formats, which involved not only the network maintainers but the computer’s owners as well. As a result, all nodes of the network were checked and approved initially and implicitly trusted thereafter. The owners and maintainers of each server were known, as well as what kind of data it had, how it was stored, how it could be accessed etc.

7 CHAPTER 2. BACKGROUND AND MOTIVATION

This resulted in an infrastructure where knowledge about the data within the network was assumed to be unrelated to the network itself - everyone was supposed to know how to find the information needed by means of some external knowledge, i.e. - manuals, documents or other information residing outside of the network structure. In other words, the network did not have anything to do with the content (or knowledge) kept on it. For a long time this was tractable, as the network was relatively small.

The “web explosion” was precipitated by the invention of the HyperText Transfer Protocol (HTTP) based on the proposal of Sir Tim Berners-Lee [Berners-Lee, 1989], which allowed authors to embed into their document links to other documents residing on any server. This made information findable in a much easier and intuitive way than before, and encouraged publishing more and more information online. The process was exponential.

However, the implications from the emergence of the net as it is now - a vast open network of servers with practically unknown owners and no controlling body to enforce not only data integrity and access rights, but even basic policies and protocols - were not taken into account in the development of its infrastructure. The net did not adapt to this change, which resulted in some fundamental design flaws leaving huge gaps in various aspects of its operation. One of these obvious flaws, for example, is the inadequate SMTP protocol which does not verify an email sender’s identity - a feature inherited from the early days where everyone on the network was trusted or at least known and accountable. This seemingly small oversight has brought about the current situation with spam email, where spam generates more than 90% of all email traffic [MessageLabs - Symantec Corporation, 2010] and may eventually make the whole email system unusable. A not so obvious flaw is the lack of a centralised information-finding infrastructure covering the vast interlinked structure which developed thanks to HTTP - The Web.

2.3 Structure of the Web

In order to understand some aspects of the problem this work tries to solve, we have to look at the underlying structure of the web and analyse how it affects the task of information finding.

2.3.1 Physical Structure

Documents “on the web” are actually located on different computers (servers) called hosts, situated all over the globe and connected to each other via the internet. These servers use a variety of platforms to “serve” their data: different operating systems, web server software, databases etc. All of this is usually transparent to the end-user (meaning that there is no difference to the user arising from any of these factors) and is largely irrelevant to the problem of finding information.

8 2.3. STRUCTURE OF THE WEB

2.3.2 Logical Structure

An inherited problem of the web is the lack of a unified information structure which would allow a network user to find the needed information1 in an easy and unambiguous way, conduct an exhaustive search (in case the same or similar information exists in a number of different locations) or - something equally important - be able to verify that such information does not exist anywhere on the network. The addressing system, consisting of Uniform Resource Locators (URL) and the Do- main Name System (DNS), which was supposed to add some order in web-published content, currently generates more problems than it solves and does not contribute to infor- mation discovery to any visible extent. Some of the important elements of the addressing system are described below. A resource is the smallest, unbreakable piece of information which can be published on the web. It is a single file, which can be accessed directly and separately - i.e., it does not need to be embedded in any other element in order to be accessed. Each resource is addressed via an identifier known as its Uniform Resource Locator. The URL consists of several parts: a protocol prefix, telling the accessing application what method to use in order to reach the resource, a domain name (part of the DNS system) or an IP address identifying the server hosting it, and a resource identifier within the domain (usually a directory name and file name). It is the domain owner’s responsibility to ensure uniqueness of the resource/address match for the last part. One aspect of this - that there is only one resource corresponding to one URL - cannot be circumvented easily and is almost always true. The reverse though, that one resource should only have one URL, is not usually the case. For a large number of reasons (some of them explained below), many resources can be addressed in multiple ways, which is a major source of troubles for search engines [Bar-Yossef et al., 2007]. A web page is the smallest web document from a logical point of view (i.e. - something that can make sense on its own). While a resource can be an image, a script or a style sheet which is meaningless separately, a web page can combine several resources and create a logical and meaningful document. It is encoded in the Hyper-Text Markup Language (HTML). One of the significant oversights of the web model is the structure of HTML: it combines all document elements into one file, so that content, formatting, graphic design, in-site navigation, links to external documents and meta-data are all a part of the HTML file representing the web page. While the web initially was meant to be used by humans only, this made sense; however, such documents present major challenges for machine “consumption”, since search engines have to separate each element from the rest - a non-trivial task being now addressed to some extent by the proposed specifications for HTML version 52 (HTML5) which have been “in the works” for 7 years and are expected

1With all due respect to Claude Shannon, “information” here and elsewhere throughout this text is used as a synonym to data or content, whatever its actual informational payload is. 2http://www.w3.org/TR/html5/

9 CHAPTER 2. BACKGROUND AND MOTIVATION to take even longer until finished. A web site has been variously defined in terms of network infrastructure (“everything on one IP address”), ownership (“all online property of one company”) or access method (“files that are made available through a web browser”), but these definitions are either incorrect or circular (e.g. - a site is something accessible by browsers, and browsers are things accessing sites). The most correct definition of a web site (author unknown) seems to be: “The entire collection of web pages and other information (such as images, sound and video files, etc.) that are made available through what appears to users as a single web server.” In other words, it is all of the information that a user can access over the internet which has been collected under one logical structure, irrespective of features such as access rights, protocols, formats, location of resources or type of the structure itself (i.e. it does not have to be a tree or graph or list or anything else in particular or anything at all). Usually, the HyperText Transfer Protocol (HTTP) is used to create unidirectional connections (links) between one resource and another, allowing site creators to build said site structure and logic. In some cases though, site structure and content is instead a part of a downloadable application (e.g. a flash or java applet), which is another challenge for finding information as its components are not individually addressable and links between them are not apparent or easily discoverable (leading to the existence of a large part of the “hidden web”, which is visible to humans but not visible to search engines). A domain is a collection of web sites which ideally, but not always, share a common trait (e.g. - a common topic, language, owner etc.). Domains have a hierarchical structure, where the top level domains (TLDs) such as .com or .au are administered by accredited registrars which assign second-level domains to their respective holders. Each second-level domain holder is then free to create and administer any lower-level domains under this, and is also responsible for resolving the addresses of any resources in its domain. This system was intended as a tool assisting information discovery, as domain holders were supposed to provide access to their data in a reasonable form. In reality however, the system suffers from several major flaws (discussed below):

• No relation to actual mappings of any kind.

• Unclear logic.

• Abuse by over-usage of domains.

• Under-usage of subdomains.

• Perceived content duplication.

No relation to actual mappings of any kind: domains do not carry any useful information relating to the actual underlying topology - a national domain does not mean that the site (or its host) is situated in the respective country or is related to it in any

10 2.3. STRUCTURE OF THE WEB way; subdomains in the same domain can be very dispersed geographically (or logically). In short - groupings are practically arbitrary. Furthermore, the DNS system is often manipulated to point the same domain name to different servers depending on the location of the user who is asking (Google for example widely uses this technique); what is even more confusing is that these servers then serve different content on the same URL: i.e. one user will see one content at google.com and another user will see something slightly (or drastically) different at the same address [metamend, 2010], with the unfortunate extreme example of Google China before its demise. Unclear logic: the initial thinking was that domains should be separated into zones depending on what they are going to be used for - .com for commercial use, .edu for educational, .org for non-profit, .mil for military and .net for entities supporting the network itself. This was soon found to be very limited and impractical, and a switch to national domains was recommended such as .au for Australia, .bg for Bulgaria etc. The initial system however, which had already grown considerably, was not abandoned and remained in place. The “used for” logic was abandoned though, adding further anarchy - now anyone can get .org or .net domains, so there is no difference between them and .com (but, nobody can get .mil or .edu without proper justification). An experiment which further added to the complexity was the addition of new top level domains such as .info, .biz, .name etc., which will be expanded in the near future by the unlimited addition of new generic top-level domains (gTLD) [ICANN, 2010]: apparently, ICANN3 has given up on the idea of domain names having a centralised logic and will make the “free for all” situation official. As a result, an entity wishing to go online has a number of choices in getting a domain name, and in many cases registers all possible domain extensions, just in case. On the other hand, since hundreds of millions of names have been taken already, it’s more often than not the case that a company cannot acquire its own name for its domain and gets a different one. The consequence is that users cannot just type the name of the company into the URL bar of a browser and depend on “getting it right” - the company name and the domain name are now usually different, negating the initial idea of creating human-usable addresses for resources. Moreover, the system does not assist (and was never intended to assist) finding generic information not related to companies - i.e., if you search for astronomical information, star.com would probably not be the place where you can find information about stars. Abuse by over-usage of domains: many companies have many products they want to publish information about. Since putting something “on its own domain” is considered better from many points of view (mostly related to the search models explained below), one company ends up owning dozens of domains, where it publishes content not linked to its other content, using a different site structure for each one. So, we have the unre- lated product1.com, product2.com etc. sites instead of subdomains product1.company.com,

3http://www.icann.org/

11 CHAPTER 2. BACKGROUND AND MOTIVATION product2.company.com etc. as part of one site, with a common structure and a unified information-finding mechanism.

Under-usage of subdomains: on the other hand, most of the domain names regis- tered are never branched down into subdomains. Initially it was thought that a domain, as the name suggests, would include all of a company’s online properties, so that there would be www.company.com - the company’s presence on the web, intranet.company.com - its internal network, mail.company.com for its email service, product.company.com for a product etc. This never happened and now everything is labelled as “www”, which is a misuse of the system and a waste of naming resources, and does not contribute to introducing any order in online content.

Perceived content duplication: some search engines give preference to a site for a keyword which is contained in its name or URL. Many search engine optimisers register multiple domain names (each name including keywords they want to optimise for) with the purpose of tricking the search engines and getting better ranking in search results on this basis. In order to achieve this, many servers are configured to serve the same content under different domain names (i.e. one server answers multiple names) [Edelman, 2003; Fetterly et al., 2003]. From the point of view of a search engine, these are seen as different web sites - all having the same or very similar content. This perceived duplication has to be dealt with using quite sophisticated methods [Bar-Yossef et al., 2007; Manku et al., 2007], which are nevertheless error-prone.

Logically speaking, if the domain name system in general is to be functional from an information-finding point of view (as initially intended), a system similar to URL/DNS addressing should exist to assist locating content just as URLs assist locating resources. In such a system, every domain owner should be responsible for indexing all content within the domain and providing a search interface to it. In fact, it has already been proposed [Liang et al., 2004] that domain owners should supply meta-data about their content in a hierarchical system analogous to that of the DNS, with user queries being routed similarly to DNS queries - “bottom up” until an authoritative node is found, then back down the hierarchy until a resource is returned. This suggestion has not been followed up by further research or implementation, possibly because it leaves some major questions unanswered (for example - what is the authoritative node for the definition and information about a generic word such as “democracy” or “physics”?).

2.3.3 Document Interconnection (Hyperlinking): The Web

As early as 1945, when modern computers were only a hypothesis, the issue of browsing knowledge bases was considered and a model for conceptual browsing was proposed [Bush, 1945] which basically describes what amounts to modern hyperlinking of documents (it also suggests the understanding that people do not find convenient a top-down hierarchical approach to finding documents, but need associative links related to concepts).

12 2.4. INFORMATION FINDING: ASPECTS AND APPROACHES

The main characteristic of the modern web is the interconnection between documents with such associative links. These are provided by the HyperText Transfer Protocol (HTTP) in the form of a web link - a connection pointing from a part of one docu- ment (text or embedded image) to another document, which may be a part of the same web site or an external link leading to a different site.

These links form a graph which can be visualised as a web of connections (hence “web” as the name of the whole online phenomenon). They allow a way of navigation between documents known as browsing, which was initially the only way of information discovery. When modern search engines arose, they started using these links as a basis for their estimations of the “importance” of individual documents which was assumed to be related to the positioning of these documents on the graph (the most influential pioneer was Google’s “PageRank” algorithm [Page et al., 1998] introduced in 1998), although the graph was initially supposed to be used by humans only.

2.4 Information Finding: Aspects and Approaches

In many cases, users know beforehand precisely what they want and where to find it. These may be memorised addresses that users enter directly into their web browser (the so-called type-in traffic to web sites), or sites that were visited previously and bookmarked. When this is not the case, information has to be found by some other means.

Navigating the web is widely recognised as a non-trivial challenge. The lack of any in-built navigation or information-finding infrastructure means that users have to resort to a variety of methods (some of them assisted by search providers), corresponding to a variety of information finding needs.

Users’ information needs in general can be subdivided into three large categories: navigational, informational or transactional [Broder, 2002]. The navigational need is when users know what resources they need, but don’t know where they are so they need to locate them (thus the task is Information Locating); the informational need means users need information on a topic but don’t know what exactly it is, so they need to find out what relevant resources exist (Information Discovery). Transactional searches are related to cases where users want to perform a specific action - send an email, buy something online etc., i.e. they are not looking for information as such but are trying to perform a task; while this constitutes a significant part of web traffic, it is outside the scope of this work.

If we exclude offline methods (such as a booklet of useful web addresses - which actually exist!) or methods such as going to a discussion group and asking (which is a kind of request for someone to create a personalised recommendation list), there exist mainly two information finding options: searching or browsing. Searching can be loosely defined as an act of actively looking for particular information, or locating - usually done by the user typing in keywords and submitting a search request (or query) to a search engine. Browsing

13 CHAPTER 2. BACKGROUND AND MOTIVATION

(or discovery), on the other hand, is a passive exercise: the user clicks on links and follows them, with the limitation that these links have already been created by someone else; the user cannot actively input a search term and can only follow a pre-defined list of options. Both of these models are popular and widely used, and - it is important to note - by the same people. In other words, users cannot be divided into a “searching group” and a “browsing group”, the division is rather task-defined. For some tasks people search and for others they browse, expecting different outcomes from each strategy - the two models are complementary.

2.4.1 Information Locating

The information locating need of users is typically addressed by the search engine, imple- menting the keyword search paradigm. It is important to see how search engines work in order to see both their strengths and their limitations.

2.4.1.1 General Technology of Web Search

A typical search engine [Brin and Page, 1998] consists of three parts - a machine for down- loading content from the web (usually called a spider or crawler), indexing/evaluation logic and user interface. End-users see only the last part, but for it to work the search engine must have first collected the content and processed it, so it can then search in it according to various algorithms and present a matching result to the user’s query. Unlike what some users think (and some search engine help pages imply) a search engine, when queried, does not “search the web” - it only searches in its own local database, which is just a limited reflection of the web. The method of creation of this reflection is thus very important. The basic idea of a web crawler is simple: it downloads a web page, stores it for indexing, looks in it for links to other pages, and if it finds any new ones - downloads these pages too, then goes on in this fashion until it has downloaded all of the web. Of course, it is not that simple and as a matter of policy a search engine does not want to really download all of the web, but for practical purposes this may be taken as a good first approximation of what a crawler does. With the constant addition of new documents to the web, an important part of a web spider’s job is finding these additions. There is another aspect to the web though - documents that are already published tend to change or disappear with a much higher frequency that one might intuitively expect. It has been estimated that 20% of documents change daily, and 50% of them get updated or replaced over a period of 50 days [Cho and Garcia-Molina, 2000]. This may seem excessive until we remember that a web document “update” may not mean that the actual document was updated. Most web sites are generated by Content Management Systems (CMS) based on a database of documents from which they generate web versions “on the fly”. A web-accessible document consists

14 2.4. INFORMATION FINDING: ASPECTS AND APPROACHES of the underlying document “wrapped up” by the CMS in some interface elements and site logic; in many cases, these are dynamic and change often, in an extreme case changing with every page reload (pages with a visitor counter, or displaying current date/time, or embedded dynamic advertising). Search engines deal with this issue by implementing incremental (as opposed to batch) crawlers, which run constantly and update their copies of documents at varying intervals depending on the frequency at which a document is expected to change. Removal of documents is a much more serious problem. A large-scale study found that pages disappear at a rate of 0.25-0.5% per week [Fetterly et al., 2003]. It would seem a trivial task to find out if a link is dead or not, but it turns out this is not the case [Bar- Yossef et al., 2004]. Some sites generate soft 404 errors (they generate a web page with a textual explanation of the error instead of issuing a number 404 - “Not found” - error code), or their CMS is set up to supply some content even at URLs that should be “dead”. This behaviour confuses search engine crawlers and they have to take special measures against it [Yahoo!, 2007]. If a whole domain is removed, it is often occupied by cyber squatters who serve their own content on it [Edelman, 2003], which is often automatically generated and would not be considered real content by human users. Since this is difficult to detect by the crawler (and made more difficult by the fact that many cyber squatters deliberately generate content similar to that of the old site precisely for this purpose), it gets indexed as a change to the previous document and is a significant source of digital garbage. Some limited tools have been developed to allow humans (site owners) to assist crawlers, influence the crawling process and pass some information to search engines:

• Meta tags are included in HTML documents that are for search engine use only (site visitors do not see them). These can be meta-information about the document or directives for crawlers.

• robots.txt is a standard text file in the home directory of a web site which contains more directives. Crawlers that comply with the Robots Exclusion Standard4 access this document on a site first, to see what they are allowed to index.

• Sitemaps5 were introduced by Google6 to allow webmasters to inform the crawler about the structure of their sites. A sitemap is an XML file with a declaration of the site hierarchy, as well as update frequency of particular resources.

However, these tools are very crude and limited. No mechanism exists to allow site operators to transmit to a search engine only the important parts of documents (according to the opinion of their owner), or some sort of document ranking. A sitemap, for example,

4http://www.robotstxt.org/orig.html 5http://www.sitemaps.org/protocol.php 6http://www.google.com/

15 CHAPTER 2. BACKGROUND AND MOTIVATION expressly does not convey information about page importance/ranking or content, just its position in the graph of the web site.

An approach alternative to information gathering by a web spider is to build user- driven search engines such as digg.com7; users bookmark sites they visit, this information is submitted to the search engine and is then used for its index. This is called social bookmarking and basically eliminates the need for a crawler. It is also supposed to “har- nesses the collective intelligence of the Web” [Gordon-Murnane, 2006] by applying human classification to data.

Resource pruning occurs on a massive scale, where content is deemed to be spam or scam of any kind, or duplicated [Charikar, 2002] (which may also happen even before the actual crawling of the data [Bar-Yossef et al., 2007]). Of course, much data is not accessible to web spiders in the first place: where a user needs to be logged in in order to access a resource, where users have to actively submit a form to request a resource (something traditional crawlers cannot do) or where site navigation is embedded in an interactive application (such as a java applet or flash menu) - the so-called hidden web [Barbosa and Freire, 2007]. As a result, a significant portion of the web is not crawled or indexed at all.

The most important part of a search engine is the way it processes information after it is collected and before it is presented to end-users. Indexing can be done in a number of ways: the simplest indexing method is just a map of word occurrences (i.e. which word is contained in which document). Full-text indexing also takes into account in what order words appear in the document. Latent semantic indexing [Deerwester et al., 1990] tries to reconstruct the meaning of the words contained in documents by correlating every word in every document with other words contained in documents which also contain this word (co-occurrence statistics). This allows associative search, where returned results may not actually match the query, but contain words with high co-occurrence values (i.e. - are more similar) with the terms in the search query.

The relative importance of a word for a document is typically measured by the TF-IDF measure [Robertson, 2004], which is a weighting scheme based on Term Frequency multi- plied by Inverse Document Frequency (number of times the word is found in a document, divided by logarithmically discounted number of documents in general that contain the word - which discounts popular words contained in many documents).

The relative importance of each document as a whole is usually judged by external methods, where the document itself is not analysed and the parameters taken into account are outside of it. Such are the graph analysis methods, most notable of which is Google’s PageRank algorithm [Page et al., 1998]. It calculates the relative importance of a document by the number of other documents pointing to it, weighted by their own importance. Weights can also be negative, if the document is in a bad neighbourhood (sites deemed to

7http://digg.com/

16 2.4. INFORMATION FINDING: ASPECTS AND APPROACHES be spammy, which is calculated by an undisclosed method) - this adds a reputation factor which may override otherwise high . This method is extremely computationally intense, as it tries to build a relationship matrix involving billions of documents, and is highly recursive. Thus, it is only periodically updated by Google, using distributed computing over thousands of servers. The precursor of PageRank, HITS [Kleinberg, 1999], accounted not only for incoming but also outgoing links from pages; through analysis of these, it aimed to identify “authority hubs” on a topic and then increase the ranking of pages for that topic if these hubs link to them.

Additionally, search results ranking is also influenced by implied user feedback: click- through data is analysed and preference is given to results that were more often accepted by users (clicked). However, many users just click on the top results so a positive feedback loop is created; search engines try to mitigate this by discounting clicks to top results and giving more weight to clicks on lower-ranked results.

Other search engines are trying different ranking algorithms, some based on neural networks [Burges et al., 2006; Richardson et al., 2006]. URLs are also analysed, although technically the URL of a resource should have no significance; however, it is usually be- lieved that a shorter and simpler URL is better, and some algorithms give preferences to resources on short URLs [Richardson et al., 2006], or assign higher document relevance for a word contained in the URL. This practice has given rise to many of the methods for generating digital garbage with the purpose of search engine optimisation.

Documents are not only rated as relevant to the words that they contain, but also to words that other people have put on their sites when linking to them (e.g. - if I put a link on my site to a document somewhere else, and the link reads “check out this story” - the document gets indexed with the tag “story”, although it does not contain the term). This has the side effect that people link to sites in certain ways with the purpose to exploit this feature - a practice known as Googlebombing (or linkbombing, as it affects not only Google). The most famous case was the Googlebomb that made the search engine return the biography of George W. Bush when searched for “miserable failure”. It was subsequently “difused”, Google claiming that it improved the algorithm8 so that such cases would not happen again, but the feature continues to be exploited by search engine optimisers to boost site rankings for desired keywords, and generates further amounts of digital garbage.

User interaction with search engines is simple: the user enters a keyword or a list of keywords (a search query) into a search box, presses the “Search” button and receives a number of links to documents (presumably relevant to the query) as search results. The degree of relevance is calculated by the search engine with an algorithm that it does not disclose, due to many search engine optimisers trying to manipulate it and get better rankings and more traffic to their sites. Most of the values used for this ranking have

8http://googlewebmastercentral.blogspot.com/2007/01/quick-word-about-googlebombs.html

17 CHAPTER 2. BACKGROUND AND MOTIVATION been pre-processed, for example the pagerank (relative importance) of each document and other static measurements [Richardson et al., 2006]. At the time of retrieval, they are only combined to fit the particular query. At retrieval time the search engine may also perform a number of low-cost additional optimisations, such as filtering based on the user’s geographic location or preferred language, or suggesting an alternative query (usually correcting small typing mistakes). No major rephrasing of the query is suggested though - i.e., the user may be advised how to spell a word differently or use a synonym, but not to change the query for a completely different query. An important point to note is that searching in its pure form is usually stateless - i.e., every search is treated as a new one, unrelated to any previous search; there is no (visible) session with information about previous actions of the user. The user cannot search within the results of an earlier query, but has to change the search terms and essentially start a new query (contrary to what the “Search within results” feature of Google implies, it actually starts a new search with the new query added to the initial query; however, search results for A + B are not a subset of search results for A). This is a major drawback, preventing users from performing extended searches over a longer timeframe (or, to put it another way - explorations over a particular topic). Modern search engines try to overcome this drawback by creating a user session “behind the scenes” (invisible to the user), where they do utilise some information from previous user actions such as “search trails” (queries during one search session [Cao et al., 2009]). This, however, is limited for technical reasons, and has all the inherent limitations of implicit feedback discussed below. Some search engines try to assist users by clustering their search results - such as search.yippy.com9 (the search engine formerly known as clusty.com). Together with search results, users get a number of additional queries that may be used to narrow down a future search. These are based on clustering done at retrieval time which is performed on a small subset of the data only, and clusters are formed and labelled automatically (as opposed to the manually created and named categories of a web directory). An industry white paper [Vivisimo Inc., 2006] explains: “Post-retrieval clustering of search engine results offers complexity, labor, cost, coordination, maintenance and flexibility advantages over the pre-retrieval tagging of all indexed pages. The issue of quality is a mixed one - tagging offers the guarantee of human-like labelling which may be too limited, whereas clustering offers crisp and targeted cluster labels which may not always resemble human-generated language. Whether the balance is positive or negative has to be judged in specific cases.” Another assisting feature is Google’s “show similar” function, which suggests to the user to view similar sites. The method for finding such similarity is not disclosed, but seems to rely on co-occurrence of inbound links - i.e., if some site links to A and to B, then A must be similar to B. The assumption is usually true, but sometimes generates strange results - for example, when tested on “find sites related to Google”, it was found to return (among the legitimate links to other search engines) links to the PHP programming

9http://search.yippy.com/

18 2.4. INFORMATION FINDING: ASPECTS AND APPROACHES language, the International Movie DataBase and a London hotel.

2.4.1.2 Advantages of Search Engines

Searching as a way to find information is indispensable in some cases and provides some undisputed advantages:

1. Quantity of information.

2. Quality of information.

3. High coverage.

4. Exact match.

5. Document context.

6. Simple interface.

7. Technology portability.

1. Quantity of information. The best perceived benefit of searching using a search engine is the number of results received. Typically, there will be millions of pages that the search engine claims are relevant to the user’s search. Since search engines use an automated information gathering mechanism, they can afford to include amounts of data many orders of magnitude larger than those that “hand-made” web directories have. 2. Quality of information. A search engine will typically point the user to a specific document, i.e. - if the user is after a specific piece of information, searching is the only option. With a well-defined combination of keywords, the user can achieve a very high quality of received results. Moreover, a search engine will usually give up-to-date results, because its information archive is regularly updated. 3. High coverage. A search engine typically covers a very large proportion of the web. Thus, if the user doesn’t know where the information is likely to come from - a Pakistani newspaper or US university - using a search engine is the best option. 4. Exact match. The results presented to users contain precisely what they searched for (with the exception of associative machines, but they resort to associations only where no exact match can be achieved). From the users’ point of view, this is a plus - they get what they asked for. From the search engine’s point of view, it is also a plus - users get what they asked for. In a way, the search engine “washes its hands” by being precise - if users did not receive what they wanted, the blame is put on them for not describing it properly. 5. Document context. When users search for a particular keyword or combination of keywords, the result set typically contains a snippet of the document - some text around

19 CHAPTER 2. BACKGROUND AND MOTIVATION the keyword, so that users can get a glimpse of a relevant part of the document where the keyword is put into context, and decide whether the document is relevant to the search.

6. Simple interface. A search engine is the easiest to use method for searching, because it has a one input interface - there is only one thing the user needs to do: enter a keyword (or a list of keywords). In the absence of practically usable zero input machines [Lieberman et al., 2001], this seems to be the simplest existing option.

7. Technology portability. From the point of view of the company creating and operating the search engine, perhaps the best part is the portability of its technology. Most of the machine learning methods used in its creation are not dependent on language or culture; once the technology has been developed, the search engine can easily cover more than one search market. This is the main reason why Google is number one in almost all countries, and why searching is by far the most popular information finding method.

2.4.1.3 Disadvantages and Problems of Search Engines

The standard search engine model also has many disadvantages, which paradoxically come mainly from its advantages. The same feature and aspect that provides an advantage to the search engine is also a weakness in another sense. Thus, many weaknesses are “in-built” and cannot be eliminated; they can only be mitigated by offering alternative solutions, or by patches to the basic concept that would take the search engine out of its clean model and turn it into some hybrid. Some other weaknesses stem from the limited expressiveness of the chosen query method (“keyword search”) and from an implementation decision to limit the number of search results returned to the user:

1. Quantity of information.

2. Quality of information.

3. High coverage.

4. Exact match.

5. Document context.

6. Simple interface.

7. Technology portability.

8. Broken navigation logic.

9. Keyword paradigm.

10. Limited retrieval.

20 2.4. INFORMATION FINDING: ASPECTS AND APPROACHES

1. Quantity of information. The main problem of a search engine is the same as its main strength - the enormous amount of information it returns to the user. While getting many results is a good thing at first glance, the user soon realises that receiving 3 million answers to a query is of no practical use - they cannot be visited in a lifetime. Typically the user sees only the first 20-30 answers before giving up [Group, 2004]. In the absence of ways to allow users to make more specific requests, search engines can only strive to improve their result rankings so that the first results that are seen can be more relevant. However, a great part of “relevance” depends on the particular user’s person and the particular search context, which are entirely missing from the (clean) search engine model.

2. Quality of information. The enormous quantity of information gathered by the search engine leads to it also accumulating an enormous quantity of digital garbage, which is becoming an ever bigger proportion of all web content. Consequently, users get a high proportion of results that are of no use to them (though formally relevant, as they do contain the search query), which acts extremely detrimentally to the search engine’s usability and usefulness and discredits the model as a whole. Manually-built web directories can deliver better quality in this respect.

3. High coverage. While receiving many answers from a large variety of sources is often useful, sometimes the user may actually prefer a narrow base to search in - for example, scientific journals only, or sites related to a particular country only. That is the reason why vertical (specialised) search engines exist, in some cases taking the market share of the larger search engines, and why Google is not number one in every market (see the success of Yandex10 vs. Google in Russia [LiveInternet, 2010]).

4. Exact match. One of the most basic inherent problems of search engines is the “results matching the search query” paradigm they are based on. There are two underlying assumptions here:

• Users know what they are looking for.

• Users know how to find what they are looking for.

While many people do know what they are looking for and how to find it, this is not by any margin the majority of cases [Yoshida et al., 2007]. Firstly, many people are not looking for a particular piece of information in the first place (e.g. a specific article), but for “something interesting” to read, or something that they have a very vague idea about which they want to learn in more detail. Since they do not know what exactly it is, they cannot define it by way of a list of search terms. The other assumption is also quite big and does not always hold - even if users know what they want, they may not be able to formulate it in terms of keywords. This is a Catch 22 situation - in order to find a

10http://www.yandex.ru/

21 CHAPTER 2. BACKGROUND AND MOTIVATION document, the user has to know some of the words that it contains. In order to know these words, the user must have already read the document (or similar ones), i.e. must have already found it by some other means. This reveals the important underlying assumption that the user has some prior knowledge, which cannot have come from the search engine. The dilemma was best formulated by the science fiction author Robert Sheckley in his 1953 story “Ask a Foolish Question” where the “Answerer” (a machine that can answer any question) turns out to be useless as nobody can ask it a valid question: “in order to ask a question you must already know most of the answer”. The search engine is a modern “Answerer” which will become increasingly useless the more it learns and specialises and the less its users know; it is by definition one step behind, it can only add to already existing knowledge but cannot create the initial knowledge in the user, who has to turn to other sources of information for this (which sources are lately becoming extinct due to competition from search engines). Additionally, even when users do find satisfying answers, they can never know if those answers are exhaustive or some other part of the answer remains undiscovered because they did not ask a further question. As Socrates’s student Meno asked more than 2400 years ago: “How can you tell when you have arrived at the truth when you don’t know what the truth is?” 5. Document context. While showing a snippet of the text may help the user a bit, this is just masking the fact that the search engine cannot offer a concise and authoritative summary of the whole document. The user is supposed to judge the document based on some text (which is usually not even a couple of whole sentences long), but the main context of the document is missing - is this a scientific article from a high-quality journal, an article on somebody’s blog, just a comment to something else and not a separate document at all, is it a part of an already outdated collections of articles... - the user cannot know. 6. Simple interface. The simple interface of a search engine may be a good thing in terms of usability, but is also a limitation in a way. A search engine cannot easily grow into something larger, while a web directory can (and most have) become a web portal containing a large number of specialised channels. These channels can carry additional information for the user, or be structured differently to allow better search in their specific context, but they are not available as an option to search engines. 7. Technology portability. The fact that a working search engine that is doing fine in one language/market can easily be transplanted to another may be good for its owners, but it effectively kills the local search markets in those new areas where the global search engine expands. National markets are usually not big enough to allow more than one or two search engines, so the local ones usually disappear, with the resulting loss of local knowledge they had. Users are then committed to receiving generic, “world” results and not information culturally relevant to them. 8. Broken navigation logic. The search engine, as a result from a query, sends the user to a document. This document resides on a web site and is usually deep in its

22 2.4. INFORMATION FINDING: ASPECTS AND APPROACHES navigational hierarchy. By going directly to it, the user skips all of this navigation and sees the document as a free-standing entity, which normally it is not (at least not according to what its publishers meant it to be). The web site may have contained an introduction to the matters discussed in the document; it may have contained links to other documents; it may have contained a warning that this document is outdated and a different one should be consulted instead. All this document context is lost to the user.

In effect, by bypassing normal site navigation, the search engine model is trying to de- stroy precisely the part of the web which makes it a web - the ordered connections between documents, and is trying to push it into a model of chaotically stacked unlinked documents where the only possible navigation is by keyword search. This not only undermines the efforts of web site creators to build logically connected sites, but works against the search engines themselves as they rely on this same structure for their relevance rankings.

9. Keyword paradigm. The problem that most concerns end-users is best explained in this quote: “A fundamental deficiency of current information retrieval methods is that the words searchers use often are not the same as those by which the information they seek has been indexed. There are actually two sides to the issue; we will call them broadly synonymy and polysemy. We use synonymy in a very general sense to describe the fact that there are many ways to refer to the same object. Users in different contexts, or with different needs, knowledge, or linguistic habits will describe the same information using different terms. [...] The prevalence of synonyms tends to decrease the recall performance of retrieval systems. By polysemy we refer to the general fact that most words have more than one distinct meaning. In different contexts or when used by different people the same term (e.g. “chip”) takes on varying referential significance. Thus the use of a term in a search query does not necessarily mean that a document containing or labeled by the same term is of interest. Polysemy is one factor underlying poor precision.” [Deerwester et al., 1990]

Furthermore, the keywords submitted to the search engine are a flat list of terms, with no importance attached to any of them. It is not clear which is the main search term, which is a supporting term (for disambiguation or making the query more specific) etc. Users have no way of indicating this to the search engine. (We have to note here that a scheme for dynamic query weighting has been proposed [Broder et al., 2003] where a “first approximation” search returns a number of documents, then query term importance is calculated from them and a second, more complete search is then performed using a proposed “weak AND” operator reflecting term weights. However, due to trade secrets, it is not clear if this has been implemented in practice by a major search provider.)

A further problem is the fact that the search engine returns documents containing the search terms. However, it may be that the user expects documents described by the search terms. For example, if we search for a “daily newspaper”, how many daily newspapers have the actual terms “daily” and “newspaper” prominent on their homepage?

23 CHAPTER 2. BACKGROUND AND MOTIVATION

10. Limited retrieval. The high data coverage achieved by search engine spiders masks an issue of the retrieval mechanism which limits the search results available to users; it is not apparent, but search engines can be used to perform a limited search only and (in the general case) never a comprehensive search covering all available data.

How can that be, when everyone assumes that the proper query will always point you to the relevant results? A standard search query is practically very limited in length; the search engine assumes that the user wants a document to contain all the words in the query, which is realistic only if they are not more than five or six. Longer queries return no match, so the search engine invisibly substitutes a shorter query in place of the original one, removing or replacing some of the original terms. An important implication here has been noted [Witten et al., 2006], but is usually overlooked by proponents of the search engine model. Every document can be described by a (not too large) finite number of search queries. Search results are ranked by a number of factors, most of them external to the document itself and unrelated to the user and his search context. On the other hand, search engines limit the number of results they show to users (Google has a limit of 1000 shown documents). With hundreds of billions of indexed documents on the web, it is then very probable that a document will rank behind this decision boundary for every conceivable query that can be associated with it - i.e., this document becomes unreachable through keyword search; for users who only rely on search engines to find information this document is practically non-existent - even if they search for it using the correct queries. Since the decision boundary is fixed, but the web continues to grow, the proportion of these “lost” documents will only increase in time, making search an ever more hopeless task.

Additionally, there is the issue of conflicting interests in web search. Not surprisingly, information providers at different levels of the “information food chain” have their different views of the economics of the web. The debate goes as follows. On the one hand, a primary news source (i.e. the entity that first published a piece of information) has made a considerable investment in obtaining and publishing it, so it is entitled to receive all the “eyeballs” of the public and the revenues that go with that. From this point of view, web portals and search engines are parasites that have nothing of their own to offer to users, so they live off what they manage to leech from the public attention that the primary sources are due to normally receive [Nielsen, 2006]. On the other hand, search engines claim that they support those primary sources by driving traffic to them - by assisting the user in finding the original site, they claim, they provide the public which the site is seeking in the first place. There is a hidden assumption in this argument which not many people realise - that without search engines, people would not be visiting web sites. While it may be true for some obscure web sites, many sites receive some or most of their traffic from other sources and to presume that all web traffic would just disappear if there were no search engines is just not serious. This debate is far from theoretical, as highlighted in the case of

24 2.4. INFORMATION FINDING: ASPECTS AND APPROACHES the lawsuit of AFP11 against Google, which ended in Google paying for licensing12, and the similar deals Google had to sign with Associated Press13 and other content publishers. The implication of these business relations is that all content is not in fact equal - some is searchable and some is not, or is searchable under different conditions from the rest. Even if they do not resort to lawsuits, many other entities have their business reasons to try to block search engines from indexing their content. In a significant number of other cases, content is not actively blocked from crawling but web sites are just not created with search engine spiders in mind, resulting in sites which can be accessed by humans but are difficult for spiders to get in - the so-called hidden web [Barbosa and Freire, 2007]. This is in a way a level of access restriction: although not intentional, it has the effect that search engines cannot access the same content that people do, so they cannot include it in their search results.

2.4.2 Information Discovery

According to some research [Yoshida et al., 2007], users looking for general background in- formation on some topic account for 46% of queries to search engines, while those searching for a particular and more detailed piece of information account for less than that - 36.8% (the rest being taken up by searches aimed at comparative shopping and other reasons). Google’s vice president of consumer products claims an even higher figure: according to her, 80% of search queries represent users “exploring topics that are unknown to them” [Gord Hotchkiss, 2007]. This type of information search is information discovery: people trying to discover new information and explore a broad topic of knowledge. The information discovery model on the web is typically represented by static lists of links, which could be topic-specific or general (web directories) and are explored by browsing. Next to searching, browsing is the other important way in which people find information online. It is similar to a walk with a starting point from which the user follows links to other documents. The starting point (a web directory or portal) has a hierarchical tree-like structure where the exit points are links to external web sites (which may in turn have their own tree-like structures leading to the particular documents of interest). In browsing, users are passive - they cannot input any requirements and are restricted by what the site has to offer; users can only select between the existing limited options.

2.4.2.1 General Technology of Web Directories

Web sites that are built for being browsed (addressing the information discovery need) comprise a hierarchical structure of categories, each category collecting links to documents relevant to the category topic. The structure is sometimes called a tree, but that is not

11http://www.afp.com/ 12http://www.pcworld.com/article/130467/google and french wire service settle lawsuit.html 13http://www.ap.org/

25 CHAPTER 2. BACKGROUND AND MOTIVATION usually the actual case - for all big directories, due to the ambiguous nature of some categories, it turns out to be necessary to include cross-links between categories so the structure becomes a cyclic graph. Unlike the search engine space, where a handful of search engines have practically occupied the whole market, there are a great number of web directories. Typical examples are Yahoo! Directory14 and the Open Directory Project15, and a not so typical example is Startpagina16. Since the model is very flexible and allows all sorts of hybrids and specialisations, web directories do not usually compete with each other - where users typically use just one search engine of choice and have no reason to use another, they can at the same time use a number of directories, each one for a different purpose. There are other significant differences between web directories and search engines. The most important difference is that information gathering is a very costly process for a web directory. The usual model assumes that site owners submit their sites to the directory, then an editor approves or rejects the entry. If approved, it is edited and put in a category or categories selected by the editor. For some reason, machine learning was never applied to a large web directory, so the predominant model is usually associated with manually collected and processed data. User interaction with web directories is fairly simple - the user can click on links to traverse the graph, or can search just as in a search engine. Search results usually come in the form of recommended categories (something that search engines lack), as well as individual documents.

2.4.2.2 Advantages of Web Directories

In this information-finding model, the presumption is that the portal or web directory has less content in terms of quantity, but of higher quality. The browsing process is different in its basic principles from searching and has its unique advantages which make it preferable in some cases or for some tasks:

1. Recommendations to the user.

2. Quality of information.

3. Quantity of information.

4. Document context.

5. Complex interface.

6. Preserved navigation logic.

14http://dir.yahoo.com/ 15http://dmoz.org/ 16http://www.startpagina.nl/

26 2.4. INFORMATION FINDING: ASPECTS AND APPROACHES

1. Recommendations to the user. The main advantage of a web directory or portal is that it sends users to sites that they never thought existed, and would never have searched for. It is enough for the user to select (rather vaguely) an area of interest, for example “music”, and the portal provides information about what sites exist that may be useful and are recommended. The recommendation is not necessarily explicit - it may be implicit in the way links are ordered, or the very fact that they are listed at all. In any case, the links have been put there by a human which is supposedly much better than a ranking algorithm.

2. Quality of information. The information in a typical web directory or portal has been published by an editor. This means that a human has checked the site, written a description for it and approved it before being published. This eliminates many of the shortcomings of an automatic search engine - scams and spams have no chance of being published, duplicated content is taken care of etc.

3. Quantity of information. Web directories contain considerably less content than search engines, which in many cases is an advantage. They act as filters, where the editors (through editorial policies and guidelines) have pre-selected what to include and what to omit, which gives the user another level of filtering not available with a search engine. This is the best reason why “verticals” (topically specialised directories) exist, although all of their information (and much more) can be found by a search engine as well - their advantage is not in having this information, but in not having the rest.

4. Document context. Before pointing the user to an external link, the portal usu- ally has some summary of why it should be visited, which is not just a snippet mechanically cut from the text of the resource, but is a piece of text written by a human, summarising the resource in a readable and understandable way and meant to be perceived in the con- text of the portal. In other words, the user is much better informed before clicking on the link. This has been expanded in some cases (for example About17) where the editors are experts in the respective fields and do not just link to sites, but recommend the best ones, comment on them and give a background of the field of knowledge - i.e. the directory is becoming an encyclopaedia. Wikipedia18 is an example of an information site where the balance between encyclopaedia and directory is more to the “encyclopaedia” part, but it is also a major web directory through the outgoing links on its pages.

Additionally, users are usually sent not to specific documents, but to a higher level in the site’s hierarchy: a page preceding those documents in the site’s navigation, which means users can also see some of the site’s own context before proceeding. In this way users access a document in the way it was meant to be accessed by its publisher. Another level of context is that the directory offers a list of similar or alternative sites, so that users know not only what is available at the particular site, but what is also available elsewhere.

17http://www.about.com/ 18http://www.wikipedia.org/

27 CHAPTER 2. BACKGROUND AND MOTIVATION

5. Complex interface. While it is also a shortcoming in another respect, the complex interface of a portal or web directory is a major advantage. It allows site owners to improve its usability by employing different designs and structures for different parts of the directory and include additional information depending on context. 6. Preserved navigation logic. From the users’ point of view, a directory or portal is very logical. They browse “top down” in a tree-like structure, starting from a general interest and going down to more specific topics. Clicking on an external link sends them to a site with a similar structure where they continue to go down the tree. Although it is a different tree, the transition is almost seamless, unlike that from a search engine.

2.4.2.3 Disadvantages and Problems of Web Directories

Most of the shortcomings of the browsing paradigm for information discovery are not in fact shortcomings of the model itself, but of the technology used to generate and order content which has negative impact on the quality of service. However, these are visible to the user when browsing a web directory, and are perceived as problems of the model itself.

1. Quality of information.

2. Quantity of information.

3. Complex interface.

4. Technology is not portable.

5. Information gathering mechanism is limited.

6. Categorisation and cognitive bias.

7. Implementation decisions.

1. Quality of information. The main problem of web directories is their limitation of the types of available information. The user says, for example, “I was searching for holidays in Turkey and the search engine gave me 5 000 offers while the directory gave me 50 travel agencies”. In the latter case, although the user may not be able to formulate this as a problem, a whole class of results is missing - specific offers as opposed to generic entry pages. This omission is due to web directories not implementing automated crawlers, which forces them to avoid regularly changing or short-lived documents. Directories usually link to sites’ home pages only (or at most some sub-sites), which are relatively stable and do not need their descriptions to be updated too often. In practice this means that if a very large site exists, the directory will extend just one link to it - to its home page, or at most half a dozen links to its major sections. For a travel agency, there may be a link to its home, to “cruises”, “skiing” and “airplane tickets” - only the respective entry pages, no

28 2.4. INFORMATION FINDING: ASPECTS AND APPROACHES links to the specific offers. In other words, the user is pointed to a gateway to information, and not the information itself. By default, whole types of documents cannot be found by browsing directories - this is where browsing loses out to searching.

2. Quantity of information. Another major problem of the web directory model is the cost of information. Since it is mostly manually gathered and edited, its price of acquisition is orders of magnitude higher than that of search engines, hence directories contain orders of magnitude less content. In some cases this is a problem of perception only - the user sees it as a shortcoming when comparing a search engine against a portal (“the same query got me 3 million results here and only 300 there”), but if the user is not going to see more than 30 of these anyway, the practical side of the comparison is rather meaningless. However, in many specific cases a web directory will not be able to return any results at all and this is already a serious disadvantage as compared to a search engine.

3. Complex interface. Even if users know what precisely they are looking for and how to find it, they have to traverse the whole directory tree from the top to the node they need - which is many more clicks than the couple of clicks needed for searching. Since users normally tolerate only 3 to 4 layers of navigation [Group, 2004], this turns out to be a major problem for larger directory structures (the Open Directory for example has over 700 000 categories). This problem cannot be mitigated: too many nodes mean either a deep structure (with too many clicks to the leafs) or a broad structure (with too many branches at every level, creating difficulties for users to select the correct one) - in either case usability suffers. Making the nodes fewer would mean having too many documents in each, which after some point would make the structure unusable (imagine sending the user to a node listing 100 000 documents). If we somehow manage to solve the “Quantity of information” problem, this is the next issue which has to be tackled.

4. Technology is not portable. Technologically speaking, a web directory or portal is not complicated - the underlying infrastructure is very simple, as opposed to search engines. The complicated part is the know-how of information gathering, which is extremely dependent on the market and profile of the portal (the so-called domain expertise). This is not transferable to other markets - if you know how to make a great portal for Canada, you still don’t know how to make a good portal for Hungary. Portal owners who want to penetrate new markets have to start from scratch - they can only transfer the software to the new market, but that is the cheapest part which does not help much (an example is the fate of Yahoo! in Scandinavia or China). This problem, although seemingly not too important, is probably the main reason for the global demise of web directories: on a small scale (single, specialised “vertical” directory) it does not matter, but due to it directories have to remain “small scale”; they cannot compete with the much larger search engines, they lose revenue which in turn means they have even less to invest into the development of their product and they eventually disappear.

29 CHAPTER 2. BACKGROUND AND MOTIVATION

5. Information gathering mechanism is limited. The major drawbacks of the information-gathering model explained above are apparent. Firstly, if site owners do not submit their sites, the web directory will not know about them and will not list them. With the multitude of existing directories, owners cannot submit their sites to all of them (and there is no reason for them to even try to, as incoming traffic generated by web directories is negligible compared to that from search engines) so in most cases web directories just lack the necessary information. Secondly, a site is submitted just as that - a site. There is one entry for it, which is usually the homepage. Separate sections or documents within the site are not usually submitted or indexed. Also, there are many more sites than editors - leading to the legendary delays of inclusion in the leading directories, such as Yahoo! Directory or the Open Directory Project (DMOZ). Last but not least - once submitted and accepted, there is no way for a site to be re-indexed with a new description, so directories tend to have outdated information. In some cases, sites disappear but the directory has no mechanism to find out so it keeps the links. This has spawned the emergence of a whole class of site operators known as domain hunters, who buy expired domains and put generic advertising on them, relying on traffic from such outdated links (search engine links to the site will usually die as soon as the next visit from the search engine’s crawler, but links from directories may remain active for years). The remaining links to this digital garbage are a form of web decay which has significant impact on directories, such as Yahoo!: “...pages [...] from the Yahoo! taxonomy have almost no dead links, but have relatively high decay, roughly the median value observable on the Web. This seems to indicate that Yahoo! has a filter that drops dead links immediately, but on the other hand the editors that maintain Yahoo! do not have the resources to check very often whether a page once listed continues to be as good as it was.” [Bar-Yossef et al., 2004]

6. Categorisation and cognitive bias. Another serious issue of web directories is related to content categorisation. Some sites may belong to multiple categories - a shop selling sporting goods in Brisbane is fully or partially relevant to Sport, Brisbane and Commerce (and e-commerce if it also sells online). Its site can be indexed in all four categories, making some of them noisy as it is not a full match to them, or it can be listed only in one category, thus depriving the others of content which (at least partially) belongs there. Both approaches make it more difficult to automate the classification process using web directory classification data, as they introduce significant classification noise in it.

Moreover, the category order itself (which is created when the directory is initially conceived and is very hard to change later) implies an ordering of things, and a way of thinking about things, which the user has to share in order to be able to browse logically (i.e. - the user has to have the same way of thinking as the creators of the structure). An example: is the above shop to be found by browsing through levels as:

Business → Commerce → Shops → Australia → Brisbane → Sporting Goods, or Business → Commerce → Shops → Sporting Goods → Australia → Brisbane, or

30 2.4. INFORMATION FINDING: ASPECTS AND APPROACHES

Sport → Sporting Goods → Shops etc., or even World → Australia → ...etc... → Milton → Shops?

Apparently, each ordering is true from a particular point of view and there will be people to follow each particular line of thinking, based on their perception of the world and the way it is ordered (their cognitive bias). However, the directory cannot employ all of these at the same time so it ends up working for some users and confusing to others. As partial mitigation, most directories extend cross-links between categories which convert their directory tree into a cyclic graph. Another aspect of subjective cognitive bias is the fact that the user may in fact have a wrong opinion about some item: for example, somebody might think that a dolphin is a type of fish and not a mammal, so would look for dolphins under the “Fish” category and never find them. Apparently, a “static” web directory has no way to mitigate this: if it puts a link from “Fish” to “Dolphins” it will imply they are related and thus become wrong itself. The problem cannot be blamed on ignorance only: it was shown experimentally [Kang et al., 2007] that users tend to classify differently the same documents even after they are given specific instructions on how to classify them, and they are people of similar cultural background classifying controlled documents from a very constrained domain familiar to them. With the extreme diversity of the web and its users, we cannot expect practically even one user to fully agree with another one on the classification of any given number of documents. 7. Implementation decisions. Additionally, there are some issues with web direc- tories which are generated not by the model itself but by some peculiarities of traditional implementation. The submission and editing process, for example, means that the editor (or the site owner) creates a description of the site which is usually standard (indeed, it is subject to strict guidelines), so web sites tend to end up with similar descriptions, making it more difficult to search through them later as they cannot be distinguished easily. There is nothing to prevent web directories from adding a web spider and indexing the actual full content of every site, but this is not usually done (with the exception of Google’s implementation of the ODP, the Google Directory19 - now defunct20). Furthermore, by some tradition all web directories seem to follow, although it is not necessitated by any logical or technological limitations, a document is listed in one node of the directory only. If the online shop from the above example is listed in the “Sporting Goods in Brisbane, Australia” subcategory, it is not listed in the broader “Sporting Goods in Australia”, “Sporting Goods” and “Online Commerce” categories, nor is it listed in deeper, more specific categories (for example in “Cross-trainers”, even though it may sell cross-trainers). All categories thus remain too sparse, and documents are not listed in categories where they belong only because they also belong to a more specific subcategory.

19http://directory.google.com/ 20The site now only displays the message: “Google Directory is no longer available. We believe that Web Search is the fastest way to find the information you need on the web.” We beg to differ.

31 CHAPTER 2. BACKGROUND AND MOTIVATION

As a result, Yahoo! Directory and Open Directory Project have on average only 5 to 6 documents per node, which is extremely detrimental to the user browsing experience.

2.4.2.4 Other Forms of Information Discovery

It is often suggested that using knowledge from social networks, the process of information discovery (as well as information locating) can be greatly assisted. In our opinion, however (and especially in the case of information discovery) this is a form of recommendation, since links are usually recommended to the user unsolicited, e.g. “see what your friend just saw” which cannot be used for actively finding information in a process, and within a context, defined by the user. While this is also a form of discovery, we consider it outside the scope of this work as we are interested in helping the user find actively something on the web, not randomly re-discovering things somebody else found before (more discussion below where we discuss search result re-ranking and recommendation engines).

2.5 Personalisation

As already discussed, both the classic search engine and web directory models do not take into account the person of the user. To do that, they need to implement some personalised approaches and account for the particular users’ needs in finding information for them. The issue of personalised information retrieval can be summarised as: “How to effec- tively locate information relevant to specific users’ needs?” [Pasi, 2010]. Obviously, we have to first find out what the users’ needs are, then define “relevant” in order to apply this to the existing information and lastly: locate what is relevant. Unfortunately, this is a typical ill-posed problem21, in that a) we do not know if a solution exists (and cannot even define what the solution is; furthermore, it being personal means that different people will define it differently, by definition), b) the solution, if it exists, may not be unique and c) the solution does not depend continuously on the data, in some reasonable topology. Nevertheless, much work is being done in the field of personalisation, since progress towards even a partial solution to an ill-posed problem has valuable commercial impact on search engines and web directories and is an important part of their product development.

2.5.1 Aspects of Personalisation

There are two aspects of the personalisation issue: to define the user model, and to exploit the user model in order to enhance information search [Pasi, 2010].

• User context, described by the user model, defines what we know about the user and his current context (of the current activity related to a particular search). We

21http://en.wikipedia.org/wiki/Ill-posed problem

32 2.5. PERSONALISATION

need to create a user profile to model the user, and we also need to have a method of learning the current search context.

• Personalised information search provides information relevant to the particular user, which has several aspects. Personalising search results is a way of exploiting our knowledge about the user by applying some preferences over search results, by way of filtering and/or re-ordering them. Document re-visitation is a part of this, which deals with cases where users have already found and visited a resource, but that was some time ago and they now wish to find it again. Recommendation expands the user’s search (whether explicitly expressed or just implied by visiting a particular page) by suggestion of related items.

The basic problems that need to be solved to allow personalisation are:

• Interpreting the content of information items: learning about the information that is available to us and we can present to the user.

• Interpreting the user’s needs (general and for the particular search case): learning about the user in general (his “cognitive context” and preferences), and his current use case (his “search context”).

• Estimating relevance of information items to the user’s needs: applying our knowl- edge of both data and user to achieve a personalised solution (additionally, we need to decide what we mean by “personalised solution”).

2.5.2 Personalisation Solutions

Information retrieval systems aim to locate information relevant to users’ needs (typically expressed by a search query) from huge document repositories. Personalisation on the side of the information retrieval system (typically: a search engine) is considered “pull” technology; alternatively, personalisation can happen on the user side (e.g. information filtering), which is considered as “push” technology.

In the basic system-centred approach the same query by distinct users produces the same results. Search context (the user’s goal for the search) and cognitive context (user’s expertise and experience) are ignored. Systems strongly rely on queries; however, while an information item may be relevant to a query, it could be of bad “quality” to the user for a number of reasons. Quality is not simply a matter of topical relevance, it also depends on other factors: the user’s knowledge of the domain, the source of the document etc. [Pasi, 2010]

As already discussed though, there are some inherent limitations of system-centred information retrieval systems, the major ones being:

33 CHAPTER 2. BACKGROUND AND MOTIVATION

• Ambiguity of search queries, and lack of expressiveness of queries.

• The search goal is not considered.

• Unawareness of context: user context (cognitive, geographic, social), document con- text (temporal, geographic, type), search context etc.

A context-aware solution should try to model the relevant context and apply some preferences to its retrieval methods; for example - find out which criteria are the most important for a specific search (their importance may vary across different queries) and weigh them accordingly while ordering search results. Consequently, the context-centred approach (as opposed to the system-centred approach) consists of two distinct but com- plementary problems: modelling the context, and using it to enhance search quality by producing a search outcome fitting the considered search context. However, even defining “context” (let alone actually finding it) is not a trivial task. “...there is no term that is more often used, less often defined and when defined defined so variously as context. ...the term context has become almost a ritualistic invocation.” [Dervin, 1997]

2.5.2.1 Building the User Profile

Personalised information retrieval should produce a search outcome fitting the user context (persistent user’s characteristics and preferences), which is represented by the user model. The first major issue in constructing this is how to collect feedback and other data expressing user background and preferences, which is the basis for the user model. In the case of explicit feedback, users perform explicit informative actions such as filling in a survey or indicating relevant information items. This is relatively easy to implement, but suffers from some drawbacks: it is intrusive, it needs the user to invest some efforts, data submitted may be deliberately or accidentally inaccurate, and the information becomes outdated after some time. A major problem of explicit feedback is that users do not take time to provide it, because they see no immediate effect of their effort and think it is not worth the trouble (it lacks the “information scent” component of “information foraging” [Pirolli and Card, 1995]). Implicit feedback occurs when some software collects information about the user’s ac- tivity without interrupting it or requiring any involvement on the part of the user. It is thus not a burden for the user, which is usually considered a major advantage so a large number of methods have been developed. The drawbacks of this approach are that it is indirect (it is based on a correlation between some factors which may not in fact be as strong as assumed), and it is noisy (some user actions may not be the result of an assumed preference). Some of the ways to collect implicit feedback are: using search history (query logs), visited results (click patterns), browsing history (server access log analysis, browser

34 2.5. PERSONALISATION add-ons such as Google Toolbar sending data to search engines), behavioural metrics (user behaviour after the user clicks on a link in search results), data collected by use of sensors (mainly in laboratory settings) etc. Specific methods for building the user model are discussed in the “Related Work” chapter.

2.5.2.2 Personalised Information Retrieval Approaches

After a personalised information retrieval system has built a user profile, it has to exploit it in some way to provide a personalised user experience. This can come in one or more of several forms:

1. Personalised search entails the search engine providing personalised search results accounting for user background and preferences.

2. Result re-ranking is based on a set of generic search results, which are then re- ordered to reflect the known user background or preferences, placing “preferred” results on top.

3. Query modification (typically query expansion) is applied to achieve disambigua- tion or make it more specific in the (perceived) search context.

4. Personalised presentation of search results shows the user generic search re- sults but re-arranges or re-designs them in a way supposed to make them more understandable.

1. Personalised search occurs on the search engine side because this type of per- sonalisation can only happen there: the search engine is where all the data about the web is, so it is the only point where data can be accessed in a truly personalised way. Search engines have two sets of problems though: the cost of this personalisation, and the lack of a realistic user profile. Ranking algorithms such as PageRank are highly iterative and computationally intense, so it is out of the question to calculate a ranking score for every indexed document for every user (and all his specific search cases). An easy (and not very good) solution is to run a generic query, retrieve a number of documents from the database and then run personalised ranking over this much smaller subset. Google for example acquired Kaltix22, a company which had developed “Adaptive PageRank” [Kamvar et al., 2003] (an approach allowing much faster PageRank compu- tation), and integrated that [Gord Hotchkiss, 2007] into their personalised search. Since PageRank “is very sensitive to the seed pages”, this allows Google to select some seed pages (documents with a non-zero PageRank value at the first iteration of the algorithm, which they transfer to other documents they link to at subsequent iterations) such that they are

22http://www.google.com/press/pressrel/kaltix.html

35 CHAPTER 2. BACKGROUND AND MOTIVATION relevant to the particular user (documents he has visited before) and calculate a personal PageRank score for documents on the fly. However, even with effective algorithms this is still too computationally expensive. Furthermore, these starting points for every user are too few; even though Google was using a log of 6 billion user actions [Chang et al., 2006] in 2006, and conceivably - much more now, these actions are not easily translated into credible persistent user profiles. Even this not very effective user profiling is becoming increasingly illegal in many jurisdictions: the European Union is introducing strict data retention laws limiting the time companies can keep user behaviour data; Facebook23’s “Like” button has already been declared against privacy laws in at least one jurisdic- tion, where the authorities have ordered web sites to remove the script24. Consequently, Google says it only personalises 20% of searches25 (not to be confused with geotargeting and other non-personal “personalisation” which is indeed done for every search - more about it below).

A more realistic in terms of computation cost, although not as granular, rank modi- fication method has been proposed: to pre-compute a number of topic-specific rankings for all documents in the collection, then at query time find out what the search context is and apply the respective set of ranking values [Haveliwala, 2002] or a combination of them based on a “personal PageRank vector” [Glen Jeh and Jennifer Widom, 2002].

Another type of personalisation is “geotargeting”; it can be pre-computed in an easier way since it is not too granular: it is only done once per country or other geographic zone, and is then applied to all users residing in this country or zone (see Table 2.1 for comparison of search results with the same query but different geographic origin).

Australia Canada India South Africa UK USA World 3 22 6 8 25 19 Australia 94 3 0 4 8 0 Canada 0 40 1 1 1 1 India 0 4 81 2 10 2 South Africa 0 0 0 64 3 0 UK 0 4 1 1 21 1 USA 0 12 2 8 12 68 other countries 0 5 2 5 9 2 spam 3 8 6 5 9 6 errors 0 2 1 2 2 1

Table 2.1: Geotargeted Google search results: number of country-specific results if query originates from Australia, Canada, India, South Africa, the UK or the USA (top row) and is submitted through the respective Google domain (tested in March 2011 with query “earthmoving equipment”). “World” results are not country specific. None of the results (from 5 out of 6 locations) were sites of actual earthmoving equipment manufacturers.

23http://www.facebook.com/ 24https://www.datenschutzzentrum.de/presse/20110819-facebook-en.htm 25http://www.theregister.co.uk/2010/03/03/google personalized search explained/

36 2.5. PERSONALISATION

However, this type of not-personal personalisation relies on the assumption that all users in a country have the same preferences (and for all searches), which is clearly not well-founded. What is more, in many cases it is counter-productive: if the user wants to find general information on, say, earthmoving equipment, the system is actually preventing him from finding this information by presenting localised results only (creating a “filter bubble” based on location - more discussion on the phenomenon below). It may be true that users in many (but not all!) countries prefer local results [Nielsen, 2011], but this preference should clearly not be followed if the information itself is not localised (these machines are the same worldwide) and a global answer is needed. In the above example, a typical user from Australia, using Google in its default form and searching for earthmoving equipment, would see 94 Australian web sites and only 3 “world” (international) ones, the remaining 3 being spam. These 94 local sites belong to equipment dealers and not manufacturers, as no earthmoving equipment is manufactured in Australia. The searcher would then not learn that a company like Komatsu26 exists even though it is the top worldwide manufacturer of such equipment (it is not one of the remaining 3 results; as an aside, it is ranked 15-th in the USA results, but at least it is listed there. Caterpillar27, however, is not).

One consequence of this type of personalisation is its inconsistency (different people receive different answers to the same question), which hampers exchange of knowledge between people - users cannot rely on their counterparts in other countries knowing the same things they do, even if they ask the same questions (for example - I cannot tell my mother, who lives in another country, to learn what I learned on a topic by telling her to perform a particular search on Google and follow the results - she would get completely different results and learn something else entirely). A much more detrimental to user experience aspect is that users (wrongly) assume they are getting comprehensive answers from the search engine, but in fact this personalisation happens “behind the scenes” and relevant information is being hidden from them “for their own good” (as decided by the search engine). Users are not informed of this filtering and have no option to remove it. The worst part is that people are not aware of it: their knowledge is being limited but, since they are not aware of this limitation, they have no motivation to search further or by some other means. This is a perfect example of personalisation gone horribly wrong.

2. Search result re-ranking personalises search by taking a set of search results and re-ordering them according to some personal criteria. This can happen both on the user side [Lieberman, 1995; Sieg et al., 2007] and on the search engine side; in fact, as already discussed, search engine personalisation is based precisely on that: the search engine does not really have a personal score for the particular user for every document in its index; it does a generic search, retrieves many more documents than the number displayed to the user, then applies personalisation by re-ranking this subset only and shows the top results.

26http://www.komatsu.com/ 27http://www.cat.com/

37 CHAPTER 2. BACKGROUND AND MOTIVATION

A relatively new method of personalising search results relies on using information from social networks. The assumption is that users will prefer search results that have been previously preferred (or at least visited) by their social “peers” - e.g., friends in a social network, so that such sites get ranked higher. This is one of the self-proclaimed goals of the Google+28 service.

3. Query modification is another method of personalising search. It occurs usually on the user side and aims to address synonymy and polysemy [Deerwester et al., 1990]: the fact that the same concept can be expressed in different ways by different people, and that the same term may mean different things to people with different cognitive background, or in a different context. This is usually one aspect of “web assistants” or personal agents (the other one being information filtering): they build a cognitive user model based on documents the user has accessed, intercept outgoing search queries and modify them to assist term disambiguation [Chirita, Paul-Alexandru and Firan, Claudiu S. and Nejdl, Wolfgang, 2007]. For example, if the user is a programmer and often reads programming documentation, a query for “java” will be expanded to “java programming language” to tell the search engine that the user is not asking for the island of Java or the brand of coffee.

This leaves open the question of current search context: if the programmer is planning a holiday on Java, he may get stuck reading programming references whether he wants them or not. A Google patent [Google, 2004] tries to solve this by deriving context from other documents the user has opened at the time the query is issued (e.g. if I have a travel brochure open in another application window, my web search will be disambiguated in a travel context); however, this is based on the assumption that activities are related and relies on the system knowing what the user is currently doing in other “user interface areas” (e.g. another application such as a word processor, or another browser tab) which is currently not technically possible29 and would probably be illegal if it was.

An interesting proposal for using query modification is to convert searching to brows- ing by populating a search query with keywords (automatically extracted from the user profile), which is then presented to the user as a recommended link for browsing [Yoshida et al., 2007].

4. Personalised presentation of search results does not try to influence search results themselves. Instead, this approach modifies the visual presentation of search results in order to make them more understandable or easier to use by the user. This can happen both on the search engine side (such as Google showing a star next to a search result that the user has previously “starred” while searching) or clustering results into concepts [Vivisimo Inc., 2006], or on the user side - such as a personal agent highlighting results it has judged relevant, or grouping them according to some personal classification [Ma et al.,

28http://plus.google.com/ 29It could be possible in a “cloud” operating system where a remote server manages all the user’s activity - such as Chrome OS which is being developed, perhaps not coincidentally, by Google.

38 2.5. PERSONALISATION

2007].

Site re-visitation is a special case of personalised information retrieval which is not related to the aspects of personalisation discussed above. It concerns the issue of users re-visiting sites that they have seen before. This form of information discovery is different from the standard problem in that users search for specific documents they have already seen, but usually remember only vaguely (i.e. - not precisely enough to be able to describe them in terms of full title or correct keywords to a search engine). Solutions to this are usually in the form of a list of visited sites, ordered alphabetically or chronologically, which can also be searchable.

2.5.2.3 Personalised Information Retrieval Solutions

Usually, approaches to personalisation are integrated: a system tries to model the user, find out the current search context and apply this knowledge to available data in order to achieve some form of personalisation of the information retrieval process.

Personalised information retrieval solutions on the user side are typically represented by a “personal agent” - an intelligent software agent capable of learning from data and adapting to user requirements or preferences, which resides on the user’s machine and is controlled by the user. Some of these agents were proposed [Lieberman, 1995; Chen and Sycara, 1998] and implemented as proof-of-concept even before the rise of modern search engines.

A good example of the architecture of such an agent is WebMate [Chen and Sycara, 1998]. It collects implicit feedback by monitoring the user’s web traffic through a local web proxy and also allows the user to supply explicit feedback (however, only positive examples - “I like this”, but no negative examples), then tries to find the different categories of texts the user may be interested in (fishing, football etc.). These are expressed as separate TF- IDF-weighted term vectors built from word occurrences in documents the user likes. This is then used to filter incoming search results and recommend to the user only those that will be of interest (presumably - those that are similar to articles with positive feedback). The agent also performs query expansion to assist the user in searching; queries are expanded with terms from “trigger pairs”, or words with high co-occurrence in documents of the targeted type.

Some modern agents implement a “form of client-side personalisation based on an interest-to-taxonomy mapping framework and result categorization” [Ma et al., 2007]: user interests are mapped to categories in a web directory, such as the Open Directory Project or Yahoo! Directory. Search results (coming from a general search engine such as Google) are passed through text classifiers which categorise them according to the user’s interests and display them accordingly [Ma et al., 2007] (personalised presentation).

Another example of filtered web search [Matthijs and Radlinski, 2011] (acting as a per-

39 CHAPTER 2. BACKGROUND AND MOTIVATION sonal agent, although not explicitly named as such) personalises search results by building a user profile (based again on recorded browsing history and click patterns) and then re- ordering the (standard) search results received by the search engine. The system builds one profile per user (there is no granularity - one user is supposed to generate only one search context), and TF-IDF term weighting is done using the Google N-Gram corpus30 to obtain the IDF values for words. It is interesting to note that the experimental study used “interleaved results” to compare click patterns for the various tested algorithms (users saw a synthetic search result page where original results from the search engine were inter- leaved with suggested results from the agent to minimise the effect of random user clicks). As an aside, the study found that 24% of keyword searches were “repeated searches”, i.e. users kept searching for the same thing over the course of several months, which is a clear indication that people were trying to use the search engine as a “re-visitation” strategy: trying to find again things they had already accessed before.

From a broader point of view, all the above examples are only interfaces to a search engine which facilitate the user’s interaction with it but do not in any way change the underlying model of information finding. Furthermore, since the search engine is not expecting a request by a software agent but by a human, it does not allow any nuances in its input; for example, if a personal agent disambiguates a query term by adding a second term to it (query for “java” becomes query for “java coffee”), the search engine does not differentiate between “java” as the main search term and “coffee” as a secondary term: if it has no results for “java”, it will happily supply results for “coffee” only - which is (expressly) not what the user wanted.

Personalised information retrieval solutions on the server side are represented again by personal agents (running on the server but still “personal” in the sense that they use individual user’s data) - which also includes personal recommender engines, and group personalisation and recommendation (see Figure 2.1 for typical architecture diagram).

A typical proposed agent solution [Liu et al., 2004] profiles users based on their search history. Again, categories in the user profile are represented as weighted term vectors, in which a high weight of a term indicates that the term is of high significance in a category for the user, and a low weight of the same term in another category indicates that the term is not important in that category. Search queries, as well as user interests, are mapped to a set of pre-built categories (using the Open Directory structure) and represent the user’s search intention to serve as context to disambiguate the terms in the query. The way the system handles queries is: the agent receives the query, submits a general query (with no limitations) to Google Directory, then up to three category-specific queries (the categories are selected based on previous searches of the same user with similar queries; the user

30http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

40 2.5. PERSONALISATION

User Profile Activity Available and Context Context Information

Information Supply Engine feedback (ISE)

Matching Information

User Action

Figure 2.1: Context based information supply [Pasi, 2010]. Note the implied grouping of user profiling, context knowledge and web data on the remote side. can also manually select which category the query is in). Keywords extracted from the general Open Directory data are used to modify the query and help disambiguate it for the search engine. Unlike other systems, search results are not filtered; they are merged into one list by a voting-based merging scheme and presented to the user as one list. As an aside, it has to be noted that the method would no longer work as described, as Google has discontinued the Google Directory (which was mostly a copy of the Open Directory with some enhancements, one of which used to be said category-specific search). The system works on the search engine side because it relies on the search engine to supply the user’s search history data (queries and clicks), with all the errors and limitations already discussed. On the positive side, the system recognises that the user can have multiple interests and builds a tree-like hierarchical structure of user interests and not just one general user profile. Some other important limitations to note are that:

• User interests can be expressed only as completely coinciding with categories in the top three levels of the Open Directory. Users cannot create different categories.

• User have no access to the profiling process and data - they cannot influence it in any explicit way.

• The result from all user profiling and query analysis is only a limiting parameter sent to the search engine - “search within this category only”. Term weighting is not transmitted to the search engine.

Crucially, the system uses for profiling only terms manually entered by the user as part of his search queries, which severely limits its expressiveness. Recommender engines are a class of personalised information retrieval systems addressing the lack of a recommendation function in search engines. They provide infor-

41 CHAPTER 2. BACKGROUND AND MOTIVATION mation not associated with active keyword search on the part of the user; instead, they are refinements over browsing systems (such as online catalogues - e.g. Amazon31 or Netflix32) where the user browses through a large category structure. Content is suggested to the user, with or without the user having expressed any explicit need or preference. This suggestion can be made on the basis of:

• Explicitly expressed needs (Yahoo!’s “Search Assist” and Google’s “Auto Suggest” features recommend expanding the query with related or disambiguating terms).

• Implied needs (e.g. - if the user is viewing this document, he may also be interested in viewing that document). This is based only on similarity between documents, and not on actual analysis of user needs or preferences.

• Previous user actions (e.g. - this is a returning user, on his previous visits he accessed documents A and B, now he is accessing C - so we can also recommend D). This approach requires a persistent user profile, as well as document analysis to find similar documents.

• Previous actions of other users: friends of user A liked page B, so maybe he will like it as well. This relies on the assumption that friends have similar tastes and information-finding needs (and that they are indeed friends just because they say so on Facebook).

• “Collaborative filtering” (e.g. - this user seems to like A, other users who like A also like B, so B must be suitable for this user too). This is a two-step approach: first the system has to analyse the user and group him together with other users with similar interests, then recommend documents as per this group’s interests [Das et al., 2007]. The most notable successful example of this approach is the Google News33 service.

An advanced concept was also proposed to create a multi-agent system [Birukov et al., 2005] that would build a distributed recommender mechanism based on implicit feedback from users, collected by their respective (distributed) personal agents. However, having in mind the lack of realistic (and widely used) personal agents, this concept does not seem likely to be implemented soon. Some things that have to be said about these systems in general are that the end- user cannot explicitly influence them and make any specific request, they do not scale well and they only work well in restricted, strictly controlled domains (e.g. goods in one marketplace such as Amazon, film descriptions as in Netflix or news from selected sources as in Google News or Outbrain34).

31http://www.amazon.com/ 32http://www.netflix.com/ 33http://news.google.com/ 34http://www.outbrain.com/

42 2.5. PERSONALISATION

Site re-visitation is a special case of information finding, where the user wishes to find some information again. Advanced users apply a number of tactics in this situation [Aula et al., 2005], none of which work efficiently for realistic numbers of documents and none of which is assisted by a major search provider. The best methods used are bookmarking (with an average of 220 URLs bookmarked per user), or keeping many browser tabs open at the same time. It is clear that these do not work for a user who sees thousands of pages per month and tries to revisit a page seen several months back. Using an online bookmark service means just publishing the link to a web site: this is only an extension to browser bookmarks and does not contribute anything to solving the basic problem.

2.5.3 Inherent Problems of Personalised Solutions

Apart from the technical challenges, all personalised solutions and the user modelling they are based on have significant privacy implications which often lead to tension be- tween personalisation and privacy in web-based systems. Some surveys on users attitude with regard to the privacy issue indicate that privacy concerns may severely impede the adoption of personalised web-based systems. As a consequence, personalised systems may become less used, personalisation features may become switched off if this is an option, and escape strategies may be adopted such as submitting falsified data, maintaining multiple accounts/identities, deleting cookies, etc. [Kobsa, 2007]

As we have also noted, many countries already have privacy protection laws (or are currently beginning to implement them) which affect the methods on which personalisation is based. Consequently, a method which may be (or become in future) technically feasible may still not be practical as it may be illegal in some jurisdiction.

Specifically for recommendation engines, we have to also note the inherent problems of “social search” exemplified by the failure of digg.com and the continuing absence of a search engine powered by Facebook.Social search (where applicable to search engines, such as Google+) or social recommendation (as in Facebook outgoing links) rely on know- ing a) who your friends are, and b) what they like. Both of these sets of data are not easy to collect, and people seem more and more reluctant to share them with service providers. More importantly, the model then relies on a correlation between the tastes of self-proclaimed friends; however, “friending” someone on a social network may mean just that you want to follow his progress, or you share a hobby - it does not mean you have a common background in everything. Even more importantly, there is still the issue of “cold start” common to all recommendation engines: initially, the engine has no information to go on, so it cannot recommend anything. Including the “social graph” of a person only removes the problem a step further - where the system had a “cold start” regarding one person, it now has a “cold start” regarding a circle of “friends”. Furthermore, once a recommendation has been made, people may follow it just because it was recommended - the system enters a positive feedback loop similar to the one search engines have when

43 CHAPTER 2. BACKGROUND AND MOTIVATION ranking search results based on clickthrough data.

2.5.4 Avoidable Problems of Personalised Solutions

In many cases, it is much more difficult to see that something is not there than to see in details something which is. In the case of personalised solutions, this seems to be the case with the way feedback is collected only implicitly, and the treatment of the user as a passive figure with nothing to (explicitly) contribute. Virtually all existing solutions work behind the user’s back, trying to guess what he wants from various actions of his. However, they do not supply a meaningful way for the user to explicitly assist the building of his personal profile, or to edit/correct what the solution has created by implicit feedback. As we have seen, the user is not even told that profiling and personalisation have occurred, let alone given a way to influence any aspect of this.

In our opinion, this omission is the result of development policies only and not inher- ently necessary, so we addressed the issue and developed our solution to accept explicit feedback and to allow the user to assist in his own profiling.

2.5.5 Personal Web Assistants

A fact of life is that the large web sites that thrive (with the typical example of Google) are those providing not information, but meta-information - that is, information about information, which turns out to be more economically profitable to produce. This has been compared to a virtual food chain where information providers (for example online newspapers) are the plants that grow the virtual “biomass”, web portals and search engines are the herbivores that consume it and then agents getting their information from the search engines would be the carnivores that live off them [Etzioni, 1996]. This three-tier vision has not yet happened though - although many agent platforms were tried ([Etzioni, 1996; Chen and Sycara, 1998; Lieberman et al., 2001]), none have yet emerged “in the wild” and become popular.

This project started as an attempt to build such a “carnivore” and much research time was spent on personal agents. However, it gradually emerged that an agent, however intelligent, cannot solve the “personalisation of search” problem and there seems to be a fundamental reason for this inability.

There are many research projects on personal web assistants, but none of them has gone on to provide a real-world application, and none of them has progressed beyond the initial prototype stages. The most obvious reason for this may be said to be the commercial success of search engines. Search engines are seen as the ultimate answer to the information-finding problem and discourage any alternative developments. Moreover, this commercial success allows Google and the other big search engines to channel all

44 2.6. ISSUES ARISING FROM CURRENT SOLUTIONS subsequent development in their direction, by investing enormous resources and employing a significant part of the scientific community. However, there seems to exist an even more serious reason for the lack of success of personal agents assisting personalised information discovery. Most probably, this whole line of development has not succeeded because it lacks a part of the picture inherently. Search engines also lack a part of the picture - they do not know much about the user or the reason why the user searches for something. This is not a crippling problem for a search engine though - it means lower result quality, but there are still some useful results that can be presented to the searcher; additionally, the user can “tweak the search” and convey some additional meaning. Intelligent personal agents, on the other hand, lack a part of the picture which they cannot compensate for; they can build a rather intimate knowledge of the user, but they know nothing about the web. They can only see what the user sees, which is a very small subset of the web. Then, whatever learning algorithms they may use, these algorithms are only applied to this small part of the web. The agent is used as either a passive filter or a keyword recommender - in other words, just a helper for the user’s dialogue with a search engine. All the knowledge about the web remains with the search engine and cannot be used by the agent. Moreover, the agent has to imitate a human when communicating with the search engine, which severely limits the ways it can express its knowledge about the user in general, and the current search session in particular. This follows from the lack of infrastructure at the search engine which could accommodate an agent, since search engines typically have interfaces for end-users only, allowing simple keyword search. Our project will address this omission.

2.6 Issues Arising From Current Solutions

Some of the aspects and implementation approaches of the popular information solutions have had a large impact not only on their competing services (e.g., search engines crowding out web directories), but on users and on the web itself.

2.6.1 Consequences of Some Approaches

The comparison between the two pairs of information-finding models as they developed in the real world (searching based on machine learning versus browsing based on manu- ally built data sets) has lead to the practical disappearance of browsing (or information discovery) as a way to find information online, much in the same way that balloons have disappeared as a means of transportation. This is not necessarily a good thing, as the alternative technology might still be more useful for some applications - for example cheap freight in the case of balloons, or learning entirely new knowledge in the case of browsing. Furthermore, some of the initial assumptions, as well as some of the popular approaches to content indexing and ranking employed by search engines, in the absence of a viable

45 CHAPTER 2. BACKGROUND AND MOTIVATION alternative to the keyword search paradigm, have had a strong negative impact on the development of the web as a whole. What are some of the implicit assumptions that modern search engines are based on? Their basic model assumes that:

• Every search is separate and independent.

• Every search is generated by a short-term interest.

• Every user means the same thing with the same word.

• The user requires a fast answer.

• The user cannot be relied upon to make his query more specific by supplying its context.

• The search engine interface should be simple, even if this means sacrificing features in order to simplify it.

These assumptions create a basic search engine where:

• Search is stateless - every search is treated as a new one, unrelated to previous searches of the same user. Modern search engines, of course, try to mitigate this by following “search trails” but this is just that - a mitigation technique trying not very successfully to cover an inherent gap.

• Research, as opposed to search, is not supported - users cannot develop a research topic over some time, add their own snippets of knowledge to it, save their results and resume the search later, customise search results etc.

• Search engines are not personalised - the user’s search context and “background knowledge” (cognitive context) is not taken into account, but rather a “common denominator” is used levelling all users. This effectively ignores the synonymy and polysemy issues [Deerwester et al., 1990], namely - that different people refer to the same concepts in different ways, and that the same word may mean different things to different people (or to the same person in different contexts). Again, we have seen that modern search engines are working on mitigating the effects of this system-centred approach, but they have elected to rely on implicit feedback for this leading to important negative side-effects.

• Optimising search speed is a top priority, so most elements of the search result have been pre-computed, using a “common denominator” for all users and contexts. The user cannot define a different priority, e.g. select a slower but more accurate service.

• Search queries are limited to a small number of words. Users lack a more expressive way to define what they are looking for.

46 2.6. ISSUES ARISING FROM CURRENT SOLUTIONS

• There are no features that allow users to specify in what context they are searching, control search engine settings (change the relative importance of different ranking factors, or even know what those factors are) or to collaborate with other users. This follows the general tendency to simplify features and interfaces in order to make products more attractive to the “average user” (who presumably prefers simpler interfaces and less features), leading to the mass development of products which might indeed be useful for most people but have limited use for an advanced user who may need those omitted features and be prepared to use them. While the approach may make good business sense for the developing company, it is not necessarily in the best interests of the user. After all, everyone starts from being “average” but after a while might grow into “advanced” and would expect advanced features to be available. Moreover, there is already a significant class of such people in existence - professional researchers for example, and a whole new emerging science trying to assist them - e-Research.

Apart from the inherent issues of the model, some problems are generated by imple- mentation decisions and could have been avoided/mitigated by a different implementation approach. As already discussed, the fixed ranking algorithm (combined with limiting the number of search results shown to the user) does not allow full recall of indexed docu- ments: even though the web crawler may have found and indexed all web documents, some of them are not accessible through keyword search. This could have been avoided if users were allowed to issue more detailed queries or tweak some parameters of the ranking algorithm to achieve different result rankings, or if search could be fragmented into smaller document collections (such as a subcategory of a web directory) where the limited number of shown results would not be an issue as there would be less results to show.

Many other problems that search engines strive to solve have also been generated largely by themselves, mainly by the application of PageRank or similar algorithms which rank sites based on their position in the web graph and not on their actual content; the proliferation of some types of search engine spam [Gy˝ongyiand Garcia-Molina, 2005] was made profitable by PageRank and is directly attributable to it. Other spam stems from the system-centred approach to content indexing and from the “bag of words” (or term vector) representation of documents.

One aspect of search engine spam is the “link spamming” [Gy˝ongyiet al., 2006] (forum spam, comment spam etc.) industry where people or “bots” (robots) post massive numbers of links on sites they do not own (entries in discussion forums, news article comments, posts on social networks and other sites where the public is allowed to post text), with the sole purpose to link from them to other sites and boost these sites’ PageRank by transferring to them some rank from the exploited site. This whole nuisance would not exist if not for the graph analysis methods of search engines, as it would be pointless without them; it is entirely contributable to the PageRank approach to search result ranking.

47 CHAPTER 2. BACKGROUND AND MOTIVATION

Another aspect of search engine spam is the creation of content specifically for the search engine and not for human readers. The system-centred approach to content in- dexing stimulates the generation of “digital garbage” by creating economic incentives for people to create cheap sites with the express intent to link from them to other sites (again to boost the other site’s PageRank) - the so-called link farms, or to create “splogs” (spam blogs) and massive numbers of low quality but “high keyword density” articles and then load them with advertising (ironically, targeted at scamming AdSense which is another service owned by Google) - known as MFA (Made for AdSense) sites. “Fast-food content” [Arrington, 2009] - the industrial manufacturing of low-quality search-engine optimised content - has now become a viable (and even lucrative) business model, even for com- panies as significant as AOL35 [Business Insider, 2011]. The disregard for human users is such that a large part of this content is machine-generated and is totally meaning- less36; without semantic or higher-level analysis though, a term-vector indexing search engine just notes the presence of some keywords in the text and indexes them accord- ingly (note the repetition of some words, to achieve “keyword density”). While search engines are starting to address the issue37, they still point too many visitors to such sites and make them profitable to create; the user personally is not taken into account in any way by the ranking and searching algorithms and is not allowed to weed out such content, even though the fact that it is junk is obvious to him.

Another issue is content duplication; it proliferates because search engines “like” short and simple URLs, or keywords within the URL. Their abuse of the Domain Name System, which was not supposed to be used for content indexing, has created an economy which also abuses the DNS: people do not just publish their content (once) for humans to read, but publish it multiple times on different URLs [Edelman, 2003; Fetterly et al., 2003] in order to get better search engine ranking for different keywords, thus clogging the system with repetitive content. As we have seen, de-duplicating such texts is not a trivial task [Bar-Yossef et al., 2007; Manku et al., 2007] and the problem is far from solved.

This whole phenomenon of the web’s mutation was best summed up by a leading Yahoo! search engineer: “It’s not unlike the Heisenberg Uncertainty Principle. PageRank stopped working really well when people began to understand how PageRank worked. The act of Google trying to “understand” the web caused the web itself to change.” [Zawodny, 2003]

35http://www.aol.com/ 36Example of a search-engine optimised text achieving top results in Google searches related to some types of leather shoes: “Calcei Shoes are the Ought to Have footwear between the vogue, they are well- known for much more than 10 a long time in the planet. Nevertheless in 2010, the popularity of Calcei Footwear continued and expended a lot a lot more than just before consequently there are assortment of variants came into the fashion discipline. And Gucci is a huge person who just took component in these kinds of scorching pattern who released the Higher Best Belts Sneaker for the early Spring of 2010. These sneakers consist of the trend of Calcei footwear and the comfort of sneakers, crafted utilizing prime high quality leather, nylon, decorated with the metallic buckles the script Gucci logo in the footwear make them a lot noble.” 37http://googleblog.blogspot.com/2011/01/google-search-and-search-engine-spam.html

48 2.6. ISSUES ARISING FROM CURRENT SOLUTIONS

As a consequence, most of the search engines’ efforts are now targeted at adversarial information retrieval [Fetterly, 2007; Gy˝ongyiand Garcia-Molina, 2005], or in other words - trying to beat the people trying to beat them... PageRank has also had a “worrisome impact” [Cho and Roy, 2004] on the discovery of newly created web pages of high quality, which affects the “ecology” of the web as a whole - users and search engines alike. It tends to emphasise older, “established” sites and documents since more links point to them. This makes new resources difficult to find (they are ranked much lower) so less people find them and link to them: another example of the search engine introducing a positive feedback mechanism and entering a vicious cycle. Search engines try to mitigate this by compensating with a “freshness” factor to boost newer resources temporarily (while they are new). It is obvious though that in some cases this is not appropriate, and besides - there is no way to objectively find the parameters deciding how big the “compensation” in rank should be; this is yet another source of problems for result ranking, and another example of existing parameters that reflect heavily on ranking which are hidden from the user and not able to be set by him according to his current search context and preferences. Nevertheless, search engines are developing at a fast pace despite their problems; the same cannot be said about web directories. The major ones are in decline - the Open Directory Project seems practically abandoned, and Yahoo!38 has removed its di- rectory from its own homepage. Web directories have their own set of issues and problems leading to their decline. The most significant of these is more a problem of economic and not technical or scientific origin: historically, they were manually built (Yahoo! employed thousands of editors to support the Yahoo! Directory) and proved not as cost-effective as search engines. Apart from that though, they have their inherent issues as well:

• Navigation is too complicated for the user.

• Manual maintenance is not only expensive, but also prone to subjective errors and editorial bias.

• Information is limited not only in volume, but also in type of resources indexed.

• The user can only follow links passively or search as in a search engine, but (just as with a search engine) cannot specify his search context.

Due to deficiencies in existing web directories, users find uses to search engines that are different from their intended purpose - for example, they use them instead of a web directory to suggest general sites by issuing not just informational but also navigational queries [Lee et al., 2005].

38http://www.yahoo.com/

49 CHAPTER 2. BACKGROUND AND MOTIVATION

People also tend to move away from information on the actual web sites and “skim” the search engine text snippets for needed answers:

“A major change over the years has been a declining emphasis on using search to identify good sites as such. Rather than hunt for sites to explore and use in depth, users now hunt for specific answers. The Web as a whole has thus become one agglomerated resource for people who use search engines to dredge up specific pages related to specific needs, without caring which sites supply the pages.

Search engines have essentially become answer engines. Their job is no longer resource discovery, but rather to answer users’ questions. Ask Jeeves was on to something with its original Q&A interface and now has an interesting approach to showing answers directly on the SERP (search engine results page).

This changing behaviour is explained by information foraging theory: the easier it is to track down new resources, the less time users will spend at each resource. Thus, the increasing improvement in search quality over time is driving the trend toward the answer engine. Always-on connections have a similar effect, because they encourage information snacking and shorter sessions. Finally, web browsers’ despicably weak support for book- marks/favorites has contributed to the decline in users’ interest in building a list of favorite sites.” [Nielsen, 2004]

According to Nielsen//NetRatings, the average user sees 1511 web pages per month [Nielsen//NetRatings, 2007] with the largest part of these (excluding web-based email, which actually accounts for 57% of all pageviews) being web portals and search engines. This means that users spend more time looking for content than they spend with the most popular category of content itself (entertainment). Search engines and web portals account for 9% of all pageviews, and other content (excluding email) accounts for 34%, which means that users spend more than one fifth of their time on the web trying to find something - a strong indication that search engines and web portals are not doing a good job of pointing users to actual content. It is a separate question that they have a conflict of interest here, as they derive their profits from this situation and reversing it would be against their business models.

Something important that seems totally lacking in all major search engine and web directory developments is a way for users to actively send their preferences and their background to the search engine or web directory in a meaningful way. Both models (searching and browsing) fail to take into account who the searchers are and what is their background - what they already know. The mechanisms that exist allow users to filter out some of the incoming results, but do not allow them to actively request specific alternative types of results instead. More importantly, they do not allow them to provide any information about their background and the context of their search. All attempts for improvement in this respect rely on passive, or implicit input - for example click pattern analysis ([Lee et al., 2005; Sun et al., 2005; Qiu and Cho, 2006; Das et al., 2007]) which

50 2.6. ISSUES ARISING FROM CURRENT SOLUTIONS tries to “second-guess” what users mean to do, but does not let them express it explicitly.

2.6.2 Research Issues

As a general summary, we can identify two sets of issues with the current state of the web:

1. From a system point of view, the search engine model cannot serve all information finding needs, cannot cover the whole of the web and encourages the creation of vast amounts of digital garbage, thus making the rest of the issues even more seri- ous. Web directories, which are supposed to complement search engines by allowing information discovery, are practically extinct; even before their demise, they had serious issues with the nature and amount of indexed content, as well as usability.

2. From a user point of view, neither the search engine model nor the web directory model take into account who the user is, what his information background is, why he is searching for a particular piece of information and what the search history for that particular case is (i.e. - what pieces of information have already been found and how the user assessed them). All solutions are system-centric and not user-centric: neither search engines nor web directories allow the user to personalise any (serious, not cosmetic) aspect of their operation. Additionally, any information that could be used for personalisation remains with the search provider and not with the user, giving rise to numerous technical, legal and privacy issues.

There is also a general, more philosophical point which has to be made and which only recently began to be raised: the so-called “filter bubble” [Pariser, 2011]. Apart from losing potentially valuable search results (see our example of trying to find earthmoving equipment), hidden personalisation using implicit feedback actually reinforces any traits that the system thinks the user may have; even worse, some of these traits may not even be related to the person but to a larger group, since in the absence of expressive feedback search engines use a large number of other “signals” such as IP address, software configuration etc. and then generalise on this basis lumping people together by some coincidence of these “signals”. If for some reason the system then decides the person does not like soccer, it will not show him soccer sites. However, it may be the case that the user is just not familiar with soccer, which is why he has not visited any related sites before - given a chance, he might have learned more and become interested; or the user is just an atypical member of a group - he lives in the US, but likes soccer and not American football. The personalisation solution though is pushing him towards a virtual world where no soccer exists (if the user does not know something exists and is not shown information related to it, it is not a part of the “private world” he lives in). In other words, this reinforcement of personal traits acts as yet another “positive feedback” - the system finds (or invents) a trait in the user’s behaviour and then reinforces it by biased filtering of his information access.

51 CHAPTER 2. BACKGROUND AND MOTIVATION

Anyone who has studied Control Theory though should know that a system using positive feedback becomes critically unstable in a very short time, since this feedback quickly magnifies whatever small problems or imbalances may exist within (exemplified also in our experiments). In an extreme case (towards which most scientific development is moving the search engine space) of information retrieval with personalisation which “just happens”, without the user’s knowledge or the ability of the user to cancel it or to request some information even though the system may think it is not suitable or preferred by him, this will eventually lead to a situation where users are boxed into individually- tailored small pockets of “suitable” information. They will be unable to leave these limited information spaces, since the search engines will confine their answers to information within these pockets only, whatever users search for. Eventually users will reach a point where they will not even want to leave these pockets - since they will have learned to only rely on information contained in them and will only consider such information true; at some point, they may reach a situation where they will not even realise that anything outside of these “pockets” exists. While initially the user shapes the personalised solution, eventually the personalised solution will shape the person of the user... The ethical consequences of this are far outside the scope of this work, but in short: this is not a good thing (to put it mildly). One of the goals of this work is to give users the means to cancel such effects from unwanted personalisation and be able to discover new knowledge irrespective of what the system thinks is or is not suitable. We will address this and some of the other issues discussed above with the Web Exploration Engine we propose in the next chapter.

52 3 Exploration Engine

3.1 Summary

In this chapter, we identify the research question of the thesis, we state our assumptions and limitations and then propose a solution to some of the issues of the web we discussed in the previous chapter. Our proposal is the Web Exploration Engine: a tool based on a semi-automated web directory coupled with a large-scale web spider which allows browsing (information discovery) as well as searching (information locating) within a significant part of the web, including search in a particular category (context). We also propose a distributed system for end-user modelling, which allows better capturing of the user’s cognitive model; this is then taken into account by the Web Exploration Engine at search time by integrating into it some infrastructure to accept complex queries. Additionally, we enable users to save and expand their searches over a period of time (thus making them researches), as well as share them with others to encourage collaboration. These developments address several gaps in existing information-finding models and improve some aspects of the currently dominant paradigms which have had an extremely negative impact on user experience and the development of the web as a whole.

3.2 Research Question

Before proceeding with a proposed solution, we need to first define the problems we wish to address, as well as some limitations and assumptions we have to make.

53 CHAPTER 3. EXPLORATION ENGINE

We can summarise the problem background with the following statement:

The information-finding needs of users on the web consist of two parts: to locate particular documents, and to explore the web based on some topic. The former is being addressed by search engines. The model they have decided to employ though has changed the web for the worse, creating numerous problems for end-users and for the search engines themselves. The latter is not currently being addressed adequately, which forces users to turn to search engines for tasks that they were not created to deal with, leading to much frustration and ineffectiveness and further compounding the search engine model’s problems. Furthermore, no viable solutions have yet been found to allow users to efficiently specify in what context they are performing a particular search, what they know already about a topic, how they feel about specific sources of information etc.; in other words, users cannot focus their search based on their own use case, but have to accept generic results which in many cases are not appropriate.

To address these omissions, we think we need to focus on the following research ques- tion: “How to allow users to explore the web taking into account their informa- tion background and their preferences, and also allow them to manage their own data at the same time?”.

3.2.1 Research Sub-questions

Some general advantages, as well as some problems of the currently dominant information- finding models do not depend on their implementation but are rather inherent in the models themselves. Other issues have arisen from implementation decisions that can be improved, as they are not based on inherent limitations.

Obviously, this work cannot address all shortcomings of all models; however, it will propose a simple mechanism addressing several gaps in existing models. Starting with our initial problem of enabling intelligent exploration of the web while taking into account the user’s information background and preferences, and having in mind the issues and implications discussed in the previous chapter, we can now break up the question into several sub-questions we seek to answer:

1. How can we enable the user to explicitly supply, and the search provider (search engine or web directory) to model, the user’s general cognitive context in the light of the user’s current search activity?

2. Can we allow the user to explicitly supply, and the search provider (search engine or web directory) to utilise, the user’s current search context? More specifically, can we enable the user to focus the current search into a narrow topic, with arbitrary granularity?

54 3.2. RESEARCH QUESTION

3. Further to the above, how can we enable the user to browse into a cluster of topi- cally similar documents, as an alternative to keyword search, while addressing the shortcomings of the current model for building such structures (web directories)?

4. Will combining the previous two items constitute a useful recommendation feature suggesting resources which the user did not search for, but are topically similar to his search?

5. Can we allow the user to reach any web page, irrespective of how low its individual document ranking is according to a search engine?

6. Can the user assist the search process with explicit, document-level positive or neg- ative feedback?

7. Can we enable the user to develop a search, over a period of time, into a personalised deep research, optionally enhanced with user-submitted content?

8. Can we create a useful tool to record the user’s browsing history into a structured, searchable database - as an alternative to current “bookmark” browser features?

9. How can we address the issues of privacy and legal compliance with personalisation on the side of the search provider?

10. Can we do something to mitigate the fact that some aspects of search engine ranking algorithms encourage the creation of massive amounts of digital garbage?

3.2.2 Assumptions and Limitations

When we create our model of the user, one basic assumption we have to make (which is a simplification, but in this case - a necessary one) is to equate “user’s interests and prior knowledge” with “web documents the user has accessed”. We shall have to assume that, since the user has actively requested these documents in the first place, they are somehow related to his interests; and that having seen them, their content has been added to the user’s personal knowledge base. Both statements are obviously debatable, but are a good “first approximation” of user’s interest and cognitive background; creating a better “second approximation” is quite outside the scope of this work.

We also assume the existence of a search engine with all its mechanisms already in place (a spider, indexer etc.) which will be responsible for the information gathering part of our solution. We will not aim to improve any of these mechanisms or substitute them with something else, but will rather build a new layer of knowledge over them. Our solution will be installed on top of the existing search engine infrastructure, making its data findable in an alternative way.

55 CHAPTER 3. EXPLORATION ENGINE

Instead of working with a real search engine for this project though (which would have been difficult for a number of technical, legal and commercial reasons), we emulate one by creating a small search engine for our experimental study.

The policy towards gathering of user personal information (in the form of preferences, knowledge and browsing history) has the protection of user privacy as its foremost task.

Data is assumed to be dynamic, i.e. constantly changing - data instances may be deleted, added or changed at any time which should not interrupt the work of the sys- tem. This assumption is missing in many current related developments and makes them unsuitable for real-world practical use. However, it is a fact that the web changes, the user’s knowledge of it changes, the user’s perception of things changes with time, and any algorithm that attempts to freeze these variables at a certain point in time cannot be of practical use when it comes to a real-world application.

The proof-of-concept application uses the same infrastructure and code on both the user side and the search engine side, but this is for our convenience only. In a “real world” application, they do not need to be the same.

3.2.3 Practical Tasks

Following from the research question and the above assumptions and limitations, we can subdivide our main goal into the following derivative tasks:

1. Develop a method to allow users to discover (browse) information collected by a search engine spider without the limitations of the keyword search paradigm.

2. Develop a method to capture a pragmatic cognitive model for the user (preferably with minimal user interaction in order to enhance usability).

3. Develop a method to allow the user to express a specific search context for every information-finding task he performs.

4. Develop a framework to translate the user’s cognitive model and the search context into a form that can be processed by a search engine.

5. Implement a prototype system as a proof of concept, which can provide information to the user in the form of relevant web links based on all the above, and the database of a search engine.

The general approach towards solving these tasks is to see what the “state of the art” is in each field, then adopt an existing method if it fits our specific task or adapt it as needed.

56 3.3. GENERAL OUTLINE OF THE SOLUTION

3.3 General Outline of the Solution

In the course of researching this work and analysing a number of proposed solutions to various aspects of the discussed problems, it emerged that no simple solution exists, since many of the problems are both user-side and server-side. Working on one side only would only partially mitigate some issues but not completely resolve them. Consequently, we propose a complex distributed system differing from the usual one-sided approaches (compare diagram for current systems in Figure 2.1 and our proposal in Figure 3.1).

Available initial query Information

Search category Context selection

relevance term weights Information Supply Engine (ISE) feedback

Matching Information

Activity user category Context

User Profile User Action and Context

Figure 3.1: Context based information supply: our proposal. User profiling is on the user side, information-related activities are on the server side. A (partial) user profile is sent on to the server after being filtered through the current activity context.

3.3.1 Hybrid Web Directory

Addressing the tasks to “1. Develop a method to discover (browse) information collected by a search engine spider which does not have the limitations of the keyword search paradigm” and “3. Develop a method to allow the user to express a specific search context for every information-finding task he performs”, we propose a hybrid concept which relies on a

57 CHAPTER 3. EXPLORATION ENGINE combination of statistical learning techniques and human classification. In this combined search engine / web directory model (which we call the Web Exploration Engine) users are able to:

• Browse a web directory, satisfying their information discovery need.

• Search the whole database (same as a search engine), satisfying the information locating need.

• Search within only a branch of the web directory tree, enabling search in a narrower (predefined) context. If the user selects a particular branch of the hierarchy to search in, he is explicitly expressing the current search context.

We propose to achieve this by building a semi-automated web directory which includes in itself all of the documents indexed by a (fully automated) search engine’s web spider. Such an Exploration Engine will have the same amount of information as a search engine, but it will be classified into a hierarchical structure as a standard web directory. This will necessarily be an extremely large structure (orders of magnitude larger than all existing web directories), which is why machine learning methods are needed to build it. Since the hierarchy can be arbitrarily deep, the system can have relatively few documents per leaf node, so every document will be reachable by exploration/browsing (the user will be able to reach a node where all documents belonging to it will be on just one page, small enough for a person to see all listed documents): we thus address the issue of limited search results that search engines have.

The resulting structure will not be prohibitively difficult to navigate: if we assume an average of 100 documents per leaf node and 10 branches per node, users will be able to browse something in the order of 100 billion documents with only ten navigational clicks. To address the cognitive bias discussed in the previous chapter though (the fact that users may approach the classification of knowledge differently from the directory’s creators and not be able to find what they are looking for), we will also develop an assistant mechanism to guide users through the directory structure.

If the user searches its whole database, the Exploration Engine will act as a standard search engine. However, if the user elects to search within only a category, the Explo- ration Engine will act as a topic-specific (“vertical”) search engine, where this topic can be taken to mean search context: a very broad context at the upper levels of the hierarchy, becoming increasingly narrower the deeper the user gets, with practically arbitrary gran- ularity selected by the user. In effect, our system will act as a large number of specialised nested search engines. Additionally, we will enable the user to create and/or expand a search query by supplying relevance feedback on search results, enabling “on-the-fly” user- specified search context supplementing the pre-built contexts. Doing this in a “session” manner (in small increments) will lead the user as the “information scent” envisaged by

58 3.3. GENERAL OUTLINE OF THE SOLUTION the Information Foraging theory [Pirolli and Card, 1995] which is largely missing [Nielsen, 2004] in current search models.

3.3.2 User-Controlled Ontology

The task to “2. Develop a method to create a realistic cognitive model for the user (prefer- ably with minimal user interaction in order to enhance usability)”, as discussed above, has already been addressed by many other projects. The usual agent-based approach, which we will also follow, is to create a user-side personal agent monitoring the user’s web activity, which is then mapped to an ontology of some description (simple tree-like hierarchy or a complex graph). In our case, the mapping will be to a completely user-built hierarchy: we will allow the user to create the category structure himself, in order to avoid the cognitive bias which will come with a pre-existing ontology. We will not take the route to try and profile the user from the side of the search engine, as this seems a bad strategy on reflection: the search engine cannot collect comprehensive data about the user, and would have insurmountable technical, privacy and legal issues if it did.

3.3.3 Personal to Global Ontology

Once we have created a user-side model of the user’s cognitive context, we are faced with the task to “4. Develop a framework to translate the user’s cognitive model and the search context into a form that can be processed by a search engine.” As already discussed, such user-side models are usually used to achieve user-side personalisation in the form of data filtering, personalised presentation of search results or search query expansion: various forms of assisting the user’s interaction with a search engine. The common missing element in these models is the search engine itself: it is taken as a given and no attempts are made to modify it to accommodate the personalised search solution. In our case, we will modify the search engine and allow it to accept complex search queries in the form of weighted term vectors, where the term weights will reflect not only the importance of a word for the specific search query, but also its importance in the cognitive context of the particular user. In this way, the query term vector will represent both the current search and the overall context of the user; if the user initiates a search from a specific branch of his personal category tree, this will be reflected in the query and represent the current search context. Having a server-side (central) ontology, we can also match the query to a particular branch of this ontology and limit the search within its context. This part of our proposed model represents the main contribution of this work.

3.3.4 Prototype and Practical Considerations

Being a complex multi-part mechanism, our proposed solution is difficult to describe and explain. In order to demonstrate it and develop some “use cases” for illustration purposes,

59 CHAPTER 3. EXPLORATION ENGINE as well as to test the technologies needed to create such a system, we have also added a practical part to this work: “5. Implement a prototype system as a proof of concept, which can provide information to the user in the form of relevant web links based on all the above, and the database of a search engine.” We have programmed a prototype and tested it with “real world” data. Additionally, while developing the prototype, we discovered the relative ease with which some other improvements can be made which are not essential to the search model but are still “nice to have” features improving general usability:

• While training a statistical classifier to distinguish documents of different categories, we learn the most important (distinctive) terms for that category; we can then present them to the user as a list of clickable suggested keywords that can be easily added to the current query.

• We added the option for the user to save a query for later re-use.

• When the user loads a previous query and modifies it, he can either “Save” it - overwriting the initial version, or “fork” it, creating a new saved query.

• Users can collaborate with others by sharing their research sessions: every saved query can be made public and re-used by others.

An important point we also need to clarify is that such a system does not necessarily need to work with one information provider (web directory or search engine) only. Once the overall framework is in place and end users have built their profiles on their own computers, they can then use these profiles with any number of information providers who adopt the ideology and create an interface (and underlying infrastructure) to accept search requests from such profiles.

3.4 Advantages of the Proposed Solution

The benefits of a system as described above, which adopts a distributed as opposed to an integrated approach to search personalisation in conjunction with an ordered server-side data structure, are multi-faceted and we will list them below.

3.4.1 Usable User Profile

The first and main benefit of our solution is the ability to create a usable user profile without the many shortcomings of other solutions, both already in use or proposed as future developments. Let us review this statement in more detail: We emphasise that the user profile created by our system will be “usable” because, unlike the profile created by other user-side personal assistant agents, it will have an

60 3.4. ADVANTAGES OF THE PROPOSED SOLUTION expressive way to communicate with the search engine, through which it will be able to convey the part of the user profile necessary to personalise each search activity. Server-side personalisation solutions, on the other hand, are hampered by many problems which we do not solve but circumvent:

• Collecting meaningful amounts of user behaviour data is a technical challenge for search engines, as they normally have no access to most of the user’s activity1; in our solution, profiling happens where most of the user activity is - on his main computer. In a future development, our system will also have “roaming profiles” such as those already being integrated by major web browsers (Firefox2 and Opera3); they will allow the user to use (and develop) his profile from various devices, and will be saved in a location of the user’s choice (i.e., the data will not be saved and controlled by one central entity).

• Collecting meaningful amounts of user behaviour data would present also storage and processing problems for a search engine if it managed to do it for many millions of users; in our solution we shift both data storage and processing to the end-users’ machines.

• Legal issues limiting the retention of identifiable users’ data, as well as privacy concerns do not apply in our case: in our solution every user stores his own data so that we do not store it ourselves or have direct access to it.

• Tying the user profile to a server-side ontology is necessarily limited, as this ontology will have to be one for all users. In our solution, we can afford to allow every user to build a personal ontology and train a personal classifier to place documents in it as per his own criteria. At search time, this personal classification will be dynamically matched to a central ontology, on a case-by case basis: the user will be guided to a node in the central ontology which best fits his query, no matter how he created the query and what he thinks about its classification. This is where our system will correct the impact of the user’s cognitive bias: even if the user has classified “Dolphin” under “Fish” in his personal data structure, the system will dynamically match that to the “Mammals / Marine” branch of the server-side ontology and will present the user with correct results.

Furthermore, the user model created by our solution has an arbitrary granularity which can be controlled by the user: i.e., he can decide how many sub-categories (contexts) will be created and how they relate to each other. This gives more freedom to the user than other systems [Liu et al., 2004] which limit him to a number of fixed, pre-set categories.

1Presumably, Google’s Chrome operating system is trying to address this and give the search engine access to all user activity data; however, this may be illegal and is limited by the small market share of the operating system and the devices running it. 2http://www.mozilla.com/firefox/ 3http://www.opera.com/

61 CHAPTER 3. EXPLORATION ENGINE

3.4.2 Improved General Usability

Further to creating usable profiles, the user models generated by our solution will also be re-usable: since they do not depend on a particular service provider, the user will be able to use his profile with any number of search providers where the server-side part of the mechanism is implemented. In other words, if a number of search engines and web directories implement a “Web Exploration” mechanism with a standard query interface, users will be able to benefit from that without having to create a new profile for every one of them.

The user-side part of the mechanism, apart from monitoring the user’s activity in order to create a user model, can also easily grow into a personal data management assistant; since it monitors all web traffic anyway, it can assist document re-visitation by becoming a feature-rich personal bookmarking solution where the user can search within web pages he has visited. Unlike browser bookmark applications, where the user has to specifically decide to bookmark a page, this personal agent will record everything: users will be able to revisit even pages they did not think important enough at the time to bookmark. Furthermore, browser bookmark applications record only a site’s name and URL, while our agent will also index the full content of the document, making search really useful.

As we already said, a major problem of explicit feedback in general is that users do not take time to provide it, because they see no immediate effect of their effort and think it is not worth the trouble. We provide an incentive for them: by providing feedback (classifying resources into personal categories), users work for themselves: they do not improve an “invisible” profile of themselves at the search engine (with its privacy issues), but create an ordered document library for themselves, on their own computers where they have full control and no privacy concerns.

From a server-side point of view, we have already stated that the complex navigational structure of a web directory is an issue which cannot be mitigated: the structure will remain complex whatever we do. However, our system reduces the usability problem as we no longer need the user to fully browse the structure “top down” and actively select a branch to follow down at every level. Using the “research query” the user has developed by providing relevance feedback, our system proposes a most probable path from the top to a bottom node. The user can follow it, or select a different branch where he does not agree with the system. If the system uses a copy of its backend dynamic classifier to classify the research query, and if it is 95% accurate (which would be achievable with a document corpus controlled better than the Open Directory), this means the user will only have to make a correction in 5% of the cases: we have made his task 20 times easier. Furthermore, since the system lists not only the “most probable” branch at every level but the “next probable” as well, we have a recommender system in place which suggests categories relevant to the user’s current search context, which displays search results in their own document context.

62 3.4. ADVANTAGES OF THE PROPOSED SOLUTION

3.4.3 Expressive Queries

Unlike most other personalised search solutions, our system does not take the search engine for granted. Where other systems only assist the user with formulating a query (which is then treated as a normal query by the search engine), or filter/re-order incoming results, we propose to modify the search engine as well to take advantage of the profiling data the user side has accumulated. In our system, the search engine not only accepts limiting parameters (search within a particular category only), but also accepts a complex query with weighted terms. This query can express the user’s overall cognitive context as well as the specific search context for the current search; even if the user is searching for only one or two keywords, the system will “translate” that into a query of hundreds of words conveying the meaning of: “this user, who has vocabulary A, is searching for terms B in the context of category C”.

3.4.4 Exhaustive Exploration of a Topic

Unlike keyword search, exploration of the server-side web directory will provide exhaustive information on a topic. Unlike traditional web directories with limited resources (we have seen that the majority of web documents are not indexed by web directories at all), our directory will include a quantity of information similar to that of a major search engine. Even if the user then gets only a limited number of search results for a specific search, exploring the deeper levels of the hierarchy will allow him to reach all documents with a relatively small number of clicks (a search query is not even needed, it is optional). We thus give the user an opportunity to get comprehensive information including documents he did not know existed, so he would never have searched for. This also resolves the issue with limited recall of search engines (where some documents are not reachable by keyword search, even though they have been indexed by the search engine and the user is searching for terms contained in them): in our system, the user can a) search within a limited context (sub-category) which can be arbitrarily specific (small), or b) just browse to a comprehensive list of documents on a sub-sub-topic. In the latter case, the user is able to ascertain that the results are comprehensive (no further information on his search can be found) or non-existent (he is not finding anything not because of wrong choice of query on his part but because information like this does not exist online): a major improvement over the current search model.

3.4.5 User-Specified Search Context

One scenario of using our system is when the user initiates a search from within his personal categorisation structure: the user browses to a category and then submits a search query from there. The system interprets this as: “find documents similar to these which contain these keywords”. The search query can now carry not only the keywords

63 CHAPTER 3. EXPLORATION ENGINE but the context within which they are used, allowing the user to exploit his pre-built categorisation (contexts) and explicitly send its most relevant part to the exploration engine.

3.4.6 Busting the Filter Bubble

As discussed above, personalised search solutions using implicit feedback tend to create a “filter bubble” [Pariser, 2011], where users are unwittingly limited in the information they receive because the search engine thinks they do not want some of the available results. This problem is similar to the “cold start” issue in recommender systems, which initially lack data necessary to perform any meaningful recommendations. They usually address it by implementing some form of “collaborative filtering” based on data gathered from other users (which implies similarity between users, an approach we saw failing in geotargeting). In our solution, instead of assuming the user is similar to other users, we use a different approach: our system guides him through a hierarchical structure of documents where no documents are hidden from him; documents may be re-ordered due to explicit or implied preference, but all of them will be reachable, even those deemed not too suitable.

3.4.7 Information Scent

As already mentioned, Information Foraging theory [Pirolli and Card, 1995] suggests that people expect to follow a trail, an “information scent” leading them to the information they seek. If such a trail does not exist, they do not appreciate the information found or tend to abandon the source altogether and look for a different one [Nielsen, 2006]. The exploration of our hierarchical structure will present such a trail with its increasingly more specific results, and the ability of the user to further fine-tune the search process by providing document-level feedback. The incremental improvement of results with every small step is directly visible to the user and will compensate the otherwise negative effect of requiring the user to invest more time in making the query more specific.

3.4.8 Recommendation Engine and Information Discovery

In many cases a traditional web directory is not able to return any results for a specific search, which is one of its serious disadvantages. In our system though, the user always gets some results; the search request is not treated as a typical search query where results have to match the query (returned documents should contain the query terms) - it is treated as a document and the system classifies it into a leaf node of the category hierarchy. The user will receive, as an answer to his request, a path through the hierarchy and a list of documents best describing the nodes along this path, even if these nodes and documents do not in fact contain any of the search terms. These will be “the most probable” best matches, and in text classification there is always a “most probable” result - even with

64 3.5. DISADVANTAGES OF THE PROPOSED SOLUTION an empty query (in which case probability will depend on prior probability only, and the system will return the most popular categories as results). The effect of this feature is that the user will receive some recommended content even if his particular search is fruitless; this allows information discovery, in cases where people did not search for something not because they would not want it but because they did not know it existed.

From the point of view of the user, this will probably be the most important feature of our system. As already discussed, people’s information-finding needs consist of information locating and information discovery. These needs are complementary, and the fact that one of them is perfectly covered by search engines does not mean the alternative is not needed. An objection that usually was made to various presentations of this project was: “all right, but we do not need a new system to find information - we can already find anything on Google with the right search”. Firstly, we have seen this is not factually true: there is an ever growing number of documents behind the “decision boundary” for every query associated with them that are not findable even with the correct queries describing them. More importantly though, the above argument assumes the user knows “the correct query”, which in turn assumes that he knows what exactly he wants. However, if we accept that “in order to ask a question you must already know most of the answer”, no current system addresses the issue “where does the user get most of the answer from, before asking the correct query”? We propose to build such a system which would initially supply the user with the information needed to formulate the question, and then its answer.

3.5 Disadvantages of the Proposed solution

It would not be fair to claim that the proposed system has only advantages over existing models and no disadvantages. As with most things in life, everything has a price and our system has to pay this price by incurring several major disadvantages in both its backend and the frontend.

3.5.1 Expensive Backend, Complexity

The cost of building our system will be considerably higher than the cost of creating a traditional search engine (it requires a traditional search engine, and then builds addition- ally on top of it) or a traditional web directory (again, we need a web directory as part of our system - moreover, one much larger than a traditional web directory).

The complexity of the system, which also includes distributed data sources (the users’ profiles, residing on their own computers), is also a significant added expense - both in the financial sense of the term and in the sense of added development challenges.

65 CHAPTER 3. EXPLORATION ENGINE

3.5.2 Complicated Frontend, User Investment

One major drawback of our system is the complexity of the frontend we have to present to end-users: instead of dealing with a “one input interface” (user just inputs one thing - the search engine everyone is accustomed to), we now have an interface consisting of:

• Simple keyword search (same as a traditional search engine).

• Directory browsing (same as a traditional web directory).

• Restricted keyword search (search within a category only).

• Assisted browsing of the web directory.

• Recommendation feature.

• Relevance feedback for suggested search results.

• User-side ontology creation and maintenance.

• Hybrid search starting from user-side ontology and going into server-side directory.

Apparently, learning how to use this system, and especially creating a meaningful user- side ontology, will take time. Users will have to invest heavily in the system before they can see its full benefits. For the casual user, it will make search considerably slower and in many cases will not be suitable at all (e.g. - somebody looking for local pizza delivery will be better off with a traditional search engine). Our proposed system will be suitable mainly for “knowledge workers” who can afford to invest some of their time in order to receive better (re-)search results.

3.6 Specifics of the Exploration Engine

A more detailed description of our envisaged “Web Exploration Engine” follows. It de- scribes a scenario where the user starts out with a vague idea of what he wants, so he just browses into the pre-made categories. The user can then (optionally) submit a query to tell the exploration engine what he wants. After receiving search results, the user can supply feedback on them and thus make his (re-)search more specific as it integrates the feedback into successive iterations of shown results.

3.6.1 Browsing the Directory

The most basic part of the Exploration Engine is a simple mechanism for browsing the directory, which practically replicates the Open Directory (see Figure 3.2): a static hier- archical structure. In this scenario, the user just clicks on links to categories and explores

66 3.6. SPECIFICS OF THE EXPLORATION ENGINE them but does not use the personalisation tools provided (giving document-level relevance feedback or adding personal bookmarks to the list of documents provided by the directory).

As compared to the Open Directory, we have expanded this model with two additional improvements. First, the user has an additional sorting option: to see more typical doc- uments first, where ordering is by the classifier scores for each instance. Thus sorted on top are documents that (according to the classifier) match the category best, or in other words are more typical and representative for it. Another improvement we had to make is related to the sparsity of the data. ODP contains an average of only 5.54 documents per category; browsing such categories is impractical for the end user, and training a classifier on only 6 instances is practically impossible. To address this, we “folded” categories to include all instances from descendant nodes as well (the same approach was used in other projects based on Yahoo! Directory [Xue et al., 2008; Sieg et al., 2003], where category fragmentation is even worse).

Figure 3.2: User browsing the directory. Exploration has not been personalised yet: in the upper box (highlighted by bold rectangle) no relevance feedback has been added by the user. The keywords from the list in the bottom right box (highlighted) assist the user to formulate a search query and search either within the directory or at a number of external search providers. These keywords automatically emerge from the classifier data and are not supplied by the editor. (example with actual data)

Apart from manually classified documents, in every category we also display additional documents downloaded by our spider which have been added automatically to the cate- gory by the classifier. In a real-world implementation, this will strengthen the categories by adding to them more content by several orders of magnitude. For example, the Busi- ness category of the Open Directory contains a total 222,963 documents, with Financial

67 CHAPTER 3. EXPLORATION ENGINE

Services within it having 12,662 entries; the online edition of the Financial Times4 alone has over 587,000 pages according to Google. If we add the content of all finance-related sites to our Financial Services category, there would be hundreds of millions of documents in it instead of just 12,662.

3.6.2 Exploring the Directory

The main feature of the system is its Exploration mode (see Figure 3.3). In this scenario, the user can start exploring the web by submitting an optional query in the form of a short text or a whole document. It is optional because the user can, alternatively, just start browsing the directory (i.e. - have an empty query); in both cases, this query can later be expanded. Results returned to the user come in the form of a path through the directory tree, together with some document listings. The user’s query is treated as a document and is passed through the same pre-processing filter that the backend classifier uses, then goes through the many levels of classifiers. At each level, the most probable category is returned as the answer but other categories are listed as well, in order of probability (i.e. - how well they fit the query). This gives the user a way to correct the system, by following not the suggested path through the directory but an alternative path; even if the classifier is correct though, the user can still “wander off” by clicking on an alternative link in order to explore areas not actually fitting the query, but still somewhat relevant to it. Together with each suggested category, the system also offers a number of documents from it. They can be ordered in a number of ways between which the user can choose:

• The default is Editor’s choice: a human has decided which documents are most important for a category.

• The Typical ordering lists documents by their fitness to the category (based on the classifier score); seeing the most typical documents of the category assists the user in deciding whether it is relevant to his query or not.

• Relevant documents first: this lists documents in order of relevance to the user’s query. It uses the Search mechanism (see below), performing a local search with the whole complex research query.

Next to each listed result are “action buttons” allowing the user to mark it as relevant, bookmark it, mark it as not relevant or report it to the editors as error/spam. Marking the document as relevant or not relevant to the user’s research adds to the research query: the system adds all keywords contained in the document to either the positive or the negative feedback component of the complex query vector.

4http://www.ft.com/

68 3.6. SPECIFICS OF THE EXPLORATION ENGINE

Figure 3.3: User exploring the directory. The system recommends a sub-category and shows some examples from it (individual site listed in upper highlighted box; in front of its name are the relevance feedback tools: buttons to mark it as relevant, bookmark it, mark it as not relevant or report it to the editors as error/spam). In the middles box: the system lists other sub-categories at the same level, ordered by relevance to the query. Lower box: ordering options.

In a “real world” implementation of the system (coupled with a large search engine), users would be provided with a deep path through a large directory tree, seeing hundreds of nodes to chose from (both “best fit” to their query and suggested “next best”). In effect it’s a cascading classifier which recommends directory categories but also allows users control in the sense that its recommendations can be overridden at any point, enabling true information discovery.

3.6.3 Searching in the Directory

Keyword search (Figure 3.4) is also available and uses the same mechanism which orders exploration results by relevance. It is based on an inverted index such as search engines use. This index is the inverted version of the document description table: instead of a list of all words contained in a document, it indexes all documents that contain a word.

In our model though, this is not a flat occurrence list with yes or no values (contains or does not contain) or number of occurrences. Instead, we have a normalised TF-IDF

69 CHAPTER 3. EXPLORATION ENGINE weight for each word for each document, where we record the ratio between the word’s TF-IDF value for that document and the average TF-IDF of all words in the document. Based on this we can compare the relative importance of words in documents and we can say for example: word wi is twice more important for document Dj than for document

Dk. We use these values to calculate the relative scores of each candidate document and order them by relevance to the search query. Every query term is also weighted by TF-IDF before we multiply it by this score.

In general, TF-IDF is used to discount common, or “stop” words and to emphasise distinctive words. However, a word is a stop-word or distinctive in some context. The word software for example can provide a good decision boundary when splitting IT-related from non-IT-related texts, but once we have texts about software only, it stops being distinctive and should already be treated as a stop-word. This is where our system introduces an innovation as compared to existing systems. At the top level, the user receives search results from the whole document collection, ordered by global relevance. When a lower- level category is selected, we perform local search in it: over a partial document collection and ordered by local relevance. In effect, the user is provided with a series of increasingly specialised search engines allowing focused search in narrow pre-defined contexts. In our test prototype, with just two clicks the user can fragment the web into 6,368 pieces and search in just one of them. In a real-world scenario, this could easily scale to millions of such fragments, allowing the user to select from millions of different contexts with only several clicks.

Searching in our engine can be initiated in two ways: end-user submitting a query to the server (same as in a traditional search engine), and end-user submitting a query to his personal assistant agent. In the latter case, the query will be subjected to some processing before being submitted by the agent to the server: namely, the search terms will be weighted depending on the context (category) from which the user submitted the search, and in some cases disambiguation terms will also be added. This provides the current search context of the user, in a simple form which does not necessitate server-side processing: the term vector is pre-weighted by the user-side agent and is then treated as a normal query vector by the server, same as those submitted directly to it.

3.6.4 Query and Query Expansion

A major feature of the system is that the user’s query can be expanded by relevance feedback. In the result list the user can mark some results as either relevant or not relevant and add them to the query. When the user clicks on a button next to a displayed result, all the (weighted) keywords contained in that document are added to the query. The user also has a list of those added documents which can be edited. This is a one- click query expansion where the user is not required to enter any keywords - the system supplies them instead. Since some of the added documents are positive (relevant) and

70 3.6. SPECIFICS OF THE EXPLORATION ENGINE

Figure 3.4: Advanced search in the directory. The user has already supplied some positive and negative feedback (in the upper highlighted box). A positive example appears in the results list and is displayed with a bold font (personalised presentation) in the lower highlighted box. others are negative (not relevant) examples, the query is necessarily a complex object with a positive and negative component which have to be weighted. We do not use RSJ weighting [Robertson and Jones, 1976] because it assumes binary index descriptions of documents and we would prefer a more accurate document representation vector. Instead, we use a Rocchio relevance feedback weighting scheme [Rocchio, 1971]:

Q = αQu + βQp − γQn (3.1)

where the query Q fed to the classification algorithm consists of the original user query Qu plus the query vector Qp from positive feedback (documents the user marked as relevant) minus the negative query Qn (documents marked as non-relevant). Those vectors P P are calculated as Q = Dp and Q = Dn , where D are the respective document vectors p cp n cn and c is their cardinality. α, β and γ are weights signifying how important each component is. The defaults are α = 1 and β = γ = 0.5, i.e. the original query is as important as the whole relevance feedback, and the positive and negative feedback components are equally important. Users can adjust these values on a case-by-case basis though. Unlike the ARCH architecture [Sieg et al., 2003], our query expansion works on the document level and not the category level. Where ARCH adds a feature vector derived from the “concept” (category), we only add the feature vector for a particular document

71 CHAPTER 3. EXPLORATION ENGINE

Figure 3.5: Normalised weights of query terms. The initial two-word query was expanded with the term vectors of documents marked as relevant or not relevant (since there is more than one document, we have an IDF component hence their values are not integer). As the user moves from Business into the more specific Transportation and Logistics category, the query gets re-ordered by local importance of terms. (example with actual data)

(see Figure 3.5). This gives control to the user and is not only more precise (provides several orders of magnitude more granularity), but is also realistic. Documents contain a couple of hundred words each, while a category has hundreds of thousands (we derived 127,830 unique terms from our fairly small sample of the Business category of the Open Directory; a real-world scenario would be much worse): apparently, an expansion adding hundreds of thousands of terms would dilute the user’s initial query so much as to make it irrelevant, not to mention computation costs.

Also, unlike incremental relevance feedback [Allan, 1996], we do not limit the size of any of the query components; this limitation was intended to accommodate streaming filters and not a large-scale classifier as our system. The price we have to pay is slower processing, but - as discussed - we prefer to make no assumptions on behalf of the user, so we do not assume the user wants a “fairly good” answer fast5; our system will, by default, not make this compromise. The user though has the option to select a different, faster algorithm which will tilt the compromise towards faster but more inaccurate search.

Our research query is very different from a standard search engine query, in that a) it can be arbitrarily long (there is no theoretical upper bound, though there maybe a soft limit imposed for optimisation purposes; in practice, it would probably still be smaller

5A pioneer of modern search engines asked [Kleinberg, 1999]: “Although there would be considerable utility in a search tool with a longer response time, provided that the results were of significantly greater value to a user, it has typically been very hard to say what such a search tool should be computing with this extra time.” We hope to have found the answer.

72 3.6. SPECIFICS OF THE EXPLORATION ENGINE than a concept-based expanded query), and b) it is not a flat list of words but an array of the type word|weight; weight values can also be negative as result of negative feedback. This query is treated as a constructed document which summarises what the user is searching for; we call this a research query to distinguish it from standard search queries. This dummy document is passed through the hierarchical classifier structure where all its terms are taken into account when calculating the result; web documents returned as results though do not need to contain all or even most of the terms - they are just those that best relate to the query.

3.6.5 Enhancements

Some other improvements we have introduced can be considered usability enhancements and not real innovation, but they help our exploration engine become user-centric and not server-centric as the dominant current model. Such is the saved researches feature, which allows users to save and re-use later research queries they have developed. Each query may include search terms, relevance feedback, start category (the user can skip the top levels of the directory and start at a specific node) and related bookmarks. In terms of collaboration, we have implemented a feature which allows users to make their researches public. Other users can load such research sessions, modify them and then save them as their personal researches. In future, we will also need user collaboration to help develop the web directory itself. Our experiments with Open Directory Project data show that the amount of freely avail- able labelled data is not enough to train a realistic hierarchical classifier for the volumes of documents we want to categorise. Furthermore, even if we did train our classifier on such data, it would need constant re-tuning and corrections in the future due to the constantly changing nature of the web. One way to tackle this is to employ editors to classify resources, which is not economi- cally attractive. An alternative approach is to use volunteer editors, but the current state of the Open Directory Project shows that this does not work too well. We propose a collaborative filter instead, where every user will be a small-time editor. By providing feedback on whether a document is a good fit to a directory or not, millions of users can easily help train our classifiers to a more satisfactory level. This feature has not been implemented in our prototype yet; we have only provided a feedback mechanism for users to report errors or spam, but these are manually verified by us before change in classification occurs.

73

4 Related Work

4.1 Summary

In this chapter, we review related prior work in several fields pertinent to various aspects of the system we want to build. These are not related to each other, hence the placement of this chapter here (as opposed to the beginning of the thesis, as is traditional), where the logic of building our solution dictates the exploration of these techniques. We review methods of user modelling, matching user models to ontologies and using machine learning to build web directories. We also explore clustering techniques and a special case of neural networks - the Self-Organising Map, as these seem to be a logical alternative to creating a web directory. Traditional methods of text classification are reviewed as well, including the Multinomial Na¨ıve Bayesian classifier.

4.2 Building the User Model

User models can be based on data collected by either implicit or explicit feedback. Most research projects and industry developments have concentrated on collecting implicit feed- back ([Liu et al., 2002; Lee et al., 2005; Birukov et al., 2005; Yang and Jeh, 2006; Qiu and Cho, 2006; Das et al., 2007]), mainly with the motivation that the system should not be intrusive [Yang and Jeh, 2006].

Some of these methods have long been established and are widely used; for exam-

75 CHAPTER 4. RELATED WORK ple, Google acquired Urchin Software Corp.1 in 2005 (and later renamed the service to Google Analytics2) in order to get behavioural metrics: the software (through java-scripts installed on a large number of web sites) collects comprehensive data about what the user did at the visited site. A similar set of user behaviour data is collected by the Facebook “Like” button, which is displayed on many sites and is not in fact a button but an active script and collects (personally identifiable) users’ browsing data, whether the user clicks on the “button” or not (e.g., if you visit any site displaying such a button, Facebook learns that you visited it and gets several other parameters from your browser - without any action on your part, no matter whether you are a Facebook user or not).

Some other methods, such as anything based on sensor data, are still in the proof-of- concept stage; a laboratory example of using sensor data is gaze-based keyword extraction [Buscher et al., 2008], where specialised hardware follows the user’s gaze while the user is reading a document, then the information is analysed and the system finds which words attracted the most attention. This is then exploited to suggest new search terms for query expansion; another possible application is to assist in training a personal document classifier, which trains only on terms which the user finds important.

It has to be taken into account though that implicit feedback is noisy and prone to errors. Some sources of errors in implicitly collected data are:

• Ambiguous identification of the user (one person may use several devices to access the web; several persons can share the same device/browser), leading to leaking of information between unrelated user profiles, or missing information.

• Search history is a limited way of expressing interest: if the user has a regular source of some information and visits it through a bookmark, he may never search for it so it will not be reflected in search logs at all.

• One person may use a number of different information-finding solutions (search en- gines, web directories, recommendations through social networks), so his search his- tory is spread over many sources.

• User behaviour after search results are shown cannot always be interpreted correctly. For example: short stay at a site may mean it was bookmarked for later reading, or the user was looking for a quick fact and got it at a glance; long stay may mean the user just walked away from the computer, or opened the page in one browser tab and then switched to another tab to read something else before returning to this one; not clicking on a result does not mean it is not relevant: the user may already be familiar with that resource, or may have looked for information which was in the description snippet, so did not need to click on the link.

1http://www.google.com/intl/en/press/pressrel/urchin.html 2http://www.google.com/analytics/

76 4.2. BUILDING THE USER MODEL

• Click patterns are subject to a positive feedback loop with the search engine: people click top-ranked results more, so they get enhanced as “more relevant” through this implicit feedback although it just means that the users trusted the search engine. Furthermore, the click may not have come from a person at all - it may have been an automated tool (a “bot”).

• Browsing history and search history have serious privacy implications, so many peo- ple try to actively mask their browsing and search trails using various tools (browser add-ons such as “Google Sharing”3, TOR4 etc.).

Apart from web-related activities, a system could also use the “personal information repository” of the user (the personal collection of text documents, emails, cached web pages, etc.) [Chirita, Paul-Alexandru and Firan, Claudiu S. and Nejdl, Wolfgang, 2007] as a basis for his “cognitive context”. The user’s browser/operating system “fingerprint” (installed software and software set- tings) is also routinely utilised, as well as the device’s IP address, which allows geographic targeting (“geotargeting”). Regardless of how information for the user profile is collected, another issue is its representation. The profile can be expressed using:

• Unstructured models: bag of words or vector-based models.

• Structured models, such as a conceptual (taxonomy or ontology based) hierarchy or graph.

The bag of words user model was one of the first used to express user interests and cognitive context [Lieberman, 1995]. It is based on the user’s browsing history (typically collected by a user-side agent running on top of the user’s web browser), where the user’s context is presumed to be related to the terms contained in all accessed documents. The model consists of a collection of documents viewed by the user, compressed into a “bag of words”: a plain list of word occurrences. It could also be expressed as a term vector, which is derived from the bag of words by some transformation or weighting scheme (e.g. - TF-IDF). Conceptual user models are usually based on external resources (existing taxonomies or ontologies). Apart from collecting information that represents the user’s interests, they need a mapping mechanism to connect this information to the external resource. One such resource which is often used [Liu et al., 2004; Daoud et al., 2010; Ma et al., 2007; Sieg et al., 2007] is the Open Directory Project with its extensive hierarchy, representing a form of simple ontology. Relationships in it are of the kind “A is a subset of B” and the concept hierarchy is a tree; there are additional links between categories though (“see also”, “the

3http://www.googlesharing.net/ 4http://www.torproject.org/

77 CHAPTER 4. RELATED WORK same category in other languages”) which can be used to build a graph [Daoud et al., 2010]. Another resource is YAGO (Yet Another Great Ontology) [Suchanek et al., 2008], where a large ontology is extracted from WordNet5 and Wikipedia texts; facts are represented as knowledge triples expressed in first-order logic: relation(entity1, entity2); relations are mapped and disambiguated into WordNet concepts creating a Directed Acyclic Graph. Relationships between entities in a conceptual taxonomy can also be weighted [Gauch, 2010], giving the model higher precision.

A further development is an agent like Miology [Speretta and Gauch, 2009], which applies semi-automatic ontology construction: it starts with a number of pre-made on- tologies, then attempts to match and use the best-fitting one for every use case. It also evolves ontologies as it monitors the user’s web browsing history and adds new words to their concepts (derived from documents viewed by the user), and allows user personalisa- tion of the concept structure.

4.2.1 Knowledge Representation and Ontology Matching

The usual approach to building a model of user knowledge and preferences and then applying them to web search has traditionally been to create an integrated mechanism either at the user side or at the search engine side. This mechanism models the user and then uses this (internal to itself) model to apply some sort of personalisation. The first part of its task is to define an “area of knowledge” matching some user interest. The problem as a whole is not new - knowledge representation has been a major part of AI for decades [Minsky, 1974]. Many frameworks exist to define and build an ontology. Most proposed solutions though use pre-built ontologies: the Open Directory [Liu et al., 2004; Daoud et al., 2010; Ma et al., 2007; Sieg et al., 2007], Yago [Suchanek et al., 2008], Schema.org6 (used by search engines Bing, Google, Yahoo! and Yandex) etc.; an interesting idea is also to use a number of ontologies and select the relevant one on a case-by-case basis [Speretta and Gauch, 2009].

The Open Directory is a simple hierarchical classification tree built by many volunteer editors. It is informal and crude, but extensive and available for free use together with examples of documents belonging to each category, which assists machine learning tasks associated with using it. Schema.org, on the other hand, is a formal ontology being built and used by the major players in the search engine space. It is targeted mainly for use by search engines, the idea being that web site owners will voluntarily tag their content with the relevant ontology definitions to “tell” the search engines what their documents are about. The scheme is still under development though and is not widely spread.

In both these contexts (ontology being used by a personalisation solution or a search engine), ontology matching is not an issue: the ontology is either being used internally by

5http://wordnet.princeton.edu/ 6http://schema.org/

78 4.3. BUILDING THE WEB DIRECTORY an application, or is being referenced by a strict protocol, e.g. RDF (Resource Description Framework).

4.3 Building the Web Directory

Although no major web directory is currently based on machine learning, the tasks needed to create one have been treated for one reason or another by the research community and various industry for decades. A brief review of some of the approaches we can take follows.

4.3.1 Data Pre-processing

Before starting a learning process, any algorithm needs to first obtain the data in a form it can work with. Pre-processing of documents obtained by the web crawler into such a form is probably the most important aspect of any preliminary work.

There are a number of different levels of pre-processing that can occur. First, when a document is downloaded from the web, we have to take into account that, as already discussed, a considerable part of that document is formatting, meta-data and common site elements. Formatting and meta-data can easily be removed, but the “common site elements” part is a huge problem. What is meant by “common site elements” are the parts of a document that represent a web site’s header and footer, site navigation, advertising, “see also”-type of links to other parts of the site etc. - generally speaking, things that are a part of an HTML document, but not a part of its “meaning” or “what it is about” as such. Search engines employ various filters to remove common site elements based on correlational analysis of documents from the same site, as well as many heuristics. These are highly sensitive commercial secrets though, as these filters are to a large extent “adversarial” since they strive to remove content which in many cases has been put there deliberately by site owners with the intent to trick them.

Document text is then parsed by a tokenizer, which splits it into words and normalises them according to some rules. For example, it can force lower case only, so that Word becomes the same as word, or flatten special characters so that ni˜na becomes the same as Nina (which apparently has some drawbacks). The tokenizer can also apply some rules about ignoring certain “tokens”: words containing numbers could be removed, or too long or too short words. Rules can be applied to phrases too. Such rules are usually based on heuristics and are strongly language-dependent.

Data transformation can also occur: data can be saved and used as raw data, or can be transformed to something else before being used. Data could be subjected to scaling, normalisation or smoothing [Kantardzic, 2002].

Scaling, as the name suggests, changes the scale of the measurements. The most popular method uses TF-IDF (term frequency - inverse document frequency) scaling: each

79 CHAPTER 4. RELATED WORK element in the word/document matrix is proportional to the number of times the word appears in the document (its term frequency), multiplied by the (logarithmically scaled) inverse document frequency:

ni,j N TF -IDFw = P ln (4.1) k nk,j dfw

where ni,j is the number of occurrences of the term ti in document dj, the denominator is the sum of the number of occurrences of all terms in document dj, N is the number of documents in the collection and dfw is the number of documents that contain the term w. This “up-weights” rare terms to reflect their high relative importance, and discounts common words as not too distinctive. Normalisation puts all measurements within a limit - otherwise, distance measures for features with larger average measurements will outweigh those that have smaller values. Smoothing, on the other hand, compresses data by recording it in a (smaller) number of possible values (e.g. - discretisation). This could be preferable to normal TF-IDF weighting in some cases, because with the constant addition and removal of documents in the collection, TF-IDF values for words would vary with time which would introduce some temporal shift in the work of processing algorithms.

4.3.2 Data Representation

In modern information retrieval, many models that are used are derived from the so-called “bag-of-words” model [Baeza-Yates and Ribeiro-Neto, 1999], which represents documents as a number of words, ignoring their internal relationships (e.g. - their order). The underlying assumptions of this model are two:

• Probabilities of occurrence of the different words are independent of each other, and

• The order in which words appear in the document is not important.

Words in a document are considered random variables, which in this model are ex- changeable. This is a drastic simplification, which is considered necessary in information retrieval because of computational constraints. In large documents in particular it does not hold due to Zipf’s law, according to which a large number of words appear in a text only a few times, while a few words occur orders of magnitude more often [Zipf, 1949]. The explanation is that authors employ already used words with a probability proportional to the current word frequency. However, it has been found in practice that viable information retrieval systems can be built ignoring this, and using the independence assumption. Using n-grams or a Markov chain instead of the “bag-of-words” model would preserve some of the document structure and increase classification accuracy in some cases. Text

80 4.3. BUILDING THE WEB DIRECTORY is represented as a number of patterns, where for example “the quick brown fox”, with n = 4 would be recorded as:

• the

• the quick

• ...

• the fox

• ...

• the quick brown

• the quick brown fox

Unfortunately, this approach provides better success rates for classifiers that use it at the cost of exponential increase in storage requirements and processing time (every document classified has to be matched to every existing n-gram in the training set), making it extremely expensive to use for large-scale data.

4.3.3 Indexing

The simplest form of indexing documents is to create an occurrence matrix, recording which words are seen in a document (possibly also - how many times). In the case of search engines, this is in fact an “inverse” index (called also a “posting list” [Google, 2011a]), in the sense that the key is the word (or term), against which are indexed all documents that contain it. From a theoretical point of view, this makes no difference - it only matters for the actual implementation. Since documents usually contain only a small part of the overall vocabulary, this matrix is typically extremely sparse and has to be compressed using various techniques. Inverting the index is extremely expensive from a computational point of view and is usually performed by distributed computing based on MapReduce [Dean and Ghemawat, 2010] or a similar algorithm. Search engines also keep a full-text index, which is practically a copy of the actual document, plus additional indices. This is used in full-text searches, where the exchange- ability assumption of the bag of words model is not acceptable - i.e., the order of words in the text does matter (for example, the user searches for a particular phrase). Some ranking algorithms also make use of the distance between terms in the text, giving preferences to documents where the search terms are closer to each other (though not necessary as a precise phrase). When retrieving documents matching a query, the terms from the index are usually weighted as well. Normally the TF-IDF measure is used, but in some practical cases this

81 CHAPTER 4. RELATED WORK presents unexpected consequences. An initial problem of Google for example was that, given a one word query, it tended to return “documents” containing that one word only (after all, this is a 100% match). This necessitated an additional length-normalisation of documents, where weights are corrected to penalise shorter (or extremely long) documents. It is apparent though that this is a rather arbitrary measure, which is designed to favour documents with a length deemed desirable, which length is selected in a subjective way. To circumvent this inherent problem, Google had to resort to methods external to the documents and their contents, such as PageRank or the indexing of documents against keywords contained in links pointing to them.

An alternative indexing is Latent Semantic Indexing [Deerwester et al., 1990], which indexes documents not against words but against concepts. While still treating documents as bags of words, this model is more precise as it breaks these “bags” into smaller chunks, i.e. it models the document in more detail. Since the resulting features are orders of magnitude less than the initial features (e.g. words), this is also considered a form of dimensionality reduction (see below).

4.3.4 Dimensionality Reduction

A major issue with text classification is the high dimensionality of textual data, which is even more pronounced on the web where data sources are not “controlled” and have a large variety of vocabularies (many languages, dialects, regional variations of languages, different spellings, topic-specific terminology etc.) and many “impurities” (typing errors, slang words, personal and other names, intentional noise etc.). The typical representation of a document as a vector where each dimension is a word means that the number of dimensions is equal to the number of words in this extremely large combined vocabulary. Apparently, reducing this number would increase the tractability of all related tasks.

The usual approach used by many methods is to optimise by applying dimensionality reduction, for which a number of solutions exist. Dimensionality reduction is a form of lossy compression where precision is traded for computation costs. In theory, it has the effect of smoothing data and eliminating noise, or at least some noise. However, it also loses data that might otherwise assist the learning algorithm so it results in decreased accuracy [Lang, 1995]; in most cases it compensates this by a great reduction in complexity, hence computation cost. Dimensionality reduction can happen by feature selection (selectively using only some features, or words in the vocabulary - where selection depends on some heuristic), or feature construction.

Feature selection can be based on some analysis of the data, or on domain knowl- edge - i.e., domain experts applying their knowledge which is external to the data itself. Analysing the data can allow us to remove various undesired features which are not useful, such as:

82 4.3. BUILDING THE WEB DIRECTORY

• Rare words, which are not statistically significant enough to assist a learning process such as an automated classifier.

• “Stop words”, or words too common to be distinctive enough to be of any use to an automated classifier.

Dimensionality reduction by feature construction (or feature composition), on the other hand, consists of artificially constructing a smaller number of new features from the ex- isting ones.

Feature composition can utilise a human expert to derive one new feature from several original features through some transformation, depending on the domain and the expert’s knowledge of it. For example, “start date” and “end date” might be transformed into “duration” if the actual start and end of an event are not important but its duration is.

Most methods though are based on algorithmic analysis of the underlying data. One of them is Minimum Description Length (MDL) used in NewsWeeder [Lang, 1995], the first automated classifier for online data (not “web” data, as the web was not developed yet so it classified newsgroup postings). MDL is a simple probabilistic model which encodes documents based on Bayesian probability analysis and weighting of terms in order to decrease entropy. Other methods that can be used are unsupervised clustering like k- means performed over the dictionary of the document collection (the resulting clusters are the new features); Latent Semantic Indexing (LSI) [Deerwester et al., 1990], transforming the occurrence matrix into a relation between the terms and some concepts and a relation between those concepts and the documents; Semantic Hashing [Salakhutdinov and Hinton, 2007], or even a random projection of the high-dimension word vector onto a much lower- dimensional space [Kaski, 1998].

There are many advanced techniques used in different situations. One of them is Principal Components Analysis (PCA) [Pearson, 1901], which is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system so that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. When used for subsequent clustering or classification, the main limitation of PCA is that it does not consider class separability since it does not take into account the class label of the feature vector. PCA simply performs a coordinate rotation that aligns the transformed axes with the directions of maximum variance; there is no guarantee that the directions of maximum variance will contain good features for discrimination.

Singular Value Decomposition (SVD) is another method for dimensionality reduction. T It decomposes a matrix A as: A = Qn×r × Sr×r × Lr×m where:

• r < min(n, m)

83 CHAPTER 4. RELATED WORK

T • Q is an n × r matrix where columns are orthogonal singular vectors of (AA )n×n

• S is a diagonal r × r matrix with singular values sorted in descending order on the main diagonal.

• L is an m × r matrix the columns of which are orthogonal singular vectors of T (A A)n×n.

The values of the diagonal matrix S are then used as a vector with r dimensions, which as stated above is a much smaller value than the original dimensions. PCA is usually applied to numerical data, but can also be used for analysing text after its transformation to a term-document matrix (in a web context, it has been used to cluster search queries [Hosseini and Abolhassani, 2007])).

Probabilistic Latent Semantic Analysis (PLSA), also known as Probabilistic Latent Semantic Indexing (PLSI) [Hofmann, 1999], evolved from LSA. While LSA is primarily based on linear algebra, PLSA is based on a mixture decomposition derived from a latent class model. Considering observations in the form of co-occurrences (w, d) of words and documents, PLSA models the probability of each co-occurrence as a mixture of condition- ally independent multinomial distributions:

X X P (w, d) = P (c)P (d|c)P (w|c) = P (d) P (c|d)P (w|c) (4.2) c c

The first part is the symmetric formulation, where w and d are both generated from the latent class c in similar ways (using the conditional probabilities P (d|c) and P (w|c)), and the second part is the asymmetric formulation where, for each document d, a latent class is chosen conditionally to the document according to P (c|d), and a word is then generated from that class according to P (w|c). Although words and documents are used in this example, the co-occurrence of any couple of discrete variables may be modelled in exactly the same way.

PLSA has been reported to have severe overfitting problems [Blei et al., 2003b] and a hierarchical Bayesian model was proposed instead [Blei et al., 2003a], from which Latent Dirichlet Allocation (LDA) can be derived. The difference with LSA and PLSA is that LDA assumes a Dirichlet prior distribution of probabilities. Learning the various distri- butions (the set of topics, their associated word probabilities, the topic of each word, and the particular topic mixture of each document) is a problem of Bayesian inference, which can be carried out using variational methods (or also with Markov Chain Monte Carlo methods, which tend to be quite slow in practice).

Another method for dimensionality reduction are the so-called autoencoders. Autoen- coders are a type of multi-layer neural network, which has been shown to outperform

84 4.3. BUILDING THE WEB DIRECTORY

Latent Semantic Analysis for text classification and retrieval tasks [Hinton and Salakhut- dinov, 2006] in some cases. An autoencoder which has a linear hidden layer is practically equivalent to PCA. Upon convergence, the weight vectors of the K neurons in the hidden layer form a basis for the space spanned by the first K principal components. Unlike PCA though, this technique does not necessarily produce orthogonal vectors. Basically, when used on a document collection, autoencoders perform a form of semantic hashing by mapping documents to a fixed number of outputs (the concepts contained in the doc- uments) [Salakhutdinov and Hinton, 2007]. The outputs are binary - either one or zero, and more than one can be positive (unlike other, winner-takes-all neural nets). Thus, each document is coded with a binary sequence (something like 10011101...), which can be compared to every other document’s hash extremely effectively by binary computation based on Hamming distance, which is measured by the number of bits that differ; this is useful when anticipating many queries of the type “find documents similar to this one”.

The above methods presume a static document collection, where they can calculate its statistics in the beginning and use them thereafter, which does not match our scenario of an ever-changing document collection (the web) and classification structure. An online (incremental) method has been proposed [Gorrell, 2006], the results of which converge to those of LSA but process only one instance at a time, i.e. - it accounts for streaming addition of documents to the collection. However, it still does not offer a solution to documents being removed from the corpus, and to gradually evolving classification.

Some related methods are those dealing with concept clustering, for example Tree- Traversing Ant (TTA) [Wong et al., 2006]. For clustering, it solves the problem of cal- culating similarity between terms by using Normalised Google Distance [Cilibrasi and Vitanyi, 2007] and n◦ of Wikipedia (n◦W). Other algorithms use WordNet to get concept definitions. Similarity between terms is usually based on their co-occurrence in documents of the collection, defined as:

|Dt1 ∩ Dt2 | tc(t1, t2) = (4.3) |Dt1 ∪ Dt2 |

where t1 and t2 are terms that occur in documents Dt1 and Dt2 respectively. To combat term ambiguity, it was proposed to also use other information from Wikipedia - such as incoming and outgoing links from a web page [Nakayama et al., 2007].

However, it has to be noted that these latter methods are feature reduction techniques based on the features of a large external data source (Google, Wikipedia, WordNet) instead of the data actually analysed, and build a sort of generic concepts space which is then applied to the data.

Self-Organising Maps also achieve dimensionality reduction; however, since they do much more, we discuss them separately next.

85 CHAPTER 4. RELATED WORK

4.3.5 Unsupervised Clustering, SOM

Clustering, in general, is the classification of items from a data set into different groups, partitioning them into subsets (clusters) so that the items in each cluster are similar to each other - similarity being defined in terms of proximity according to some defined distance measure, where these clusters (unlike classification into pre-defined categories) emerge from the data.

Since clustering creates some order in a collection of documents, and involves a mini- mum of human intervention (a huge plus as compared to classification - manual or assisted), we decided to explore clustering first and see whether some form of clustering algorithm would be suitable for our task. To find out what is suitable, we must first see what we want the algorithm to achieve in our context, what data is available to it, and at what costs it will perform its task:

• The algorithm should result in a easy to browse collection of documents.

• We prefer an algorithm which is not too computationally expensive and is able to handle very large volumes of data.

• High precision of results is not important.

• Data often changes between iterations, which should not lead to starting processing from scratch: the algorithm should be incremental.

Most clustering algorithms suffer from some form of inductive bias [Alpaydin, 2004], which comes from having to either pre-select the number of clusters, or the error tolerance (vigilance) of the clusters defined by the maximum allowable distance between the cluster centre (centroid) and the instances classified as belonging to the cluster. In view of the inductive bias, best results are achieved if the bias has been chosen correctly by selecting the optimal starting values and assumptions. This usually happens by either having a large number of experiments, or by having a “domain expert” (knowledgeable human) to come up with the initial values and assumptions. In the case of conducting a large number of experiments and selecting the method that provided the best results, such an expert is also needed in order to evaluate the results and decide which set is best.

There are many ways to perform clustering, usually classified in two categories: hierar- chical algorithms and partitional algorithms [Alpaydin, 2004]. Hierarchical algorithms can be agglomerative or divisive and they work by either starting with many clusters (equal to the number of instances) and merging them until there is only one, or the opposite - starting with one cluster and then breaking it up until there is one instance per cluster. This produces a dendrogram which does not suffer from bias, as there is no vigilance ra- dius defined and the number of resulting clusters can be anything between one and the

86 4.3. BUILDING THE WEB DIRECTORY number of instances (they can be reconstructed later from the resulting dendrite struc- ture); however, to make the dendrogram useful, some of its branches have to be merged into clusters, and selecting the number of such merges would introduce the said inductive bias. Partitional clustering algorithms on the other hand obtain a single partition of the data instead of a cluster hierarchy - i.e., they result in a “flat” structure; some of them can achieve a (browsable) hierarchy by a secondary clustering within the produced clusters. An early example of such algorithms is Scatter/Gather [Cutting et al., 1992], an extended hierarchical algorithm to allow interactive browsing of a document collection, which was developed and used for creating a clustering search engine. A more popular algorithm is k-means clustering [MacQueen, 1967], which performs instance clustering by trying to minimise total intra-cluster variance (distance between each data point and the mean point of the cluster - the cluster centroid), or the squared error function:

k X X 2 V = (xj − µi) (4.4)

i=1 xj ∈Si

where there are k clusters Si, i = 1, 2, . . . , k and µi is the centroid of all the points xj ∈ Si. k is a value that has to be pre-selected, introducing the above-mentioned bias. It was also shown [Ding and He, 2004] that the relationship between this clustering method and Principal Component Analysis is that “principal components are the continuous so- lutions to the discrete cluster membership indicators for k-means clustering”. Another interesting paradigm is based on the Self-Organising Maps [Kohonen, 1995] method (also called Kohonen neural nets). SOM is usually said to be a form of a neural network, but it can also be argued that it is just a dimensionality reduction tool for better visualisation of high-dimensionality data by projecting it to a 2D map [Kantardzic, 2002]. It is often said SOM is a clustering algorithm, but this is a misconception [Ultsch, 2005]; SOM is in fact reduced to k-means clustering when its neighbourhood function is equal to zero, i.e. k-means is a special case of SOM [Kohonen et al., 1996], but in the usual case SOM is more than just clustering. The SOM method takes a large number of high-dimensional data instances and ar- ranges them in a number of clusters which are situated on a map such that instances within a cluster are similar to each other, but also instances in neighbouring clusters are similar to each other. Apart from the last characteristic, it is practically equivalent to k-means clustering. The method can work in conjunction with various types of clustering, including k-means, but has an additional training step involving not only the cluster to which an instance was assigned, but also neighbouring clusters. The map is usually two-dimensional, but in general could be multi-dimensional and of any shape, such as a cylinder, a toroid etc. The inputs of the neural network are vectors in instance space, while the outputs (neurons or nodes) are placed on a map and distance

87 CHAPTER 4. RELATED WORK between them is calculated by a pre-defined method. There can be variations in the lattice as well (e.g. the map can use a hexagonal or rectangular grid), which result in a different neighbourhood area for each node.

Essentially, the method involves two actions - classification/clustering and adjustment, repeated as necessary. Clustering could be done in a number of ways, and adjustment depends on which method of clustering is used. The main idea is to distribute data instances into a number of clusters, then adjust every cluster a bit depending on the instances that were assigned to it, then repeat the distribution again until convergence. What makes the method “self-organising” is that in the adjustment phase when corrections are made to a cluster centroid to “pull” it nearer to a data instance that belongs to it, not only this centroid is adjusted but also the ones near it. “Near it” is defined not in instance space (i.e. - with any metric imposed by the data), but in an artificially introduced order, which is the map coordinates. As a result, similarities in data space end up as proximity in the representation map. The map has a pre-defined size and topology, and the method effectively creates a two-dimensional projection of the data on it. The size and shape of this map, as well as the method of calculating distances on it, is the inductive bias inherent in this algorithm.

Formally, the SOM method consists of the following steps:

• Initialisation. Before the training stage, the prototype vectors are given initial val- ues. These can be: small random values; “sample initialisation” (the weight vectors are initialised with random samples drawn from the input data set); “linear initial- isation”, where the weight vectors are initialised along the linear subspace spanned by the two principal eigenvectors of the input data set. Intuitively, a manual initial- isation with “strategically placed” typical instances (which would produce a kind of semi-supervised SOM) should improve results, but this has been confirmed only for small data sets [Goren-Bar et al., 2000]; rather counter-intuitively, the advantage is lost when the data set is large. For our experiments, we used sample initialisation.

• Training. In each training step, one sample vector x from the input data set is chosen randomly and a similarity measure is calculated between it and all the weight vectors of the map. The Best-Matching Unit (BMU), denoted as c, is the unit whose weight vector has the greatest similarity with the input sample x. Similarity is usually defined by means of a distance measure, typically Euclidian distance. After finding the BMU, the prototype vectors of the SOM are updated: the vectors of the BMU and its topological neighbours are moved closer to the input vector in the input space (see Figure 4.1). This adaptation procedure stretches the prototypes of the BMU and its topological neighbours towards the sample vector. The update rule for the weight vector of the unit i is:

88 4.3. BUILDING THE WEB DIRECTORY

mi(t + 1) = mi + a(t)hci(r(t))[x(t) − mi(t)] (4.5)

where t denotes time, a(t) is the learning rate and hci(r(t)) is the neighbourhood kernel around the winner unit c, with neighbourhood radius r(t).

Figure 4.1: SOM update procedure (source: SOM Toolbox).

• Iterations are usually prescribed to be at least 500 times the number of nodes of the map. The number though varies greatly from application to application, depending on the data and initialisation, so the only way to find out the correct value for each particular case is empirically. The learning rate, as well as the neighbourhood function, decrease monotonically with each iteration. Their initial values, as well as their rates of decrease, are also best found empirically.

The clusters generated by the method can be made easier for humans to browse by giving them human names (or “labels”) which emerge from the data. There are several cluster labelling methods, one of which (LabelSOM [Rauber, 1999]) is recommended in conjunction with WEBSOM.

The complexity of the original SOM method as described above is O(M 2), where M 2 is the size of the map. However, “due to the many stages in the development of the SOM method and its variations, there is often useless historical ballast in the computations” [Kohonen et al., 1996]; performance can be greatly optimised, e.g. by better initialisation. If initialisation is not random but takes the data into account, computation can be orders of magnitude faster, since the map will then be already approximately organised in the

89 CHAPTER 4. RELATED WORK beginning and can start with a narrower neighbourhood function and smaller learning-rate factor. The Self-organising hierarchical feature map [Koikkalainen and Oja, 1990], which builds a tree of maps, has a complexity of O(MlogM). A three-stage algorithm was proposed [Su and Chang, 2000] using k-means clustering to find the N 2 cluster centres, which then uses a heuristic to organise them into the N × N map array so that similar vectors are put adjacent to each other from the beginning. Another variant [Kusumoto and Takefuji,

2006] achieves a O(log2M) complexity by growing the map from an initial 2 × 2 grid without using a neighbourhood training function at all, allowing faster convergence and faster search, although it is not clear if this method can be used with textual data.

4.3.6 Classification

There are a large number of machine learning methods used for text classification; some of them have been applied to web documents ([Lang, 1995; Chakrabarti et al., 1998; Xue et al., 2008]) and some systems have even been employed in real-world web directories [Attardi et al., 1999], but none persisted or are in use by major directories at present. A simple classification method is the “k Nearest Neighbour” algorithm (kNN). It is a type of instance-based learning, or lazy learning where all computation is deferred until classification. The algorithm classifies an instance by comparing it to its “neighbours” (the instances most similar to it) and then applying a “majority vote”: it assigns the instance to the class most common amongst its k nearest neighbours. k is a positive integer and is typically small (e.g. - 5); if k = 1 the object is simply assigned to the class of its nearest neighbour. The distance metric can be Euclidean distance or Hamming distance (“overlap” of features - in text classification, the number of words that differ between two documents). A significant drawback is that a class with significantly more instances would tend to come up in the nearest neighbours more often, necessitating normalisation. The na¨ıive version of the algorithm is easy to implement by computing the distances from the instance being classified to all the available training examples. However, this is computationally expensive and methods to minimise such comparisons have to be used with large datasets, for example “proximity search”. Another example of text classification is Support Vector Machines (SVM); in effect, this method builds as many classifiers as there are classes, each one separating the class from all others. Each instance is evaluated by every classifier, then the decision is made by the one with highest confidence. A large-scale study [Liu et al., 2005] using SVM over data from the Yahoo! directory found performance “far from satisfactory” though, due to noisy and sparse data. A much faster alternative [Anagnostopoulos et al., 2006] to SVM was proposed with a small loss of accuracy. Using a search engine’s inverted index only, the method finds 10 keywords that best describe a number of (pre-classified) documents. Performing a

90 4.3. BUILDING THE WEB DIRECTORY standard keyword search then (using the normal search engine) finds the documents in the whole indexed corpus that best match the cluster, with 90% of the accuracy of an SVM classifier using 14 000 terms. The query is extremely fast, as it “runs in time proportional to the size of the result set” and is “independent of the corpus size”. However, the method as described can only be used to find documents related to a concept on an “ad-hoc” basis; it only returns the “top k” documents and does not attempt to classify the whole corpus, or to resolve conflicts where a document is returned in more than one query. Among their many other applications, artificial neural networks can also perform text classification. A neural network consists of a number “neurons” imitating the way the hu- man (and other animals’) brain works. These neurons are simple processing elements con- nected to each other in a network (usually, but not necessarily simple and uni-directional); the complex set of connections between processing elements and element parameters allows the network to exhibit complex global behaviour, including learning from example. Neural networks in general do not have to be adaptive, but in order to use them for classification they have to be able to learn from some examples. A typical neural network used for such applications consists of three layers. Inputs form the first layer; they are connected to the second (hidden) layer, which is in turn connected to the output neurons. Connections between individual neurons are defined by parameters called “weights”, used to manipulate the data in calculations. A neural network is typically defined by three types of parameters:

• The pattern of connections between neurons: layered or not, how many layers, fully connected or not, unidirectional (the network is a directed acyclic graph) or not (“recurrent”) etc.

• The learning process for updating the weights of the connections.

• The “activation function” which defines what output a neuron generates for a certain set of inputs and is usually non-linear.

Mathematically, a neuron’s network function is defined as a composition of other func- tions, which can in turn be defined as a composition of other functions. This can be represented as a network structure, hence the name. When used for classification, some features derived from the data (in the case of text - words, or if dimensionality reduction has been performed - concepts) are fed to the input neurons, which then transform them according to their activation functions and pass them on to the next layer/s until they reach the output layer, where every possible class is represented by an output neuron. The state of the output neurons reflects the “decision” of the network on the instance; ideally, it will be 1 for the “winner” class and 0 for all others (if that is not the case, the output with the highest value is declared the winner - the so-called “winner takes all” policy).

91 CHAPTER 4. RELATED WORK

Achieving the target output is based on a training stage, where the neural network “learns” the connection weights needed to achieve it. In the case of classification into pre-defined classes, this is usually done by supervised learning. The network is given a “training set”, attempts to classify each instance in turn and if its output differs from the desired output for this label it slightly modifies the weights; this is done over thousands of iterations, until the network converges to a stable state. The algorithm for correcting weights is called “backpropagation”, which is an abbreviation for “backward propagation of errors” as it attempts to minimise errors by repeated weight updates [Werbos, 1974]. It is important to note that the magnitude of these updates (the “learning rate”) changes with time: it normally diminishes to zero, at which point the network stops learning. In the case of streaming data though (such as our use case of web data) it needs to remain constant if we want the classifier to keep learning from newly arrived instances.

A popular (and simpler) classification method is the Multinomial Na¨ıve Bayesian (MNB) classifier [McCallum and Nigam, 1998]. It is a simple probabilistic model which classifies data instances into multiple classes. MNB uses pre-classified examples to estimate prior class probabilities and word distributions, thus it is “supervised”.

The MNB classifier uses Bayesian statistics and relies on the independent feature model - i.e., it “na¨ıvely” assumes word occurrences are independent of each other. This assump- tion violates Zipf’s law [Zipf, 1949] and the actual term distribution of web documents (texts on the web were found to follow a double Pareto distribution [Chierichetti et al., 2009]) where feature independence is never the case. However, the model works surpris- ingly well even though it is very inaccurate in estimating correct probabilities. This is due to the fact that the algorithm does not need correct estimations, it only needs to find out the class with highest probability [Domingos and Pazzani, 1997]. When classifying an instance, the winning class (the one with the maximum likelihood) is found as:

p(C|D) p(C) X p(wi|C) argmax p(C = c) = ln = ln + ln (4.6) c p(¬C|D) p(¬C) p(w |¬C) i i

where C is the class the instance belongs to, c is each class, p(C|D) is the prior probability of a document for a class and p(wi|C) is the prior probability of word i for that class, i including all words in the document (the probability for the document to belong to a certain class is the combined probability of the class itself and the document evidence: the probabilities of words in the document to belong to this class).

The traditional method works well for simple cases, but with unbalanced classes it is not accurate [Frank and Bouckaert, 2006]: the classifier tends to favour incorrectly the larger classes since they have a high prior probability. The na¨ıve assumption is also a problem where classes are not only unbalanced as numbers (i.e. - one class has many more instances than another) but are unbalanced as content as well. If documents in one class

92 4.3. BUILDING THE WEB DIRECTORY are typically longer than documents in the other classes, then word occurrence rates in it would be higher and this class would be favoured incorrectly as well. This unbalanced end result in classification is disproportionally perceived as useless by users, e.g. - if the classifier makes an error in 5% of instances, but they are all in a particular category (as is the case, exemplified by our experiments) and this category ends up empty, users would declare the algorithm totally unsuccessful and not just 5% unsuccessful. A proposal to overcome some systemic errors in the MNB model is the Complement Na¨ıve Bayes (CNB) algorithm [Rennie et al., 2003], which tries to overcome skewed training due to skewed classes by training the algorithm to distinguish a class on examples not in that class, as opposed to the normal practice of using instances in that class.

93

5 Implementation and Practical Issues

5.1 Summary

In order to test our ideas, we needed to create a working system and see how it behaves with “real world” data. To do that, we needed to adapt some of the existing methods described in the previous chapter to account for the specifics of a constantly evolving system, where both the underlying data and classification change constantly. Apart from a working proof-of-concept prototype, we also needed, and created, a testing framework to allow us to explore various approaches and algorithms before we selected the best-fitting for each task. In this chapter, we explain the general setup we used for our experiments (technical backend and data used), as well as which approaches we have selected to embrace when dealing with each element of our solution, the Exploration Engine and the distributed personalisation model. Some of the more important decisions we made were to:

• allow each user to build a personal ontology, which we then match dynamically to a central one (as opposed to trying to fit the user’s model into a pre-built ontology);

• implement a semi-automated web spider and couple it with the web directory, aban- doning the popular model of manually built directories;

• not use linguistic pre-processing of data;

• leave detection of outliers to human editors, algorithmically assisted by our engine;

• not use dimensionality reduction;

95 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES

The chapter reports on our experiments with Self-Organising Maps and our decision to abandon them as a possible solution, due to technical and inherent problems we could not overcome. It then reports on our experiments with classifiers, and our final selected method for text classification - the Multinomial Na¨ıve Bayesian classifier. Since it was not well suited to our purposes in its original form, we had to adapt it to the dynamic nature of the web and to the specifics of web documents (most notably, their highly unbalanced classification). We introduced some fundamental changes to the algorithm, from which we derived our version: MNB-SPDA. Unlike the classic MNB, our algorithm is not static but dynamic and incorporates a negative feedback loop, which allows it to compensate for errors or unbalanced data and achieve consistent results across widely differing classes. Another important contribution of the chapter is the Floating Query concept we intro- duce, where the initial search query is a weighted vector which is re-weighted every time the uses changes the current search context (directory category).

5.2 General Approach

Unlike other works which aim to improve existing systems or processes, this work started with a broader view and no specific aspect of the state-of-the-art in mind as an area of potential improvement. From a reader’s point of view this reflected on, for example, the structure of this document: where other works start with a literature review of related work, we have left this to a much later part of this document since initially we did not know what work is related to ours, as we had no idea what our work would turn out to be. From a methodological point of view, this meant we had to use the “reflective practitioner” [Sch¨on,1983] approach. “Reflective practice” studies the way we learn from experience. The theory recognises two distinct modes - single- or double-loop learning. “Single-loop learning” occurs when the person (or organisation or system) has a plan or mode of operation which is not changed - on encountering an error, the error is corrected but the system remains the same (similar to the way a thermostat works - when it finds the temperature rising or falling under the desired range, it takes corrective action according to its design but does not re-assess its own design). “Double-loop learning”, on the other hand, is cyclical; it involves a cycle of implementing current policies, analysing the outcomes and then revising either the policies or the underlying assumptions which lead to them (this is similar to “belief revision” in Belief-Desire-Intention (BDI) intelligent agents). Taking this approach, we explored many technologies and ideas which seemed to lead to a solution we desired. As seen in the following parts of this chapter, most of them did not. However, they had to be explored and an attempt had to be made to either make them usable for our purposes, or show that they could not be used and then revise our goals accordingly. This explains our exploration of unsupervised clustering, Self- Organising Maps and neural networks, and our many failed attempts at improving the original Multinomial Na¨ıve Bayesian classifier.

96 5.3. GENERAL SETUP

5.3 General Setup

We created a web application using the PHP programming language1 because it allows rapid prototyping and has many built-in features or available modules which we needed (such as functions for sorting and processing arrays with text indices, the cURL2 library for downloading documents from the web, the Tidy library for XML parsing etc.). We experimented storing some data in a MySQL3 database and some data as text files on the file system, but using the file system in this manner was quickly seen to be not optimal, so the final version of the prototype uses a MySQL database only.

For testing data, we used the largest available source of categorised web documents: the Open Directory Project, which has an “open data” policy and allows full access to its data, as well as the creation of derivative works.

Some notes on this approach:

• Complex data processing in PHP is relatively slow, due to it not supporting persis- tent data structures in memory (as in Java). In PHP, the web server has to access the database and read all the needed data at every page reload.

• For a real world implementation, it would be advisable to use a more efficient pro- gramming language, such as probably C++.

• For a real world implementation, the database will have to be distributed.

• The Open Directory Project has been underdeveloped for years, so its data tends to be sparse and not well maintained (too much “classification noise” and errors). However, this is currently the best available source. In future, a better one will have to be developed if our system is to be a “real world” system of some value to users.

5.4 Selected Approaches to Tasks

Before we start building our system, we have to look at ways to deal with the various issues that will arise from its various components. Each of them has been treated in one way or another by existing systems - we are not breaking new ground but either implement something already in use elsewhere, or modify it to fit our circumstances. It has to be pointed out that the innovation of this work is in the holistic treatment and integration of the user modelling and information discovery problems and not in any of its components. Furthermore, we want to stress that the fact we have taken a particular approach towards solving a specific issue is just our take on the issue - other implementations can also be

1http://php.net/ 2http://php.net/manual/en/book.curl.php 3http://www.mysql.com/

97 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES tried and might prove to be better; this work’s ambition is to create a functioning “proof- of-concept” only, while building an “industry best”, or even just a “working real world application”, is far outside its scope and available resources.

For our proof-of-concept application, we had to solve the following tasks: acquire data about each user, data about the web, process both of these (classify them into meaningful structures) and then match them in order to provide meaningful answers to a search request initiated by the user. A brief review of each of these steps follows.

5.4.1 Building the User Model

As we have already stated in “Assumptions and Limitations”, we have to make a simpli- fication and equate “user’s interests and prior knowledge” with “web documents the user has accessed”, and base our model of the user on this.

Our proposed method of acquiring the raw data for this is to create a user-side agent in the form of a web browser plug-in which records the user’s browsing history, as well as some meta-data for documents accessed. Initially, we used Slogger4, a logging and caching add- on for Firefox. However, it is now defunct so we implemented a “bookmarklet” instead: a java-script posting information to a local server when the user presses a button in the browser’s bookmark bar.

The data thus collected then has to be processed and evaluated based on some user feedback. As discussed in the previous chapter, most research and industry developments use implicit feedback. While practical, this takes control away from the user and carries the risk of misinterpreting his intentions. A real challenge is also to not just find an acceptable balance between system intrusiveness and usefulness, but to allow users to adjust this balance according to their personal preferences. The main issue related to the acquisition of user preferences is the implicit problem that all personal agents and web-based profiling systems face and (most probably) fail because of. The problem is that the average user visits a large number of web pages, but is not willing to actively supply feedback on more than a tiny fraction of them. This means that, unless the system is a multi-million user setup such as Google News [Das et al., 2007], the learning algorithms do not have much data to go on.

Our approach to this challenge is to extend the personal agent with a hierarchical classifier which works visibly to the user and invites his participation, but does not require it. In practical terms, this is an extension to the browser plug-in which records the user’s browsing history: combined with a hierarchical classifier, this extension classifies every page the user visits “on-the-fly” and displays the classification to the user, on the bookmarks bar of his browser. So, if the user is reading some piece of football news, the classification snippet will automatically show for example News → Sports → Football →

4https://addons.mozilla.org/en-US/firefox/addon/slogger/

98 5.4. SELECTED APPROACHES TO TASKS

International. In most cases, the user can just ignore this, so it will not intrude on his browsing. If the classifier is not sure in its classification though, the classification path will be highlighted to attract the user’s attention. The user can then click on it and select a different category for the document from a drop-down list and assist the classifier by providing a manually labelled example (or, he can just as before ignore it). In the course of his daily usage of the system the user will eventually provide enough examples for it so that after some time it will stop asking for clarifications and will become non-intrusive. Additionally, if the user clicks on the proposed classification and does not like any of the offered choices, he will have the opportunity to add a new category to the classification hierarchy, thus developing it as he goes. This method of building the user model encourages users to invest time and efforts in supplying feedback because they will see the results of this feedback immediately, and it will be presented to them as the building of their own web archive, not as improving a remote (and somebody else’s) search engine.

5.4.2 Knowledge Representation and Ontology Matching

As already discussed, the usual approach to building a user model and then applying it to web search has traditionally been to create an integrated mechanism either at the user side or at the search engine side. However, both of these methods have severe limitations and we decided to go a different way: our mechanism will not try to do both things at the same time and on just one side. Instead, we will build two separate mechanisms: one at the user side, modelling the user’s knowledge (his cognitive context), and one at the search engine side, modelling the web. We then have a new problem: the resulting definitions of “areas of knowledge” (or, in a more specific sense - nodes in an ontology) should be defined in a transmittable way so that they can be exchanged and used by the independent parts of a mechanism which do not share the same architecture (e.g. - internal knowledge representation formats, or processing methods); for example, if the user expresses interest in the area of “Football”, we need a meaningful way to tell the search engine what exactly the user means by “football”. This definition will have to be understandable by rea- soning entities with possibly completely different knowledge representation and reasoning mechanisms, similar to communication in a multi-agent system. As we already said, many pre-built ontologies exist and are often used for tasks com- parable to ours. However, we have some problems with using them:

• Being generic, all pre-built ontologies will ignore the user’s cognitive bias and will be difficult or impossible for the user to navigate (remember the example of the user looking for dolphins under “Fish”).

• Since they are necessarily trying to include everything anyone can ever think of, they will be much too large (and confusing) to a single user who will only need (and

99 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES

understand) a very small part of them.

• Some existing ontologies, such as Schema.org, are meant just to describe a resource in terms of “this is a Thing > Place > Administrative Area > Country” but not to assist grouping them together (e.g. you cannot express concepts such as “countries speaking French” or “tourist places good for scuba diving”).

• In a distributed system such as ours, relying on a specific ontology on a deep, archi- tectural level, would mean that all elements of the system have to use it exclusively. In the real world, such uniformity is difficult to achieve (a consortium of the major search engines has yet to agree on Schema.org even for their own limited use). We prefer an open system where the user (or company operating the search-engine side of the system) can freely decide what ontology to use.

The approach we decided to use is to define areas of knowledge by similarity only: we take “area of knowledge” to equate with a number of similar documents which are pertinent to the same topic (grouping being hierarchical so as to allow a hierarchy of topics for better navigation). This is not a “full-blown” ontology but a simplified one: relationships between entities are only of the type “A is a subset of B”, and the graph is non-cyclic. However, an ontology of simple document grouping captures perfectly what we aim to achieve, namely - information discovery by browsing into related topics, as well as topic-specific search. The way this grouping of documents is achieved, and any reasoning based on it is left to every separate part of the mechanism to implement in its own way: for example, an implementation of the user-side mechanism may use a Bayesian text classifier (such as we use), unsupervised clustering, a neural network or a Support Vector Machine. As long as it achieves the goal and is able to communicate its results to other systems, it can be a part of our framework. The second part of the task is to transmit knowledge of the already created groups of documents to the information provider. Since we do not rely on the same ontology being present on both ends of the mechanism (in which case we would just transmit the node’s ID), we need to “describe” the node in the user’s ontology (consisting of a number of documents) to the search engine in a way the search engine can use to assist the user’s query by matching it to a part of its own ontology. The simplest way to achieve this would be to just transmit all the documents in the node and let the search engine process them and make some sense out of their grouping (essentially, this is how server-side mechanisms work - the search engine knows what documents the user has visited, it already has them and can process them directly). In our case, however, apart from the practical issue of uploading so many documents with every query, such an approach would be too intrusive from a privacy point of view: by sending the documents, or even only a list of their URLs, we are disclosing the user’s

100 5.4. SELECTED APPROACHES TO TASKS browsing history and habits. On the other hand, the search engine does not actually need to have the documents themselves - it only needs some data derived from them to assist it in providing results to the user’s query. Instead of sending the raw data and letting the search engine process it to achieve personalisation, we pre-process the documents on the user side and only send the processed results as an integral part of the query, in the form of a weighted term vector. In this way we take some computing and data storage load off the search engine, and at the same time protect the user’s privacy.

documents term vectors user query Seven hundred people sum query terms from the “Occupy What political 2 the 5 the 2 the Wall Street” protest 1 seven 1 seven change related to 1 what 1 hundred 1 hundred 1 political are arrested after 1 people 1 people women happened 1 change 1 from 1 from 1 related obstructing traffic 1 occupy 1 occupy during the last week ...... on the Brooklyn in the Kingdom 1 arabia 1 city 1 election Bridge (pictured) of Saudi Arabia?

in New York City.

Q 0.635 election In Australian rules 0.481 protest 1 australian 0.443 football football, Geelong defeat 1 rules 0.405 arrested ... 0.274 china Collingwood to win ... 1 final the AFL Grand Final. 0.788 political 0.001 the 0.213 change 0.140 arabia 0.129 kingdom weights 0.098 week final query China launches ...

Tiangong 1, the 2 the 0.002 the 1 china first test module of ... the Tiangong space 1 program search engine station program.

King Abdullah

of Saudi Arabia 1 king 1 abdullah announces that 1 saudi ... Saudi women will be 1 election permitted to vote and

stand for election.

Figure 5.1: The forming of a complex research query: term vectors from documents in a category form the category vector, which is weighted in respect to neighbouring categories and then used to modify the user’s query. The resulting query vector is the original user’s query modified with context-specific weights which represent the user’s cognitive and current contexts.

An example of what such an enhanced search query (a “research query”) looks like

101 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES and how it is derived is illustrated in Figure 5.1. As seen from the illustration, the initial user query can be much more than a couple of keywords (it can even be a whole document the user pastes into the search box). The initial term vector of the query is weighted using values derived from the category; these weights represent the user’s cognitive context due to the fact that the classifier has derived them by learning how the user distinguishes this category in his personal ontology from neighbouring categories, and they represent the current context due to the fact that this vector is used to modify the query and not another. The resulting vector is sent out to the search engine which, consequently, receives not a flat list of keywords but also some values associated with them meaning how important each of these words is to the query. This solves the problem mentioned above of a personal search assistant disambiguating a query by adding more keywords, where the search engine does not know which of these is important and which is there for disambiguation purposes only.

5.4.3 Data Acquisition and Processing

Before we can assist the user in his search for relevant information, we have to first acquire and process this information ourselves in order to understand it and match it to the user’s need. A brief summary of our approaches to these tasks follows.

5.4.3.1 Data Acquisition

As we already stated in our “Assumptions and Limitations”, we consider the problem of initial data acquisition outside the scope of this work: any real world application of our idea would be based on already existing search engine infrastructure on the server side. Furthermore, many solutions to this task exist, most of them are mature and have been used for many years and to improve them even a little bit would be a topic for an entire new thesis. However, we still needed some method to acquire web data for our tests so we had to implement something for our proof-of-concept implementation. We considered using a “state of the art” system such as a combination of Nutch5 for downloading documents and Lucene6 for indexing, or a framework such as YaCy7 or Xapian8. However, some research brought the conclusions that:

• Nutch/Lucene have system requirements exceeding our available testing hardware;

• proper usage requires competences exceeding those of the available researcher;

• modifying them “on the fly” to accommodate experiments necessary for research purposes requires an even higher level of competence;

5http://nutch.apache.org/ 6http://lucene.apache.org/ 7http://yacy.net/ 8http://xapian.org/

102 5.4. SELECTED APPROACHES TO TASKS

• both of the above competence sets are in areas (systems administration and Java programming) outside the scope of this work;

• the features and capacity of these systems are significant overkill for our purposes;

• on the other hand, some features needed for our model (e.g. - individual document- level spider control) are not present and would be too hard to implement.

As a result, we decided to implement our own simple web spider using the cURL library for PHP (coupled with a MySQL database) which is adequate enough for small-scale fast prototyping. Our simple spider has the following features:

• It can be given a URL (or a number of URLs) as starting point for its web crawl.

• Newly discovered URLs are not automatically downloaded but need to be approved by a human editor.

• Approval of new URLs can be individual, “mass” (multiple select) or by URL pat- tern (e.g. - automatically approve URLs from www.server.com containing /news -wildcard- .html).

• The queue of URLs waiting for approval can be prioritised based on time of URL discovery, number of links pointing to the URL from other documents already down- loaded, or other heuristics (for example - download simpler URLs first9).

• The editor can ban URLs individually or by pattern.

• The editor can instruct the spider to ignore URLs by pattern (e.g. - do not follow or even record URLs from www.server.com containing /comment -wildcard- .php).

• The spider records the original text of the document, as well as a filtered version (more on this in “Data Processing”).

• The spider does not record which document links to which other documents - this relationship matrix is not needed in our model.

• However, the spider records where (at what URL) it saw the first link to every doc- ument. This turns out to be a very useful heuristic to manually delete spammy documents later: if we find a document we consider search engine spam, the docu- ment pointing to it usually points to dozens or even hundreds of similar documents which can then be deleted by the editor with minimal effort.

9Yes, we did criticise this approach in search engines. However, in their case they use the feature to promote “simple” URLs over “ordinary” ones in search results. We only download simple URLs earlier in order to access earlier a site’s homepage (which has the simplest URL of all documents on the site) and allow the editor to process it earlier; search result rankings shown to users are not affected by this.

103 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES

On the end-user side of our proposed mechanism, we used the same spider for the proof- of-concept tests. In a real world implementation, the task of downloading web documents seen by the user is already being done by his web browser, so the user-side analogue of the web spider will be a module of the browser plug-in which provides the rest of the user functionality.

5.4.3.2 Data Pre-processing

Pre-processing of our raw data is a task which is important for the validation of our proof-of-concept prototype, although not otherwise related to this work as a whole. We implemented a simple pre-processor for two reasons:

• not being essential to the work as a whole, there was no justification for experiment- ing with more complex ones, and

• using a more complex pre-processor would introduce many more potential points of failure before we start testing the learning algorithms and would make our assessment of their efficiency more problematic.

We decided to use the Tidy library for PHP to remove HTML formatting and meta- data. As already discussed, removing common site elements is a huge problem solved by complex (and proprietary) algorithms by commercial search engines. Our experiments found that a typical HTML document of a 40 kB size would contain (after manual filtering) around 2 kB of “meaningful text” with the remaining 38 kB being formatting and common site elements. A human can easily distinguish this from the information payload of a page and perform such manual filtering, but for a machine algorithm it is not a trivial task: it would require correlational analysis of a large number of documents from the same site, finding out their common elements and filtering them out. This approach a) is prone to errors, and b) requires knowledge of other documents from the same site. Since the user side of our system cannot have such knowledge, this is not practical in our case. Our tokenizer policy (below) partially solves the problem, but a good solution remains outstanding; we consider it outside the scope of this work and it will need to be addressed in an eventual future development. In our system, we applied the following tokenizer policies:

• Words are normalised to lower case.

• Words containing numbers are ignored.

• Words with a length of 1 or 2 characters are ignored.

• Words of length of 20 or more characters are ignored.

104 5.4. SELECTED APPROACHES TO TASKS

• Words contained in phrases 1 or 2 words long are ignored (this removes most of the in-site navigation and external links from documents).

• Words contained in phrases 20 or more words long are ignored (this removes machine- generated keyword lists trying to “search-engine optimise” sites).

While not perfect and requiring further improvement, these policies seemed to produce reasonably good word vector representations of HTML documents comparable to manually filtered ones; we felt improving them further would be outside the scope of this work (probably equal to another thesis), so we left such experiments to future extensions. A more important aspect of pre-processing is linguistic processing - simplifying texts by various linguistic methods in order to reduce their complexity. A common method is stemming, which reduces words to root words - for example, “reduces” becomes “reduce”. This might also reduce “horsing around” to “horse” and “round”, leading to the impression that the document containing the phrase is about animals and geometry. While it is apparent that reducing complexity is a “must do” if we want to achieve a system that works in the real world, we must also consider the implications of the method we select to achieve it. Some failures in projects similar to ours have been attributed to the use of a bad pre-processor for document texts [Arnaud, 2004]. This may be a case of just taking a wrong decision when selecting a pre-processor. Or, it might be a deeper problem. Perhaps the problem with such pre-processing is a fundamental one: while a high-level method is developed to organise documents, these documents have already been organised (or pre-processed) by a different low-level method. The high-level method then cannot do what it is supposed to do, since its function has already been performed by the pre- processing algorithm. There is also another aspect of linguistic processing: it has to know the structure of the language. However, whatever pre-processor is used, it will be limited to one language only. On the other hand, a typical user knows more than one language and visits sites in more than one language. The web has documents in hundreds of languages (the Open Directory, which we used for training data, has 81 language sub- trees). Instead of trying to solve hundreds of separate problems then, it would be best to find a method that does not depend on linguistic analysis, but rather organises documents in an independent way. Indeed, some researchers specifically state [Berger and Merkl, 2005] they do not use stemming or any other linguistic method for categorisation. Consequently, we also decided to not use linguistic processing, incurring higher computation costs and storage space requirements. There is also the question of data transformation - e.g., do we record and use raw data, or is it transformed to something else before being used? We applied TF-IDF scaling (with some variations discussed below where we explain our approach to classification), a number of normalisations (also discussed below) and experimented with smoothing. Smoothing could be preferable to TF-IDF in some streaming applications, where changes in the documents viewed by the classifier would introduce some temporal shift in the

105 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES work of processing algorithms. However, this shift would be negligible on the scale of the “real world” web that we target, so we decided that smoothing is not needed in our case. While experimenting with it though, we found that classification based on data which has undergone smoothing is much faster, due to mathematical operations using integers only and not floating point values. Unfortunately, this also reflected negatively on classification accuracy but we decided that an algorithm using such values is still usable in some cases, for example - to produce fast (though not too accurate) results for users who prefer speed over accuracy. More detailed discussion on this exception can be found below where we present our DCC classification algorithm. Another type or pre-processing practised in data-mining is outlier detection and re- moval. It helps build better classifiers and provides a considerable quality improvement by removing instances which would be harmful to the learning system. A question remains though - how is an “outlier” defined? - especially having in mind that we are going to be processing textual data with no easy or unambiguous classification. Our approach is to let the end-user (or editor, in the case of the server-side part of the mechanism) define some documents as unwanted for training the classifier in a category (they still remain listed, but are not used by the learner); for example, we manually removed some documents containing nothing but hundreds of personal names (obituary lists). We developed some tools to assist the process of finding such documents, based on outlier values for word counts or average TF-IDF per document (the obituary lists were extremes in both).

5.4.3.3 Data Representation

In our implementation, we decided to use a term vector representation based on the “bag-of-words” model, where documents are represented by a weighted “bag of words”. The weights are derived from local TF-IDF values for each category. Since we have a different vocabulary in each category, and different word occurrence counts, our vectors are category-specific, e.g. the same document is represented by a different vector in every category, depending on its context (more on this below). This applies both to the server- side part of our model and the user-side mechanism. In the latter, the user creates the categorisation tree into which these term vectors are split, reflecting his interests and his view of the world. Both the server-side and the user side structures are ordered hierarchies; thus, unlike the usual implementation of such models, ours is structured.

5.4.3.4 Indexing

In our implementation, we use a document description table (for each document: which words are contained in it, and how many times) for use by the text classification algo- rithm, and a weighted “posting list” which is derived from the inverted index. Applying MapReduce to the generation of the inverted index was considered outside the scope of this work and the resources available to it, so it remains an extremely slow operation in

106 5.4. SELECTED APPROACHES TO TASKS our prototype, especially since we also apply local term weighting at the same time; this does not, however, reflect on the validity of our proposed solution, nor on its usability as this is done “behind the scenes” and does not affect the user-facing part of the system.

Unlike the standard practice, we also apply a further transformation to the posting list after generating it from the document description table. Instead of a flat occurrence list of words or a TF-IDF-weighted vector, we have a normalised weight for each word for each document:

tf-idfwi swi = (5.1) tf-idfi

- the specific weight swi for the word in document i is the ratio between the word’s TF-IDF value for that document and the average TF-IDF of all words in the document. We can compare the relative importance of words in documents on this basis and, keeping in mind that TF-IDF values are local (for a specific category), we can say e.g.: word wi is

N times more important for document Dj than for document Dk in category C. We use these values in two ways:

• Internally, we optimise by saving storage space by only recording the top N docu- ments containing a particular word (for example, the top 1000), sorted by the word’s importance to them. If we return no more than 1000 search results to end-users for their queries (this is what Google does, so we got this parameter from them as a practical heuristic) and these are the top 1000 by keyword relevance, there is no point in storing knowledge about what other documents contain the same word as they will not be shown to users anyway.10 We thus have a storage space require- ment which is linearly dependent on the number of words in the vocabulary, and is not dependent on the number of documents. From a technical point of view, these values are stored in a serialised form, so that we have only one table row per word, no matter against how many documents we index it.

• At the end-user interface of our central web directory, we use these values to generate search results when users search by keyword: we order candidate documents by these scores to rank them by relevance to the research query.

We have to note here, although it is discussed in more details where we discuss our classification method, that - just as with every other use of IDF in our system - these values are always local, calculated for a specific category. So, TF-IDF “up-weights” distinctive terms and discounts common terms, but in our case this only happens for terms which are

10Of course, some leeway can be left for cases where users specifically exclude some documents from their results etc., but this can be easily mitigated by storing the top 2000 documents if it turns out to be an issue.

107 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES distinctive or common in the specific context. Of course, there is also a price to pay for this: we have not one general posting list, but one posting list per category. Consequently, our storage space requirement is also linearly dependent on the number of categories in the directory, which is a disadvantage as compared to standard search engines or web directories.

A drawback of our current system is that the normalised weights for each word for each document that we use is based on the average TD-IDF of all words in the document. A small number of outliers can easily bias this average, resulting in suppressing some words that have sparse presence in the document but are typical for that document and are in fact important. Outlier detection and deciding how many outliers to remove, though, depends on parameters which would be category-specific as each category has a different document collection. Since the DCC algorithm which uses these values, and keyword search in general are not central to our proposed model, and since experimenting with these parameters would require much more data than we had available, we had to decide to leave such experiments for a future development of the system.

5.4.3.5 Dimensionality - Reduction or Not?

Since it was apparent from the start that our biggest technical challenge would be com- putation efficiency, we invested considerable efforts in finding some solution to the issue.

Of the two simplest solutions based on discarding features, we decided to use just one:

• We decided to ignore rare words with low statistical significance (rather arbitrarily, we set this parameter as words occurring in less than 10 documents). In a random sample of 631,749 documents from the Open Directory (in all languages), this re- moved 92% (ninety-two percent) of the vocabulary so 10 was deemed a sufficient value for the parameter (with other document collections it may be different and will probably have to be varied, but we had no other data to experiment with). It is interesting to note also that 56% of all words occurred in only one document, e.g. “basilectalization”, “helicobacterium” etc.

• However, it is our opinion that “stop words”, or words too common to be distinctive, is a concept which does not apply in our case. As already discussed, distinctiveness varies with context and since we are aiming to build a large number of contexts, we cannot have “stop words” in general and have to let each classifier find its own stop words within its own particular context.

For similar reasons, we also discarded the methods based on (common) external data sources: they could be useful for simulating the top level of the search-engine side of our mechanism, but would not work for lower-level categories. They would not address the

108 5.4. SELECTED APPROACHES TO TASKS personalised concepts of end-users either, which should be based on their own view of the web; nor the fluctuation of concepts with streaming data. It is apparent that there is a large variety of existing methods to achieve dimensionality reduction and thus lower the complexity of the classification task. However, after some experiments and after considering all implications, we decided against using dimensionality reduction; our solution works with the raw data (excluding rare words, and some stochastic discarding of features discussed in the description of our MNB-SPDA algorithm). The reasoning for this decision is that dimensionality reduction is not suitable for the continuous classifying of dynamic underlying data such as ours (the web). Any initially successful projection of the documents into a lower-dimension representation would de- teriorate as we alter the collection, since its success is measured by information entropy over the collection; as the collection changes, that would change too. In other words, if we achieve a set of constructed features which best split the data, this “best split” is only guaranteed to be best for the initial data. Evolving the document collection would mean that:

• we need to update the dictionary constantly;

• we need to perform dimensionality reduction again and again;

• as a result, documents will have to be re-mapped to the new low-dimension vectors constantly;

• all classifiers will need to be retrained to the new features with every change.

While this is not impossible to do, it adds another level of complexity to the system as a whole, and does not seem to save too much in the way of computation so the pre- cision/cost trade-off is not justified. We also have to keep in mind that at lower levels of the hierarchy we would need separate dimensionality reduction since document collec- tions there are different, which would increase complexity exponentially with the category tree depth. Moreover, dimensionality reduction would project the data into (typically) 64 or 16 dimensions; since we only have an average of 13.43 classes per classifier (based on the Open Directory structure), we might as well project the data into them without this intermediary.

5.4.4 Self-Organising Maps and Broken Dreams

We aim to create a browsable “map of the web” on both the user side and the server side of our mechanism. A map should be easy to navigate and consistent: that is, we have to distribute instances (web documents) into areas of the map in a logical and predictable way so that users can easily understand it and learn it. In other words, we need a consistent classification or clustering method producing such a map or ordering. The current solution

109 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES of web directories is to do it all manually, which has turned out to be so expensive as to put them out of business. The opposite approach would be to follow in the steps of search engines, apply machine learning and try to achieve a fully automated (unsupervised) method; alternatively, some sort of hybrid can be devised. Logically, we started with the completely unsupervised option first.

There is nothing in the above problem definition that calls for a “classical” tree-like directory structure of the data. If we want a completely unsupervised (e.g. - cheap in terms of human labour) solution, we might apply some ordering algorithm in order to achieve our browsable collection of documents which may take a form other than a category tree. Many such algorithms exist [Berkhin, 2002] and some have in fact been applied to web documents. It is understandable then that this work initially explored the usage of such algorithms for the creation of a proof-of-concept application.

When reviewing k-means and similar algorithms, we found them not suitable for the task to create an easy to browse hierarchical document collection. Apart from being com- putationally expensive (many thousands of iterations are needed) and requiring batch pro- cessing (meaning they cannot work with online arrival of new instances and re-classification of instances), their parameters - for example k, the number of clusters, cannot be uniformly set throughout the hierarchical structure. This is because the different clusters of docu- ments would need to be branched differently (or not at all) depending on their content: e.g., the “Animals” cluster may branch down into many sub-clusters, but the number of sub-clusters for “Second-hand Tractor Parts” will be different, and “Suppliers of Second- hand Tractor Parts in South-West Queensland” may not need to be branched at all. This decision has to be taken by a human at every node, of which there will be millions. If we are going to employ humans to do this, this would defeat the purpose of an unsupervised solution and we may as well use these humans to categorise some instances into categories and use a semi-supervised classification method (as we finally decided to do).

Most of our initial experiments were conducted on Self-Organising Maps (SOM), mainly because this method creates a visually appealing browsing paradigm, which we thought would be good from usability point of view. Users are familiar both with web browsing and with geographic maps and we thought it would be a good idea if we combined them and our clustering algorithm created a browsable “map” similar to normal geographic two-dimensional maps (which was also the aim of the WEBSOM project [Lagus et al., 2004] - now defunct).

Implementation

The SOM method has been used for a large number of different applications and has many variations and modifications [Oja et al., 2001]. Most relevant to us is WEBSOM [Lagus et al., 2004], which creates clusters of web documents in order to present the user with an easy way to browse a document collection. The method is an extension to standard SOM that copes with the diversity of the web by creating a two-step SOM.

110 5.4. SELECTED APPROACHES TO TASKS

Since the dimensionality of web text data is extremely high, WEBSOM first clusters the dictionary into categories and then indexes documents with these categories and not terms (very much like an autoencoder [Hinton and Salakhutdinov, 2006]). However, in its original form WEBSOM is not suitable for large “real-world” scenarios. It does not scale well, and apart from that cannot cope with the whole web for a very basic reason - even if it does manage to cluster every web document, the user will be overwhelmed by a single map containing billions of documents. It will either have to have millions of documents within a cluster, making it practically useless, or will have to have so many clusters that the map itself will need searching within it, defeating the purpose.

For usability reasons, it is apparent that a cluster should have a manageable number of instances, so that they can be visualised and browsed by the user without the need of additional navigation within the cluster. A variation of SOM which addresses this is a development over the hierarchical feature map - the Growing Hierarchical Self-Organising Map (GHSOM). It creates multi-level maps, branching out depending on the data [Rauber et al., 2002]. We tried to build a GHSOM for our proof-of-concept prototype.

From a more general perspective, it has to be noted that SOM and all methods that extend it are not strictly defined algorithms, but a framework for creating algorithms. In other words, they provide a general method, which can be extended and varied depending on the application and data. While this makes it more difficult to implement (no “user manual” with clear-cut instructions), it also provides the freedom to select solutions and settings to best fit a particular purpose.

We had to make several decisions for our implementation of the method:

• Size and shape of the map;

• Distance metrics on the map;

• Pre-processing of documents;

• Dimensionality reduction;

• Clustering/adjustment algorithm;

• Learning rate;

• Decay function of learning rate with time;

• Neighbourhood function;

• Decay function of the neighbourhood function with time;

• Convergence criteria;

• Branching criteria.

111 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES

For all of them we had to take into account what is known about the data, namely: that there are large quantities of it, and it changes constantly. The size and shape of the map are both a technical issue, and a user-interface issue. From a technical point of view, the bigger the map is, the more computationally intensive the method is, because the classifier will have so many more classes to learn. On the other hand, a bigger map (and more classes) gives higher precision. But, if we branch out the map into many levels of similar-sized maps, and these maps are small, this means many more maps and much more storage space taken for the classifiers, because every map has a separate classifier. In other words, we have a trade-off between complexity of the individual classifier and complexity in the form of having too many classifiers. From an end-user point of view, a map which is too large means too many options the user can click on, which is not such a good thing considering we are trying to make things simpler for the user. On the other hand, a smaller map will have to be broken into more levels in order to represent the same number of clusters, meaning more clicks until the user reaches what he is looking for (potentially missing a branch somewhere since, unlike other maps, SOM can split similar documents into more than one group of clusters located in different parts of the map). The trade-off here is again between complexity of the individual map and complexity of the overall structure. Obviously, a middle ground had to be chosen for both technical and ease-of-use reasons. Since the primary goal is to assist the user, ease-of-use should take precedence, and for this reason we chose as a base for experiments a figure of 100 clusters per map. This is quite large from a personal point of view (of a person that has to select between 100 options), but not too overwhelming. The shape of the map is also important. The topology could in general be anything - including a toroid or cylinder. These shapes are only recommended though if the data is known to be circular [Kohonen et al., 1996], so in our case we need a 2D map. Intuitively, it should be as symmetric as possible - that is, a square. However, the SOM manual, based on empirical data, advises against it and suggests a rectangular, but not square map for better convergence, so we also tried a 12 × 8 grid. The next decision is about the placement of clusters on the map, and the way to measure distance between them. The map could be a mosaic of squares, or of hexagons. The SOM guidelines recommend hexagons for better convergence [Kohonen et al., 1996]. From an end-user point of view, this is also better. A cluster represents a number of documents grouped by similarity, and neighbouring clusters should be similar to each other. The more neighbours a cluster has, the more links to similar concepts it can provide to the user. A regular hexagonal grid though has one disadvantage which is completely unrelated to the SOM method, and machine learning in general: it does not lend itself to easy visualisation on web sites. Using standard and simple HTML, which only includes tables, rectangles are the only shape that can be drawn. Having in mind that the primary purpose of the effort is to create a structure which is easily browsed, we had to make a simplification

112 5.4. SELECTED APPROACHES TO TASKS and tried a variation of the hexagonal grid - a brick-like non-regular hexagonal tiling, where the cells have a width of one and height of two, and the centres of cells in every successive column are offset by one. Every cell thus has six neighbours and the grid is still symmetrical, though on less axes. Each such cell can be thought of as a regular hexagon which was “squeezed” from two sides until one of its corners came to the line between two of the others. This structure retains the coordinate system of a rectangular map, though distances between centres are not equal.

On the question of distance measurement, distance between cluster centres (necessary for the computation of the neighbourhood function in SOM) can be calculated in one of several ways. The simplest is the so-called “Manhattan” or city block distance, which measures the distance as if it has to be travelled at right angles. Euclidean distance can be measured in two ways - as the actual distance between the rectangular tiles, or the distance between them as if they were regular hexagons (i.e. - pretend that the re-shaping of the map was for display purposes only, and calculate distances as if it did not occur). Since the simplification was made for visualisation only, we calculated as for a normal hexagonal grid, i.e. separated actual look from calculations.

Optimisations

In the course of our experiments we tried to mitigate the high computational com- plexity of algorithms using the Expectation-Maximisation method (EM) [Dempster et al., 1977]. Similar to Self-Organising Maps, it is not a fixed algorithm itself, but rather a method, which means it can be used in conjunction with other algorithms such as k- means, or methods such as SOM. EM alternates between performing an expectation (E) step, which computes an expectation of the likelihood by including the latent variables as if they were observed, and a maximisation (M) step, which computes the maximum likelihood estimates of the parameters by maximising the expected likelihood found on the E step. The parameters found on the M step are then used to begin another E step, and the process is repeated. The reason we tried it is the fact that it can be optimised for faster convergence - for example, by skipping some of the data instances and processing only a subset of the data at each iteration. If, while assigning items to clusters in the first phase of an iteration we keep score of their fitness to the cluster, we can then sort them by fitness and in the second phase only train on those that have the best fit. Unfortunately, this optimisation did not prove to be enough to make either k-means or SOM tractable with web-scale data.

Other Issues

Self-Organising Maps are known to have a problem affecting the quality of clustering called the border effect [Kohonen, 2001]. The neighbourhood definition is not symmetric at the borders of the map, where the number of neighbours per unit on the border and corner of the map is not equal to the number of neighbours in the middle of the map. Therefore, the density estimation for the border units is different to the units in the middle of the

113 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES map (simply put - clusters in the centre attract more instances than those at the borders). We tried to mitigate this effect and increase uniformity of data distribution on the map by calculating a penalty weight for each cluster between iterations. A cluster that has more than the average number of items was penalised on the next iteration, while a cluster with less than the average number of items was boosted by a negative-value penalty:

penaltyi,t = d(ni,t − avg(n)t) (5.2)

where penaltyi, t is the multiplier for cluster i at iteration t, ni,t is the number of items in it, avg(n)t is the average number of items per cluster for that iteration, and d is a damping factor where 0 < d  1 to prevent high volatility between iterations. This penalty factor is applied to the classifier to decrease/increase the chance of an item to be classified in the respective cluster, resulting in convergence towards a more uniform distribution.

Selecting the learning rate and the function of its decay over time is usually a matter of empirically finding those that work best with the particular data set. It is usually started with a high learning rate, which then drops to zero (i.e. - the algorithm stops learning from the data at some point, when the map is considered to be converged). In our case though, we had to introduce either a constant learning rate, or some sort of cycling value for it to accommodate newly-arriving data instances and allow us to learn from them at a non-decayed rate. We did our experiments with a constant learning rate.

The selection of a neighbourhood function and its decay over time is also usually a matter of empirically finding what works best. The function itself could be a Gaussian, a “bubble” (constant over the neighbourhood and zero elsewhere), a trapezoid or anything else which monotonically decreases from a given centre. We tried the bubble and trapezoid variations, because of their lower computational cost.

Branching the map into new levels is controlled by yet another parameter, which in our case was again defined with usability in mind: we defined a limit of 1000 documents per cluster (i.e. a cluster that becomes larger than that was branched out to a new level). As with many other parameter settings, this one is arbitrary, or based on heuristics which in many cases are arbitrary themselves. In this case, we got the heuristic from Google - it is the limit of search results displayed to users of keyword search, which they apparently consider a satisfactory number from usability point of view.

Experimental Results: Failure

The results of our extensive experiments with several variations of SOM can be sum- marised as follows:

• The method really works and creates zones of similar documents on the map.

114 5.4. SELECTED APPROACHES TO TASKS

• The method is computationally expensive in the extreme, when applied directly to high-dimensional data such as text.

• Dimensionality reduction is also computationally expensive, as well as impractical in our case for several reasons, already outlined above: a) data is dynamic, so dimen- sionality reduction has to be repeated when the data changes significantly, which b) necessitates a re-mapping of instances to the lower-dimension (concept) vectors, which in turn c) means retraining the whole structure from scratch. Furthermore, the hierarchical structure splits the data into clusters of widely differing document statistics (which is its objective, after all), meaning that d) dimensionality reduction should be different for every node, e.g. will have to be done millions of times for a large hierarchical structure.

• Initialisation is very important, with small variations resulting in completely different results, i.e. - the resulting map is unpredictable and inconsistent over time (would change completely with every re-iteration of the dimensionality reduction / retraining cycle) which would create a usability nightmare for end-users.

• The border effect is extremely strong when the map is relatively small (as ours had to be), and mitigation efforts do not improve the situation reliably.

• The empirical adjustment of multiple parameters makes the method prone to all sorts of errors, which cannot later be detected (i.e. - they result in a bad map, which is not easy do catch since there is no way to know if a better map could be achieved with different values).

• Optimal parameter settings depend on the data; data being different in the different nodes of the hierarchical structure, these parameters have to be re-assessed and re- defined for every node. This requires knowledge of the data plus intimate knowledge of the SOM method.

• In other words, to maintain the structure we need editors knowledgeable of every node (category) of data. In this case, we decided we may as well employ these editors for normal text classification.

5.4.5 Classification

Having conceded that the solution to our task to build a large-scale web directory cannot be fully automated, we needed some other way to create it. This could be either an implementation of an existing method for text classification, or an adaptation of one if none of the existing methods fits all our purposes.

115 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES

5.4.5.1 Classification Methods

After some not very successful experiments with a neural network, we chose to work with the Multinomial Na¨ıve Bayesian classifier as the basis for our prototype because it is simple, easy to modify, tolerant to noise in the training data and it can work well incrementally, e.g. - small changes to the document collection or the reclassification of a small number of documents would not require full re-training. It also has an additional bonus in that after we calculate the word occurrence statistics it needs, we can then use these statistics for some of the additional features enhancing the usability of our system.

5.4.5.2 Hierarchical Classification

We aim to create the basis for a large hierarchical classifier which can accommodate a web directory comparable to the Open Directory Project but with orders of magnitude more data. The method should also take into account the addition and deletion of documents which in a live directory combined with a web spider would be continuous, as well as occasional re-labelling of instances by a human.

The Open Directory has a directory tree with over 763,000 nodes, so to duplicate it we need a hierarchical classifier with at least as many classes.

All hierarchical classifiers have a common shortcoming - if they make an error on a document at one of the higher levels, this error is then propagated down the hierarchy. This has been addressed by dynamically training a specialised MNB classifier for each document in the collection with a subset of existing categories as training data, allowing the classifier to maintain an acceptable accuracy down to Level 5 of the classification tree [Xue et al., 2008]. Due to the high complexity and computation costs of the method though, we did not experiment with it but decided on assisted manual (human) intervention: the backend editing system allows the editor to sort documents by classifier confidence, so that editorial efforts can be concentrated on the documents that most need it.

We decided to treat the hierarchical classifier as a hierarchically ordered set of normal classifiers, where the output of one classifier is the input for those at the lower level. Each classifier splits the data instances into a number of classes, which the lower-level classifiers process further using their own training data and partial document collection. If the higher-level classifier moves an instance from one class to another at a later point in time, this instance gets removed from the document collection of the respective lower-level classifier and put into another one. Since each lower-level classifier works with only a part of the data, it calculates all static measures (such as TF-IDF values for words) locally - i.e., taking into account only its own subset of the document collection. This means there can be no global stop words, because they are stop words in some context only. The usual example for such a word is the article “the”, which is so common in the English language that it cannot be used to distinguish between different types of texts. But, it shows that

116 5.4. SELECTED APPROACHES TO TASKS the text is in English in the first place - so it is very useful to distinguish between English and Bulgarian texts, for example (i.e. - it is a valuable feature at the top level of the classifier). This approach enables the “floating query” (explained in more detail below) which we consider another important contribution of this work: our system re-evaluates the user’s query at every point of the hierarchy using these local values.

5.4.5.3 Dynamic Classification

Most traditional classification methods treat classification as a “one off” task: they take a document collection, process it and finish. For a web directory this cannot be a solution, since it updates its document collection constantly (adds, edits and removes documents from it), as well as evolves the classification structure. Furthermore, classification of documents changes with time - sometimes instances are moved from one class to another by editors not because they were wrongly classified initially, but because the perception of what they belong to may have changed. In effect, we have four separate changes (or evolutions) happening simultaneously:

• The classification process is trying to match web documents to nodes in an ontology which slowly evolves (human editors add/delete nodes from it).

• The collection of documents also changes: documents are added/deleted from it.

• The documents themselves change.

• The editors’ perception of the documents change: they now classify a document in a certain category and later re-classify it into another.

The question of ontology evolution has been studied in terms of creating different on- tologies for different periods and then comparing them [Enkhsaikhan et al., 2007], which does not however facilitate a gradually changing system such as a web directory. An algo- rithm was proposed to deal with streaming addition and removal of documents [Katakis et al., 2005] which couples an incremental feature ranking method with an incremental learning algorithm that can consider different subsets of the feature vector during predic- tion; for evolving classes though it would need to be retrained from scratch. Analogous to our problem, but not identical, is the task of classifying streaming data: in it, new data arrives continuously and needs to be classified continuously but the classifi- cation criteria themselves do not evolve with time; another difference with a web directory is that in the case of a web directory, not only new instances but also old instances need to be re-classified in case the underlying ontology changed. One approach to this task [Gao et al., 2007] addresses the evolving ontology (or “drifting concepts” as they call it) by processing the data instances in batches. It tries to build balanced training sets by

117 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES under-sampling negative examples for classes, then trains an ensemble of classifiers on them. However, this whole structure is rebuilt from scratch with the arrival of the next batch of data. Ensemble learning is computationally expensive in itself, and retraining the whole structure at relatively small intervals makes the approach prohibitively expensive in our case. We decided to develop our own approach to the problem, addressing all four aspects of dynamic classification at the same time.

5.4.5.4 Approaches to Training

Before starting our learning algorithm, we need to see what general approaches exist on learning from existing training data. Intuitively, it is logical to learn from all existing examples. However, many classifiers use different heuristic criteria to minimise training by selecting only a small subset of the data to train on. It is worthwhile to see what approaches spam filters use for learning, since they also use a Bayesian filter (though not multinomial) and have been in wide-spread use for many years, processing real world data. They can employ one of the following training strategies [Zdziarski, 2005]:

• TEFT (train-everything) is the classic approach of all text classifiers, including Bayesian, and is also the most intuitive - learn from all the data. However, it is computationally expensive and (as our experiments also show) does not always provide the best results. The experience of spam filters using this approach is that it suffers from unbalanced classes (where spam is much more than non-spam) and only works well if the ratio is not worse than 70:30, which is decidedly not the case in our situation.

• TOE (train-on-error) - the algorithm runs the classifier part first, then compares the result to a (manual) label if one exists, and learns from the instance only if it has made an error (the so-called if it isn’t broken, don’t fix it philosophy). Such filters are more “static” than TEFT filters, i.e. - they take longer to learn. On the other hand, they work better for large datasets and for highly unbalanced classes, which matches our case.

• TUM (train-until-mature) filters try to be the middle ground between TEFT and TOE - initially they learn on everything, then they stop learning and only retrain when they make a mistake. As with the others, they need labelled data in order to recognise a mistake, plus some heuristics to decide when a token (term) is mature so as to stop learning it.

• TUNE (train-until-no-errors) algorithms learn until they make no mistakes, or very few mistakes. The downside is that when they start making new mistakes, they

118 5.5. DATASET AND EXPERIMENTAL STUDY

have to be retrained over the whole document corpus. Selecting the error acceptance threshold is a matter of an (arbitrary) parameter setting.

The work of spam filters is an online process, in the sense that they work on a stream of documents. Experimentally, they have found that “Train on everything” takes up large resources while not delivering the best classification results. “Train until mature” and “Train until no errors” require occasional re-training when the nature of the underlying data changes (which would be a frequent occurrence in our case). They also have the added complexity that old instances have to be preserved in order to be used during such re-trainings, which proved to be intractable in real life. This leaves “train on error” as the policy of choice. Every email user is familiar with the user interface of this policy: when reading emails (in a mail client or in web mail), there is usually a “Spam” button which the user can press to indicate that a document is spam. Since it was shown to the user in the first place, apparently the filter has not classified the instance as spam, so the pressing of this button indicates an error and forces the classifier to “train” on the instance; going to the “Spam” folder and marking messages from it as “Not spam” has the opposite effect and removes false positives. The main effect of this policy is that the classifier in fact sees only “borderline” documents, which represent a small subset of the data collection (and the better the classifier gets, the smaller this subset becomes). The approach is counter-intuitive and skews word counts significantly, but has been shown in practice to work best so it is being used in most industrial spam filters [Zdziarski, 2005].

5.5 Dataset and Experimental Study

Having settled on the general approaches to most issues, we conducted experiments testing various aspects of their work in order to find at least one setup which would yield a working system exemplifying our proposed solution.

5.5.1 Test Setup

A benchmarking mechanism was implemented to record classification results for each al- gorithm, classification and training times (in microseconds) and dictionary size of each algorithm’s training set. To avoid bias from a fixed ordering of algorithms (file cache, database cache etc.) they were called in random order when classifying each instance. Preparatory operations (reading the document, fetching IDF values etc.) were common for all algorithms and were not taken into account. The test-bed system was designed to allow comparison of the performance of algo- rithms and their variations while excluding systemic errors and bias. More specifically, the system consists of a preparatory module which prepares a classification/training cycle which is common for all tested algorithms, a classification/training part and a common

119 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES statistics/analysis part. The preparatory module downloads documents from the web, filters them for noise and generates the term vector for every document. It then creates a classification queue, which is randomly initialised for every test run to avoid bias from the ordering of documents. Such bias could occur, as the Open Directory data has some idiosyncrasies. Document numbering in it is by category, i.e.: documents 1 to 224,030 are all “Art”, the next 222,963 are “Business” etc. If we read a batch of documents in ascending order from the database, all of them would be from the same category, which re- flects extremely negatively on some algorithms (see discussion on MNB-SPDA below). To avoid this, our system feeds random documents from the queue to the tested algorithms. Algorithm order also changes randomly for each instance to avoid bias due to database and other caches. The system then records classification success/failure, the document’s category (in order to provide class-specific accuracy data), classification time and training time (if the algorithm requires training on the instance), to allow comparisons in terms of computation costs as well as classification success.

After a test run, we can thus compare algorithms in terms of:

• Overall accuracy (success rate of the classifier).

• Class-specific accuracy (separate success rate for each class).

• Computation cost for classification.

• Computation cost for training.

• Storage requirements for training data.

5.5.2 Available Data

For our proof-of-concept prototype, we used a data dump from the Open Directory Project (see Table 5.1) containing a list of 4,228,645 (dated October 2009) labelled URLs which have been classified by human editors into 763,529 categories. We downloaded 149,178 documents from these URLs and trained a hierarchical structure of Multinomial Na¨ıve Bayesian classifiers using these documents as training data. We downloaded a further 9,517 documents for validation, plus some others to expand the search database (the latter are not taken into account when benchmarking algorithms since we do not have manual labels for them). ODP data is very noisy (classification errors, web decay etc.); we did not perform a clean-up so as not to skew experiment data, hence the relatively low accuracy results reported. We expanded classification only over the top three levels of the English-language directory, with a total of 474 classifiers and 6,368 classes.

Document numbers above may seem sufficient for learning but, as noted by previous researchers [Liu et al., 2005], data from major directories is in fact very sparse. 76% of the categories of Yahoo! Directory contain less than 5 documents; similarly, ODP has

120 5.5. DATASET AND EXPERIMENTAL STUDY

Archive size 380 MB Archive format RDF Total URLs 4,228,645 Unique hosts 3,057,663 Most pages from one host 79,762 (geocities.com) Errors on download 7.11% Categories 763,529 Cross-links between categories 110,516 Average categories a document is listed in 1.02 Languages 18

Table 5.1: The Open Directory data. an average of 5.54 documents per category. This is due to a very fragmented categorisa- tion structure where many nodes are either too specific and have almost no content that matches them, or are “placeholders” serving only to organise lower-level branches, with no content of their own. As we have already mentioned, an omission by design (of ODP, Yahoo! and all similar directories) is that sites are supposed to be indexed only in one node of the directory tree - e.g., if a site is classified in the Business: Investing: Exchanges subcategory, it is not listed in the higher levels, i.e. Business and Business: Investing, nor in any lower levels. This makes the above situation even worse, since even high-level nodes such as Business are practically empty. In our implementation we did the same as previous researchers did [Liu et al., 2005] - we “folded” high level categories by including in them the content of their sub-categories, both to assist end-users when browsing the structure, and to increase training data dramatically and enable efficient learning.

We did not use the ODP document descriptions for classification, but the actual text from the downloaded documents. They were passed through a filter which stripped HTML tags and applied further filtering as described above (discarded very short and very long phrases, and very short and very long words). This filtered text was then used by all algorithm variations to eliminate error from differing pre-processing.

5.5.2.1 Noise in the Data

An important problem with ODP data has to be noted, namely - the significant level of noise of several types it contains, each of which has different impact and has to be dealt with differently:

• Dead links - more than 7% of listed sites returned a 404 HTTP response code (Docu- ment not found error) on download, or were unavailable (connection timeout). These we removed automatically.

• Web decay - sites that have been removed but either generate soft 404 errors or now serve different content, not what was categorised by ODP. They do not issue an error

121 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES

response code and are the most difficult to remove, as they require a human’s decision for each instance. They harm the classifier by making it learn wrong classifications.

• Intentional noise - legitimately-looking sites that embed noise within their own code (comment spam, hidden text due to using search engine optimisation techniques, hacked sites with hidden link farming code etc.). Same as above, only more difficult to spot and remove.

• Entry pages - sites that have only a logo and an “Enter” link on their homepage which was indexed by ODP, but has no usable text to classify. They add noise to the classifier’s training phase and reduce its success rate at the classification phase significantly.

• Classification duplication - although contrary to ODP policy, many sites have been indexed in more than one category. The number of labels per URL is 1.02, so the level of classification noise is much lower than Yahoo! Directory (2.23) but this is still a problem affecting both the classifier’s training and its overall success rate. We chose to ignore this for now, but in a future implementation it should be accepted as additional data and not noise, which would mean implementing a fuzzy classifier.

• Wrong classification - some sites have been classified into a category not suitable for them. This one is highly subjective and the initial intuition was to just ignore it, but it turned out to present the most serious problem of all since it is the reason for the dominance of the “Regional” category (which category, in our opinion, should not exist at all - it should be a separate hierarchy).

5.5.2.2 Dimensionality of the Data

In theory, the document collection should only contain words from a limited dictionary (provided we only deal with documents in one language, and we only experimented with the English part of the directory). In practice though, there are many occurrences of names of people or places, foreign language words inserted into English texts, foreign language sites wrongly categorised in an English-language node, documents in more than one language (featuring parallel translations of a text), Olde English (medieval and earlier) texts, typos, deliberate attempts at filter poisoning (or Bayesian poisoning [Graham-Cumming, 2006; Zdziarski, 2005]) like a large collection of “words” such as “btsnwgdguf”, bringing the overall dictionary to over half a million words for our (rather small and deliberately limited to one language) sample of documents. Our full dictionary had 593,718 words, 388,329 of which with a Document Frequency (DF) of one (i.e. - they were seen in only one document). For comparison, the 20 newsgroups11 document collection usually used for benchmarking text classification techniques has a dictionary of slightly more than 61,000 words. Ignoring words that occur in less than 10 documents brought the dictionary used

11http://people.csail.mit.edu/jrennie/20Newsgroups/

122 5.5. DATASET AND EXPERIMENTAL STUDY for training to a more manageable 62,492 words (about 2% less in the case of train-on-error variations).

5.5.2.3 Unbalanced Data

The most significant problem with the ODP data though is that classes are highly unbal- anced in a number of ways.

Firstly, they are unbalanced as number of instances they contain: we found that the largest top-level English-language category (“Regional”) contained 1.10 mln instances, while the smallest (“News”) had less than 9 thousand. Thus, the dominant category had 42.4% of all instances, with the remaining instances spread in 14 categories.

Secondly, classes are unbalanced as average length of the documents they contain - for example, documents in the “Business” category of our sample had an average of 96.54 unique terms each (after filtering), while the “Adult” category (rather surprisingly) had more - 121.76, and “News” (not so surprisingly) had 235.05.

Furthermore, as mentioned above, the MNB model’s assumed term independence does not always hold - some terms have a strong correlation, like “San” and “Francisco” in texts on American cities [Rennie et al., 2003], creating some bias in the algorithm. If this bias applied to terms evenly distributed across classes, it would just lower the success rate in general. What happens in reality though is that different classes violate the independence assumption to different degrees (the third type of unbalancing); word count normalisation cannot compensate for this, since it accounts for length of documents only and not for the actual terms they contain.

Classes are unbalanced also by the fact that the “Regional” category contains prac- tically a bit of everything - local business, local entertainment, local news etc. so its keywords to a large extent coincide with those from all other classes, unlike for example the “Adult” or “Business” class which have more distinctive dictionaries. A high word occurrence rate for common terms, coupled with the high prior probability of the class itself leads to a situation where the “classic” MNB (with no modifications) predicts the dominant class for almost every instance and has a success rate of ≈ 1 for that class and 0 for some others.

The combination of all these factors results in significant variation in classification success from class to class. Each factor is usually tackled by a separate normalisation, but the interaction between them is not well studied. After some experiments, we eventually decided to work from the opposite end and compensate not for each factor separately, but for their cumulative effect as measured by the class-specific error rate.

123 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES

5.5.3 Algorithm Testing and Experimental Results

We conducted a large number of (mostly unsuccessful) experiments with a variety of classification methods and a number of variations on their “textbook” versions, which finally allowed us to modify one algorithm enough to make it usable for our purposes.

5.5.3.1 Algorithm Variations Tried

Before we finally settled on our modified version of the MNB classifier, we tried and compared the following classification methods (some of them discussed in more detail below):

1. Single-layer neural network with constant learning rate. Result: slow training, slightly inferior accuracy. No added benefits as compared to Bayesian classifiers. Conclusion: reluctant write-off (would be used if we did not need those additional benefits).

2. Classification based on TF-IDF-weighted term vectors, no learning. Result: on the plus side - no training time, fast classification, but inferior accuracy. Conclusion: would be usable in another context, and we would have used it if not for the next variation.

3. DCC: Classification based on discretised TF-IDF-weighted term vectors, no learn- ing. Result: on the plus side - no training time, very fast classification, but inferior accuracy. Conclusion: usable in another context, such as the front-end where users may want a different point in the fast/accurate compromise. (discussion below)

4. MNB: Bayesian classification by the classic formula, with “train on everything” policy. Result: slow training, disastrous accuracy. Conclusion: total write-off (problems beyond repair). (discussion below)

5. MNB-TOE: Bayesian classification by the classic formula, with “train on error” policy. Result: fast training, good accuracy, high error variance between classes. Conclusion: write-off (repairing its problems resulted in algorithms be- low).

124 5.5. DATASET AND EXPERIMENTAL STUDY

6. MNB-WN: Bayesian classification by the classic formula, with normalised word counts and “train on everything” policy. Result: very good accuracy but high error variance between classes. Conclusion: very good, but not best for our purposes. (discussion below)

7. Bayesian classification by the classic formula, with “train on everything policy” and random deletion of some training data every iteration. Result: slow training, disastrous accuracy, no real advantage in random “forgetting” to achieve dynamic classification. Conclusion: total write-off (problems beyond repair).

8. Bayesian classification by the classic formula, but ignoring the prior distribution of texts over classes. Result: disastrous accuracy. Conclusion: total write-off (problems beyond repair).

9. Bayesian classification by the classic formula, with “train on everything policy” and an added weight for frequently-trained terms (“confidence amplification”) equal to

log2 of the number of instances the term was trained. Result: no improvement. Conclusion: reluctant write-off (good idea, but did not work in practice).

10. Bayesian classification by the classic formula, with “train on error” policy, using stochastic partial prior distributions (from trained instances only, cumulative over N iterations). Result: overall improvement of accuracy, but subject to “jerks” in accuracy between iterations, some of them leading it to disastrous accuracy. Conclusion: total write-off (unusable in this version; repairing problems resulted in algorithm below). (discussion below)

11. Bayesian classification by the classic formula, with “train on error” policy, using stochastic partial prior distributions (from trained instances only, over a sliding window of the last N trainings). Results: improved accuracy, algorithm is stable as opposed to previous variation. Conclusion: not good enough yet (repairing problems resulted in algo- rithm below).

12. MNB-SPDA: Bayesian classification by the classic formula, with “train on error” policy, using stochastic partial prior distributions (from trained instances only, over a sliding window of the last N trainings), with normalised word counts.

125 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES

Results: fast training, not so fast classification, best achieved accuracy, best achieved inter-class accuracy variation (i.e. - this algorithm is equally reliable in all categories and not only in those with the most instances).

Conclusion: this was our chosen winner. (discussion below)

13. MNB-SPDWNA: Bayesian classification by the classic formula, with “train on error” policy, using stochastic partial prior distributions (from trained instances only, over a sliding window of the last N trainings), with normalised word counts where normalisation is over the last N trained documents.

Results: a failed attempt to improve the previous algorithm; the second normalisa- tion has the potential to throw it into a positive feedback loop in some circumstances from which it does not recover.

Conclusion: total write-off (fundamental problems beyond repair). (dis- cussion below)

5.5.3.2 Algorithm Selection

In this study, an important part of our goal is to build a viable web directory. We can do this by improving existing classification methods, which in their classic form are not suitable for dynamically changing data and suffer from unbalanced classes; these are the issues we need to address when deciding which one we should use. Before comparing the algorithms we tried, and explaining why we settled on our own modification, we have to stress that our system does not rely exclusively on this algorithm, or any specific algorithm at all; it could work with any other method of creating a stable, reliable and consistent classification structure. Furthermore, our system consists of a distributed personal part (user side) and a public part (server side), which do not have to be created by the same algorithm: multiple personal parts can be implemented using a variety of algorithms, and multiple public parts can also be implemented, using the same or different algorithms. Since data exchange between the two sides is extremely simplified and involves only the end product of these algorithms, it does not really matter what they are.

Before we finally settled on the Multinomial Na¨ıve Bayesian classifier, we tried some other options. A simple neural net was almost good enough in terms of accuracy, but was finally abandoned due to a basic trait that neural networks have: when they produce results, these results are not “explainable” - they are based on a combination of internal weights which cannot be “reverse-engineered” into an explanation or used for any reason- ing. On the other hand, MNB is based on individual term weights; much of the additional functionality of our system (e.g. the “floating query”) relies on these and cannot be repli- cated by a neural net. So, while a neural net could probably be calibrated enough to work for the main task of text classification, we would need additional tools for the other tasks, which we decided was not an optimal solution.

126 5.5. DATASET AND EXPERIMENTAL STUDY

We also tried an approach similar to the “search engine classification model” [Anag- nostopoulos et al., 2006]. It is based only on a fast search in the database for all terms contained in a document. Every term has values for every category, so the algorithm just sums these and announces the category with the highest combined sum as the winner (in effect, this compares the term vector for the document to a term vector for the cate- gory; since the term vector for the category is a “centroid” in clustering terminology, we called the classifier a “centroid” classifier, and its discretised version a “Discretised Cen- troid Classifier” - DCC in comparison tables below). These values can be “real” (floating point) pseudo-TF-IDF over the categories:

ni D TF -IDFic = TF × IDF = P × log (5.3) k nk d: ni ∈ dc

where TF -IDFic is the value for a word in a category, ni is the number of documents in that category dc that contain the word, nk is the number of all word occurrences in the category and D is the total number of documents. When normally calculating TF-

IDF, it should not be dc in the denominator but the number of categories. However, in a big enough collection (such as the web) a word would most probably be seen in every category at least once, so this value would become meaningless. Using dc instead manages to extract the most meaningful terms for every category.

After we calculate this value for every word for every category, we end up with values which are not comparable since they are not normalised; this is a hard task given all the broken independence assumptions discussed above, and the reason we abandoned the real values and applied discretisation. We “normalise” with a brute-force method: in every category we sort terms in order of ascending TF -IDFic, then substitute the value with the order number. Thus, if we have a dictionary of 100,000 terms, the least important of them for a category would have a value of 0 in that category, the next would have 1 and so on, and the most important would have a value of 100,000. To classify, we just sum these values for every category and find the winner as the category with the highest count; this is faster than the initial variant as it sums integers as opposed to floating point values, and much faster that MNB which has to calculate and sum logarithms of floating-point values. The method is crude, but turns out to be practical for the same reason MNB works: it only needs to find the winner and not to estimate probabilities precisely. On the plus side, it also lacks a “training” phase; all the above “learning” operations can be done with two fairly simple SQL queries in the database. Unfortunately, it is not as accurate as our final version of the MNB, so we decided not to use it for the back-end of the system. However, it can be used for the front-end and provide end-users with a faster alternative if they prefer fast to accurate search.

Having thus found that the Multinomial Na¨ıve Bayesian classifier remains the most probable basis for our tool, we tried to modify it to fit our purposes.

127 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES

Firstly, in order to maintain a “live” web directory, it is apparent that the classification task is not “one off”; rather, it has to be repeated again and again, indefinitely. Conse- quently, we do classification continuously, in an indefinite number of iterations - Algorithm 1; when we finish classifying all existing instances, we just re-initialise and start again. During the re-initialisation process, we take into account any changes that have happened in the meantime, such as new instances added to categories, old ones removed, changes made to document descriptions or classification etc. Due to technical reasons, we actually make a “snapshot” of the data such as is it at this moment, and the classifier works with that snapshot and not the “live” data.

Algorithm 1: Multinomial Na¨ıve Bayes (MNB) foreach iteration do Calculate prior distributions over whole collection; Apply weight decay; Prepare classification queue; Initialise iteration; (Classify Queue) foreach document do foreach class do Calculate log-likelihoods of words in document; Add global log-likelihood of class; end Find class with highest probability; Announce winner class; Save results; end end

Secondly, we had to make MNB more accurate. As already discussed, the most serious problem with it is that it tends to favour incorrectly large classes, leading in our case to a situation where it places almost every instance in the dominant class (“Regional”) achieving an accuracy of 100% there, and nothing at all in some other classes, meaning an accuracy of zero (see Table 5.4).

One reason for such behaviour is the high word occurrence rates in the dominant class, which skew the “document evidence” in its favour. Initially, we tried to apply some standard mitigation methods. We discounted all word counts by the TF-IDF measure, eliminating some noise from common words. This did not help (results for MNB in the table are for MNB with TF-IDF weighting applied). Applying the “train on error” policy did not improve it noticeably either.

We then applied word count normalisation [Frank and Bouckaert, 2006] to compensate for the varying average length of documents between classes:

128 5.5. DATASET AND EXPERIMENTAL STUDY

0 nwd nwd = α × P P (5.4) 0 n 0 w d∈Dc w d

where nw0d are class-specific word counts (number of occurrences of the word w in documents d of the part of the corpus Dc belonging to class c), replacing wi in the “classic” MNB equation (4.6). We took the smoothing parameter α (vector length measured in the l1 norm) to be equal to 1, as that was shown to work well [Frank and Bouckaert, 2006]. We refer to this variation as MNB-WN (Multinomial Na¨ıve Bayes with Word Normalisation) in the text below. The algorithm - Algorithm 2, now involves an extra step, calculating normalisation based on word occurrences in each class (once per iteration), plus an additional operation at the classification of each instance.

Algorithm 2: MNB with Word Normalisation (MNB-WN) foreach iteration do Calculate prior distributions over whole collection; Apply weight decay; Calculate normalisation based on word occurrences in each class; Prepare classification queue; Initialise iteration; (Classify Queue) foreach document do foreach class do Calculate log-likelihoods of words in document; Apply normalisations; Add global log-likelihood of class; end Find class with highest probability; Announce winner class; Save results; end end

Term frequency distribution also varies greatly between classes, so we normalised these values by the l2 norm which provides some “subtle benefits” for classification [Rennie et al., 2003] (in our experiments - to the tune of a 2% increase in accuracy). In

ni fi = P (5.5) k nk

(term frequency of word i equals its count divided by word counts of all k words in document), we normalise fi as

129 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES

f f 0 = i (5.6) i pP 2 k(fk)

and use this value as the TF part of our TF-IDF estimation.

Coupled with the “train on error” policy, this improved the situation dramatically in terms of overall accuracy, although classification errors still varied widely between classes (see results for algorithm MNB-WN in Table 5.4).

Additionally, it was apparent from the start that the most important task would be to make the algorithm we use dynamic, in the sense of handling dynamic data and dynamic classification. If just the data was dynamic, this would have been a streaming algorithm, which would not present such a challenge. However, dynamic classification where the classes and their definitions evolve over time means that we have to a) constantly return to instances already classified since they could have been re-classified later (which we were already doing in constant iterations over the available data), and b) find some effective mechanism to re-train the classifiers when the classification of an item changes or classes are added or deleted.

The first thing we did was the introduction of “weight decay” of learned word weights. This is used in some “real world” solutions, for example the junk mail filter in the Thun- derbird 3 mail client (part of the Mozilla suite). However, in Thunderbird it is triggered by a rather unreliable event - the dictionary reaching a particular size, which is a) an arbitrarily-set parameter, and b) not guaranteed to happen at all (for example, if the document collection is from a restricted domain with a small dictionary). In our case, we decided to use our classification iterations, since they are guaranteed to happen at regu- lar intervals. After every iteration, we divide word counts (for words contained in each category) by 2, and delete from the database words that remain with a word count of 1. This deals with words that have not been used for some time (e.g. - when the documents that contain them were removed from a category or from the collection altogether) and is especially useful in protecting the classifier from Bayesian poisoning. It also solves the issue with forgetting classification which has changed, e.g. - if a document was moved from one class to another, its word count in the original category will “decay” with time and will eventually be removed.

However, we still wanted a faster method of forgetting old classifications. As already stated, “train on error” classifiers are more “static” than “train on everything”, which means not only that they learn slowly, but also that they unlearn slowly; for example, they will recover very slowly after a Bayesian poisoning attack or a re-classifying of documents by human editors (which has the same effect on the classifier). The first intuition was to periodically delete some of the existing training data (again, an approach used by some spam filters), with the hope to find some balance point where the algorithm quickly

130 5.5. DATASET AND EXPERIMENTAL STUDY

“forgets” something it no longer needs, such as an old classification rule, while at the same time remembers enough to remain acceptably accurate with the instances that have not changed - hence our experiments with random deletion of some data after every iteration. The experiments were unsuccessful - these deletions either had no visible effect (if too few), or harmed the classifier so much that it needed full retraining (if too many).

Algorithm 3: MNB-WN with TOE, values recalculated at start of iteration (failed attempt) foreach iteration do Calculate prior distributions over errors during last N iterations; Apply weight decay; Calculate normalisation based on word occurrences in each class; Prepare classification queue; Initialise iteration; (Classify Queue) foreach document do foreach class do Calculate log-likelihoods of words in document; Apply normalisations; Add global log-likelihood of class; end Find class with highest probability; Announce winner class; Save results; end end

We have to specifically note also that class prior distributions in this case were re- calculated at the start of each iteration, and that (the classifier being “train on error”) they were in fact not class distributions of all instances but class distributions of errors - which is a much different distribution, as already noted. As such, they are also subject to much more change: e.g., if the classifier has done very well predicting instances from the most populous class during an iteration, then it has not trained on it too often and the prior distribution and word counts for this class suddenly become too low when the next pass is initialised (Algorithm 3). This leads to significant fluctuations in performance between iterations - success rates drop sharply after an update, then on the next update the class becomes dominant again (since during this pass the algorithm has made too many errors on it). The end result is a “see-saw” between these two states, which is unacceptable and needed fixing (increasing the value of N makes it smoother, but brings the algorithm closer to the original static MNB-WN variant with classifier life-time values).

Stepping back from the specifics, we also have to make the general observation that this variant was already significantly different from the textbook algorithm. The “classic” MNB uses class prior distribution as an important part of its probability estimations; “prior distribution” in it means the “distribution of documents into the respective classes”,

131 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES where “documents” is taken to mean all documents in the collection. Most “real world” spam filters though have found it is better to employ the train-on-error policy [Zdziarski, 2005] even though it contradicts the theoretical model. Due to this policy, the classifier in effect sees only a limited part of the collection and bases its statistics on it only. Both the prior distributions and the word counts it uses are not those of the whole collection but those of the errors. Since classes in the case of spam are very unbalanced and this approach unbalances them even further, they try to compensate that by a heuristic giving a false positive error more weight than a false negative error. This heuristic is parameter-based and there is no methodology to calculate this parameter - it is based on experiments or personal preferences (where a spam filter can be tuned by a person). This parameter is a pre-set value and does not change in the life of the filter, irrespective of how it affects its efficiency or what data it processes.

In our case the problem arose not so much from the unbalanced classes, but from the re-initialisation of the document corpus statistics at the start of each iteration. It was apparent that we needed a smoother transition which should not subject the classifier to such “jerks” in performance after each update of values. It was also apparent that industrial spam filters obtain some benefit from breaking the rules, although the theoretical basis of this has not been well-studied.

We decided to do the obvious and break the theoretical rules even further to see if there would be any additional benefits. The most immediate way to do this was to completely ignore prior class distributions: if spam filters become better by not using the “real” distributions, maybe we can do better by using none at all? In hindsight, the idea was stupid: prior distributions are there for a reason; this set of experiments failed (the classifier ignoring prior distributions was even worse than the “classic” MNB with no normalisations). Working from the opposite direction, we tried to boost the document evidence instead, multiplying some word counts by a “confidence factor” (giving preference to frequently seen terms, or in other words - classification rules with higher support). This did not work either.

We then decided to break away from the “iteration” paradigm, since it is an artificially introduced construct and it is not logical to expect any benefits from it. Instead, we decided to use a constantly updated “rolling count” of values over errors (a “sliding window” over the last N errors), which eventually allowed us to achieve an algorithm acceptable for our purposes.

There are three sets of data used to classify an instance: word count distribution by class, prior distribution of classes and word count normalisation by class. The question was on which of them to apply the sliding window approach.

The first set we did not try to manipulate. We use a word count of all errors (for all time), decaying after each iteration. Changing that would require an enormous complexity of the database, since we would have to record not only the word counts themselves, but

132 5.5. DATASET AND EXPERIMENTAL STUDY when each individual incrementation of the counter happened so that we could undo it later (when it “slides” out of the window).

The second set of data, prior distribution of classes, we redefined as based on the last N trainings of the classifier, e.g. - we calculate class prior distribution over the last N errors (we found that N = 10000 works well enough, and increasing it has no visible benefits) - Algorithm 4.

Algorithm 4: MNB with Stochastic Prior Distribution Adjustment - MNB-SPDA foreach iteration do Apply weight decay; Calculate normalisation based on word occurrences in each class; Prepare classification queue; Initialise iteration; (Classify Queue) foreach document do Calculate prior distributions over last N classification errors (from error log); foreach class do Calculate log-likelihoods of words in document; Apply normalisations; Add current log-likelihood of class; end Find class with highest probability; Announce winner class; Compare with manual label; if winner class and manual label are different then Update error log; end Save results; end end

In effect, our statistics are based on a non-random, stochastically selected and dy- namic sample of the data (which is why we called the algorithm MNB with Stochastic Prior Distribution Adjustment, or MNB-SPDA). This actually changes the fundamental workings of the algorithm: it is no longer a batch-classifying static algorithm; instead, it is now dynamic in the sense that it needs to keep classifying (and make some new errors!) in order to learn, and adapts to the data. The “sliding window” approach introduces a negative feedback loop: problematic classes become over-represented in this stochastic sample (there are disproportionately more errors in them) so their prior probability is adjusted upwards (see Table 5.3) to compensate for the problems leading to low accuracy (note that we do not even need to know what the problems are). Of course, a price has to be paid: increased accuracy in some classes happens at the expense of classes where the classifier was previously more accurate. Nevertheless, overall accuracy increases and - more importantly - error variation between classes drops four times, meaning that the

133 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES classifier is now equally reliable for all classes. This approach leads to some important consequences that have to be noted:

• The algorithm now has a learning stage: we have to keep classifying (and making errors) until it converges to a dynamically stable state. We found experimentally that this happens after four or five iterations over the whole document corpus.

• Classification order matters! If we do not randomise iterations properly and, for example, start classifying with instances of one class only it can easily get unbal- anced since this class becomes over-represented and can go into wild fluctuations that are difficult to recover from (see Figure 5.2). We had to make sure iterations are randomised properly by adding an extra routine to the iteration initialisation procedure which re-ordered instances for classification every time (using the in-built random number generator of the database).

• The method deals automatically with any imbalances, noise and other problems in the data, compensating them with one general correction instead of several separate normalisations for each type of imbalance. There are no heuristics or parameter settings to modify for its various use cases (which are a major issue for industrial spam filters [Zdziarski, 2005]). We can apply it in every part of the category tree without modification.

The method has three drawbacks: a somewhat increased complexity (in the need to maintain the “sliding window” of errors), the sacrifice of accuracy in the dominant class and the issue of properly randomised initialisation. However, we felt these trade-offs acceptable as a) the algorithm is still better overall than the alternatives, and b) it is equally reliable for all classes (as already discussed, this is extremely important from the point of view of usability and end-user perception, where the user would find the whole system useless if it was making errors in his preferred category). We now had an algorithm which was acceptably, and evenly, accurate (given the noisy data used from the Open Directory, its results may not seem impressive but were nevertheless better than the others; with data from better controlled sources it would be much better). Understandably, we tried to improve it even further. We already said that there are three sets of data that can be manipulated using our “sliding window” approach. The one which we had not tried yet was word count normalisation. We now did it not over the full document collection but over the last N documents the classifier trained on (i.e. made a mistake on) - we now did Stochastic Prior Distribution Adjustment and Word Normalisation - MNB-SPDWNA (Algorithm 5). However, it turned out that MNB-SPDWNA has an important and fundamental draw- back: the value for normalised class word counts is in the denominator part of equation 5.4. If, for any reason, the algorithm makes no errors or very few errors in one class, the

134 5.5. DATASET AND EXPERIMENTAL STUDY

Algorithm 5: MNB with Stochastic Prior Distribution Adjustment and Word Nor- malisation - MNB-SPDWNA foreach iteration do Apply weight decay; Prepare classification queue; Initialise iteration; (Classify Queue) foreach document do Calculate prior distributions over last N classification errors (from error log); Calculate normalisation based on word occurrences in the last N classification errors (from error log); foreach class do Calculate log-likelihoods of words in document; Apply normalisations; Add current log-likelihood of class; end Find class with highest probability; Announce winner class; Compare with manual label; if winner class and manual label are different then Update error log; end Save results; end end

135 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES

Figure 5.2: Consequences of bad initialisation. Vertical lines represent new iterations. values for it become too low. Since we are dividing by them, we increase its score and the algorithm starts predicting this class for all subsequent instances.

The consequences are illustrated in Figure 5.2, which shows the result of a bad initial- isation in a test run. We did not randomise the training sequence well enough and one of the classes was not represented in the first 100,000 trained instances. The algorithm made more than 10,000 errors in that time, which resulted in an adjusted prior distribution where this class was not present. Its word normalisation values then became extremely large. Interestingly, for the next 30,000 instances the algorithm actually exhibited a better success rate than before, then it became drastically unsuccessful. The initial improvement was due to weight decay being applied, which also made the respective word counts much lower and lessened the effect of the wrong normalisation. This also happened at the end of the next iteration when weight decay was applied again, but to a lesser extent since weights were large already. It then converged to a somewhat higher (but still much worse than normal) success rate and remained there after successive iterations, i.e. - it did not recover from the error. Unlike it, the MNB-WN variant suffered a short and sharp drop in performance which it quickly overcame with increasing word counts as more documents were classified. Later applications of weight decay, as expected, did not affect it.

We got similar outcomes on a number of other test runs even with correct initialisation, meaning that the MNB-SPDWNA algorithm in its pure form is too unreliable. There are two stable states to which it can converge: the optimal state with an even error distribu- tion, which is a form of dynamic equilibrium that needs to be maintained constantly; and the worst case, where the last N errors do not include an error in one of the classes - in this case the algorithm predicts this class for most instances, thereby making no further errors in it and entering a positive feedback loop from which it cannot recover.

We tried to introduce some form of correction by giving the stochastic word normali- sation a weight of 80%, with the remaining 20% being the usual word normalisation over the whole collection. The thinking was that this would act as the Laplace correction for missing values in MNB and would introduce some negative feedback. Unfortunately, this

136 5.5. DATASET AND EXPERIMENTAL STUDY only partly mitigated the positive feedback effect and did not make the algorithm usable so we had to write it off. We finally decided to use our MNB-SPDA algorithm for the back-end classifier of our system.

5.5.3.3 Algorithm Comparisons and Discussion

Apart from our variations using a dynamic stochastic sample of the last N unsuccessfully classified documents, all tested algorithms are static - i.e., they use the whole document collection and produce virtually the same results at each iteration (variation between iterations is within 0.01% due to randomised order of testing). Our first intuition was that our MNB-SPDA algorithm would keep getting better with each successive iteration, but this does not happen. On further reflection, the explanation for this is simple: the negative feedback loop works towards minimising error variance between classes and not errors themselves, so the algorithm can never be completely error-free. MNB-SPDA converges to a stable state after four or five iterations and its accuracy does not change significantly thereafter. Variation between iterations is within 2.5% for MBN-SPDA - it consistently beats MNB-WN in accuracy, but with a varying margin (results reported here are from an average-to-low iteration).

MNB MNB-TOE MNB-WN MNB-SPDA MNB-SPDWNA DCC 47.91% 67.86% 69.44% 70.40% 42.84% 68.33%

Table 5.2: Average classification accuracy

The reported results in Table 5.2. show that the “classic” MNB algorithm without any corrections performs badly. It is interesting to note though how it fails: it is extremely good at guessing the dominant category and extremely bad at some of the others (with literally zero success rate for the “News” and “Reference” categories, and close to zero for “Business” and “Shopping” - see error distributions in Table 5.4). The only variant performing worse (except MNB-SPDWNA) was the experiment we conducted with a train- on-error algorithm using a prior over errors updated once per iteration. Train-on-error policy by itself improves the situation enormously - a jump from 47.91% to 67.86%, which explains why spam filters use it. Word count normalisation [Frank and Bouckaert, 2006] makes results even better - 69.44%. The best results were exhibited by MNB-SPDA, which had a 70.40% average accuracy rate. This may not sound impressive if compared to classification results reported elsewhere, but we consider it a success given the enormously noisy data we worked with. Prior distribution over the whole collection, and the values to which MNB-SPDA converged after 8 iterations, can be seen in Table 5.2. The last column provides an explanation of the improved results of the algorithm. What we see in it is an artificial “boosting” of categories with diluted content (e.g. “Society” and “Arts” which are difficult

137 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES

Category All Stochastic Boost Arts 9.51% 11.46% +20.50% Business 9.06% 9.52% +5.08% Computers 4.97% 5.94% +19.52% Games 2.20% 2.00% -9.09% Health 2.37% 2.74% +15.61% Home 1.25% 1.18% -5.60% News 0.25% 0.24% -0.04% Recreation 3.97% 4.14% +4.28% Reference 2.19% 2.23% +1.83% Regional 42.07% 35.59% -15.40% Science 4.61% 4.97% +7.81% Shopping 3.53% 4.14% +17.28% Society 8.60% 10.98% +27.67% Sports 3.85% 3.92% +1.82% Adult 1.58% 0.95% -39.87%

Table 5.3: Stochastic adjustment to distribution. The first column shows the real distribu- tion of instances, the second column shows what distribution the MNB-SPDA algorithm converged to, and the third is the boost it gave to problematic categories (or handicap where it made negative corrections to categories in which it was too successful). to define) where the classifier had problems, while clear-cut categories such as “Adult” and “Games” are corrected downwards since the algorithm can guess them anyway based on document evidence alone, no matter what the prior is. The “Regional” category was also adjusted downwards, for the opposite reason - its high prior probability makes it easier to guess. The end result of the adjustment is significantly higher accuracy in the problematic categories at the expense of a slight worsening of results elsewhere. This provides a much smoother spread of the error across categories: while the standard algorithms are very good in the dominant category and extremely poor in some of the others, MNB-SPDA has a much smaller variation in its accuracy and is not significantly worse in any category.

For the purposes of building a web directory, we think it is much more valuable to have equal treatment of all categories (i.e. - equal error levels) than higher overall accuracy. Luckily though, MNB-SPDA provides both.

If we look at the results in terms of computation costs, it also saves significantly (more than four times) on training times. A comparison of total training times (normalised) can be seen in Table 5.5.

There is also a small saving in the training set: while (all of the) train-on-everything algorithms used 61,255 words, MNB-SPDA used 59,832 to train its classifier. Both of these savings - in time and in storage requirements - are due to the fact that the algorithm trains only on a subset of the documents. With a less noisy dataset, where the algorithm would make fewer errors, these savings would be much higher. It can be argued that our algorithm performs a form of dimensionality reduction by feature selection, as it uses less

138 5.5. DATASET AND EXPERIMENTAL STUDY

Category MNB TOE WN SPDA Arts 23.00% 66.27% 69.25% 65.03% Business 00.59% 49.88% 39.92% 68.02% Computers 13.12% 54.10% 68.06% 65.47% Games 10.88% 49.83% 76.40% 71.36% Health 6.12% 41.10% 61.66% 68.10% Home 14.61% 40.90% 56.97% 73.51% News 0.00% 0.99% 48.34% 72.19% Recreation 0.98% 39.96% 38.31% 68.63% Reference 0.00% 22.78% 47.00% 71.70% Regional 99.87% 89.91% 87.29% 74.67% Science 15.94% 55.59% 60.12% 66.80% Shopping 0.07% 32.60% 54.71% 68.13% Society 14.87% 62.30% 46.05% 60.82% Sports 2.65% 50.25% 66.95% 72.25% Adult 17.20% 54.00% 86.88% 83.22% Deviation 0.2394 0.1945 0.1493 0.0500

Table 5.4: Class-specific success rates. Note how MNB-SPDA has sacrificed accuracy in the Regional class in order to improve all others.

MNB TOE WN SPDA DCC 4.13 1.13 4.17 1.00 0.00

Table 5.5: Algorithm training times, normalised. MNB-SPDA trains less than the ordinary train-on-error variant because it makes less errors to train on. DCC has no training phase.

features (words) than are available. On the negative side, it has to be said that MNB- SPDA does need several iterations to converge so it takes some time to reach this optimal state. However, in a real world web directory where dynamic classification will go on indefinitely (for years), losing several hours initially will not be a major consideration.

MNB TOE NW SPDA DCC 1.00 1.01 1.01 1.46 0.18

Table 5.6: Algorithm classification times, normalised.

MNB-SPDA is also somewhat heavier in classification cost; classification times can be compared in Table 5.6. This is due to the fact that the algorithm performs an adjustment to its prior distributions after each training, which adds some complexity. The algorithm has to maintain a stack with the last N errors and summarise it before classifying each instance. It is only a simple SQL query - a “SELECT ... GROUP BY ...” performed over a relatively small table (10,000 instances in our case); however, this is an operation other algorithms do not need. In this area, computation can be optimised in the future.

139 CHAPTER 5. IMPLEMENTATION AND PRACTICAL ISSUES

5.6 The Floating Query

Because of the way we separate our system into a hierarchical tree of independent classifiers with their own independent document collections, we can also afford to add a feature which we consider one of the most important aspects of the web exploration model we propose: the “floating query”. What we mean by this is that, while the user navigates the directory structure, the search query also changes depending on each current context. The query starts as a list of keywords with some initial weights, which are then multiplied by the local weight of each keyword in the category which the user is currently in. Thus, the query vector is re-evaluated with every click the user makes, adapting it to the current context. An idealised example can be seen in the Table 5.7 (values are not realistic but chosen for illustration purposes only).

Original English Equipment Aerospace and 0.01 0.92 0.00 0.00 repair 0.47 0.69 0.15 0.06 radar 0.23 0.03 0.89 0.08 helicopter 0.20 0.61 0.85 0.98

Table 5.7: Evolution of the original search query as the user moves through the English- language category into Equipment and Aerospace subcategories.

Note how the word and is initially distinctive, as it is a useful indicator to distinguish between English and other languages, then becomes a stop-word (practically disappears) as all documents in the category become English-language only. In the top level of English- language sites, repair is a good discriminator between Equipment and other sites, but then becomes common within that context and its importance shrinks accordingly. In the Aerospace category, helicopter becomes important as it distinguishes different types of Aerospace equipment.

5.7 User Testing, or Lack Thereof

Since our main claim is that we will improve the usability of the web for users, it would have been nice to measure how well we have managed to do so. The lack of such measurements is a regrettable, but unavoidable omission of this work.

Before we explain why, it is important to see what frameworks exist for measuring “usability”, or user’s satisfaction. When assessing search engines, researchers utilise two categories of effectiveness measures: system-based and user-based. The system-based ap- proach relies on objective measurements of system performance like precision (the fraction of retrieved instances that are relevant to the query) and recall (the fraction of relevant instances that are retrieved). If we can define which instances are relevant to the query, we can easily evaluate any search algorithm by comparing its results with the results of other

140 5.7. USER TESTING, OR LACK THEREOF algorithms. However, we have already seen that defining “relevance” is a non-trivial task, and that modern search engines specifically do not try to equate “relevance to a query” with “documents containing the terms of the query”. On the other hand, our contribution diverges from this even further: one of our goals being to “broaden the horizons” of the user by offering him information he did not ask for, we are in fact deliberately not trying to achieve high precision or recall, so comparing on that basis would be pointless. The user-based approach emphasises users’ subjective evaluations of the search engines, such as perceptions and satisfaction. Even though these measures are not as well defined as those of the system-based approach, it is accepted that a methodology based on the end- user point of view is also essential for evaluating search engine effectiveness [Chuang and Wu, 2007]. Typically such a methodology involves a survey with a relatively representative group of users. Surveys are usually conducted by giving a group of people tasks to perform, then measuring how well the search engine assists them with these tasks. However, it has been found [Russell and Grimes, 2007] that users behave differently when doing their own tasks and not “artificially” assigned tasks: they spend more time, make fewer queries and issue different kinds of queries overall. The problems with the approach are that:

• It is highly subjective and reflects both the user’s bias and the bias of the person who has set up the experiments. It may be a subconscious bias, but an experimenter would tend to select such experiments that illustrate a particular point of view, or usage scenario.

• Satisfaction depends on the user’s prior expectations. The user being happy or not is not a measure of how well a search engine is doing but of how well it is doing in comparison to what the user expected it to do.

• Surveyed people often behave as if they are tested and try to say or do what they think is expected of them.

Furthermore, the system we propose is not directly comparable to any existing system - search engine or web directory; it also requires a significant investment on the part of the user before it can reveal its full potential, and then it would do that in a limited number of usage scenarios. As we have already said, our system cannot (and is not designed to) supplant existing systems but is an addition to them. As such, we decided there is nothing to compare and we did not try to do end-user evaluation of the prototype.

141

6 Conclusion

In this chapter, we summarise the current state of the dominant web information-finding solutions, what major issues exist and what we propose to do about them. We also outline some areas for future improvement which remained outside the scope of this work.

6.1 State of the Art

Modern information-finding solutions evolved from a “system-centric” approach, where search was based exclusively on the data available and not on the person who is searching in it. Since their very beginning, the main problem these solutions faced was the issue of relevance of information resources to a particular query. However, “relevance” has not even been defined yet, let alone a way found to sort resources by relevance. Search engines have tried to mitigate this in two ways:

• The concept of “relevance” has been substituted with “importance”, which is easier to define from a system-centric point of view. This definition is based on graph analysis, where a resource’s importance is judged by the number of other resources linking to it weighed by their own importance.

• Efforts at various types of personalisation are made, based on “implicit feedback” such as the user’s prior search history, geographic location, click patterns etc.

Both of these approaches have their problems:

143 CHAPTER 6. CONCLUSION

• Many “important” resources are not more relevant to a query than “not important” resources, since “importance” does not in fact correlate strongly with “relevance”. This approach has also had the unfortunate side effect of creating a whole new industry, that of “search engine optimisation” which aims to increase the importance of online resources by exploiting various features of the algorithms that calculate it. This, in turn, has made finding relevant resources even more difficult and has brought about the appearance of enormous amounts of “digital garbage” (content published not for human readers but solely for the purpose of deceiving search engines). Search engines have thus made web search an “adversarial” exercise.

• Collecting meaningful feedback on which to base personalisation faces many chal- lenges, both technical and ethical/legal. There are many technical barriers including not enough access to users’ information, and the storage and processing capacity needed to utilise what is available. On the other hand, users find the collection of such information an invasion of privacy and try to circumvent its collection where they can; to some extent collecting such information has also become, or is currently becoming, illegal in many jurisdictions.

Much more serious though are some shortcomings on a higher level. The first of these comes from two aspects of modern search engines: that they limit the number of search results shown to users, and that their ranking algorithms are fixed (with regard to the particular user) so that they always rank search results the same way for the same user. The consequence of this is that some documents will never be shown to this user because the search engine ranks them low in “importance”, even though the user submits search queries describing these documents correctly. In other words, even using the correct queries, users cannot find some information which is in fact relevant to these queries. What is particularly troubling in this case is that users are not aware of this limitation or the ranking factors leading to it: they remain under the impression that their search is comprehensive when in fact it is not.

An even more worrying aspect is the so-called “filter bubble”, which only lately came to be noticed or discussed at all. By applying personalisation based on “implicit feedback”, search engines limit the user’s access to information. Since this happens “behind the scenes”, the user is not aware of this limitation and believes that the search engine’s results are representative of all the available information - which it increasingly is not, with the introduction of more and more personalised search results. Since a human’s personality is shaped by experience, and since this experience is increasingly shaped by search engine algorithms, we are entering a world where these algorithms are shaping the personality of most people on the planet without them even being aware of this. However, no development at any of the dominant search providers has tried to inform users about these limitations, give them control of their “profiling” or allow them to influence the workings of ranking algorithms.

144 6.2. PROPOSED SOLUTION

6.2 Proposed Solution

In this paper we propose the Web Exploration Engine as an alternative information-finding model for web data. Our model is inherently personalised; it is based on a distributed system where every user’s cognitive background is modelled by a personal agent, which then communicates the relevant part of that model to the exploration engine. Unlike other proposed personal solutions though, our system does not rely on an existing search engine but proposes to create a new one (or modify an existing one) which accommodates such personalised search requests. Our system also allows the user full control over the search process by allowing him to explicitly define the search context; where modern search providers try to guess algorithmically (by implicit feedback) a search query’s context or the user’s personal context, our model offers a large number of pre-computed contexts from which the user can easily choose with a click, then elaborate further by supplying relevance feedback. This model allows the user to not only deeply research a topic, but also to easily expand the research into related topics and discover new information. Our approaches to some major aspects of information finding is different than the ones on which the dominant search providers are based:

• Where current solutions attempt to find the context without asking the user for as- sistance (an approach with severe limitations and open to many errors), we explicitly invite the user to supply both the current search context and the general “cognitive context”, in a way preserving privacy.

• In the current state of the Web, “information finding” has become a synonym of “search”. However, information finding is more than search: search only serves the “information locating” part of the information-finding need; information discovery is a separate task not currently addressed by a major provider. We propose a giant web directory where all web documents have been pre-classified in a large number of categories; users can explore them and discover information, or search within a particular category and locate information. The selection of this category allows the user to refine the search context with arbitrary granularity.

• Users are generally limited in their interaction with a search engine by the “keyword search” paradigm: they are supposed to describe a document they seek (and presum- ably have no knowledge of) with words that the document contains. In contrast to this, our model allows users to explore the web without any keywords, by browsing or by guiding the exploration engine with relevance feedback alone.

• Our result rankings are not based on graph analysis or other data external to the ranked resources. This, and the fact that we calculate relevance locally, within the context of each category, will make the generation of digital garbage unprofitable if our model becomes wide-spread.

145 CHAPTER 6. CONCLUSION

• By allowing user to just browse into any (arbitrarily deep) topic, or to conducted a guided search where “guidance” is both explicit and visible to the user, we eliminate the “filter bubble” limiting what users can find.

Our model has several advantages over existing search engine and web directory models:

• The only ranking parameter the system uses (the ratio between α, β and γ in feed- back weighting) is user-adjustable, so users have full control: we make no assump- tions on their behalf.

• Our Exploration Engine handles the user’s search in a session manner, where the search is developed over time in small increments; this allows better control on the part of the user, and supplies the “information scent” current systems notably lack.

• The dynamic nature of our classifier automatically takes care of changes in clas- sification (where editors move documents between categories), as well as Bayesian poisoning and other noise.

• The independent nature of the individual classifiers in our hierarchical classification tree allows a relatively easy implementation of a distributed system.

• Some of the features of our method facilitate more effective human editing by point- ing the editors to the most problematic entries requiring attention.

• Heuristic pattern-based classification adds orders of magnitude more content into categories, used also for training data for automated classification.

We believe that the introduction of our classification method for dynamic data [Kalinov et al., 2010a], and our approach to information locating and information discovery as a whole [Kalinov et al., 2010b], make the revival of web directories practical and will allow them to easily grow into Web Exploration Engines.

6.3 Summary of Contributions

We addressed several important aspects of modern search engines, web directories and personalisation solutions, where we believe our proposal can offer some advantages over the current state-of-the art. Some of the contributions were developed as an answer to the research sub-questions we had formulated initially, while others were incidental as part of the process of finding our answers.

1. Most personal assistant agents and similar solutions accept a search engine “as is”: they send requests to it as if they are a human who can only type a relatively small and flat list of keywords. On the other hand, this is also the only type of

146 6.3. SUMMARY OF CONTRIBUTIONS

search requests that search engines currently accept. We introduce infrastructure on the server side which accepts detailed requests from users (or their personalisation agents) in the form of weighted term vectors of arbitrary length; this allows far greater expressiveness of the query - it can now express not just a simple search but a detailed research carrying with it also some information about the user who issued it and the context in which he did. This is one part of the answer to our research sub-questions number 1 and 2.

2. Our system enables end-users to create viable models of their general own cogni- tive context (their personal “knowledge base”) by ordering documents they have accessed into an arbitrary number of specific contexts; users build their own ontolo- gies, according to their own views - as opposed to trying to fit the user’s model into a pre-built ontology as current state-of-the-art systems do. Our system then uses this knowledge in a meaningful manner, by transmitting to the server at search time only the part of that context which is relevant to the current search (outgoing search queries are re-weighted to reflect this context). This is the other part of the answer to our research sub-questions number 1 and 2.

3. We propose a simple ontology matching where the arbitrary ontology on the user side of our system is dynamically matched to the static ontology on the server side. Users receive a path through the server-side hierarchical structure, so they can select what granularity of the response they prefer (e.g. - how deep in the hierarchy they want to go). This dynamic matching will correct the impact of the user’s cognitive bias: even if users have a different perception of the classification of the resources they search for, the system will still guide them through a path to these resources (although different from the path they expected).

4. Users of our system do not have to issue queries at all. Instead, they can browse a pre-ordered hierarchy of topics and discover information they did not know existed, so would not have searched for and found at a traditional search provider. This could be beneficial for everyday users, but most of all for “knowledge workers” and could be a significant improvement in the field of e-Research. This the answer to our research sub-question number 3.

5. When issuing a search query, users receive not only results relevant to the query but also some suggestions from topics similar to the one they searched for. This allows an integrated recommendation feature as part of our Web Exploration Engine. This the answer to our research sub-question number 4.

6. Unlike the dominant search engine model, in our model all web documents are reach- able: both by simple browsing and by focused search. This the answer to our research sub-question number 5.

147 CHAPTER 6. CONCLUSION

7. In the process of developing an automated classifier for dynamically changing data, we modified the Multinomial Na¨ıve Bayesian classifier and introduced a negative feedback loop in it, thus changing it fundamentally. Our modification, MNB-SPDA, exhibits better overall accuracy, as well as a smoother spread of errors between classes (it is equally reliable for all classes, unlike the original MNB).

8. We also developed a fast classifier based on a discretised class centroid, which is not as accurate as MNB-SPDA but is three times faster so is preferable in a situation where users prefer speed over accuracy.

9. Our approach of using an automatically pre-built hierarchy of topics not only allows browsing this hierarchy, but also enables a unique feature of our system: the “float- ing query”; the system re-evaluates the user’s query at every point of the topical hierarchy using local context values, and re-orders the query terms by importance accordingly. A similar approach applies to document ranking in search results - it is also category- (context-) specific; the same document is ranked differently against the same keywords, depending on the current search context. In effect users of our system have access to a large number of nested specialised search engines. This partially contributes to the answer to our research sub-question number 5, in the context of a keyword search use case.

10. Users in our model can assist the search with explicit, document-level positive or negative feedback, allowing much better control over the process. When the user clicks on a button next to a displayed result, all the (weighted) keywords contained in that document are added to the query: a one-click query expansion adding the users feedback to his current search session. This the answer to our research sub-question number 6.

11. Our implementation enables users to develop a search, over a period of time, into a personalised deep research, which they can re-visit at a later point in time, and can also enhance it by adding their own content to it (users can add their own snippets of data to search results, even though the exploration engine may lack this data or not consider it relevant). This the answer to our research sub-question number 7.

12. The browser add-on we envisage as the user-side part of our distributed system is also a useful tool which will record the user’s browsing history into a structured, searchable database. This can become a viable alternative to current “bookmark” browser features and a true “personal knowledge base management tool”. This the answer to our research sub-question number 8.

13. In making our system distributed and moving the user profiling part of it to the user’s computer, we address (or completely side-step) the issues of privacy and legal

148 6.4. FUTURE WORK

compliance which personalisation on the side of the search provider involves. In our case all personal data remains with the user, as well as its processing. This the answer to our research sub-question number 9. Incidentally, this is also a huge technical advantage to the server-side part of our system, since it shifts massive storage and computing demands away from it.

14. Unlike the dominant graph-analysis methods, our result ranking relies on features of the document itself and not on external factors. This renders ineffective most of the arsenal of the search engine optimisation industry: “keyword bombing” or “link farming” become irrelevant; “keyword stacking” is counter-productive, as adding more keywords to a document dilutes the TF-IDF values for all of these keywords and it will not rank high for any of them. Furthermore, any optimisation will work in one category only, as every category has its own IDF values; a document cannot be “optimised” for one category without harming its ranking in all the (millions of) others: we make the bad guys’ work millions of times more complicated. If the model becomes wide-spread, it may deter further creation of “digital garbage”. This the answer to our research sub-question number 10.

15. From an ethical point of view, modern personalised search solutions using implicit feedback have a huge problem in that they tend to create a “filter bubble”: their users are unwittingly limited in the information they receive because the search engine decides what is suitable for them. In the long run, instead of the user shaping the personal solution, the opposite will happen: by deciding what the person can learn, the machine will shape his personality. Our model does not have this problem. In it no documents are hidden from the user, no matter what the system thinks about their “suitability”; documents may be re-ordered due to explicit or implied preference, but all of them are reachable, even those deemed not suitable.

6.4 Future Work

Our proposed Web Exploration engines needs much further work before it becomes prac- tical on a “real world” scale. It needs to be significantly improved in several aspects.

From a technical perspective, the existing prototype needs to address the current scalability issues and be expanded to accommodate billions of documents. Classification should also be optimised for faster computation. The current classification algorithm, while practical and usable, may not be optimal so other options should also be investigated; these should also include the possibility of “boosted learning”, where the algorithm learns not only from human-labelled data but also from data which was automatically labelled with high classifier confidence. A future system will need to incorporate a commercial- grade web spider and to improve the filtering of HTML documents into meaningful text for

149 CHAPTER 6. CONCLUSION indexing/categorisation in terms of: better filtering of common site elements, and better vector representations of the text (improving tokenizer rules). Keyword search needs to be improved by the detection and removal of some “outlier” keywords from documents, which currently have some negative impact, as mentioned in our discussion of Indexing. From an editorial point of view, it has to be recognised that the system requires a significant amount of human labour to provide classification labels on which the algorithm can train. Even though the system assists editors and makes their jobs much easier as compared to other existing web directories, the cost of acquiring information is still too high (especially when compared to search engines). In future, a collaborative filter may be implemented to allow end-users to assist with classification; if this is properly implemented, it may completely solve the issue with the current scarcity of labelled data. From a general point of view, we also have to recognise the ambiguity of some data which we have currently chosen to ignore. Where a document might reasonably be said to belong to two or more categories to some extent, our system currently places it in one category only; if we have two or more labels for it (as is the case with some documents indexed by the Open Directory), we treat them equally. In future, the system should allow fuzzy membership of documents in categories, with a “floating point” value so that a document may belong for example 46% to category A, 32% to category B and 22% to category C. There are many implications in this which will have to be addressed in a major re-working of the classification algorithm and the hierarchical classifier structure. The system needs to be tested with real-world users in a real-world scenario (with a realistic database covering the whole web), which was outside the scope of this work due to the magnitude of the task and its implications. Some usability enhancements which can be added in the future include:

• Better visualisation of the category browsing model, including dynamic (AJAX- based) display and re-ordering of search results.

• Roaming user profiles which users will be able to use across different browsers or devices.

• The personal agent part of the system should be expanded to become an integrated personal information management tool/portal organising all the digital information belonging to the user (not only his web browsing history but also other digital texts, media, communications, contacts etc.).

From a commercial aspect, the system needs to be integrated with an existing major search engine so as to use the infrastructure which is already in place, and the available data already collected. We believe this has the potential to revolutionise the way people find information on the web.

150 Bibliography

Allan, J. (1996). Incremental Relevance Feedback for Information Filtering. In Proceed- ings of the 19th International Conference on Research and Development Information Retrieval (SIGIR 96), pages 270–278. ACM Press. 72

Alpaydin, E. (2004). Introduction to Machine Learning (Adaptive Computation and Ma- chine Learning). The MIT Press. 86

Anagnostopoulos, A., Broder, A. Z., and Punera, K. (2006). Effective and Efficient Clas- sification on a Search-Engine Model. In In Proceedings of the International Conference on Information and Knowledge Management, pages 208–217. ACM Press. 90, 125

Arnaud, D. (2004). Study of a Document Classication Framework Using Self-Organizing Maps. Technical report, Technological Educational Institute of Thessalonica. 103

Arrington, M. (2009). The End Of Hand Crafted Content. Tech Crunch. 48

Attardi, G., Gull´ı,A., Dato, D., and Tani, C. (1999). Towards Automated Categorization and Abstracting of Web Sites. http://www.di.unipi.it/~gulli/papers/submitted/ automated_classification.htm. 90

Aula, A., Jhaveri, N., and Kaki, M. (2005). Information Search and Re-access Strategies of Experienced Web Users. In Proceedings of the 14th international conference on World Wide Web, pages 583–592, New York, NY, USA. ACM Press. 43

Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. ACM Press Books, Harlow. 80

Bar-Yossef, Z., Broder, A. Z., Kumar, R., and Tomkins, A. (2004). Sic transit gloria telae: Towards an understanding of the web’s decay. In Proceedings of the 13th conference on World Wide Web, pages 328–337, New York, NY, USA. 15, 30

Bar-Yossef, Z., Keidar, I., and Schonfeld, U. (2007). Do Not Crawl in the DUST: Different URLs with Similar Text. In Proceedings of the 16th International World Wide Web Conference (WWW2007). 9, 12, 16, 48

xvii BIBLIOGRAPHY

Barbosa, L. and Freire, J. (2007). An Adaptive Crawler for Locating Hidden-Web En- try Points. In Proceedings of the 16th International World Wide Web Conference (WWW2007). 16, 25

Berger, H. and Merkl, D. (2005). A Comparison of Support Vector Machines and Self- Organizing Maps for e-Mail Categorization. In Simoff, S., Williams, G., Galloway, J., and Kolyshkina, I., editors, Proceedings of the 4th Australasian Data Mining Conference, pages 189–204. University of Technology Sydney. 103

Berkhin, P. (2002). Survey Of Clustering Data Mining Techniques. Technical report, Accrue Software, San Jose, CA. 108

Berners-Lee, T. (1989). Information Management: A Proposal. Technical report, CERN, Genf, Switzerland. 8

Birukov, A., Blanzieri, E., and Giorgini, P. (2005). Implicit: An Agent-based Recommen- dation System for Web Search. AAMAS, pages 618–624. 42, 75

Blei, D. M., Jordan, M. I., and Ng, A. Y. (2003a). Hierarchical Bayesian Models for Applications in Information Retrieval. Bayesian Statistics, 7:25–44. 84

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003b). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022. 84

Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1–7):107–117. 14

Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2):3–10. 13

Broder, A. Z., Carmel, D., Herscovici, M., Soffer, A., and Zien, J. (2003). Efficient Query Evaluation using a Two-Level Retrieval Process. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, CIKM ’03, pages 426–434, New York, NY, USA. ACM. 23

Burges, C. J. C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. N. (2006). Learning to Rank Using Gradient Descent. In Raedt, L. D. and Wrobel, S., editors, ICML, pages 89–96. ACM. 17

Buscher, G., Dengel, A., and van Elst, L. (2008). Query Expansion Using Gaze-Based Feedback on the Subdocument Level. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pages 387–394, New York, NY, USA. ACM. 76

Bush, V. (1945). As We May Think. Atlantic Monthly. 12

Business Insider (2011). Leaked: AOL’s Master Plan. http://www.businessinsider. com/the-aol-way/. 48

xviii BIBLIOGRAPHY

Cao, H., Jiang, D., Pei, J., Chen, E., and Li, H. (2009). Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs. In Proceedings of the 18th International World Wide Web Conference (WWW2009), pages 191–200, New York, NY, USA. ACM. 18

Chakrabarti, S., Dom, B., Agrawal, R., and Raghavan, P. (1998). Scalable Feature Selec- tion, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies. The VLDB Journal, 7(3):163–178. 90

Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chan- dra, T., Fikes, A., and Gruber, R. E. (2006). Bigtable: a Distributed Storage System for Structured Data. In Proceedings of the 7th USENIX Symposium on Operating Sys- tems Design and Implementation - Volume 7, OSDI ’06, Berkeley, CA, USA. USENIX Association. 36

Charikar, M. (2002). Similarity estimation techniques from rounding algorithms. In Pro- ceedings of the 34th Annual ACM Symposium on Theory of Computing, pages 380–388. 16

Chen, L. and Sycara, K. (1998). WebMate: A Personal Agent for Browsing and Searching. In Sycara, K. P. and Wooldridge, M., editors, Proceedings of the 2nd International Conference on Autonomous Agents, pages 132–139, New York. ACM Press. 39, 44

Chierichetti, F., Kumar, R., and Raghavan, P. (2009). Compressed Web Indexes. In Proceedings of the 18th International World Wide Web Conference (WWW2009), pages 451–451. 92

Chirita, Paul-Alexandru and Firan, Claudiu S. and Nejdl, Wolfgang (2007). Personalized Query Expansion for the Web. In Proceedings of the 30th annual international ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR ’07, pages 7–14, New York, NY, USA. ACM. 38, 77

Cho, J. and Garcia-Molina, H. (2000). The evolution of the web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Data Bases, pages 200–209, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. 14

Cho, J. and Roy, S. (2004). Impact of Search Engines on Page Popularity. In Proceedings of the 13th International Conference on World Wide Web, pages 20–29, New York, NY, USA. ACM. 49

Chuang, Y.-L. and Wu, L.-L. (2007). User-Based Evaluations of Search Engines: Hygiene Factors and Motivation Factors. In 40th Hawaii International International Conference on Systems Science (HICSS-40 2007), page 82. IEEE Computer Society. 139

xix BIBLIOGRAPHY

Cilibrasi, R. and Vitanyi, P. M. B. (2007). The Google Similarity Distance. IEEE Trans- actions on Knowledge and Data Engineering, 19:370. 85

Cutting, D. R., Pedersen, J. O., Karger, D., and Tukey, J. W. (1992). Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 318–329. 87

Daoud, M., Tamine, L., and Boughanem, M. (2010). A Personalized Graph-Based Docu- ment Ranking Model Using a Semantic User Profile. In Proceedings of the Conference on User Modeling, Adaptation and Personalization (UMAP), pages 171–182. Springer. 77, 78

Das, A., Datar, M., Garg, A., and Rajaram, S. (2007). Google News Personalization: Scalable Online Collaborative Filtering. In Proceedings of the 16th International World Wide Web Conference (WWW2007). 42, 50, 75, 96

Dean, J. and Ghemawat, S. (2010). System and Method for Efficient Large-scale Data Processing. 81

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 41(6):391–407. 16, 23, 38, 46, 82, 83

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39(1):1– 38. 111

Dervin, B. (1997). Given a Context by any Other Name: Methodological Tools for Taming the Unruly Beast. In Proceedings of the International Conference on Information Seeking in Context, pages 13–38, London, UK. Taylor Graham Publishing. 34

Ding, C. and He, X. (2004). k-means Clustering via Principal Component Analysis. In Proceedings of the twenty-first international conference on Machine learning, New York, NY, USA. ACM. 87

Domingos, P. and Pazzani, M. J. (1997). On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning, 29(2-3):103–130. 92

Edelman, B. (2003). Domains Reregistered for Distribution of Unrelated Content: A Case Study of ’Tina’s Free Live Webcam’. http://cyber.law.harvard.edu/people/ edelman/renewals/. 12, 15, 48

Enkhsaikhan, M., Wong, W., Liu, W., and Reynolds, M. (2007). Measuring Data-Driven Ontology Changes using Text Mining. In Christen, P., Kennedy, P. J., Li, J., Kolyshkina, I., and Williams, G. J., editors, Sixth Australasian Data Mining Conference (AusDM 2007), volume 70 of CRPIT, pages 39–46, Gold Coast, Australia. ACS. 115

xx BIBLIOGRAPHY

Etzioni, O. (1996). Moving Up the Information Food Chain: Deploying Softbots on the World Wide Web. In Proceedings of the Thirteenth National Conference on Artificial Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference, pages 1322–1326, Menlo Park. AAAI Press / MIT Press. 44

Fetterly, D. (2007). Adversarial Information Retrieval: The Manipulation of Web Content. ACM Computing Reviews. 49

Fetterly, D., Manasse, M., Najork, M., and Wiener, J. (2003). A large-scale Study of the Evolution of Web Pages. In Proceedings of the 12th international conference on World Wide Web (WWW2003), WWW2003, pages 669–678, New York, NY, USA. ACM. 12, 15, 48

Frank, E. and Bouckaert, R. R. (2006). Na¨ıve Bayes for Text Classification with Un- balanced Classes. In Proc 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, pages 503–510. Springer. 92, 126, 127, 135

Gao, J., Fan, W., Han, J., and Yu, P. S. (2007). A General Framework for Mining Concept- Drifting Data Streams with Skewed Distributions. In Proceedings of the Seventh SIAM International Conference on Data Mining - SDM’07. SIAM. 115

Gauch, S. (2010). Keynote Talk: A Rock is not a Rock. Proceedings of the 9th International Conference on Adaptivity, Personalization and Fusion of Heterogeneous Information - RIAO 2010. 78

Glen Jeh and Jennifer Widom (2002). Scaling Personalized Web Search. In In Proceedings of the 12th International World Wide Web Conference, pages 271–279. ACM Press. 36

Google (2004). Systems and Methods for Associating a Keyword with a User Interface Area. United States Patent 7,272,601. 38

Google (2011a). How does Google Collect and Rank Results? http://www.google.com/ librariancenter/articles/0512_01.html. 81

Google (2011b). Microsofts Bing uses Google search results and denies it. http:// googleblog.blogspot.com/2011/02/microsofts-bing-uses-google-search.html. 4

Gord Hotchkiss (2007). Marissa Mayer Interview on Person- alization. http://www.outofmygord.com/archive/2007/02/23/ Marissa-Mayer-Interview-on-Personalization.aspx. 25, 35

Gordon-Murnane, L. (2006). Social Bookmarking, Folksonomies, and Web 2.0 Tools. Searcher Mag Database Prof, 14(6):26–38. 16

xxi BIBLIOGRAPHY

Goren-Bar, D., Kuflik, T., and Lev, D. (2000). Supervised Learning for Automatic Classi- fication of Documents using Self-Organizing Maps. In DELOS Workshop: Information Seeking, Searching and Querying in Digital Libraries. 88

Gorrell, G. (2006). Generalized Hebbian Algorithm for Dimensionality Reduction in Nat- ural Language Processing. PhD thesis. 85

Graham-Cumming, J. (2006). Does Bayesian Poisoning Exist? http://www.virusbtn. com/spambulletin/archive/2006/02/sb200602-poison. 120

Group, D. (2004). Information Intelligence: Intelligent Classication and the Enterprise Taxonomy Practice. Technical report, Delphi Group. 21, 29

Gy˝ongyi,Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J. (2006). Link Spam Detection Based on Mass Estimation. In Proceedings of the 32nd international conference on Very Large Data Bases (VLDB 06), pages 439–450. VLDB Endowment. 47

Gy˝ongyi,Z. and Garcia-Molina, H. (2005). Web Spam Taxonomy. 47, 49

Haveliwala, T. H. (2002). Topic-sensitive pagerank. In In Proceedings of the 11th Inter- national World Wide Web Conference, pages 517–526. 36

Hinton, G. and Salakhutdinov, R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786):504–507. 85, 109

Hofmann, T. (1999). Probabilistic latent semantic indexing. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57, New York, NY, USA. ACM. 84

Hosseini, M. and Abolhassani, H. (2007). Hierarchical Co-clustering for Web Queries and Selected URLs. In Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., and Godart, C., editors, Proceedings of the 8th International Conference on Web Information Systems Engineerin - WISE 2007, volume 4831 of Lecture Notes in Computer Science, pages 653–662. Springer. 84

ICANN (2010). New gTLD Program. http://www.icann.org/en/topics/ new-gtld-program.htm. 11

International Telecommunication Union (2008). ICT Indicators Database / Market Infor- mation and Statistics. http://www.itu.int/ITU-D/ict/statistics/. 3

Kalinov, P., Stantic, B., and Sattar, A. (2010a). Building a Dynamic Classifier for Large Text Data Collections. In Shen, H. T. and Bouguettaya, A., editors, Proceedings of the 21st Australasian Database Conference, volume 104 of CRPIT, pages 113–122. Aus- tralian Computer Society. i, 144

xxii BIBLIOGRAPHY

Kalinov, P., Stantic, B., and Sattar, A. (2010b). Let’s Trust Users - It is Their Search. In Huang, J. X., King, I., Raghavan, V. V., and Rueger, S., editors, Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, volume 1 of WI-IAT 2010, pages 176–179, Los Alamitos, CA, USA. IEEE Computer Society. i, 144

Kamvar, S. D., Haveliwala, T. H., and Golub, G. H. (2003). Adaptive Methods for the Computation of PageRank. Technical report, . 35

Kang, B. H., Kim, Y. S., and Choi, Y. J. (2007). Does Multi-user Document Classification Really Help Knowledge Management? In Orgun, M. A. and Thornton, J., editors, 20th Australian Joint Conference on Artificial Intelligence, volume 4830 of Lecture Notes in Computer Science, pages 327–336. Springer. 31

Kantardzic, M. (2002). Data Mining: Concepts, Models, Methods, and Algorithms. Wiley- IEEE Press. 79, 87

Kaski, S. (1998). Dimensionality Reduction by Random Mapping: Fast Similarity Compu- tation for Clustering. In IJCNN98 International Joint Conference on Neural Networks, volume 1, pages 413–418. Piscataway, NJ: IEEE. 83

Katakis, I., Tsoumakas, G., and Vlahavas, I. (2005). On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams. Advances in Informatics, pages 338–348. 115

Kleinberg, J. M. (1999). Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, 46(5):604–632. 17, 72

Kobsa, A. (2007). Privacy-Enhanced Web Personalization. In Brusilovsky, P., Kobsa, A., and Nejdl, W., editors, The Adaptive Web, Methods and Strategies of Web Personal- ization, volume 4321 of Lecture Notes in Computer Science, pages 628–670. Springer. 43

Kohonen, T. (1995). Self-Organizing Maps. Springer Verlag, Berlin. 87

Kohonen, T. (2001). Self-Organizing Maps. Springer-Verlag New York, Inc., Secaucus, NJ, USA. 111

Kohonen, T., Hynninen, J., Kangas, J., and Laaksonen, J. (1996). SOM PAK: The Self- Organizing Map program package. Technical report, Helsinki University of Technology, Laboratory of Computer and Information Science. 87, 89, 110

Koikkalainen, P. and Oja, E. (1990). Self-Organizing Hierarchical Feature Maps. In Inter- national Joint Conference on Neural Networks, volume 2, pages 279–285, Piscataway, NJ, USA. IEEE Service Center. 89

xxiii BIBLIOGRAPHY

Kusumoto, H. and Takefuji, Y. (2006). O(log2M) Self-Organizing Map Algorithm Without Learning of Neighborhood Vectors. IEEE Transactions on Neural Networks, 17(6):1656– 1661. 89

Lagus, K., Kaski, S., and Kohonen, T. (2004). Mining Massive Document Collections by the WEBSOM Method. Inf. Sci., 163(1-3):135–156. 108

Lang, K. (1995). NewsWeeder: Learning to Filter Netnews. In Proceedings of the 12th International Conference on Machine Learning, pages 331–339. Morgan Kaufmann pub- lishers Inc.: San Mateo, CA, USA. 82, 83, 90

Lee, U., Liu, Z., and Cho, J. (2005). Automatic identification of user goals in Web search. In Proceedings of the 14th international conference on World Wide Web, pages 391–400. ACM Press. 49, 50, 75

Liang, W., Yiping, G., and Ming, F. (2004). Internet Search Engine Evolution: the DRIS System. Communications Magazine, IEEE, 42(11):30–30. 12

Lieberman, H. (1995). Letizia: An Agent That Assists Web Browsing. In Mellish, C. S., editor, Proceedings of the Fourteenth International Joint Conference on Artificial In- telligence (IJCAI-95), pages 924–929, Montreal, Quebec, Canada. Morgan Kaufmann publishers Inc.: San Mateo, CA, USA. 37, 39, 77

Lieberman, H., Fry, C., and Weitzman, L. (2001). Why surf alone? Exploring the Web with Reconnaissance Agents. Communications of the ACM, 44(8):69–75. 20, 44

Liu, F., Yu, C., and Meng, W. (2002). Personalized Web Search by Mapping User Queries to Categories. In Proceedings of the Eleventh International Conference on Information and Knowledge Management, pages 558–565, New York, NY, USA. ACM. 75

Liu, F., Yu, C., and Meng, W. (2004). Personalized Web Search For Improving Retrieval Effectiveness. IEEE Transactions on Knowledge and Data Engineering, 16:28–40. 40, 61, 77, 78

Liu, T., Yang, Y., Wan, H., Zeng, H., Chen, Z., and Ma, W. (2005). Support Vector Machines Classification with a Very Large-Scale Taxonomy. SIGKDD Explorations, 7:2005. 90, 118, 119

LiveInternet (2010). Search Engine Stats - Russian Internet. http://www.liveinternet. ru/stat/ru/searches.html?slice=ru. 21

Ma, Z., Pant, G., and Sheng, O. R. L. (2007). Interest-Based Personalized Search. ACM Transactions on Information Systems, 25. 38, 39, 77, 78

MacQueen, J. (1967). Some Methods for Classification and Analysis of Multivariate Ob- servations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297. University of California Press. 87

xxiv BIBLIOGRAPHY

Manku, G. S., Jain, A., and Sarma, A. D. (2007). Detecting Near-Duplicates for Web Crawling. In Proceedings of the 16th International World Wide Web Conference (WWW2007). 12, 48

Matthijs, N. and Radlinski, F. (2011). Personalizing Web Search using Long Term Brows- ing History. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM 2011, pages 25–34, New York, NY, USA. ACM. 39

McCallum, A. and Nigam, K. (1998). A Comparison of Event Models for Na¨ıve Bayes Text Classification. In In AAAI-98 Workshop on Learning for Text Categorization, pages 41–48. AAAI Press. 92

MessageLabs - Symantec Corporation (2010). MessageLabs Intelligence Monthly Report, September 2010. 8 metamend (2010). What Is The Google Dance? http://www.metamend.com/ google-dance.html. 11

Minsky, M. (1974). A Framework for Representing Knowledge. MIT-AI Laboratory Memo 306. 78

Nakayama, K., Hara, T., and Nishio, S. (2007). Wikipedia Mining for an Association Web Thesaurus Construction. In Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., and Godart, C., editors, Proceedings of the 8th International Conference on Web Information Systems Engineerin - WISE 2007, volume 4831 of Lecture Notes in Computer Science, pages 322–334. Springer. 85

Nielsen, J. (2004). When Search Engines Become Answer Engines. http://www.useit. com/alertbox/20040816.html. 50, 59

Nielsen, J. (2006). Search Engines as Leeches on the Web. http://www.useit.com/ alertbox/search_engines.html. 24, 64

Nielsen, J. (2011). International Usability: Big Stuff the Same, Details Differ. http: //www.useit.com/alertbox/international-sites.html. 37

Nielsen//NetRatings (2007). Free data and rankings. http://www.nielsen-online.com/ resources.jsp?section=pr_netv. 50

Oja, M., Kaski, S., and Kohonen, T. (2001). Bibliography of Self-Organizing Map (SOM) Papers: 1998-2001 Addendum. http://www.cis.hut.fi/research/refs/. 108

Page, L., Brin, S., Motwani, R., and Winograd, T. (1998). The PageRank Citation Rank- ing: Bringing Order to the Web. Technical report, Stanford Digital Library Technologies Project. 13, 16

Pariser, E. (2011). The Filter Bubble: What the Internet Is Hiding from You. Penguin Group USA. 51, 64

xxv BIBLIOGRAPHY

Pasi, G. (2010). Keynote Talk: Issues on Preference-Modelling and Personalization in Information Retrieval. Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 1:4. xiii, 32, 33, 41

Pearson, K. (1901). On Lines and Planes of Closest Fit to Systems of Points in space. The London, Edinburgh and Dublin Philosophical Magazine and Journal of Science, 6(2):559–572. 83

Pirolli, P. and Card, S. (1995). Information Foraging in Information Access Environments. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’95, pages 51–58, New York, NY, USA. ACM Press/Addison-Wesley Publishing Co. 34, 59, 64

Qiu, F. and Cho, J. (2006). Automatic Identification of User Interest for Personal- ized Search. In Proceedings of the 15th international conference on World Wide Web (WWW2006), pages 727–736, New York, NY, USA. ACM Press. 50, 75

Rauber, A. (1999). LabelSOM: on the Labeling of Self-Organizing Maps. In Proceedings of the International Joint Conference on Neural Networks, volume 5, pages 3527–32, Piscataway, NJ. IEEE Service Center. 89

Rauber, A., Merkl, D., and Dittenbach, M. (2002). The Growing Hierarchical Self- Organizing Maps: Exploratory Analysis of High-Dimensional Data. In IEEE Trans- actions on Neural Networks, volume 13(6), pages 1331–1341. 109

Rennie, J. D. M., Shih, L., Teevan, J., and Karger, D. R. (2003). Tackling the Poor As- sumptions of Na¨ıive Bayes Text Classifiers. In Proceedings of the Twentieth International Conference on Machine Learning, pages 616–623. 92, 121, 127

Richardson, M., Prakash, A., and Brill, E. (2006). Beyond PageRank: Machine Learning for Static Ranking. In Proceedings of the 15th international conference on World Wide Web (WWW2006), pages 707–715, New York, NY, USA. ACM Press. 17, 18

Robertson, S. (2004). Understanding inverse document frequency: On Theoretical Argu- ments for IDF. Journal of Documentation, 60(5):503–520. 16

Robertson, S. E. and Jones, S. K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129–146. 71

Rocchio, J. (1971). The SMART Retrieval System, chapter Relevance Feedback in Infor- mation Retrieval, pages 313–323. Prentice Hall, Englewood Cliffs, NJ. 71

Russell, D. M. and Grimes, C. (2007). Assigned Tasks are not the Same as Self-chosen Web Search Tasks. In 40th Hawaii International International Conference on Systems Science (HICSS-40 2007), page 83. IEEE Computer Society. 139

xxvi BIBLIOGRAPHY

Salakhutdinov, R. and Hinton, G. (2007). Semantic Hashing. online. http://www.cs.utoronto.ca/ hinton/. 83, 85

Sch¨on,D. (1983). The Reflective Practitioner: How Professionals Think in Action. Basic Books, New York. 94

Sieg, A., Mobasher, B., and Burke, R. (2007). Web Search Personalization with Ontological User Profiles. In Proceedings of the 16th ACM conference on Conference on Information and Knowledge Management, CIKM ’07, pages 525–534, New York, NY, USA. ACM. 37, 77, 78

Sieg, A., Mobasher, B., Lytinen, S., and Burke, R. (2003). Concept Based Query En- hancement in the ARCH Search Agent. In Proceedings of the International Conference on Internet Computing, pages 613–619. 67, 71

Speretta, M. and Gauch, S. (2009). Miology: A Web Application for Organizing Personal Domain Ontologies. In Proceedings of the 2009 International Conference on Informa- tion, Process, and Knowledge Management, pages 159–161, Washington, DC, USA. IEEE Computer Society. 78

Su, M.-C. and Chang, H.-T. (2000). Fast Self-Organizing Feature Map Algorithm. IEEE- NN, 11(3):721. 89

Suchanek, F. M., Kasneci, G., and Weikum, G. (2008). YAGO: A Large Ontology from Wikipedia and WordNet. Web Semantics: Science, Services and Agents on the World Wide Web, 6:203–217. 78

Sun, J., Zeng, H., Liu, H., Lu, Y., and Chen, Z. (2005). CubeSVD: A novel approach to personalized web search. In Proceedings of the 14th international conference on World Wide Web, pages 652–662. ACM Press. 50

Ultsch, A. (2005). Clustering with SOM: U*C. In Proceedings of the Workshop on Self- Organizing Maps, pages 75–82, Paris, France. 87

Vivisimo Inc. (2006). Tagging vs. Clustering in Enterprise Search. http://vivisimo. com/html/download-tagging. 18, 38

Werbos, P. J. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University. 91

Witten, I. H., Gori, M., and Numerico, T. (2006). Web Dragons: Inside the Myths of Search Engine Technology. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 24

Wong, W., Liu, W., and Bennamoun, M. (2006). Featureless Similarities for Terms Clus- tering Using Tree-Traversing Ants. In PCAR ’06: Proceedings of the 2006 International

xxvii BIBLIOGRAPHY

Symposium on Practical Cognitive Agents and Robots, pages 177–191, New York, NY, USA. ACM. 85

Xue, G. R., Xing, D., Yang, Q., and Yu, Y. (2008). Deep Classification in Large-Scale Text Hierarchies. In Myaeng, S.-H., Oard, D. W., Sebastiani, F., Chua, T.-S., and Leong, M.-K., editors, Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pages 619–626, New York, NY, USA. ACM. 67, 90, 114

Yahoo! (2007). Why is your crawler asking for strange URLs that have never existed on my site? http://help.yahoo.com/l/uk/yahoo/search/webcrawler/slurp-10.html. 15

Yang, B. and Jeh, G. (2006). Retroactive Answering of Search Queries. In Proceedings of the 15th international conference on World Wide Web (WWW2006), pages 457–466, New York, NY, USA. ACM. 75

Yoshida, T., Nakamura, S., and Tanaka, K. (2007). WeBrowSearch: Toward Web Browser with Autonomous Search. In Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., and Godart, C., editors, Proceedings of the 8th International Conference on Web Information Systems Engineering, WISE’07, pages 135–146, Berlin, Heidelberg. Springer-Verlag. 21, 25, 38

Zawodny, J. (2003). PageRank is Dead. http://jeremy.zawodny.com/blog/archives/ 000751.html. 48

Zdziarski, J. A. (2005). Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification. No Starch Press. 116, 117, 120, 130, 132

Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley. 80, 92

xxviii Index of Terms

bag of words...... , 47, 77, 80, 82, 106 55, 62, 65, 67, 78, 79, 82, 92, 110, Bayesian poisoning ...... , 122, 130, 146 121, 137, 144, 146, 148 behavioural metrics ...... , 35, 36, 76 indexing ...... , 16, 31, 45, 47, 48 bookmark , 4, 13, 16, 43, 55, 62, 68, 73, 76, management system ...... , 14, 15 148 publisher ...... , 25 browsing history , 34, 40, 55, 56, 76–78, 98, quantity ...... , 29, 30 101, 148, 150 search ...... , 49 browsing ...... , 2, 3, 12–14, 25, 26, type...... , 28 28–30, 32, 38, 42, 45, 50, 53, 55, 56, context 58, 62, 63, 66–68, 77, 86, 87, 89, 99, cognitive...... , 2, 100, 109–112, 121, 145–148, 150 3, 31, 33, 34, 38, 46, 50, 54, 55, 58, 59, 61, 63, 77, 99, 102, 145, 147 categorisation...... , 30, 105 current ...... , 50, 58, 102, 140, 145 classification document ...... , 19, 22, 23, 27, 34, 62 directory tree ...... , 31 general ...... , 27, 28, 34, 38, 40 general . . , 16, 86, 90, 91, 105, 115, 117, granularity ...... , 61 122, 126, 128, 129, 134 search ...... , noise...... , 30, 97, 122 3, 21, 22, 24, 32–36, 38, 39, 46, 47, classifier 49, 54, 56, 58, 59, 62, 70, 145, 147, hierarchical ...... , 73, 98, 116, 146 148 personal...... , 76 user ...... , 32, 34 classifier ...... , 60–62, 67, 72, 81, 92, 100, 106, 108, 112, 118, 119, data pre-processing ...... , 79, 105, 106 128, 130, 132, 137, 138, 140, 146 database. ., 1, 8, 14, 35, 55, 56, 58, 97, 103, click pattern...... , 40 119, 120, 127, 130, 132, 134, 148 clickthrough ...... , 17, 44 DCC algorithm ...... , 106, 108, 124, 127 cold start ...... , 43, 64 digital garbage., 15, 17, 21, 48, 51, 55, 144, collaborative filter ...... , 42, 64, 73, 150 145 content dimensionality reduction, 82, 84, 85, 87, 91, domain expertise ...... , 31, 33, 82, 86 109, 115, 138 fast food ...... , 48 DNS...... , 9, 11, 12, 48 general ...... , 1, 3, 8–12, 14–16, 21, domain ...... , 9–12, 15, 30 25–28, 30, 31, 33, 42, 47, 48, 50, 51, double-loop learning ...... , 96

xxix INDEX OF TERMS e-Research...... , 47, 147 keyword ...... , 2, error variation...... , 133 12, 13, 17, 19–21, 23, 38–41, 45, 48, Expectation-Maximisation...... , 113 60, 63, 68, 70, 76, 82, 90, 102, 105, exploration engine...... , 148 107, 123, 140, 145, 146, 148, 149 exploration ...... , 2, 18 Latent Semantic Indexing ...... , 82, 83 Facebook ...... , 36, 42, 43, 76 link ...... , 8–10, 12–15, 17, 18, feedback 23, 25–28, 30, 31, 35, 38, 43, 47–49, explicit . , 33, 34, 39, 41, 42, 44, 51, 54, 56, 68, 76, 77, 79, 82, 85, 103, 105, 55, 58, 62, 64, 66, 75, 145, 148, 149 112, 121, 122 implicit . , 18, 33, 34, 39, 42, 44, 46, 50, meta-data ...... , 9, 12, 15, 44, 79, 98, 104 51, 64, 75–77, 98, 143–145, 149 MNB-SPDA ...... , 96, 109, 120, 125, 133, filter bubble . . . . . , 37, 51, 64, 144, 146, 149 137–139, 148 geotargeting...... , 36, 64, 77 MNB-SPDWNA ...... , 126, 134, 136, 137 Google N-Gram corpus ...... , 40 MNB-WN ...... , 125, 129–131, 136, 137 Googlebomb...... , 17 multi-agent system ...... , 42 graph analysis ...... , 16, 47, 143, 145, 149 Multinomial Na¨ıve Bayes (MNB) . . . , 6, 92, graph . . . , 10, 16, 26, 31, 59, 77, 78, 91, 100 93, 96, 116, 120, 123, 124, 126–129, 131, 132, 136, 137, 148 hidden web ...... , 10, 16, 25 MySQL...... , 97, 103 hierarchy ...... , 77, 87, 109, 116, 117, 122 HITS...... , 17 negative feedback loop...... , 133, 137 HTML . . , 9, 15, 79, 104, 105, 112, 121, 149 neural network . . . , 17, 84, 87, 91, 100, 116, HTTP ...... , 8, 10, 13 126 normalisation . . . , 80, 82, 90, 105, 123, 127, inductive bias ...... , 86 128, 132, 134, 136, 137 information filtering ...... , 33, 38 information finding occurrence matrix ...... , 81, 83 discovery ...... , 2, 5, one input interface ...... , 20, 66 9, 10, 13, 14, 25, 28, 39, 45, 49, 51, ontology . . , 59, 61, 66, 77, 78, 99, 100, 117, 53, 56, 58, 65, 69, 97, 100, 145, 146 147 general, 5, 8, 12, 13, 40, 44, 51, 54, 145 operating system...... , 8, 77 locating . . , 2, 5, 12–14, 53, 58, 65, 145, outlier ...... , 106, 108, 150 146 pagerank model . . , 5, 12, 26, 45, 53, 54, 143, 145 adaptive ...... , 35 information foraging theory . . . . , 34, 59, 64 general . . . , 13, 16–18, 35, 36, 47–49, 82 information retrieval system ...... , 33, 35 personal...... , 36 internet ...... , 1, 7, 8, 10 personal agent , 4, 38–40, 42, 44, 45, 59, 60, inverted index ...... , 69, 81, 90, 106 62, 70, 77, 78, 98, 145, 146, 150 IP address ...... , 9, 10, 51, 77 personalisation . . , 32, 33, 35, 37, 39–41, 51, k-means ...... , 83, 87, 90, 110, 113 55, 143, 146, 149

xxx INDEX OF TERMS personalised information retrieval ...... , 34 107, 108, 114, 145 PHP...... , 97, 103, 104 model . , 3, 5, 11, 20, 21, 23, 24, 32, 40, polysemy ...... , 38, 46 47, 51, 54, 59, 60, 63, 147 positive feedback loop . . , 17, 43, 49, 51, 77, personalised. . ., 4, 5, 35, 38, 43, 44, 55, 136 78, 148 privacy., 4, 5, 36, 43, 51, 55, 56, 59, 61, 62, query expansion , 39, 42, 59, 70, 71, 76, 77, 100, 101, 144, 145, 148 148 query modification...... , 35, 38 query query ...... , 2, 3, floating ...... , 117, 126, 140, 148 12–14, 16–18, 20–24, 33, 34, 36, 38, research . , 62, 63, 70–73, 101, 102, 107, 40, 46, 53, 58–60, 63–66, 68, 70, 72, 147 81, 100, 140, 141, 143, 145, 147 search...... , 147 relevance . , 3, 17, 19, 21, 24, 30, 32, 33, 69, 70, 107, 141, 143–145 ranking algorithm . . , 17, 27, 35, 47, 55, 81, social ...... , 32, 43 144 trail ...... , 18, 46, 64, 77 re-visitation ...... , 33, 39, 40, 42, 62 search engine recommendation , 13, 27, 33, 40, 41, 55, 66, vertical ...... , 21, 58 69, 76, 147 search engine optimisation . . , 3, 12, 17, 48, recommender engine ...... , 40–43, 62, 64 105, 122, 144, 149 reflective practice...... , 96 search engine , 2–6, 9, 10, 12–30, 32, 33, 35, reflective practitioner ...... , 96 37–41, 43–52, 54–56, 58–66, 69, 70, relevance feedback . . , 62, 66, 67, 70, 72, 73, 76–79, 81, 87, 90, 99–102, 104, 108, 145 110, 140, 141, 143–150 relevance ...... , 33 search history . . , 34, 40, 41, 51, 76, 77, 143 research query ...... , 68 search providers Resource Description Framework (RDF) . . , digg.com ...... , 16, 43 79, 121 Google Directory ...... , 31, 40 resource ...... , 9 Google ...... , 11, 13, saved research ...... , 73 15–18, 20, 21, 24, 25, 31, 35–39, 41, search 42, 44, 48, 65, 82, 85, 114 associative ...... , 16 Open Directory ...... , 6, current ...... , 54, 147 26, 29–32, 39–41, 49, 62, 66, 67, 72, full-text ...... , 81 73, 77, 78, 97, 105, 108, 109, 116, general ...... , 3, 9, 120–123, 134, 150 12–14, 18–22, 24, 26, 29, 32, 37, 45, Startpagina ...... , 26 46, 50, 53, 55, 63, 76, 78, 143, 145, Yahoo Directory, 26, 29, 30, 32, 39, 42, 148 49, 67, 90, 120–122 goal ...... , 34 Yandex...... , 21 keyword...... , 2, 14, 20, 23, 24, Yippy ...... , 18 40, 42, 45–47, 55, 56, 63, 66, 69, 91, search results

xxxi INDEX OF TERMS

clustering...... , 18, 38 cognitive ...... , 38, 53, 56 filtered...... , 18, 27, 33, 37, 39, 50 conceptual ...... , 77 general ...... , 3, 12, 14, 16–19, general . . . , 3, 32, 34, 39, 43, 44, 53, 55, 21, 22, 24–26, 28, 29, 33, 35, 38–41, 60–62, 75, 78, 97, 99, 145 45–47, 50, 51, 54, 58, 61–64, 66, 68, structured...... , 77 70, 73, 76, 106, 107, 144, 147, 148 unstructured ...... , 77 limit. . . ., 20, 24, 37, 47, 58, 63, 64, 144 vector-based...... , 77 personalisation ...... , 37 user profile ...... , 35, 40–42, 76 personalised presentation . . , 35, 38, 39, web browser. ., 4, 10, 11, 13, 34, 43, 55, 61, 59 62, 76, 77, 98, 104, 148, 150 personalised . . . , 3, 35, 36, 46, 144, 150 web directory...... , ranking . . , 3, 12, 21, 23, 34, 45, 47, 49, 2–5, 18, 19, 21, 22, 25–32, 39, 45, 55, 144, 148 47, 49–51, 53–55, 58, 60, 62–65, 73, re-ranking ...... , 33, 35, 37, 40 76, 79, 90, 107, 110, 115–117, 126, saved...... , 46 128, 138, 139, 141, 146 Self-Organising Maps ...... , 85, 87–89, 96, Web Exploration Engine . . . , 52, 53, 58, 66, 110–115 145, 146 semantic analysis ...... , 48 web graph ...... , 13, 47 single-loop learning ...... , 96 web page. . ., 6, 9, 10, 14–16, 19, 27, 28, 43, social graph ...... , 43 49, 50, 55, 62, 77, 85, 98, 104, 122 spam filter . . . , 118, 119, 130, 132, 134, 137 web portal ...... , 24, 26, 44, 50 spam . . . , 8, 16, 17, 27, 47, 48, 73, 103, 118, web server ...... , 8–12, 17, 97 119, 122, 132 web site , 2, 3, 10–18, 21–28, 30, 31, 36, 39, stop word...... , 83, 108, 116 43, 44, 47–51, 76, 78, 79, 104, 105, Support Vector Machine (SVM) . . , 90, 100 112, 121, 122 synonymy...... , 38, 46 web spider. . ., 14–16, 28, 31, 47, 53, 55, 56, taxonomy...... , 77, 78 67, 79, 103, 116 term co-occurrence ...... , 16 web ...... , 1, 2, 4, 5, 8, 9, 12–14, 16, term vector . . . . , 39, 40, 47, 48, 59, 82, 101, 19, 23–25, 31, 35, 45, 46, 49, 51, 53, 102, 106, 120, 127 54, 56, 68, 70, 73, 76, 79, 82–85, 92, term weight ...... , 40, 71, 73, 102, 126 97–99, 105, 106, 109–111, 120, 127, TF-IDF . . . . . , 16, 39, 40, 69, 70, 77, 79–81, 140, 143, 145, 150 105–107, 116, 127, 128, 130, 149 weight decay ...... , 130 tokenizer...... , 79, 104, 150 Zipf’s law...... , 80, 92 topology...... , 32 unsupervised clustering ...... , 83, 100 URL, 9, 11, 12, 15, 17, 43, 48, 62, 100, 103, 120, 122 user model bag of words ...... , 77

xxxii