Improving Search Engines Via Classification

Improving Search Engines via Classi¯cation Zheng Zhu May 2011 A Dissertation Submitted to Birkbeck College, University of London in Partial Ful¯llment of the Requirements for the Degree of Doctor of Philosophy Department of Computer Science & Information Systems Birkbeck College University of London Declaration This thesis is the result of my own work, except where explicitly acknowledge in the text. Zheng Zhu i Abstract In this dissertation, we study the problem of how search engines can be improved by making use of classi¯cation. Given a user query, traditional search engines output a list of results that are ranked according to their relevance to the query. However, the ranking is independent of the topic of the document. So the results of di®erent topics are not grouped together within the result output from a search engine. This can be problematic as the user must scroll though many irrelevant results until his/her desired information need is found. This might arise when the user is a novice or has super¯cial knowledge about the domain of interest, but more typically it is due to the query being short and ambiguous. One solution is to organise search results via categorization, in particular, the classi¯cation. We designed a target testing experiment on a controlled data set, which showed that classi¯cation-based search could improve the user's search experience in terms of the numbers of results the user would have to inspect before satisfying his/her query. In our investigation of classi¯cation to organise search results, we not only consider the classi¯cation of search results, but also query classi¯cation. In particular, we investigate the case where the enrichment of the training and test queries is asymmetrical. We also make use of a large search engine log to provide a comprehensive topic speci¯c analysis of search engine queries. Finally we study the problem of ranking the classes using some new features derived from the class. The contribution of this thesis is the investigation of classi¯cation-based search in terms of ranking the search results. This allows us to analyze the ii e®ect of a classi¯er's performance on a classi¯cation-based search engines along with di®erent user interaction models. Moreover, it allows the evaluation of class-based ranking methods. iii Publications Publications Relating to the thesis 1. Zheng Zhu, Ingemar J Cox, Mark Levene: \Ranked-Listed or Cate- gorized Results in IR: 2 Is Better Than 1". The 13th international conference on Natural Language and Information Systems, 2008, pages 111-123, London, United Kingdom - Chapter 3 2. Judit Bar-Ilan, Zheng Zhu and Mark Levene: \Topic-speci¯c analysis of search queries". The 2009 workshop on Web Search Click Data (WSCD '09), pages 35-42, Barcelona, Spain Extended version appeared as Judit Bar-Ilan, Zheng Zhu and Mark Levene: \Topical Analysis of Search Queries ". in the 10th Bar-Ilan Symposium on the Foundations of Arti¯cial Intelligence - Chapter 4 3. Zheng Zhu, Mark Levene, Ingemar J Cox: \Query classi¯cation using asymmetric learning ". The 2nd International Conference on the Applications of Digital Information and Web Technologies, 2009, pages 518-524, London, Unite Kingdom - Chapter 4 4. Zheng Zhu, Mark Levene, Ingemar J Cox: \Ranking Classes of Search Engine Results". The International Conference on Knowledge Discov- ery and Information Retrieval, 2010, Valencia, Spain - Chapter 5 iv Acknowledgements I am deeply grateful to my supervisors, Prof. Mark Levene and Prof. Inge- mar J Cox, for their continued support, their patient guidance and their faith in my work throughout these years. Whenever I needed any help, their doors were always open. Many thanks are due to my colleagues at Birkbeck, University of London. Without their collaboration and discussions, I can not have ¯nished our research work. I would particularly like to thank Dell Zhang, Rajesh Pampapathi, Phil Gregg, Aisling Traynor and Simon Upton for their help and support. Dr. Iadh Ounis and Dr. Tassos Tombros gave me critical comments during the defense, which will make me to think a lot more deeply about my work from here on in. Special thanks are due to Dr. Kai Yu and Sean Zhang. Their encouragement and suggestions guided me towards starting my Ph.D in the very beginning. I would also thank Martyn Harris for the ¯nal proofreading. Last but not least,I would like to declare my gratitude and love to my wife Lei Zhu and my parents YongGang Zhu, XiangYa Zheng. Their enthusiasm and constant encouragement with their patience accompany me through this period and every other period. I would never have ¯nished this work without their support. v Contents 1 Introduction 1 1.1 Current State of the Art in Search ................. 1 1.2 What is Lacking in Search ...................... 4 1.3 Classi¯cation as a Way to Organise Search Results ........ 5 1.4 Outline of the Thesis ......................... 7 2 Background 10 2.1 Web Information Retrieval ...................... 10 2.1.1 Retrieval Models ....................... 14 2.1.2 Evaluation Metric ...................... 17 2.1.3 Text Analysis Process .................... 21 2.2 Machine Learning ........................... 22 2.2.1 Unsupervised Learning .................... 24 2.2.2 Supervised Learning ..................... 25 2.3 Web Content Clustering and Classi¯cation ............. 31 2.3.1 Web Content Clustering ................... 31 2.3.2 Web Content Classi¯cation ................. 32 2.3.3 Search Results Classi¯cation ................ 34 vi 3 A Model for Classi¯cation-Based Search Engines 40 3.1 Problem Description ......................... 40 3.2 Classi¯cation-Based Search Engines ................ 41 3.2.1 Class Rank .......................... 43 3.2.2 Document Rank ....................... 44 3.2.3 Scrolled-Class Rank (SCR) ................. 45 3.2.4 In-Class Rank (ICR) ..................... 45 3.2.5 Out-Class/Scrolled-Class Rank (OSCR) and Out-Class/Revert Rank (ORR) ......................... 46 3.2.6 Classi¯cation ......................... 47 3.3 Experimental Methodology ..................... 49 3.3.1 Target Testing ........................ 49 3.3.2 Automatic query generation ................. 49 3.3.3 User/Machine models .................... 50 3.4 Experiments .............................. 51 3.4.1 Experimental results ..................... 51 3.5 Discussion ............................... 60 4 Query Classi¯cation 62 4.1 Introduction .............................. 62 4.2 Related Work ............................. 64 4.3 Query Enrichment .......................... 66 4.3.1 Pseudo-Relevance Feedback ................. 66 4.3.2 Term Co-occurrence from Secondary Sources ....... 68 4.4 The SVM Query Classi¯er ...................... 68 4.4.1 Data representation ..................... 68 vii 4.4.2 Multi-label, Multi-class Classi¯cation ............ 70 4.4.3 Evaluation Criteria ...................... 70 4.5 Experiments .............................. 71 4.5.1 Experimental Data Set .................... 71 4.5.2 Experimental Setting ..................... 72 4.5.3 Experimental Results .................... 73 4.6 Topic Analysis of Search Queries .................. 78 4.6.1 Results ............................ 79 4.7 Discussion ............................... 94 5 Ranking Classes of Search Engine Results 96 5.1 Introduction .............................. 96 5.1.1 Related Work ......................... 97 5.2 Class-Based Ranking Method .................... 98 5.2.1 Query-Dependent Rank ................... 99 5.2.2 Query-Independent Rank .................. 100 5.2.3 Additional Class Ranking Models .............. 102 5.3 Experiments .............................. 104 5.3.1 Experimental Setting ..................... 104 5.3.2 Experimental Methodology ................. 106 5.3.3 Experimental Results .................... 106 5.4 Learning to Rank ........................... 111 5.5 A Proof of Concept for a Classi¯cation-Based Search Engine .. 113 5.5.1 Classi¯cation-Based Search Engines - an Architecture .. 113 5.5.2 Classi¯cation-Based Search Engines - an Implementation 116 5.6 Discussion ............................... 118 viii 6 Concluding Remarks and Future Directions 121 ix List of Figures 2.1 Real information need, perceived information need, request, and query [90] ............................... 12 2.2 Text analysis process ......................... 21 2.3 Optimal separating hyperplane in a two-dimensional space .... 28 3.1 Conceptual illustration of a classi¯cation-based IR system .... 42 3.2 The Illustration of Ranks ...................... 44 3.3 Results using the Open Directory oracle classi¯er ......... 55 3.4 Results for the KNN classi¯er with non-disjoint classes ...... 58 3.5 Results for the KNN classi¯er in real case ............. 59 4.1 Distribution of number of clicks per query for the di®erent categories ................................. 85 4.2 Average clickthrough rate of clicked queries ............ 86 4.3 Clickthrough position - all clickthroughs .............. 89 4.4 Clickthrough position - ¯rst clickthrough for each clicked query . 90 4.5 Daily volume of the queries in speci¯c categories ......... 93 5.1 The cumulative distribution of the rank of click through based on ranking position (multiple-click). ................ 109 x 5.2 The Cumulative Distribution of Click Through Based on Ranking Position (one-click). ......................... 111 5.3 An architecture of a classi¯cation-based metasearch engine. ... 114 5.4 Google Search User Interface .................... 115 5.5 Search Engine Results Pages (SERP) ...............

Improving Search Engines Via Classification

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support