Query Classification Based on a New Query Expansion Approach

QUERY CLASSIFICATION BASED ON A NEW QUERY EXPANSION APPROACH by Li Shujie Thesis submitted in partial fulfillment of the requirements for the Degree of Master of Science (Statistics) Acadia University Fall Convocation 2009 © by Li Shujie, 2009 This thesis by Li Shujie was defended successfully in an oral examination on August 21, 2009. The examining committee for the thesis was: Dr. Anthony Tong, Chair Dr. Crystal Linkletter, External Reader Dr. Wilson Lu, Internal Reader Dr. Hugh Chipman and Dr. Pritam Ranjan, Supervisors Dr. Jeff Hooper, Department Head This thesis is accepted in its present form by the Division of Research and Graduate Studies as satisfying the thesis requirements for the degree Master of Science (Statis- tics). ...................................................... ii I, Li Shujie, grant permission to the University Librarian at Acadia University to reproduce, loan or distribute copies of my thesis in microform, paper or electronic formats on a non-profit basis. I, however, retain the copyright in my thesis. Author Supervisor Date iii Contents Abstract x Acknowledgments xi 1 Introduction 1 1.1 Query Classification . 1 1.1.1 What is query classification? . 1 1.1.2 Why is query classification useful? . 1 1.2 Relevant Work . 2 1.3 My Approach . 3 2 Data and Terminology 6 2.1 Information Retrieval . 6 2.1.1 Bag of words assumption . 6 2.1.2 Document frequency and term frequency . 7 2.2 Information Theory . 7 2.3 GenieKnows Taxonomy . 8 2.3.1 Topics . 8 2.3.2 Original data . 9 2.3.3 Multi-word taxonomy . 10 3 Feature Selection 12 3.1 Feature Selection Using Chi-Square Statistic . 12 3.1.1 Penalized feature selection . 13 iv 4 Query Expansion 16 4.1 Word Similarity . 16 4.1.1 Cosine similarity . 16 4.1.2 Smoothed KL divergence . 18 4.2 The Advantage of Using Feature Words . 19 5 Query Classification 21 5.1 Naive Bayes Classification Method . 21 5.1.1 Naive Bayes Bernoulli model . 22 5.1.2 Naive Bayes multinomial model . 24 5.2 Dirichlet/Multinomial Model . 26 6 Experiments 28 6.1 KDD Data . 28 6.1.1 Some problems in using KDD Cup 2005 queries . 29 6.1.2 Precision, recall and F1 value . 32 6.1.3 The number of return topics . 34 6.2 Experiment Results . 34 6.2.1 Choosing α and k ......................... 34 6.2.2 Notation . 36 6.2.3 F1 values for the KDD-Cup data . 37 6.2.4 Comparison of word similarity measures . 41 6.2.5 Comparison of three classification methods . 45 6.2.6 Comparison of feature word penalty parameters . 48 6.2.7 Comparison with the KDD-Cup 2005 competitors . 51 7 Conclusion and Future Work 53 A Appendix (Feature Words) 54 v List of Tables 2.1 Top-level topics in the GenieKnows taxonomy . 9 2.2 Taxonomy extract for topic Arts/Entertainment . 10 2.3 Multi-words taxonomy extract for topic Arts/Entertainment . 11 6.1 KDD-CUP Categories and Genieknows Topics . 30 6.2 Number of changes to feature words set for various values of penalty parameter α. Up to 198 changes are possible . 35 6.3 Notation . 37 6.4 F1 values: Cos+NBMUL . 38 6.5 F1 values: Cos+NBBER . 38 6.6 F1 values: Cos+DIRI . 39 6.7 F1 values: KL+NBMUL . 39 6.8 F1 values: KL+NBBER . 40 6.9 F1 values: KL+DIRI . 40 6.10 KDD-Cup 2005 results . 52 A.1 Feature Words for Topic 1 . 55 A.2 Feature Words for Topic 2 . 56 A.3 Feature Words for Topic 3 . 57 A.4 Feature Words for Topic 4 . 58 A.5 Feature Words for Topic 5 . 59 A.6 Feature Words for Topic 6 . 60 A.7 Feature Words for Topic 7 . 61 A.8 Feature Words for Topic 8 . 62 vi A.9 Feature Words for Topic 9 . 63 A.10 Feature Words for Topic 10 . 64 A.11 Feature Words for Topic 11 . 65 A.12 Feature Words for Topic 12 . 66 A.13 Feature Words for Topic 13 . 67 A.14 Feature Words for Topic 14 . 68 A.15 Feature Words for Topic 15 . 69 A.16 Feature Words for Topic 16 . 70 A.17 Feature Words for Topic 17 . 71 A.18 Feature Words for Topic 18 . 72 vii List of Figures 1.1 Query Classification system . 5 6.1 Number of change of feature words compared to the no penalty case . 36 6.2 F1 values for the Naive Bayes multinomial model. The red lines indicate smoothed KL divergence, and the green lines indicate cosine similarity . 42 6.3 F1 values for the Naive Bayes Bernoulli model. The red lines indicate smoothed KL divergence, and the green lines indicate cosine similarity 43 6.4 F1 values for the Dirichlet/multinomial model. The red lines indicate smoothed KL divergence, and the green lines indicate cosine similarity 44 6.5 F1 values for the models using smoothed KL divergence. The black lines indicate naive Bayes multinomial model, the green lines indicate naive Bayes Bernoulli model, and the red lines indicate Dirichlet/multinomial model . 46 6.6 F1 values for the models using Cosine similarity. The black lines indicate naive Bayes multinomial model, the green lines indicate naive Bayes Bernoulli model, and the red lines indicate Dirichlet/multinomial model . 47 6.7 F1 values for methods using smoothed KL divergence. Top: naive Bayes multinomial; middle: naive Bayes Bernoulli; bottom: Dirichlet/- multinomial model. The different colors represent different penalty parameters α: black (α=0), green (α=0.000015), red (α=0.00004), yellow (α=0.00006), pink (α=0.0001), blue (α=0.0002), orange (α=0.0006), purple (α=0.002) . 49 viii 6.8 F1 values for methods using cosine similarity. Top: naive Bayes multinomial; middle: naive Bayes Bernoulli; bottom: Dirichlet/multinomial model. The different colors represent different penalty parameters: black (α=0), green (α=0.000015), red (α=0.00004), yellow (α=0.00006), pink (α=0.0001), blue (α=0.0002), orange (α=0.0006), purple (α=0.002) 50 ix Abstract Query classification is an important and yet challenging problem for the search engine industry and e-commerce companies. In this thesis, I develop a query classification system based on a novel query expansion approach and classification methods. The proposed methodology is used to classify queries based on a taxonomy (a database of words and their corresponding topic classification). The taxonomy used was obtained from GenieKnows, a vertical search engine company in Halifax, Canada. The query classification system can be divided into three phases: feature selection, query expansion, and query classification. The first phase uses a chi-square statistic to select a subset of \feature words" from the GenieKnows taxonomy; the second phase uses cosine similarity and Kullback-Leibler divergence to find \feature words" similar to the query for query expansion; and finally the third phase introduces three classification methods: naive Bayes multinomial model, naive Bayes Bernoulli model and Dirichlet/multinomial model to classify the expanded queries. Results from the KDD-Cup 2005 competition are used to test the performance of the proposed query classification system. The experiment shows that the performance of the query classification system is quite good. x Acknowledgments There are many people who deserve thanks for helping me during my study at Acadia. The last two years in this department and GenieKnows has been an unforgettable experience for me. First and foremost, I express my sincere gratitude to my supervisors Dr. Hugh Chipman and Dr. Pritam Ranjan. They not only provided guidance, direction, and funding to me, but also encouraged me during my study at Acadia. My thesis could not be finished without their gracious help. I also express my gratitude to my committee members, Dr. Crystal Linkletter, Dr. Wilson Lu, Dr. Jeff Hooper and Dr. Anthony Tong. Mathematics of Information Technology and Complex Systems (MITACS) and Ge- nieKnows provided an eight-month internship to me, and this thesis is highly related to my internship at GenieKnows. Many thanks to MITACS and GenieKnows. Dr. Tony Abou-Assaleh, my internship supervisor at GenieKnows, provided a lot of sup- port and instruction to me during my internship. Philip O'Brien and Dr. Luo Xiao reviewed my thesis and gave me a lot of valuable suggestions. Dr. Luo Xiao helped familiarize me with the GenieKnows data and gave me great help during my prelim- inary research for this thesis. I will always be grateful to them for their help. xi Chapter 1 Introduction 1.1 Query Classification 1.1.1 What is query classification? In search engines (e.g., Google or Yahoo) and e-commerce companies (e.g., Ama- zon.com or ebay.com), users always type queries to obtain information. Queries are short pieces of text, such as a search term typed into a search engine. Query classification aims to classify queries into a set of target topics. Suppose there are five target topics feducation; shopping; restaurant; statistics; computer scienceg, and two queries \Acadia University" and “coffee”. Our main objective is to classify the two queries into one or more target topics. For instance, it is reasonable to classify the query \Acadia University" into the topic \education" and “coffee” into the topic \restaurant". 1.1.2 Why is query classification useful? Query classification can help e-commerce companies learn user preferences. If a user types five queries in an e-commerce company's website: fpattern recognition; statistical computing; statistical inference; linear models; data miningg, all these 1 CHAPTER 1. INTRODUCTION 2 queries should be classified into the topic \statistics" or \computer science".

Load more