QUERY CLASSIFICATION BASED ON A NEW APPROACH

by Li Shujie

Thesis submitted in partial fulfillment of the requirements for the Degree of Master of Science (Statistics)

Acadia University Fall Convocation 2009

© by Li Shujie, 2009 This thesis by Li Shujie was defended successfully in an oral examination on August 21, 2009.

The examining committee for the thesis was:

Dr. Anthony Tong, Chair

Dr. Crystal Linkletter, External Reader

Dr. Wilson Lu, Internal Reader

Dr. Hugh Chipman and Dr. Pritam Ranjan, Supervisors

Dr. Jeff Hooper, Department Head

This thesis is accepted in its present form by the Division of Research and Graduate Studies as satisfying the thesis requirements for the degree Master of Science (Statis- tics).

......

ii I, Li Shujie, grant permission to the University Librarian at Acadia University to reproduce, loan or distribute copies of my thesis in microform, paper or electronic formats on a non-profit basis. I, however, retain the copyright in my thesis.

Author

Supervisor

Date

iii Contents

Abstract x

Acknowledgments xi

1 Introduction 1 1.1 Query Classification ...... 1 1.1.1 What is query classification? ...... 1 1.1.2 Why is query classification useful? ...... 1 1.2 Relevant Work ...... 2 1.3 My Approach ...... 3

2 Data and Terminology 6 2.1 ...... 6 2.1.1 Bag of words assumption ...... 6 2.1.2 Document frequency and term frequency ...... 7 2.2 Information Theory ...... 7 2.3 GenieKnows Taxonomy ...... 8 2.3.1 Topics ...... 8 2.3.2 Original data ...... 9 2.3.3 Multi-word taxonomy ...... 10

3 Feature Selection 12 3.1 Feature Selection Using Chi-Square Statistic ...... 12 3.1.1 Penalized feature selection ...... 13

iv 4 Query Expansion 16 4.1 Word Similarity ...... 16 4.1.1 Cosine similarity ...... 16 4.1.2 Smoothed KL divergence ...... 18 4.2 The Advantage of Using Feature Words ...... 19

5 Query Classification 21 5.1 Naive Bayes Classification Method ...... 21 5.1.1 Naive Bayes Bernoulli model ...... 22 5.1.2 Naive Bayes multinomial model ...... 24 5.2 Dirichlet/Multinomial Model ...... 26

6 Experiments 28 6.1 KDD Data ...... 28 6.1.1 Some problems in using KDD Cup 2005 queries ...... 29 6.1.2 Precision, recall and F1 value ...... 32 6.1.3 The number of return topics ...... 34 6.2 Experiment Results ...... 34 6.2.1 Choosing α and k ...... 34 6.2.2 Notation ...... 36 6.2.3 F1 values for the KDD-Cup data ...... 37 6.2.4 Comparison of word similarity measures ...... 41 6.2.5 Comparison of three classification methods ...... 45 6.2.6 Comparison of feature word penalty parameters ...... 48 6.2.7 Comparison with the KDD-Cup 2005 competitors ...... 51

7 Conclusion and Future Work 53

A Appendix (Feature Words) 54

v List of Tables

2.1 Top-level topics in the GenieKnows taxonomy ...... 9 2.2 Taxonomy extract for topic Arts/Entertainment ...... 10 2.3 Multi-words taxonomy extract for topic Arts/Entertainment . . . . . 11

6.1 KDD-CUP Categories and Genieknows Topics ...... 30 6.2 Number of changes to feature words set for various values of penalty parameter α. Up to 198 changes are possible ...... 35 6.3 Notation ...... 37 6.4 F1 values: Cos+NBMUL ...... 38 6.5 F1 values: Cos+NBBER ...... 38 6.6 F1 values: Cos+DIRI ...... 39 6.7 F1 values: KL+NBMUL ...... 39 6.8 F1 values: KL+NBBER ...... 40 6.9 F1 values: KL+DIRI ...... 40 6.10 KDD-Cup 2005 results ...... 52

A.1 Feature Words for Topic 1 ...... 55 A.2 Feature Words for Topic 2 ...... 56 A.3 Feature Words for Topic 3 ...... 57 A.4 Feature Words for Topic 4 ...... 58 A.5 Feature Words for Topic 5 ...... 59 A.6 Feature Words for Topic 6 ...... 60 A.7 Feature Words for Topic 7 ...... 61 A.8 Feature Words for Topic 8 ...... 62

vi A.9 Feature Words for Topic 9 ...... 63 A.10 Feature Words for Topic 10 ...... 64 A.11 Feature Words for Topic 11 ...... 65 A.12 Feature Words for Topic 12 ...... 66 A.13 Feature Words for Topic 13 ...... 67 A.14 Feature Words for Topic 14 ...... 68 A.15 Feature Words for Topic 15 ...... 69 A.16 Feature Words for Topic 16 ...... 70 A.17 Feature Words for Topic 17 ...... 71 A.18 Feature Words for Topic 18 ...... 72

vii List of Figures

1.1 Query Classification system ...... 5

6.1 Number of change of feature words compared to the no penalty case . 36 6.2 F1 values for the Naive Bayes multinomial model. The red lines in- dicate smoothed KL divergence, and the green lines indicate cosine similarity ...... 42 6.3 F1 values for the Naive Bayes Bernoulli model. The red lines indicate smoothed KL divergence, and the green lines indicate cosine similarity 43 6.4 F1 values for the Dirichlet/multinomial model. The red lines indicate smoothed KL divergence, and the green lines indicate cosine similarity 44 6.5 F1 values for the models using smoothed KL divergence. The black lines indicate naive Bayes multinomial model, the green lines indicate naive Bayes Bernoulli model, and the red lines indicate Dirichlet/multi- nomial model ...... 46 6.6 F1 values for the models using Cosine similarity. The black lines in- dicate naive Bayes multinomial model, the green lines indicate naive Bayes Bernoulli model, and the red lines indicate Dirichlet/multino- mial model ...... 47 6.7 F1 values for methods using smoothed KL divergence. Top: naive Bayes multinomial; middle: naive Bayes Bernoulli; bottom: Dirichlet/- multinomial model. The different colors represent different penalty pa- rameters α: black (α=0), green (α=0.000015), red (α=0.00004), yellow (α=0.00006), pink (α=0.0001), blue (α=0.0002), orange (α=0.0006), purple (α=0.002) ...... 49

viii 6.8 F1 values for methods using cosine similarity. Top: naive Bayes multi- nomial; middle: naive Bayes Bernoulli; bottom: Dirichlet/multinomial model. The different colors represent different penalty parameters: black (α=0), green (α=0.000015), red (α=0.00004), yellow (α=0.00006), pink (α=0.0001), blue (α=0.0002), orange (α=0.0006), purple (α=0.002) 50

ix Abstract

Query classification is an important and yet challenging problem for the industry and e-commerce companies. In this thesis, I develop a query classification system based on a novel query expansion approach and classification methods. The proposed methodology is used to classify queries based on a taxonomy (a database of words and their corresponding topic classification). The taxonomy used was obtained from GenieKnows, a engine company in Halifax, Canada.

The query classification system can be divided into three phases: feature selection, query expansion, and query classification. The first phase uses a chi-square statistic to select a subset of “feature words” from the GenieKnows taxonomy; the second phase uses cosine similarity and Kullback-Leibler divergence to find “feature words” similar to the query for query expansion; and finally the third phase introduces three classification methods: naive Bayes multinomial model, naive Bayes Bernoulli model and Dirichlet/multinomial model to classify the expanded queries.

Results from the KDD-Cup 2005 competition are used to test the performance of the proposed query classification system. The experiment shows that the performance of the query classification system is quite good.

x Acknowledgments

There are many people who deserve thanks for helping me during my study at Acadia. The last two years in this department and GenieKnows has been an unforgettable experience for me.

First and foremost, I express my sincere gratitude to my supervisors Dr. Hugh Chipman and Dr. Pritam Ranjan. They not only provided guidance, direction, and funding to me, but also encouraged me during my study at Acadia. My thesis could not be finished without their gracious help. I also express my gratitude to my committee members, Dr. Crystal Linkletter, Dr. Wilson Lu, Dr. Jeff Hooper and Dr. Anthony Tong.

Mathematics of Information Technology and Complex Systems (MITACS) and Ge- nieKnows provided an eight-month internship to me, and this thesis is highly related to my internship at GenieKnows. Many thanks to MITACS and GenieKnows. Dr. Tony Abou-Assaleh, my internship supervisor at GenieKnows, provided a lot of sup- port and instruction to me during my internship. Philip O’Brien and Dr. Luo Xiao reviewed my thesis and gave me a lot of valuable suggestions. Dr. Luo Xiao helped familiarize me with the GenieKnows data and gave me great help during my prelim- inary research for this thesis. I will always be grateful to them for their help.

xi Chapter 1

Introduction

1.1 Query Classification

1.1.1 What is query classification?

In search engines (e.g., or Yahoo) and e-commerce companies (e.g., Ama- zon.com or ebay.com), users always type queries to obtain information. Queries are short pieces of text, such as a search term typed into a search engine. Query classifi- cation aims to classify queries into a set of target topics.

Suppose there are five target topics {education, shopping, restaurant, statistics, computer science}, and two queries “Acadia University” and “coffee”. Our main objective is to classify the two queries into one or more target topics. For instance, it is reasonable to classify the query “Acadia University” into the topic “education” and “coffee” into the topic “restaurant”.

1.1.2 Why is query classification useful?

Query classification can help e-commerce companies learn user preferences. If a user types five queries in an e-commerce company’s website: {pattern recognition, statistical computing, statistical inference, linear models, data mining}, all these

1 CHAPTER 1. INTRODUCTION 2

queries should be classified into the topic “statistics” or “computer science”. Ac- cording to the information provided by query classification, the e-commerce company may find that this user is interested in statistics or computer science. This is valuable information for the e-commerce company, which is often used to build user profiles for recommending products according to the user’s preference.

Query classification can help search engines retrieve appropriate results in response to user queries. The problem of query classification has received significant attention from researchers and industry. For example, Topic-Sensitive PageRank (Haveliwala, 2002) improves the famous PageRank Algorithm (Brin and Page, 1998) based on classifying pages into 16 topics in the Open Directory Project (www.dmoz.org). In the next section, relevant work in query classification will be presented.

1.2 Relevant Work

Some research has been done to address the query classification problem. Shen et al. (2005) used an ensemble-search based approach to query classification in the KDD- Cup 2005 competition. Beitzel et al. (2005a) examined three approaches to classify queries: “matching against labeled queries”, “supervised learning of classifiers”, and “mining of selectional preference rules from unlabeled query logs”. Beitzel et al. (2005b) used computational linguistics to develop a semi-supervised learning method so that they can take advantage of vast amounts of unlabeled queries. Bhandari and Davison (2007) took advantage of features, such as snippets, page content and titles of search results for a given query for its classification. Beitzel et al. (2007) examined two issues in query classification: (i) whether to classify queries before or after search results have been retrieved, (ii) whether to train the classifier explicitly from classified queries or using a document taxonomy.

The work cited above is focused on classifying queries into topics. Other work in- cludes classifying queries into non-topic categories. For example, Kang and Kim (2003) classified queries into three non-topic categories: “the topic relevance task”, CHAPTER 1. INTRODUCTION 3

“the homepage finding task”, and “the service finding task”. Rose and Levinson (2004) found that “navigational” searches are not prevalent and most queries are “resource-seeking”. Broder et al. (2007) focused on rare queries. Their method- ologies have two phases, the construction of a “document classifier” and then the development of a “query classifier”.

1.3 My Approach

Although the research introduced in last section provides some methodologies to solve the query classification problem, in the thesis, I introduce a new query classification system based on a new query expansion approach. Query classification can be con- sidered as a special case of text classification since queries are short texts. However, queries are often short and ambiguous.

In my thesis, I consider 18 topics (based on the application, see Chapter 2), and try to classify queries into these topics. The main idea is to extend each query to create a short “text”, and then to classify the “text” into one or more of the 18 topics by applying some classification methods. The proposed approach can be summarized by answering the following three questions.

1. Find the source of candidate words to expand a query?

2. Expand a query by using these candidate words?

3. Classify a query after it has been expanded?

To answer these three questions, I design a query classification system containing three parts: feature selection, query expansion, and query classification.

The first part answers the first question. I select a subset of feature words for each topic. This process takes advantage of the GenieKnows taxonomy (a database of words and their appropriate classification used in this thesis) and will be discussed in Chapter 2. These feature words are considered as a source of candidate words to CHAPTER 1. INTRODUCTION 4

expand each query. In this part, the penalized feature selection is novel.

The second part of the query classification system answers the second question. For each word in a query, I measure the similarity between the word and all the feature words found in the first part. Two kinds of similarity score, cosine similarity and a novel smoothed KL divergence, are used to measure the degree of similarity. For each word in a query, the similarity scores of the feature words are sorted so that the most similar feature words can be chosen to expand the query. The expanded query consists of the original words and the most similar feature words.

The third part answers the third question. Three kinds of classification methods, the naive Bayes multinomial model, the naive Bayes Bernoulli model, and a novel Dirichlet/multinomial model, are used to classify the expanded query.

An expanded query can have a score for each of the 18 topics by applying the clas- sification methods. These 18 scores for the expanded query will be sorted, and the topics with the top scores are considered as the topics the query belongs to.

My query classification system is outlined in Figure 1.1. CHAPTER 1. INTRODUCTION 5

Figure 1.1: Query Classification system Chapter 2

Data and Terminology

In this chapter, some necessary notation and concepts from information theory and information retrieval are reviewed. The Genieknows taxonomy is also introduced.

2.1 Information Retrieval

Information retrieval studies how to help users find valuable information from a large collection of text documents. This section reviews some basic ideas from information retrieval: the bag of words assumption, document frequency, and term frequency. Liu (2007, Chapter 6) provides an excellent introduction to these ideas.

2.1.1 Bag of words assumption

One of the most basic questions in information retrieval is how to represent a docu- ment. In information retrieval, the bag-of-words assumption is widely used to answer this question. This assumption treats each document as a “bag of words”. The posi- tions of these words in the document are ignored, and a document is described by a set of distinct words.

Example 2.1. Suppose a document contains two sentences: “Data mining is not data compression.” and “Data mining is mining the data.”. By applying the bag-of-words

6 CHAPTER 2. DATA AND TERMINOLOGY 7

assumption, the document can be represented as: {data, mining, compression}. Words such as “is”, “not” and “the” are ignored. Such insignificant words are considered as “stopwords” (Liu, 2007).

2.1.2 Document frequency and term frequency

The term frequency (tf) of a word is the number of times the word appears in a document. In Example 2.1, there are only 3 distinct words: “data”, “mining”, and “compression”. We can represent the document as: “data, mining, compression”, and calculate the corresponding term frequencies as (4, 3, 1).

Now for a collection of documents, we can define document frequency. For any given word, the document frequency (df) of the word is the number of documents in which the word appears at least once.

2.2 Information Theory

Information theory originated with Shannon (1948) when he was working at Bell Labs. This section reviews two concepts from information theory: entropy and KL divergence.

Suppose we have a discrete random variable X taking one of n categorical values, with corresponding probability mass function p(X) = (p1, p2, . . . , pn). Then the en- tropy, H(x) of a discrete random variable X is a measure of the amount of uncertainty associated with the value of X. The entropy can be defined as

n X H(X) = − pi log pi. (2.1) i=1

Kullback-Liebler (KL) divergence (also known as relative entropy) is another impor- tant concept in information theory, which is often used to compare two distributions. Suppose we have two random variables X and Y , and they have probability mass CHAPTER 2. DATA AND TERMINOLOGY 8

functions p(X) = (p1, p2, . . . , pn) and p(Y ) = (q1, q2, . . . , qn). Then the KL diver- gence between X and Y can be defined as

n X pi KL(X,Y ) = p log . (2.2) i q i i

KL divergence measures the dissimilarity between the distributions of X and Y . If X and Y are similar, KL divergence will be small.

2.3 GenieKnows Taxonomy

Chakrabarti (2003) defines a taxonomy as “a large and complex class hierarchy”. The GenieKnows taxonomy comes from YellowPage data, and it summa- rizes some information such as document frequency and term frequency of words within each topic.

GenieKnows is a vertical search engine company (www..ca), and their taxonomy serves for the GenieKnows engine. Local search engines re- turn local business information to users. For example, for the query “coffee”, a local search engine will return only the local coffee companies and merchandise to users; while a general search engine, like Google, will return all kinds of web pages, maybe including a web page about how to make coffee.

My query classification system is based on the GenieKnows taxonomy data; how- ever it can also use other kinds of taxonomy so that it can work for all kinds of search engines, depending on the kind of taxonomy.

2.3.1 Topics

In the Genieknows taxonomy, the topics are in a two-level hierarchial structure. The top level includes the 18 topics listed in Table 2.1. CHAPTER 2. DATA AND TERMINOLOGY 9

Table 2.1: Top-level topics in the GenieKnows taxonomy

Topic ID Topic Name 1 Arts/Entertainment 2 Automotive 3 Business/Professional Services 4 Community/Government 5 Computers/Electronics 6 Construction/Contractors 7 Education 8 Food/Dinning 9 Health/Medicine 10 Home/Garden 11 Industry/Agriculture 12 Legal/Financial Services 13 Media/Communications 14 Personal Care/Services 15 Real Estate 16 Shopping 17 Sports/Recreation 18 Travel/Transportation

The bottom level includes 273 sub-topics. For example, for the topic “Food/Din- ning”, some of its sub-topics are: “bakeries”, “coffee shops”, “convenience stores”, “health food”, etc. In my research, only the 18 high level topics are considered. My task is to classify each query into one or more of these 18 topics. The same approach can also be applied to the 273 sub-topics.

2.3.2 Original data

The Genieknows taxonomy comes from the Internet YellowPage data (http://www.yellowpages.ca/business/). The YellowPage data includes all the businesses in North America. In the YellowPage data, businesses have been classified into the hierarchial categories described above. For each business, there are 67 fields to describe the business, such as “Business name”, “Telephone” etc. In this work, only three fields are used to build the taxonomy: “Business name”, “Services” and CHAPTER 2. DATA AND TERMINOLOGY 10

“Captions”.

In GenieKnows, researchers extract all the words from the YellowPage fields except stopwords to build the taxonomy data. For each topic, all the words occurring in the topic are listed. An example of the GenieKnows taxonomy for topic 1 (Arts/Enter- tainment) are listed in Table 2.2.

Table 2.2: Taxonomy extract for topic Arts/Entertainment

Words (wi) g(wi) h(wi) selection 0.00125 191 rolled 6.967E-6 48231 roller 6.53E-5 3546 blenko 3.483E-6 120578 liqr 6.1E-6 34451 saginaw 1.045E-5 20096

In Table 2.2, the first column lists the word which appears in this topic. The

second column presents g(wi), a measure of term frequency, and the third column presents h(wi), a measure of document frequency. Due to proprietary reasons, the exact expressions for g(wi) and h(wi) are not provided.

2.3.3 Multi-word taxonomy

The GenieKnows multi-word taxonomy summarizes the co-occurrence frequencies of pairs of words. Part of the multi-word taxonomy for Topic 1 (Arts/Entertainment) is shown in Table 2.3. CHAPTER 2. DATA AND TERMINOLOGY 11

Table 2.3: Multi-words taxonomy extract for topic Arts/Entertainment

wi wj f(wi, wj) antique windsor 4.0 circle entertainment 5.0 carpet oriental 5.0 richard wright 4.0 annual recital 7.0 magic shop 17.0 age fun 23.0 piano sharp 7.0 bass sam 4.0 kirkland photography 3.0

The entries in the third column, f(wi, wj), are functions of the number of times the two words wi and wj co-occur in the topic “Arts/Entertainment”. Due to proprietary reasons, the exact expression for f(wi, wj) is not provided. Chapter 3

Feature Selection

This chapter introduces the first part of my query classification system: feature se- lection. For each topic, some words of the taxonomy are highly related to this topic and yet diverse enough to capture the variation among all the words associated with the topic. Such words are called the feature words of the topic. Feature words for all the topics are chosen to build a feature words dictionary, which is used to expand each query. By constructing a feature words dictionary, I will be able to represent queries in terms of a set of under 200 words, rather than with the over 100,000 words in the taxonomy.

3.1 Feature Selection Using Chi-Square Statistic

The chi-square (χ2) statistic is one of the most popular methods to do feature selec- tion in text categorization (Yang and Pedersen, 1997). It measures the dependence between a word wi and a topic Tj.

Yang and Pedersen (1997) and Chakrabarti (2003) provide a chi-square based statis- tic for feature selection for selecting feature words from a collection of documents. Consider a 2 × 2 frequency table corresponding to occurrence/non-occurrence of word ¯ wi (wi/w ¯i) and occurrence/non-occurrence of topic Tj (Tj/Tj), where j = 1, 2,..., 18.

12 CHAPTER 3. FEATURE SELECTION 13

¯ Word/Topic Tj Tj

wi A B

w¯i C D

Let N = A + B + C + D be the total number of documents. The frequency A+C ¯ B+D A+B C+D table gives estimates p(Tj) = N , p(Tj) = N , p(wi) = N , and p(w ¯i) = N . Pearson’s χ2 statistic is

2 ¯ 2 2 (A − N × p(wi) × p(Tj)) (B − N × p(wi) × p(Tj)) χ (Tj, wi) = + ¯ + N × p(wi) × p(Tj) N × p(wi) × p(Tj)

2 ¯ 2 (C − N × p(w ¯i) × p(Tj)) (D − N × p(w ¯i) × p(Tj)) + ¯ , (3.1) N × p(w ¯i) × p(Tj) N × p(w ¯i) × p(Tj)

which can be simplified (Chakrabarti, 2003) to

N × (A × D − C × B)2 χ2(T , w ) = , (3.2) j i (A + C) × (B + D) × (A + B) × (C + D)

which is used to do feature selection in information retrieval. For a given topic Tj (j

= 1, 2, ... , 18), I use (3.2) to calculate the chi-square statistic between word wi and

topic Tj for every word in the taxonomy. These chi-square statistic values are sorted, and the 11 words with the largest χ2 values are chosen as the feature words for this

topic Tj.

3.1.1 Penalized feature selection

If we observe the feature words selected by the chi-square statistic, we will find that some feature words in some topics have very similar meanings. To overcome this problem, I propose penalized feature selection (PFS) as a generalization of the chi- square statistics method. CHAPTER 3. FEATURE SELECTION 14

For example, without penalization, topic 4 in the taxonomy “Community/Govern- ment” has the feature words: “church”, “baptist”, “tax”, “ministry”, “christ”, “god”, “methodist”, “filing”, “united”, “efile”, “christian”. In these feature words, “church”, “baptist”, “ministry”, “christ”, “god”, and “christian” have very similar meaning and are not diverse enough to capture all the words associated with the topic “Commu- nity/Government”.

The example highlights a potential problem with the feature selection method. The

dependence between topic Tj and word wi in the chi-square statistic does not account for dependence among words. In some cases, such as the example presented here, some redundant words may be selected. The objective of PFS is to make the selected feature words as diverse as possible.

The process is a sequential process. For a given topic Tj, suppose L (1 ≤ L ≤

10) words have been selected as feature words, and they are coded as wk1 , wk2 , ... , wkL . The indices k1, ... , kL identify the L words. To choose the (L+1)-st feature word, I choose the word wi which maximizes the PFS score

L 2 X 2 PFS(wi,Tj) = χ (Tj, wi) − α × χ (wi, wkl ). (3.3) l=1

2 PFS adds a penalty term to (3.2). The second term, χ (wi, wkl ) for word wi and wkl is calculated via (3.2). The size of the parameter α controls the amount of penalization. In penalized feature selection, the first feature word is still the word with the largest chi-square statistic. If two words have similar meaning, they will tend to have larger 2 chi-square statistic. That is, the second term, χ (wi, wkl ), will be large if word wi is

dependent on previously selected word wkl . The multi-word taxonomy described in

Section 2.3.3 is used to calculate this term. That is, if a word wi has similar meaning PJ 2 with some selected feature words wkl , the penalty term α× j=1 χ (wi, wkl ) will tend to be large. So the penalty term discourages feature words having similar meanings, and makes the selected feature words as diverse as possible. CHAPTER 3. FEATURE SELECTION 15

For example, in topic 4 (Community/Government), if α is set to 0.002, the selected feature words for this topic are: “church”, “bapist”, “tax”, “efile”, “lcsw”, “filing”, “cremation”, “investigation”, “methodist”, “electronic”, “club”. With the help of penalized feature selection, these feature words are more diverse. The feature words corresponding to a variety of α for all the 18 topics are listed in the Appendix, and the choice of values for α is discussed in Section 6.2.1. Chapter 4

Query Expansion

In this chapter, I introduce the second part of the query classification system: query expansion. First, I introduce two methods to define word similarity in Section 4.1. These two methods will be used to measure the similarity between each query word and all the feature words. Then, the k most similar feature words will be added to the query to form a small text. In Section 4.2, I explain why only feature words are considered as candidates to generate an expanded query.

4.1 Word Similarity

Two words are considered to be similar if they have similar meaning. For example, “restaurant” and “food” are similar words, but “restaurant” and “computer” are not. To measure similarity between two words, I use two methods: cosine similarity and smoothed KL divergence. Both methods are based on the GenieKnows taxonomy.

4.1.1 Cosine similarity

Cosine similarity can be used to measure the similarity between two words. Let a word wi be expressed as a vector vwi given by

vwi = [dfi,1, dfi,2, . . . , dfi,18], (4.1)

16 CHAPTER 4. QUERY EXPANSION 17

where dfi,j (j = 1, ... ,18) represents the document frequency of word wi in the topic

Tj. The cosine similarity between two words wi and wk can be defined as

P18 j=1(dfi,j × dfk,j) SC(wi, wk) = q q , (4.2) P18 2 P18 2 j=1 dfi,j j=1 dfk,j which can also be written as

< vwi, vwk > SC(wi, wk) = cos(θ) = , (4.3) k vwi kk vww k where θ is the angle between vector vwi and vwk, < vwi, vwk > is the dot product of vwi and vwk, and k vwi k denotes the magnitude of vector vwi.

Since the document frequency values in vectors vwi and vwk are non negative, 0

≤ SC(wi, wk) = cos(θ) ≤ 1. Values of SC(wi, wk) close to 1 indicate that the two words wi and wk are similar, whereas SC(wi, wk) close to 0 means that the words wi and wk are not similar. This measure provides a relative ranking of the closeness of words to a specific word.

Example 4.1. To compute similarity between “restaurant”, “food”, and “computer”, I express these three words in document frequency vector form, as in (4.1)

vwrestaurant = [12631, 2221,..., 3830],

vwfood = [4496, 5157,..., 395],

vwcomputer = [1497, 16529,..., 398].

The cosine similarity measure of the word “restaurant” with the other two words are

12631 × 4496 + ··· + 3830 × 395 SC(restaurant, food) = √ √ = 0.93, 126312 + ··· + 38302 × 44962 + ··· + 3952

12631 × 1497 + ··· + 3830 × 398 SC(restaurant, computer) = √ √ = 0.53. 126312 + ··· + 38302 × 14972 + ··· + 3982 Clearly, “restaurant” and “food” are more similar than “restaurant” and “computer”. CHAPTER 4. QUERY EXPANSION 18

4.1.2 Smoothed KL divergence

Kullback-Leibler (KL) divergence (or relative entropy) has been introduced in Sec- tion 2.2. It is commonly used to measure the dissimilarity between two distribution functions. Here, we consider the distribution of two words across the 18 topics of the GenieKnows taxonomy.

Let pi,j denote the relative document frequency of wi, given by

df p = i,j , (4.4) i,j P18 j=1 dfi,j

then the vector vwi can also be re-expressed as a probability mass function that corresponds to the word wi,

vwi = [pi,1, pi,2, . . . , pi,18]. (4.5)

Thus, KL divergence between two words wi and wk can be represented as

18   X pi,j KL(w , w ) = p × log . (4.6) i k i,j p j=1 k,j

A problem with KL divergence is that it can be infinite if pk,j = 0 and the corre- sponding pi,j 6= 0. To solve this problem, I use the idea of smoothing from Manning, Raghavan and Schutze (2008) and add a smoothing term 0.5 to (4.4). The modified pi,j is given by df + 0.5 p = i,j . (4.7) i,j P18 j=1 dfi,j + 0.5 × 18 Since the smallest non-zero document frequency (df) is 1, I add a smaller smoothing term 0.5 here. KL divergence expression in (4.6) is used with (4.7) to calculate the similarity between two words.

KL divergence is not symmetric, which means KL(wi, wk) 6= KL(wk, wi). In the problem under consideration, the feature words are considered as “reference” words CHAPTER 4. QUERY EXPANSION 19

or “baseline” words, so the first word wi in KL-divergence score in (4.6) is the feature word and the second word wk is the query word.

In cosine similarity measurement, a large value (close to 1) indicates that two words are similar; in smoothed KL divergence, a smaller value indicates that two words are more similar.

Example 4.2. Consider the setup of Example 4.1, to compute the smoothed KL divergence between “restaurant”, “food”, and “computer”, I first express these three words as three vectors of adjusted probabilities, as in (4.7)

vwrestaurant = [0.073, 0.013,..., 0.022],

vwfood = [0.039, 0.044,..., 0.003],

vwcomputer = [0.011, 0.118,..., 0.00285].

Then the smoothed KL divergence measure of the word “restaurant” with the other two words are

0.073 0.022 KL(restaurant, food) = 0.073 × log + ··· + 0.022 × log = 0.182, 0.039 0.003

0.073 0.022 KL(restaurant, computer) = 0.073 × log + ··· + 0.022 × log = 3.872. 0.011 0.00285 Since (“restaurant”, “food”) has smaller KL divergence value compared to (“restau- rant”, “computer”), “restaurant” is more similar to “food” than to “computer”.

4.2 The Advantage of Using Feature Words

In the proposed methodology, only feature words, instead of all the words in Ge- nieKnows taxonomy, are used as candidates to expand queries. The main reason is to avoid adding noisy words to a query. The words in the taxonomy that are by mistake selected as the most similar words to query words are called noisy words. CHAPTER 4. QUERY EXPANSION 20

Example 4.3. For the query “bill”, if I use all the words in the Genieknows taxonomy as candidates, the 9 most similar words using cosine similarity are

stan, steve, doug, rick, curt, dan, holley, tom, ron. (4.8)

Almost all of these words are noise words, i.e., they are not related to the query “bill”. However, if only the feature words are considered to be the candidates, by using cosine similarity, the most similar words for “bill” are (penalty parameter α is set to 0.0002)

repair, home, supply, system, estimate, consulting, storage, control, electrical. (4.9) By using smoothed KL divergence (penalty parameter α is set to 0.0002), the most similar words for “bill” are

repair, home, supply, system, estimate, consulting, storage, control, electrical. (4.10) Thus, (4.9) and (4.10) are more reasonable than (4.8), which uses all words in taxon- omy. The expanded query will be considered as a small text. For instance, in Example 4.3, the query “bill” would be expanded to 10 words (“bill” plus the 9 words listed above).

As a summary for this chapter, for each word in the query (without considering the stopwords, like “the”, “as”, “of”, ... ), I use cosine similarity or smoothed KL divergence to measure similarity between the query word and all the feature words, and the k most similar feature words are added to the query. If a query has m words, it will be expanded to m×(k+1) = m×k+m words, where m is the number of words in the original query and m × k is the number of “feature words” to be added. In the next chapter, I present some classification methods that can be applied to classify the expanded query (i.e., a small text with m × (k + 1) words). Chapter 5

Query Classification

In this chapter, three classification methods are introduced. The objects to be clas- sified are the queries which have been expanded, not the original queries. Expanded queries consist of the original queries and feature words that have been added. For instance, in Example 4.3, if cosine similarity is used in conjunction with feature words, the expanded query for the query “bill” would be the ten words: “bill”, “re- pair”, “home”, “supply”, “system”, “estimate”, “consulting”, “storage”, “control”, and “electrical”.

Three classification methods are applied to classify an expanded query: two naive Bayes classifiers and a Dirichlet/multinomial model. The naive Bayes classifiers are based on Bernoulli and multinomial distributions. For more details on Bayes mod- els, see McCallum and Nigam (1998), Chakrabarti (2003), Liu (2007), and Manning, Raghavan and Schutze (2008).

5.1 Naive Bayes Classification Method

The objective here is to classify an expanded query Q to one or more topics. Suppose the expanded query consists of M words: Q = {w1, w2, . . . , wM }, and the probability that the expanded query Q belongs to topic Tj can be represented as p(Tj|Q).

21 CHAPTER 5. QUERY CLASSIFICATION 22

An expanded query is considered as a text, and in the naive Bayes classification method, the text is thought to be generated by two processes,

1. a topic Tj is randomly picked with its corresponding prior probability p(Tj) for P18 an expanded query, and j=1 p(Tj) = 1,

2. after a topic Tj is chosen, its topic-conditional distribution p(Q|Tj) is used to generate the expanded query.

By Bayes rule,

p(Tj|Q) ∝ p(Tj) × p(Q|Tj). (5.1)

This Bayes classification method is called “naive” because, given a topic Tj, each word in the expanded query is assumed to be independent. So, (5.1) can be written as

p(Tj|Q) ∝ p(Tj) × p(w1|Tj) × p(w2|Tj) × · · · × p(wM |Tj), (5.2)

QM where p(Tj) represents the prior probability, i=1 p(wi|Tj) is the likelihood function,

and thus p(Tj|Q) is the posterior probability that the given expanded query Q belongs

to topic Tj. The prior probability p(Tj) can be estimated as

N p(T ) = j , (5.3) j N

where Nj denotes the number of documents in topic Tj, and N represents the total number of documents in the GenieKnows taxonomy. The i-th term in the likelihood function p(wi|Tj) is the probability that the word wi will occur in topic Tj. The key problem in naive Bayes classification method is how to estimate p(wi|Tj).

Next, I present two methods to estimate p(wi|Tj): (a) naive Bayes Bernoulli method and (b) naive Bayes multinomial model.

5.1.1 Naive Bayes Bernoulli model

Let topic Tj have Nj documents, and Zir (r = 1, 2, ... , Nj) be independent Bernoulli random variables such that Zir = 1 if wi is in the r-th document in topic Tj, and CHAPTER 5. QUERY CLASSIFICATION 23

PNj 0 otherwise. Then dfij = r=1 zir denotes the observed number of documents in topic Tj that contains word wi, and the corresponding random variable DFij can be modeled by using the binomial distribution with parameters Nj and p(wi|Tj). That is,

N ! j dfij Nj −dfij P (DFij = dfij) = × p(wi|Tj) × (1 − p(wi|Tj)) , (5.4) (Nj − dfij)! × dfij!

where Nj is the number of documents in topic Tj. The maximum likelihood estimate

of p(wi|Tj) is dfij p(\wi|Tj) = , (5.5) Nj

where dfij denotes the observed document frequency for word wi in topic Tj.

Occasionally, for rare words that do not appear in a topic Tj, but appear in an

expanded query, dfij can be equal to zero. As a result, when (5.5) is used in (5.2), the posterior probability in (5.2) becomes zero. This problem can be resolved by applying Bayesian methods.

In the Bayesian method, we assume the prior distribution on p(wi|Tj) is a beta dis- tribution, and its probability density function is

Γ(a + b) f(p(w |T )|a, b) = × p(w |T )a−1 × (1 − p(w |T ))b−1, (5.6) i j Γ(a)Γ(b) i j i j

where a, b > 0. The prior distribution has the mean

a p(\w |T ) = , (5.7) i j a + b

which is our best estimate of p(wi|Tj) without having seen the data. I take a = b = 1 in my research. Next, we combine the prior information (5.7) with the likelihood

estimate (5.5) of p(wi|Tj).

Let the joint pdf of p(wi|Tj) and DFij be f(p(wi|Tj),DFij), then the posterior density CHAPTER 5. QUERY CLASSIFICATION 24

(the distribution of p(wi|Tj) given dfij) is

f(p(wi|Tj), dfij) f(p(wi|Tj)|dfij) = f(dfij) Γ(N + a + b) j dfij +a−1 Nj −dfij +b−1 = × p(wi|Tj) × (1 − p(wi|Tj) , (5.8) Γ(dfij + a)Γ(Nj − dfij + b)

which is also a beta distribution (Bishop, 2006). The posterior mean of p(wi|Tj) is given by dfij + a dfij + 1 p(\wi|Tj) = = , (5.9) Nj + a + b Nj + 2 which combines the prior information in (5.7) and likelihood information in (5.5). In 1 (5.9), even if dfij = 0, a very small non-zero estimate p(\wi|Tj) = is provided so Nj +2 that the estimate of p(Tj|Q) is positive. By applying (5.2), for an expanded query with M words, M Nj Y dfij + 1 p\(T |Q) = × . (5.10) j N N + 2 i=1 j

5.1.2 Naive Bayes multinomial model

Recall that tfij represents the observed number of occurrences of word wi in all

documents belonging to topic Tj. Let TFij be the corresponding random variable, PUj and Uj be the number of unique words in topic Tj. Then k=1 tfkj represents the number of occurrences of all the words in topic Tj. A multinomial distribution can be used to model the joint distribution of (TF1j,TF2j,...,TFUj j). The joint pmf is

U (PUj tf )! j k=1 kj Y tfij P (TF1j = tf1j,...,TFU j = tfU j) = × p(wi|Tj) . (5.11) j j QUj i=1(tfij!) i=1

The maximum likelihood estimate (McCallum and Nigam, 1998) for p(wi|Tj) is

tfij p(\wi|Tj) = . (5.12) PUj k=1 tfkj CHAPTER 5. QUERY CLASSIFICATION 25

Similar as in the naive Bayes Bernoulli model, when tfij = 0, p(Tj|Q) will equal 0. The problem happens for infrequently occurring words that do not appear in a topic

Tj, but appear in an expanded query.

McCallum and Nigam (1998) use a smoothed estimate

tfij + 1 p(\wi|Tj) = (5.13) PUj k=1 tfkj + Uj

to avoid this problem. Actually, it can be obtained by applying the Bayesian method, which is similar to the discussion in Section 5.1.1. In Bayesian method, we assume the

prior distribution on p(wi|Tj) is a Dirichlet distribution, which is an extension for the beta distribution, from a binary case to multi-class case. The posterior distribution is

still a Dirichlet distribution, and (5.13) is the posterior mean for p(wi|Tj) (McCallum and Nigam, 1998).

However, from the GenieKnows taxonomy, I only have the data for computing tfij , PUj k=1 tfkj PUj and do not have individual tfij and k=1 tfkj. So, I cannot use (5.13) to estimate p(wi|Tj). As an alternative, I use

( ) tfij p(\wi|Tj) = max ,  , (5.14) PUj k=1 tfkj

−11 where  = 1×10 is a small constant. If tfij = 0,  will be the estimate of p(wi|Tj). Here the smoothing constant  is chosen very small so that it avoids the zero problem due to rare words described above, but will not influence the non-zero probabilities.

Thus, for an expanded query with M words, p(Tj|Q) can be estimated by the naive Bayes multinomial model as

M ( ) Nj Y tfij p\(Tj|Q) = × max ,  . (5.15) N PUj i=1 k=1 tfkj CHAPTER 5. QUERY CLASSIFICATION 26

So far in this chapter, I have proposed two methods for estimating p(wi|Tj), one using naive Bayes Bernoulli model and the other using naive Bayes multinomial model.

The estimated p(wi|Tj) is used in (5.2) to compute the desired posterior probability,

p(Tj|Q). Next, I propose an alternative direct approach for estimating p(Tj|Q).

5.2 Dirichlet/Multinomial Model

Suppose each expanded query has M words, and let qfj (j = 1, 2, ... , 18) be the

observed number of query words that are also feature words in topic Tj. Let QFj be P18 ∗ ∗ the corresponding random variable and j=1 qfj = M . Note that M could be larger than M since some topics share the same feature words. A multinomial distribution ∗ can be used to model the distribution of (QF1, QF2,..., QF18) with parameters M

and µ1, µ2, ... , µ18, where µj = p(Tj|Q). The joint pmf of (QF1, QF2,..., QF18) is given by ∗ 18 M ! Y qf P (QF = qf , . . . , QF = qf ) = × µ j . (5.16) 1 1 18 18 Q18 j j=1 qfj! j=1

The maximum likelihood estimate for µj = p(Tj|Q) is

qf µˆ = j . (5.17) j M ∗

To apply the Bayesian method, we assume the prior distribution for µj is a Dirichlet

N1 N18 distribution, with hyperparameters a1 = c × N , ... , a18 = c × N , and its joint probability density function is

P18 18 Γ( j=1 aj) Y a −1 f (µ , µ , . . . , µ |a , . . . , a ) = × µ j , (5.18) 1 2 18 1 18 Q18 j j=1 Γ(aj) j=1

where c is a constant, Nj is the number of documents in topic Tj, and N is the P18 number of documents in the GenieKnows taxonomy, which means j=1 Nj = N. CHAPTER 5. QUERY CLASSIFICATION 27

The expected value of µj in the prior distribution is

Nj c × N Nj µj = = , (5.19) b P18 Nj N c × j=1 N which is our best estimate of µj = p(Tj|Q) without having seen the data. Next, we combine the prior information (5.19) with the likelihood estimate (5.17) of µj

= p(Tj|Q). Let the joint distribution of {µ1, µ2, ... , µ18} and {QF1, ... , QF18} be f(µ1, µ2, . . . , µ18, QF1, . . . , QF18). The posterior distribution is still a Dirichlet distribution (Bishop, 2006),

f(µ1, µ2, . . . , µ18, qf1, . . . , qf18) f(µ1, µ2, . . . , µ18|qf1, . . . , qf18) = f(qf1, . . . , qf18)

P18 18 Γ( j=1 aj + qfj) Y a +qf −1 = × µ j j , (5.20) Q18 j j=1 Γ(aj + qfj) j=1

Setting c = 1 gives the posterior mean of µj = p(Tj|Q) as

Nj + qf µˆ = p\(T |Q) = N j , j = 1,..., 18. (5.21) j j 1 + M ∗

The choice of c = 1 corresponds to a reasonably uninformative prior. With c = 1, the topic with the largest probability p\(Tj|Q) in (5.21) will correspond to the topic with the most query words (largest qfj). In this sense, the choice c = 1 is uninformative.

In this chapter, (5.10), (5.15), and (5.21) provide three ways to estimate p(Tj|Q).

For each expanded query, their values of p(Tj|Q) for 18 topics are sorted, and the topics corresponding to the largest values are chosen as the topics to which this query should be classified. The choice of the number of topics to which a query is classified is discussed in Section 6.1.3. Chapter 6

Experiments

In this chapter, I study the performance of my proposed query classification method using the two word similarity measure described in Chapter 4 and the three classifi- cation methods described in Chapter 5 on the Knowledge Discovery and Data Mining (KDD)-Cup 2005 data.

6.1 KDD Data

KDD-Cup 2005 competition data is used to test the performance of my query clas- sification system. The test queries are taken from the KDD-Cup 2005 competition (http://www.sigkdd.org/kdd2005/kddcup.html).

In this competition, 37 teams participated and they were required to classify 800,000 queries into 67 categories. From these queries, 800 queries were randomly chosen and labeled by hand. This was assumed to be the exact classification of these queries. The performance of each team was measured based on their ability to classify these 800 queries in their appropriate categories. Teams are expected to classify all 800,000 queries without knowing which were the 800 queries with “exact” labels.

28 CHAPTER 6. EXPERIMENTS 29

6.1.1 Some problems in using KDD Cup 2005 queries

Before the query classification method can be applied to the KDD-Cup data, the dataset must be processed in a number of ways. In this section, I outline several problems with the raw data. First, since 98 queries among the 800 labeled queries do not exist in the GenieKnows taxonomy (e.g., query “1939”), they are not considered in my experiment. So there are 800 - 98 = 702 queries in my experiment. Second, the 800 labeled queries are for general search engines, however, the GenieKnows taxonomy is for local search engine (local search engine has been introduced in Section 2.3). So the comparison is somewhat unfair, and are inappropriate to my query classification system. Finally, the categories of KDD-Cup 2005 are slightly different from the topics of GenieKnows taxonomy, and thus I matched them manually (Table 6.1). Each of the 18 topics corresponds to at least one KDD category. CHAPTER 6. EXPERIMENTS 30

Table 6.1: KDD-CUP Categories and Genieknows Topics

Genieknows Topics KDD-CUP Categories Computers and Electronics Computers/Hardware Computers and Electronics Computers/Internet and Intranet Computers and Electronics Computers/Mobile Computing Computers and Electronics Computers/Networks and Telecommunication Computers and Electronics Computers/Security Computers and Electronics Computers/Software Computers and Electronics Computers/Other Media and Communications Computers/Multimedia Arts and Entertainment Entertainment/Celebrities Arts and Entertainment Entertainment/Games and Toys Arts and Entertainment Entertainment/Humor and Fun Arts and Entertainment Entertainment/Movies Arts and Entertainment Entertainment/Music Arts and Entertainment Entertainment/Pictures and Photos Arts and Entertainment Entertainment/Radio Arts and Entertainment Entertainment/TV Arts and Entertainment Entertainment/Other Arts and Entertainment Information/Arts and Humanities Industry and Agriculture Information/Companies and Industries Education Information/Science and Technology Education Information/Education Legal and Financial Services Information/Law and Politics Community and Government Information/Local and Regional Community and Government Information/References and Libraries CHAPTER 6. EXPERIMENTS 31

Shopping Living/Book and Magazine Automotive Living/Car and Garage Business and Professional Services Living/Career and Jobs Personal Care and Services Living/Dating and Relationships Community and Government Living/Family and Kids Shopping Living/Fashion and Apparel Legal and Financial Services Living/Finance and Investment Food and Dining Living/Food and Cooking Home and Garden Living/Furnishing and Houseware Shopping Living/Gifts and Collectables Health and Medicine Living/Health and Fitness Home and Garden Living/Landscaping and Gardening Community and Government Living/Pets and Animals Real Estate Living/Real Estate Community and Government Living/Religion and Belief Home and Garden Living/Tools and Hardware Travel and Transportation Living/Travel and Vacation Business and Professional Services Online Community/Chate and Instant Messaging Business and Professional Services Online Community/Forums and Groups Business and Professional Services Online Community/Homepages Business and Professional Services Online Community/People Search Business and Professional Services Online Community/Personal Services CHAPTER 6. EXPERIMENTS 32

Shopping Shopping/Auctions and Bids Shopping Shopping/Stores and Products Shopping Shopping/Buying Guides and Researching Shopping Shopping/Lease and Rent Shopping Shopping/Bargains and Discounts Sports and Recreation Sports/American Football Sports and Recreation Sports/Auto Racing Sports and Recreation Sports/Baseball Sports and Recreation Sports/Basketball Sports and Recreation Sports/Hockey Media and Communications Sports/News and Scores Sports and Recreation Sports/Schedules and Tickets Sports and Recreation Sports/Soccer Sports and Recreation Sports/Tennis Sports and Recreation Sports/Olympic Games Sports and Recreation Sports/Outdoor Recreations Construction and Constructors Living/Tools and Hardware

These 702 queries were classified by applying the approaches described in Chapters 3, 4, and 5. Before presenting the results, I review the concepts of precision, recall and F1 value, which are used to compare the performance of two word similarity methods and three classification methods.

6.1.2 Precision, recall and F1 value

In information retrieval community, precision, recall and F1 value are the most com- mon ways to evaluate the performance of a retrieval system. In the KDD-cup 2005 competition, precision and recall are defined as P j number of queries correctly labeled as Tj P recision = P , (6.1) j number of queries labeled as Tj CHAPTER 6. EXPERIMENTS 33

P j number of queries correctly labeled as Tj Recall = P . j number of queries whose topics are labeled by human experts as Tj (6.2) In (6.1) and (6.2), the numerators are the same, but the denominators are different. Example 6.1 explains these two ideas more clearly.

Example 6.1. Consider two queries Q1 and Q2, and suppose there are six top- ics. Let Q1 belong to three topics: 2, 3, 6, and Q2 belong to four topics: 1, 2, 3, 4.

Now suppose the query classification system classifies Q1 into two topics: 2, 4, and

Q2 into three topics: 2, 3, 6.

Compared with the correct answer, we can see that in Q1, only one topic, 2 is correctly identified; and in Q2, two topics, 2 and 3 are correctly identified. So the numerators in (6.1) and (6.2) equal 1 + 2 = 3.

The denominator in (6.1) is 2 + 3 = 5, since the query classification system clas- sify Q1 into 2 topics (only 1 topic is correct) and Q2 into 3 topics (only 2 topics are correct). In (6.2), the denominator is 3 + 4 = 7, since the correct number of topics 3 3 for Q1 is 3 and for Q2 is 4. So, precision = 5 and recall = 7 .

There exists a tradeoff between precision and recall. If the number of topics identified by the query classification system increase, recall value in (6.2) tends to increase since its denominator is fixed, while its numerator tend to increase. However, precision in (6.1) will tend to decrease as both its denominator and numerator tend to increase, but the denominator always increases more than the numerator.

In the information retrieval community, the F1 value is designed to provide a harmonic value of precision and recall

2 × P recision × Recall F 1 = . (6.3) P recision + Recall CHAPTER 6. EXPERIMENTS 34

6.1.3 The number of return topics

In KDD-Cup 2005 data, after mapping the categories to the GenieKnows topics, I find that the range of the correct number of topics is between one and six. As discussed in Section 6.1.2, how many topics should be returned for each query in my query classification system is an important problem. To compare my query classification system with KDD-Cup 2005 results, I make the number of return topics equal to the correct number of topics labeled by KDD-Cup 2005, which means in (6.1) and (6.2), the denominators are the same, so precision = recall = F1 value.

6.2 Experiment Results

6.2.1 Choosing α and k

Penalized feature selection is introduced in Section 3.1.1, where α is the tuning pa- rameter that controls the amount of penalty, α = 0 means no penalty. The penalized feature selection with different values of α provides different feature words. I recorded how many feature words are changed compared with no penalization case, correspond- ing to different α values. The number of changes are shown in Table 6.2 (note: the total number of feature words is 11 × 18 = 198). The number of changes are also depicted in Figure 6.1. CHAPTER 6. EXPERIMENTS 35

α number of changes 0 0 0.000015 11 0.00002 13 0.00004 21 0.00006 27 0.00008 35 0.0001 39 0.0002 57 0.0006 93 0.001 98 0.002 109 0.008 118 0.01 123

Table 6.2: Number of changes to feature words set for various values of penalty parameter α. Up to 198 changes are possible

In subsequent sections, the α values of 0 (no penalty), 0.000015, 0.00004, 0.00006, 0.0001, 0.0002, 0.0006, and 0.002, will be used, because the numbers of changed fea- ture words between these α values are significantly far apart. For example, 0.0006 and 0.002 are chosen, but 0.001 is not chosen, since the number of changes corresponding to 0.0006 and 0.001 are 93 and 98, and the difference between them is very small. However, the number of changes corresponding to 0.002 is 109, and the difference between 109 and 93 is larger.

In Chapter 4, the k most similar feature words are added to each query to form a small text. I consider choosing values of k from the set 4, 7, 10, 13, 16, 19, 22, 25, 28, and 30. My experiment finds that with the increasing value of k, the trend of F1 values has two possibilities: (1) increase first and then decrease, and (2) strictly decrease. This grid of k values will enable a near-optimal k to be selected. CHAPTER 6. EXPERIMENTS 36

Figure 6.1: Number of change of feature words compared to the no penalty case

6.2.2 Notation

In this section, I introduce some notation that will be used later in this chapter. There are two methods (cosine similarity and smoothed KL divergence) to measure words similarity, and they are denoted as Cos and KL. There are three classification methods (naive Bayes Bernoulli, naive Bayes multinomial, and Dirichlet/multinomial) to classify queries, and they are denoted as: NBBER, NBMUL and DIRI. In Table 6.3, I list the notation and the corresponding meaning. CHAPTER 6. EXPERIMENTS 37

Table 6.3: Notation

Notation Meaning Cos+NBMUL cosine similarity + naive Bayes multinomial Model Cos+NBBER cosine similarity + naive Bayes Bernoulli Model Cos+DIRI cosine similarity + Dirichlet/multinomial Model KL+NBMUL smoothed KL-divergence + naive Bayes multinomial Model KL+NBBER smoothed KL-divergence + naive Bayes Bernoulli Model KL+DIRI smoothed KL-divergence + Dirichlet/multinomial Model α penalty parameter k number of feature words to be added for each query word

6.2.3 F1 values for the KDD-Cup data

The F1 values for the classification of the 702 queries in their appropriate topics using different combinations of α and k for each of the six methods are shown in Tables 6.4 - 6.9. High F1 values correspond to better performance. Several observations between methods will be made in the following sections. CHAPTER 6. EXPERIMENTS 38

Table 6.4: F1 values: Cos+NBMUL

k/α 0 0.000015 0.00004 0.00006 0.0001 0.0002 0.0006 0.002 k=4 0.356 0.355 0.354 0.349 0.354 0.351 0.353 0.358 k=7 0.363 0.359 0.362 0.36 0.347 0.336 0.338 0.342 k=10 0.36 0.366 0.36 0.352 0.347 0.348 0.35 0.347 k=13 0.358 0.364 0.362 0.358 0.355 0.352 0.35 0.353 k=16 0.364 0.362 0.358 0.355 0.353 0.342 0.343 0.343 k=19 0.358 0.364 0.366 0.36 0.348 0.354 0.339 0.346 k=22 0.355 0.359 0.363 0.353 0.346 0.342 0.34 0.345 k=25 0.358 0.358 0.36 0.356 0.354 0.355 0.345 0.342 k=28 0.354 0.36 0.36 0.358 0.351 0.345 0.333 0.339 k=30 0.355 0.364 0.358 0.355 0.356 0.342 0.33 0.336

Table 6.5: F1 values: Cos+NBBER

k/α 0 0.000015 0.00004 0.00006 0.0001 0.0002 0.0006 0.002 4 0.39 0.389 0.392 0.379 0.381 0.373 0.378 0.388 7 0.4 0.4 0.398 0.395 0.388 0.382 0.392 0.394 10 0.4 0.407 0.399 0.393 0.39 0.389 0.396 0.402 13 0.402 0.4 0.396 0.387 0.388 0.379 0.39 0.399 16 0.399 0.396 0.388 0.384 0.383 0.38 0.39 0.392 19 0.397 0.4 0.395 0.385 0.389 0.38 0.384 0.395 22 0.393 0.396 0.392 0.386 0.38 0.377 0.38 0.387 25 0.39 0.391 0.389 0.379 0.378 0.373 0.38 0.388 28 0.38 0.39 0.382 0.37 0.368 0.369 0.379 0.385 30 0.38 0.386 0.383 0.369 0.372 0.366 0.376 0.383 CHAPTER 6. EXPERIMENTS 39

Table 6.6: F1 values: Cos+DIRI

k/α 0 0.000015 0.00004 0.00006 0.0001 0.0002 0.0006 0.002 4 0.347 0.35 0.344 0.339 0.347 0.34 0.331 0.32 7 0.359 0.36 0.362 0.361 0.355 0.354 0.347 0.339 10 0.365 0.37 0.357 0.364 0.357 0.358 0.352 0.345 13 0.379 0.382 0.376 0.37 0.364 0.368 0.355 0.34 16 0.386 0.384 0.375 0.372 0.351 0.357 0.343 0.347 19 0.393 0.393 0.376 0.377 0.364 0.362 0.345 0.353 22 0.392 0.386 0.374 0.368 0.364 0.362 0.342 0.357 25 0.394 0.388 0.373 0.37 0.363 0.358 0.344 0.356 28 0.382 0.38 0.363 0.36 0.358 0.358 0.345 0.356 30 0.375 0.376 0.364 0.361 0.353 0.354 0.338 0.35

Table 6.7: F1 values: KL+NBMUL

k/α 0 0.000015 0.00004 0.00006 0.0001 0.0002 0.0006 0.002 4 0.375 0.368 0.376 0.364 0.378 0.36 0.365 0.368 7 0.38 0.381 0.38 0.374 0.377 0.375 0.368 0.373 10 0.372 0.378 0.375 0.37 0.378 0.37 0.362 0.377 13 0.37 0.369 0.372 0.371 0.375 0.364 0.365 0.378 16 0.362 0.368 0.378 0.365 0.373 0.368 0.359 0.373 19 0.356 0.361 0.362 0.369 0.368 0.369 0.363 0.372 22 0.356 0.36 0.361 0.372 0.378 0.368 0.362 0.386 25 0.362 0.37 0.37 0.372 0.38 0.375 0.36 0.379 28 0.364 0.361 0.372 0.373 0.376 0.37 0.358 0.376 30 0.357 0.357 0.366 0.364 0.37 0.371 0.366 0.38 CHAPTER 6. EXPERIMENTS 40

Table 6.8: F1 values: KL+NBBER

k/α 0 0.000015 0.00004 0.00006 0.0001 0.0002 0.0006 0.002 4 0.394 0.388 0.388 0.385 0.394 0.375 0.369 0.372 7 0.4 0.4 0.399 0.388 0.403 0.389 0.368 0.377 10 0.399 0.394 0.397 0.388 0.395 0.38 0.374 0.39 13 0.399 0.402 0.4 0.389 0.398 0.376 0.373 0.389 16 0.4 0.405 0.4 0.396 0.396 0.374 0.377 0.396 19 0.403 0.399 0.408 0.393 0.398 0.386 0.383 0.401 22 0.404 0.402 0.398 0.395 0.404 0.388 0.381 0.395 25 0.397 0.392 0.394 0.39 0.402 0.39 0.374 0.39 28 0.396 0.392 0.396 0.392 0.398 0.39 0.379 0.396 30 0.398 0.394 0.4 0.394 0.4 0.393 0.385 0.403

Table 6.9: F1 values: KL+DIRI

k/α 0 0.000015 0.00004 0.00006 0.0001 0.0002 0.0006 0.002 4 0.37 0.372 0.368 0.35 0.366 0.353 0.329 0.32 7 0.389 0.392 0.38 0.372 0.383 0.359 0.337 0.331 10 0.399 0.391 0.393 0.38 0.39 0.373 0.337 0.332 13 0.403 0.407 0.41 0.396 0.402 0.38 0.342 0.338 16 0.399 0.413 0.41 0.403 0.399 0.372 0.34 0.332 19 0.394 0.403 0.406 0.394 0.4 0.385 0.339 0.342 22 0.398 0.4 0.405 0.403 0.399 0.39 0.336 0.347 25 0.396 0.398 0.405 0.404 0.39 0.387 0.338 0.349 28 0.399 0.4 0.405 0.409 0.392 0.394 0.336 0.348 30 0.396 0.403 0.402 0.398 0.399 0.386 0.336 0.344 CHAPTER 6. EXPERIMENTS 41

6.2.4 Comparison of word similarity measures

In Chapter 4, two methods to measure words similarity (cosine similarity and smoothed KL divergence) are introduced. Here I compare these two methods. Figures 6.2 and 6.4 display the F1 values from Tables 6.4 - 6.9 corresponding to three classification methods using both cosine similarity and smoothed KL divergence. Various values of (k,α) are represented in the plots. The red lines indicate smoothed KL divergence, and green lines indicate cosine similarity. CHAPTER 6. EXPERIMENTS 42

Figure 6.2: F1 values for the Naive Bayes multinomial model. The red lines indicate smoothed KL divergence, and the green lines indicate cosine similarity CHAPTER 6. EXPERIMENTS 43

Figure 6.3: F1 values for the Naive Bayes Bernoulli model. The red lines indicate smoothed KL divergence, and the green lines indicate cosine similarity CHAPTER 6. EXPERIMENTS 44

Figure 6.4: F1 values for the Dirichlet/multinomial model. The red lines indicate smoothed KL divergence, and the green lines indicate cosine similarity CHAPTER 6. EXPERIMENTS 45

For the naive Bayes multinomial model in Figure 6.2, we can see that smoothed KL divergence performs better than cosine similarity, since all the red lines are above green lines.

For the naive Bayes Bernoulli model in Figure 6.3, we can still see that smoothed KL divergence performs better than cosine similarity. However, as the penalty parame- ters increase, the difference between smoothed KL divergence and cosine similarity is decreasing. In some cases, such as α = 0.0006, cosine similarity is slightly better than smoothed KL divergence.

For the Dirichlet/multinomial model in Figure 6.4, when 0 ≤ α ≤ 0.0002, smoothed KL divergence performs better than cosine similarity. When α = 0.006 and 0.002, cosine similarity performs better than smoothed KL divergence.

From these figures, we can conclude that smoothed KL divergence seems to be a better choice than cosine similarity.

6.2.5 Comparison of three classification methods

In Figures 6.5 and 6.6, the F1 values for three classification methods are shown, cor- responding to two similarity methods. Black lines represent naive Bayes multinomial model, green lines represent naive Bayes Bernoulli model, and red lines represent Dirichlet/multinomial model. CHAPTER 6. EXPERIMENTS 46

Figure 6.5: F1 values for the models using smoothed KL divergence. The black lines indicate naive Bayes multinomial model, the green lines indicate naive Bayes Bernoulli model, and the red lines indicate Dirichlet/multinomial model CHAPTER 6. EXPERIMENTS 47

Figure 6.6: F1 values for the models using Cosine similarity. The black lines indicate naive Bayes multinomial model, the green lines indicate naive Bayes Bernoulli model, and the red lines indicate Dirichlet/multinomial model CHAPTER 6. EXPERIMENTS 48

For smoothed KL divergence, from Figure 6.5, we can see that when 0 ≤ α ≤ 0.0002, naive Bayes multinomial performs the worst, and the naive Bayes Bernoulli and Dirichlet/multinomial Model are similar. When α = 0.0006 and 0.002, naive Bayes Bernoulli performs the best, Dirichlet/multinomial Models performs the worst, and naive Bayes Bernoulli is between them.

For cosine similarity, from Figure 6.6, we can see that naive Bayes Bernoulli performs the best, naive Bayes multinomial performs the worst, and the Dirichlet/multinomial is between them.

In summary, naive Bayes Bernoulli is the strongest performer.

6.2.6 Comparison of feature word penalty parameters

In this section, I discuss whether penalized feature selection discussed in Section 3.1.1 is at all helpful. In Figures 6.7 and 6.8, the F1 values for different values of penalty parameters α are shown. In these figures, different colors represent different penalty parameter values. CHAPTER 6. EXPERIMENTS 49

Figure 6.7: F1 values for methods using smoothed KL divergence. Top: naive Bayes multinomial; middle: naive Bayes Bernoulli; bottom: Dirichlet/multinomial model. The different colors represent different penalty parameters α: black (α=0), green (α=0.000015), red (α=0.00004), yellow (α=0.00006), pink (α=0.0001), blue (α=0.0002), orange (α=0.0006), purple (α=0.002) CHAPTER 6. EXPERIMENTS 50

Figure 6.8: F1 values for methods using cosine similarity. Top: naive Bayes multinomial; middle: naive Bayes Bernoulli; bottom: Dirichlet/multinomial model. The different colors represent different penalty parameters: black (α=0), green (α=0.000015), red (α=0.00004), yellow (α=0.00006), pink (α=0.0001), blue (α=0.0002), orange (α=0.0006), purple (α=0.002) CHAPTER 6. EXPERIMENTS 51

In Figures 6.7 and 6.8, the black line (corresponding to no penalty) is not always the top line. This indicates that penalized feature selection may be helpful. Actually, α = 0.000015 (green), 0.00004 (red), and 0.00006 (yellow) performs better than α = 0 (corresponding to no penalty) in many cases.

6.2.7 Comparison with the KDD-Cup 2005 competitors

In this section, I will compare my outcomes with the results from the KDD-Cup 2005. The KDD-Cup results are from the website http://www.sigkdd.org/kdd2005/kddcup. html. The F1 values for their best results are given in Table 6.10. From those tables, their precision and recall are not the same as F1 values, but in my experiment, pre- cision = recall = F1 values. Since F1 values are the harmonic values of precision and recall, I use F1 values as the measure for comparison. From Table 6.10, I found that there are three teams whose F1 value is above 0.4, and they are: ID 25 (F1 = 0.44), ID 37 (F1 = 0.426), and ID 8 (F1 = 0.405).

Some of these queries are not friendly to the GenieKnows taxonomy, since the Ge- nieKnows taxonomy serves for a local search engine, and these queries are always typed in a general search engine (e.g., Google). Even in this unfair circumstance, the performance of my query classification systems can still compete with the top three results in the KDD Cup 2005. The largest F1 value in my research is 0.413 (corre- sponding to KL+DIRI, k=16 and α=0.000015), which can rank among the top three in the KDD-Cup 2005 competition. In addition, most of the F1 values in my query classification system are above 0.36. In KDD-Cup 2005 competition, only four teams have F1 values above 0.36. So my query classification system performs well, even when I use a taxonomy for local search engine to deal with the queries for general search engine. CHAPTER 6. EXPERIMENTS 52

Table 6.10: KDD-Cup 2005 results

Submission ID Precision F1 1 0.145099 0.146839 2 0.116583 0.139732 3 0.339435 0.309754 4 0.110885 0.124228 5 0.31068 0.085639 6 0.254815 0.246264 7 0.263953 0.306359 8 0.454068 0.405453 9 0.264312 0.306612 10 0.334048 0.342248 11 0.107045 0.116521 12 0.196117 0.207787 13 0.326408 0.357127 14 0.317308 0.312812 15 0.271791 0.26545 16 0.050918 0.060285 17 0.264009 0.218436 18 0.206167 0.247854 19 0.136541 0.127008 20 0.127784 0.126848 21 0.340883 0.34009 22 0.414067 0.444395 23 0.237661 0.250293 24 0.244565 0.258035 25 0.753659 0.205391 26 0.255726 0.274579 27 0.206919 0.205302 28 0.148503 0.17614 29 0.171081 0.1985 30 0.145467 0.173173 31 0.108305 0.108174 32 0.16962 0.232654 33 0.469353 0.255096 34 0.198284 0.191618 35 0.32075 0.384136 36 0.211284 0.129937 37 0.423741 0.426123 Chapter 7

Conclusion and Future Work

In this thesis, I developed a new query classification system using the GenieKnows taxonomy. My query classification system consists of three parts. In the first part, I selected a subset of feature words for each of the 18 topics, in the second part, I expanded each query by adding the most similar feature words, and in the third part, the expanded query is classified. Although the GenieKnows taxonomy is designed for local search, the experiment shows that the proposed query classification system performs very well, even to deal with queries for a general search engine.

However, there are some issues that need further attention.

1. In this thesis, I use KDD-Cup 2005 queries to test the performance of my query classification system. In the future, we could try to find some local queries to test its performance.

2. We could try some other kinds of taxonomy, for example, GenieKnows game taxonomy, Wikepedia taxonomy, or Google taxonomy to test its performance.

3. I intend to compare the performance of my query classification system with the existing methodologies introduced in Section 1.2

4. My query classification system did not consider “personalization”. A query may have different meanings. For example, for the query “apple”: a user may ask

53 CHAPTER 7. CONCLUSION AND FUTURE WORK 54

information about fruit “apple”, while another user may ask information about “apple” company. In the future, we could try to design a query classification system that considers different information needs for different users, and provide “personalized” classification results. A possible approach could be to change the prior probability in (5.10) and (5.15) in a “personalized” way. Instead of using

Nj N as the prior probability, different user can have his or her “personalized” prior probability according to his or her search history. APPENDIX A. APPENDIX (FEATURE WORDS) 55

Appendix A

Appendix (Feature Words)

Table A.1: Feature Words for Topic 1

0 0.000015 0.00004 0.00006 photography photography photography photography antique antique antique antique music music music music studio studio studio studio art art art art museum museum gallery gallery lounge gallery lounge lounge gallery lounge museum museum tavern tavern tavern tavern piano piano piano piano photo photo photo photo 0.0001 0.0002 0.0006 0.002 photography photography photography photography antique antique antique antique music music music music studio studio art lounge art art gallery art gallery gallery portrait orchestra lounge lounge orchestra tuning museum portrait lounge passport portrait entertainment tuning clown bar bar passport highschool piano piano clown jockey APPENDIX A. APPENDIX (FEATURE WORDS) 56

Table A.2: Feature Words for Topic 2

0 0.000015 0.00004 0.00006 auto auto auto auto tire tire tire tire car car car car automotive automotive automotive truck truck truck truck automotive body body body body part part part part towing repair repair repair repair towing towing towing brake motor motor motor motor brake brake sale 0.0001 0.0002 0.0006 0.002 auto auto auto auto tire tire tire part car car part repair truck truck repair tuneup automotive automotive car goodwrench body part tuneup automobile part body goodwrench car repair repair automobile tire towing tuneup automotive automotive tuneup goodwrench truck truck automobile automobile body body APPENDIX A. APPENDIX (FEATURE WORDS) 57

Table A.3: Feature Words for Topic 3

0 0.000015 0.00004 0.00006 realty realty realty realty marketing marketing marketing marketing storage storage storage plumbing plumbing plumbing plumbing storage funeral funeral funeral funeral sign sign real real real real pest pest school pest sign printing pest printing printing laundry embroidery estate estate estate printing embroidery laundry staffing 0.0001 0.0002 0.0006 0.002 realty realty realty realty marketing marketing marketing plumbing plumbing plumbing plumbing funeral storage funeral funeral pest funeral storage storage embroidery real pest pest laundromat pest real laundry catering laundry laundry catering auto printing printing taxi tailor estate staffing upholstery publishing staffing auto taxidermy taxi APPENDIX A. APPENDIX (FEATURE WORDS) 58

Table A.4: Feature Words for Topic 4

0 0.000015 0.00004 0.00006 church church church church baptist baptist baptist baptist tax tax tax tax ministry ministry methodist methodist christ christ ministry ministry god methodist christ filing methodist filing filing christ filing united efile efile united efile united united efile god christian christian christian christian center center 0.0001 0.0002 0.0006 0.002 church church church church baptist baptist baptist baptist tax tax tax tax methodist methodist efile efile filing efile methodist lcsw efile filing filing filing united united lcsw cremation christ lcsw cremation investigation ministry christian investigation methodist christian center electronic electronic lcsw electronic center club APPENDIX A. APPENDIX (FEATURE WORDS) 59

Table A.5: Feature Words for Topic 5

0 0.000015 0.00004 0.00006 computer computer computer computer printing printing printing printing software software software software technology technology technology technology data data system system system system data printer solution solution printer viru network network viru data wireless printer network network printer wireless wireless spyware web viru solution wireless 0.0001 0.0002 0.0006 0.002 computer computer computer computer printing printing printing viru software software software copy system system viru letterhead technology viru printer envelope viru printer letterhead hosting printer spyware hosting press spyware letterhead spyware computing letterhead hosting computing newsletter hosting newsletter newsletter brochure newsletter brochure brochure microsoft APPENDIX A. APPENDIX (FEATURE WORDS) 60

Table A.6: Feature Words for Topic 6

0 0.000015 0.00004 0.00006 construction construction construction construction plumbing plumbing plumbing plumbing heating heating heating heating residential residential residential roofing roofing roofing roofing residential commercial electric electric electric architect commercial contractor contractor electric architect commercial builder contractor contractor builder estimate insured builder concrete concrete concrete concrete estimate air 0.0001 0.0002 0.0006 0.002 construction construction construction construction plumbing plumbing heating heating heating heating plumbing contractor roofing contractor contractor locksmith residential electric locksmith building contractor roofing building plumbing electric estimate estimate estimate builder locksmith electric electrical estimate building electrical kitchen building builder builder electric locksmith electrical roof roof APPENDIX A. APPENDIX (FEATURE WORDS) 61

Table A.7: Feature Words for Topic 7

0 0.000015 0.00004 0.00006 school school school school elementary elementary elementary elementary district district district district academy academy academy academy preschool preschool preschool preschool college college college college middle middle education education learning learning middle learning education education learning middle montessori gallery gallery gallery gallery high art art 0.0001 0.0002 0.0006 0.002 school school school school elementary elementary elementary elementary district district district district academy college kumon kumon preschool academy perk perk college education education training education preschool training education learning gallery gallery gallery gallery kumon college art middle perk art high art art high center APPENDIX A. APPENDIX (FEATURE WORDS) 62

Table A.8: Feature Words for Topic 8

0 0.000015 0.00004 0.00006 restaurant restaurant restaurant restaurant pizza pizza pizza pizza cafe cafe cafe cafe food food food food liquor liquor liquor liquor deli deli grocery grocery market grocery market market grocery market deli deli bakery bakery coffee coffee coffee coffee grill grill grill grill mexican store 0.0001 0.0002 0.0006 0.002 restaurant restaurant restaurant restaurant pizza pizza pizza pizza cafe cafe food food food food grocery grocery liquor liquor liquor mastercard grocery grocery mastercard store market market store convenience deli mastercard market liquor coffee store cafe sandwiche store coffee convenience market mastercard convenience coffee coffee APPENDIX A. APPENDIX (FEATURE WORDS) 63

Table A.9: Feature Words for Topic 9

0 0.000015 0.00004 0.00006 md md md md dd dd dd dd dr dr dr dr dentistry dentistry dentistry dentistry surgery surgery surgery surgery dental dental dental dental chiropractic chiropractic chiropractic chiropractic dmd dmd dmd dmd medicine medicine medicine medicine patient patient patient patient dc dc health health 0.0001 0.0002 0.0006 0.002 md md md md dd dd dd dd dr dr dr dentistry dentistry dentistry dentistry chiropractic surgery surgery chiropractic dental dental chiropractic dental disease chiropractic dental surgery medical medicine patient medical whitening patient health health canal dmd medical care lense health care whitening health APPENDIX A. APPENDIX (FEATURE WORDS) 64

Table A.10: Feature Words for Topic 10

0 0.000015 0.00004 0.00006 furniture furniture furniture furniture residential residential residential residential door door door door interior interior interior interior carpet carpet pest pest pest pest carpet carpet flower window window window window flower flower flower commercial estimate estimate estimate estimate florist florist florist florist commercial cleaning cleaning 0.0001 0.0002 0.0006 0.002 furniture furniture furniture furniture residential pest pest pest interior door florist florist pest interior interior tile door carpet siding siding carpet window insured insured window estimate control control estimate appliance appliance floral florist florist laminate feed flower control feed appraisal appliance insured appraisal welding APPENDIX A. APPENDIX (FEATURE WORDS) 65

Table A.11: Feature Words for Topic 11

0 0.000015 0.00004 0.00006 feed feed feed feed metal metal metal metal supply supply supply supply hardware hardware hardware engineer environmental environmental engineer hardware engineer engineer environmental environmental consulting consulting consulting consulting meat meat meat teamster steel local teamster beauty engineering steel beauty meat local beauty local industrial 0.0001 0.0002 0.0006 0.002 feed feed feed feed metal metal metal metal supply supply supply environmental engineer engineer engineer meat hardware consulting crane beauty environmental hardware teamster propane consulting teamster syndicat teamster teamster syndicat exploration aircraft syndicat environmental afl syndicat exploration exploration aflcio exploration afl afl local afl APPENDIX A. APPENDIX (FEATURE WORDS) 66

Table A.12: Feature Words for Topic 12

0 0.000015 0.00004 0.00006 law law law law atty atty atty atty attorney attorney attorney attorney insurance insurance insurance insurance financial financial financial financial cpa cpa agency agency agency agency criminal criminal criminal criminal cpa injury divorce injury injury efile injury filing efile filing filing efile filing cpa 0.0001 0.0002 0.0006 0.002 law law law law atty atty atty atty attorney attorney insurance insurance insurance insurance attorney financial financial financial financial efile agency efile efile death criminal filing death electronic efile injury wrongful fdic injury agency filing felony filing death electronic misdemeanor death criminal fdic wrongful APPENDIX A. APPENDIX (FEATURE WORDS) 67

Table A.13: Feature Words for Topic 13

0 0.000015 0.00004 0.00006 advertising advertising advertising advertising communication communication communication communication fm fm fm fm tv tv tv radio radio radio radio tv wireless wireless wireless wireless cellular cellular cellular cellular production production magazine magazine magazine magazine production production video video video video publication publication publication publication 0.0001 0.0002 0.0006 0.002 advertising advertising advertising advertising communication communication communication fm fm radio wireless communication radio wireless radio wireless tv tv publisher broadcasting wireless cellular firm publisher cellular magazine vcr film magazine video duplication vcr video publisher lcd duplication production publication tv lcd publication answering video radio APPENDIX A. APPENDIX (FEATURE WORDS) 68

Table A.14: Feature Words for Topic 14

0 0.000015 0.00004 0.00006 hair hair hair hair salon salon salon salon nail nail nail nail beauty beauty beauty beauty barber barber barber barber therapy therapy therapy therapy massage care care care care massage massage massage tanning tanning tanning physical styling physical physical tanning coiffure rehabilitation rehabilitation rehabilitation 0.0001 0.0002 0.0006 0.002 hair hair hair hair salon salon salon salon nail nail nail beauty beauty beauty beauty nail barber barber barber rehabilitation therapy care rehabilitation physical care therapy care assisted massage physical physical child physical assisted assisted men rehabilitation rehabilitation men stable tanning men stable counseling APPENDIX A. APPENDIX (FEATURE WORDS) 69

Table A.15: Feature Words for Topic 15

0 0.000015 0.00004 0.00006 realty realty realty realty apartment apartment apartment apartment real real real real estate estate estate estate property property property property apt apt apt apt realtor title title title title assisted assisted assisted development realtor home activity assisted management activity home management development management multiple 0.0001 0.0002 0.0006 0.002 realty realty realty realty apartment apartment apartment apartment real real real real estate estate estate property property property property estate title assisted activity activity assisted activity acreage acreage apt acreage multiple multiple activity multiple assisted assisted acreage title home home multiple home appraiser appraiser APPENDIX A. APPENDIX (FEATURE WORDS) 70

Table A.16: Feature Words for Topic 16

0 0.000015 0.00004 0.00006 gift gift gift gift jewelry jewelry jewelry jewelry jeweler jeweler jeweler jeweler flower flower flower flower florist florist florist florist boutique boutique shoe shoe book shoe book book shoe book fashion fashion ftd pawn tobacco tobacco pawn tobacco pawn pawn tobacco fashion store store 0.0001 0.0002 0.0006 0.002 gift gift gift gift jewelry jewelry jewelry ftd jeweler flower ftd carpet flower jeweler carpet heating florist florist fashion seiko shoe carpet heating comic book shoe seiko laminate carpet fashion comic jewelry fashion book store store store store laminate teleflora heating heating teleflora fashion APPENDIX A. APPENDIX (FEATURE WORDS) 71

Table A.17: Feature Words for Topic 17

0 0.000015 0.00004 0.00006 golf golf golf golf locksmith locksmith locksmith locksmith dance dance dance dance martial martial martial martial club club club club fitness fitness marine marine marine marine fitness fitness karate campground campground campground campground yoga bait bait marina bait yoga yoga yoga course course course 0.0001 0.0002 0.0006 0.002 golf golf golf golf locksmith locksmith locksmith locksmith dance dance dance dance martial martial martial martial club club club marine marine marine marine club campground campground campground campground fitness bait tae tae bait yoga yoga hobby tae fitness surveying surveying yoga tae firearm firearm APPENDIX A. APPENDIX (FEATURE WORDS) 72

Table A.18: Feature Words for Topic 18

0 0.000015 0.00004 0.00006 travel travel travel travel inn inn inn inn motel motel motel motel limousine limousine limousine limousine hotel hotel hotel hotel resort resort resort resort tour tour tour tour trucking trucking trucking trucking cruise cruise moving moving moving moving cruise cruise breakfast breakfast breakfast breakfast 0.0001 0.0002 0.0006 0.002 travel travel travel travel inn inn inn inn motel motel motel limousine limousine limousine limousine motel hotel hotel trucking trucking tour tour tour moving resort resort hotel hbo trucking trucking moving ambulance moving moving resort transit cruise breakfast hbo forwarding breakfast cruise ambulance inroom Bibliography

Beitzel, S. M., Jensen, E. C., Lewis, D. D., Chowdhury, A. (2005a), Automatic Classification Using Labeled and Unlabeled Training Data, in Proceed- ings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 581-582.

Beitzel, S. M., Jensen, E. C., Lewis, D. D., Chowdhury, A., Kolcz, A. and Frieder, O. (2005b), Improving Automatic Query Classification via Semi-supervised Learn- ing. In IEEE ICDM, 2005, page 42-49.

Beitzel, S. M., Jensen, E. C., Chowdhury, A., Frieder, O. (2007), Varying Approaches to Topical Web Query Classification, in Proceedings of the 30th Annual Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval, page 783-784

Berry, M. W. and Browne, M. (2005), Understanding Search Engines:Mathematical Modeling and Text Retrieval, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.

Bhandari, S. K. and Davison, B. D. (2007), Leveraging Search Engine Results for Query Classification. Technical Report LU-CSE-07-013, Dept. of Computer Science and Engineering, Lehigh University, Bethlehem, PA, 18015.

Bishop, C. M. (2006),Pattern Recognition and Machine Learning, Springer Sci- ence+Business Media, LLC.

Brin, S. and Page, L. (1998), The Anatomy of a Large-scale Hypertextual Web

73 Search Engine, in Proceedings of the 7th International Con- ference, page 108-117.

Broder, A., Fontoura, M., Gabrilovich, E., Joshi, A., osifovski, V., Zhang, T. (2007), Robust Classification of Rare Queries Using Web Knowledge, in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval.

Chakrabarti, S. (2003), Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers.

Chuang, S. L. and Chien, L. F. (2003), Automatic Query Taxonomy Generation for Information Retrieval Applications, Online Information Review, Volume 27, Issue 4, Page 243-255.

Haveliwala, T. H. (2002), Topic-Sensitive PageRank, in Proceedings of the 11th In- ternational World Wide Web Conference, page 517-526.

Kang, I. and Kim, G. (2003), Query Type Classification for Web Document Re- trieval, in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 64-71.

Liu, B. (2007),Web Data Mining, Exploring Hyperlinks, Contents, and Usage Data, Springer, Berlin NewYork.

Manning, C. D., Raghavan, P. and Sch¨utze,H. (2008), Introduction to Information Retrieval, Cambridge University Press.

McCallum, A. and Nigam, K. (1998), A Comparison of Event Models for Naive Bayes Text Classification, in AAAI-98 Workshop on Learning for Text Categorization.

Qiu, F. and Cho, J. (2006), Automatic Identification of User Interest For Personal- ized Search, in Proceedings of the 15th International World Wide Web Confer- ence, page 727-736.

74 Robert, C. P. (2007), The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation, second edition, Springer.

Rose, D. E. and Levinson, D. (2004), Understanding User Goals in Web Search, in Proceedings of the 13th International World Wide Web Conference, page 13-19.

Shannon, C. (1948), A Mathematical Theory of Communication, Bell System Tech- nical Journal.

Shen, D., Pan, R., Sun, J. T., Pan, J. J., Wu, K., Yin, J. and Yang, Q. (2005), Q2C@UST: Our Winning Solution to Query Classification in KDDCUP 2005. SIGKDD Explorations, Volume 7, Issue 2.

Shen, D., Sun, J. T., Yang, Q., and Chen, Z. (2006), Building Bridges for Web Query Classification, in Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 131- 138.

Yang, Y. and Pedersen, J. O. (1997), A Comparative Study on Feature Selection in Text Categorization, in Proceedings of the 4th International Conference on Machine Learning, page 412-420.

75