Robust Classification of Rare Queries Using Web Knowledge

Robust Classification of Rare Queries Using Web Knowledge Andrei Broder, Marcus Fontoura, Evgeniy Gabrilovich, Amruta Joshi, Vanja Josifovski, Tong Zhang Yahoo! Research, 2821 Mission College Blvd, Santa Clara, CA 95054 fbroder | marcusf | gabr | amrutaj | vanjaj | [email protected] ABSTRACT the queries can go a long way in improving the search results We propose a methodology for building a practical robust and the user experience. query classi¯cation system that can identify thousands of At the same time, better understanding of query mean- query classes with reasonable accuracy, while dealing in real- ing has the potential of boosting the economic underpinning time with the query volume of a commercial web search en- of Web search, namely, online advertising, via the sponsored gine. We use a blind feedback technique: given a query, search mechanism that places relevant advertisements along- we determine its topic by classifying the web search results side search results. For instance, knowing that the query retrieved by the query. Motivated by the needs of search ad- \SD450" is about cameras while \nc4200" is about laptops vertising, we primarily focus on rare queries, which are the can obviously lead to more focused advertisements even if no hardest from the point of view of machine learning, yet in ag- advertiser has speci¯cally bidden on these particular queries. gregation account for a considerable fraction of search engine In this study we present a methodology for query classi¯- tra±c. Empirical evaluation con¯rms that our methodology cation, where our aim is to classify queries onto a commercial yields a considerably higher classi¯cation accuracy than pre- taxonomy of web queries with approximately 6000 nodes. viously reported. We believe that the proposed methodology Given such classi¯cations, one can directly use them to pro- will lead to better matching of online ads to rare queries and vide better search results as well as more focused ads. The overall to a better user experience. problem of query classi¯cation is extremely di±cult owing to the brevity of queries. Observe, however, that in many cases Categories and Subject Descriptors a human looking at a search query and the search query results does remarkably well in making sense of it. Of course, H.3.3 [Information Storage and Retrieval]: Information the sheer volume of search queries does not lend itself to Search and Retrieval| relevance feedback, search process human supervision, and therefore we need alternate sources General Terms of knowledge about the world. For instance, in the example Algorithms, Measurement, Performance, Experimentation above, \SD450" brings pages about Canon cameras, while \nc4200" brings pages about Compaq laptops, hence to a Keywords human the intent is quite clear. Query classi¯cation, Web search, blind relevance feedback Search engines index colossal amounts of information, and as such can be viewed as very comprehensive repositories of 1. INTRODUCTION knowledge. Following the heuristic described above, we pro- In its 12 year lifetime, web search had grown tremen- pose to use the search results themselves to gain additional dously: it has simultaneously become a factor in the daily insights for query interpretation. To this end, we employ life of maybe a billion people and at the same time an the pseudo relevance feedback paradigm, and assume the eight billion dollar industry fueled by web advertising. One top search results to be relevant to the query. Certainly, thing, however, has remained constant: people use very not all results are equally relevant, and thus we use elab- short queries. Various studies estimate the average length of orate voting schemes in order to obtain reliable knowledge a search query between 2.4 and 2.7 words, which by all ac- about the query. For the purpose of this study we ¯rst dis- counts can carry only a small amount of information. Com- patch the given query to a general web search engine, and mercial search engines do a remarkably good job in interpret- collect a number of the highest-scoring URLs. We crawl the ing these short strings, but they are not (yet!) omniscient. Web pages pointed by these URLs, and classify these pages. Therefore, using additional external knowledge to augment Finally, we use these result-page classi¯cations to classify the original query. Our empirical evaluation con¯rms that using Web search results in this manner yields substantial improvements in the accuracy of query classi¯cation. Permission to make digital or hard copies of all or part of this work for Note that in a practical implementation of our method- personal or classroom use is granted without fee provided that copies are ology within a commercial search engine, all indexed pages not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to can be pre-classi¯ed using the normal text-processing and republish, to post on servers or to redistribute to lists, requires prior specific indexing pipeline. Thus, at run-time we only need to run permission and/or a fee. the voting procedure, without doing any crawling or classi- SIGIR’07, July 23–27, 2007, Amsterdam, The Netherlands. ¯cation. This additional overhead is minimal, and therefore Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. the use of search results to improve query classi¯cation is we construct a document classi¯er for classifying search re- entirely feasible in run-time. sults into the same taxonomy into which queries are to be Another important aspect of our work lies in the choice classi¯ed. In the second phase, we develop a query classi¯er of queries. The volume of queries in today's search engines that invokes the document classi¯er on search results, and follows the familiar power law, where a few queries appear uses the latter to perform query classi¯cation. very often while most queries appear only a few times. While 2.1 Building the document classifier individual queries in this long tail are rare, together they account for a considerable mass of all searches. Furthermore, In this work we used a commercial classi¯cation taxonomy the aggregate volume of such queries provides a substantial of approximately 6000 nodes used in a major US search en- opportunity for income through on-line advertising.1 gine (see Section 3.1). Human editors populated the taxon- Searching and advertising platforms can be trained to omy nodes with labeled examples that we used as training yield good results for frequent queries, including auxiliary instances to learn a document classi¯er in phase 1. data such as maps, shortcuts to related structured informa- Given a taxonomy of this size, the computational e±- tion, successful ads, and so on. However, the \tail" queries ciency of classi¯cation is a major issue. Few machine learn- simply do not have enough occurrences to allow statistical ing algorithms can e±ciently handle so many di®erent clas- learning on a per-query basis. Therefore, we need to aggre- ses, each having hundreds of training examples. Suitable gate such queries in some way, and to reason at the level candidates include the nearest neighbor and the Naive Bayes of aggregated query clusters. A natural choice for such ag- classi¯er [3], as well as prototype formation methods such gregation is to classify the queries into a topical taxonomy. as Rocchio [15] or centroid-based [7] classi¯ers. A recent Knowing which taxonomy nodes are most relevant to the study [5] showed centroid-based classi¯ers to be both ef- given query will aid us to provide the same type of support fective and e±cient for large-scale taxonomies and conse- for rare queries as for frequent queries. Consequently, in this quently, we used a centroid classi¯er in this work. work we focus on the classi¯cation of rare queries, whose 2.2 Query classification by search correct classi¯cation is likely to be particularly bene¯cial. Having developed a document classi¯er for the query tax- Early studies in query interpretation focused on query onomy, we now turn to the problem of obtaining a classi¯- augmentation through external dictionaries [22]. More re- cation for a given query based on the initial search results cent studies [18, 21] also attempted to gather some ad- it yields. Let's assume that there is a set of documents ditional knowledge from the Web. However, these stud- D = d1 : : : dm indexed by a search engine. The search engine ies had a number of shortcomings, which we overcome in can then be represented by a function f~ = similarity(q; d) this paper. Speci¯cally, earlier works in the ¯eld used very that quanti¯es the a±nity between a query q and a docu- small query classi¯cation taxonomies of only a few dozens ment d. Examples of such a±nity scores used in this paper of nodes, which do not allow ample speci¯city for online ad- are rank|the rank of the document in the ordered list of vertising [11]. They also used a separate ancillary taxonomy search results; static score|the score of the goodness of for Web documents, so that an extra level of indirection had the page regardless of the query (e.g., PageRank); and dy- to be employed to establish the correspondence between the namic score|the closeness of the query and the document. ancillary and the main taxonomies [18]. Query classi¯cation is determined by ¯rst evaluating con- The main contributions of this paper are as follows. First, ditional probabilities of all possible classes P (Cj jq), and we build the query classi¯er directly for the target taxon- then selecting the alternative with the highest probability omy, instead of using a secondary auxiliary structure; this Cmax = arg maxCj 2C P (Cj jq).

Load more