Learning to Rewrite Queries

Learning to Rewrite Queries Yunlong He Jiliang Tang Hua Ouyang Yahoo! Reseach Michigan State University Yahoo! Reseach Sunnyvale, CA East Lansing, MI Sunnyvale, CA [email protected] [email protected] [email protected] Changsung Kang Dawei Yin Yi Chang Yahoo! Reseach Yahoo! Reseach Yahoo! Reseach Sunnyvale, CA Sunnyvale, CA Sunnyvale, CA [email protected] [email protected] [email protected] ABSTRACT isfy users' information needs by understanding the intent from their queries. It is widely known that there exists a semantic gap be- Query rewriting (QRW), which targets to alter a given tween web documents and user queries and bridging this query to alternative queries that can improve relevance per- gap is crucial to advance information retrieval systems. The formance by reducing the mismatches, is a critical task in task of query rewriting, aiming to alter a given query to a modern search engines and has attracted increasing atten- rewrite query that can close the gap and improve information in recent years [6, 14, 18, 8]. For example, authors tion retrieval performance, has attracted increasing atten- in [14] propose to modify the search queries based on typical tion in recent years. However, the majority of existing query substitutions that web searchers make to their queries; and rewriters are not designed to boost search performance and query rewriting is treated as a machine translation prob- consequently their rewrite queries could be sub-optimal. In lem and statistical machine translation (SMT) models are this paper, we propose a learning to rewrite framework that trained based on query-snippet pairs [18, 8]. However, the consists of a candidate generating phase and a candidate vast majority of existing algorithms focus on either correla- ranking phase. The candidate generating phase provides tions between queries and rewrites or bridging the language us the flexibility to reuse most of existing query rewriters; gap between user queries and web documents, which could while the candidate ranking phase allows us to explicitly be sub-optimal for the goal of query rewriting { improv- optimize search relevance. Experimental results on a com- ing search relevance performance. For example, since the mercial search engine demonstrate the effectiveness of the SMT model aims at translating a sentence from a source proposed framework. Further experiments are conducted language to a fluent and grammatically correct sentence in to understand the important components of the proposed a target language, the SMT model prefers to rewrite the framework. query \how much tesla" as \how much is the tesla" instead of \price tesla". Therefore it is necessary to explicitly consider 1 Introduction ranking relevance when developing query rewriting methods. In this paper, we propose a learning to rewrite framework Users of the Internet typically play two roles in commercial that consists of (1) candidate generating and (2) candidate search engines. They are information creators that generate ranking as shown in Figure 1. Given a query, we create possi- web documents and they are also information consumers ble candidates via a set of candidate generators in the candi- that retrieve documents for their information needs. It is date generating phase; while given a learning target, we train well known, however, that there is a \lexical chasm" [18] be- a scoring function to rank these candidates from the candi- tween user queries and web documents because they use dif- date generating phase in the candidate ranking phase. The ferent language styles and vocabularies. As a consequence, advantages of this framework are two-fold. First, the can- search engines could be unable to retrieve relevant docu- didate generating phase not only allows us to reuse most of ments even when the issued queries are perfectly describing existing query rewriting algorithms as candidate generators users' information needs. For example, a user forms a query but also enables us to explore recent advanced techniques \how much tesla", but web documents in search engines use such as deep learning to build new candidate generators. the expression \price tesla". Therefore it has become in- Second, the candidate ranking phase gives us flexibility to creasingly important for search engines to intelligently sat- choose the target to optimize for query rewriting; for example, in this work, we desire to boost relevance performance thus we can choose the relevance target to rank candidates. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed Our contributions are summarized as below: for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than • Define the problem of learning to rewrite formally; ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission • Propose a learning to rewrite framework that is flexible and/or a fee. Request permissions from [email protected]. to incorporate existing query rewriting algorithms and CIKM’16 , October 24-28, 2016, Indianapolis, IN, USA optimize the relevance performance explicitly; and c 2016 ACM. ISBN 978-1-4503-4073-1/16/10. $15.00 DOI: http://dx.doi.org/10.1145/2983323.2983835 • Conduct experiments on a commercial search engine to Figure 1: Flow chart of the proposed learning to rewrite framework demonstrate the effectiveness of the proposed frame- pure counts of term frequencies. However, the SMT system work. is used as a black box instead of fully tuned for the task of query rewriting. The rest of the paper is organized as follows. In Section 2, we briefly review the related work. In Section 3, we for- Another related research topic is query recommendation. mally define the problem and introduce the proposed frame- Authors in [12] make use of search session data to find query work in detail. In Section 4, we present empirical evaluation pairs frequently co-occurring in the same sessions, which are with discussion. In Section 5, we give conclusion with future used as suggestions for each other. Baeza-Yates et al. [2] work. propose to rank the clustered queries according to two prin- ciples: i) the similarity of the queries to the input query, and 2 Related Work ii) the support of the suggested query, which measures the Query expansion and rewriting have long been important magnitude of answers returned in the past to this query that research topics in information retrieval [3]. Xu and Croft have attracted the attention of users. A follow-up work [4] [23] studied using the top ranked documents retrieved by combines various click-based, topic-based and session based the original query to expand the query. This method suffers ranking strategies and uses supervised learning in order to from the sensitivity to initial ranking results and does not maximize the semantic similarity between the query and the learn from user generated data. Later approaches [6, 7, 14] recommendations. Cao et.al [5] addressed the data sparse- focus on using user query logs to generate query expansions ness issue by summarizing queries into concepts by clus- by collecting signals such as clickthrough rate [6, 24], cooc- tering a click-through bipartite. Then, from session data currence in search sessions [14] or query similarity based on a concept sequence suffix tree is constructed as the query click graphs [7, 1]. Since search logs contain query docu- suggestion model. ment pairs clicked by millions of users, the term correlations Recently, deep learning techniques [16, 21] have been ap- reflect the preference of the majority of users. However, plied on query processing and machine translation tasks. the correlation-based method, as pointed out by [19], suf- For example, authors in [11] generate distributed language fers low precision partly because the correlation model does models for queries to improve the relevance in sponsored not explicitly capture context information and is suscepti- search. Authors in [21] applied recurrent neural networks ble to noise. More recently, natural language technology in on machine translation tasks and achieved state-of-the-art form of statistical machine translation (SMT) [19, 18, 8, 9] performance compared to traditional SMT systems. A hi- has been introduced for the query expansion and rewriting erarchical recurrent encoder-decoder method is proposed in problems. In the SMT system all component models are [20] for the task of query auto-completion. properly smoothed using sophisticated techniques to avoid In this work, our approach is formulated as a learning sparse data problems while the correlation model relies on to rewrite framework and its key component is candidate ranking. Therefore the candidate ranking phase is similar with traditional techniques. Therefore a candidate gener- to the learning to rank framework in terms of many aspects ator based on deep learning techniques could be comple- such as loss functions [25, 22] and ranking features [15]. mentary to traditional ones and provides potentially better candidates. Recurrent Neural Network (RNN) is neural se- 3 Learning to Rewrite Framework quence model that achieves state of the art performance on The query rewriting problem aims to find the query rewrites many important sequential learning tasks. The long short- of a given query for the purpose of improving the relevance term memory (LSTM) is a one of the most popular RNN of the information retrieval system. The proposed frame- instance. It can learn long range temporal dependencies and work formulates the query rewriting problem as an opti- mitigate the vanishing gradient problem. We propose to use mization problem of finding a scoring function F (q; r) which the Sequence-to-Sequence LSTM model [21] to build a new assigns a score for any pair of query q and its rewrite can- candidate generator.

Learning to Rewrite Queries

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support