A Query Suggestion Method Combining TF-IDF and Jaccard Coefficient for Interactive Web Search
Total Page:16
File Type:pdf, Size:1020Kb
www.sciedu.ca/air Artificial Intelligence Research 2015, Vol. 4, No. 2 ORIGINAL RESEARCH A query suggestion method combining TF-IDF and Jaccard Coefficient for interactive web search Suthira Plansangket ,∗ John Q Gan School of Computer Science and Electronic Engineering, University of Essex, United Kingdom Received: May 10, 2015 Accepted: July 27, 2015 Online Published: August 6, 2015 DOI: 10.5430/air.v4n2p119 URL: http://dx.doi.org/10.5430/air.v4n2p119 Abstract This paper proposes a query suggestion method combining two ranked retrieval methods: TF-IDF and Jaccard coefficient. Four performance criteria plus user evaluation have been adopted to evaluate this combined method in terms of ranking and relevance from different perspectives. Two experiments have been conducted using carefully designed eighty test queries which are related to eight topics. One experiment aims to evaluate the quality of the query suggestions generated by the proposed method, and the other aims to evaluate the improvement of the relevance of retuned documents in interactive web search by using the query suggestions so as to evaluate the effectiveness of the developed method. The experimental results show that the method developed in this paper is the best method for query suggestion among the methods evaluated, significantly outperforming the most popularly used TF-IDF method. In addition, the query suggestions generated by the proposed method significantly improve the relevance of returned documents in interactive web search in terms of increasing the precision or the number of highly relevant documents. Key Words: Query suggestion, Query expansion, Information retrieval, Search engine, Performance evaluation 1 Introduction a cold-start problem. Kato et al.[4] analysed three types of logs in the Microsoft’s search engine Bing and found that Internet search engines play the most important role in find- query suggestions are often used when the original query ing information from the web. One of the great challenges is a rare query or a single-term query or after the user has faced by search engines is to understand precisely users’ in- clicked on several URLs in the first search result page. Fur- formation need, since users usually submit very short (only thermore, Carpineto and Romano[5] found that one advan- a couple of words) and imprecise queries.[1] Most existing tage of query expansion is that there is more chance for a search engines retrieve information by finding exact key- relevant document that does not contain the original query words. Sometimes, users do not know the precise vocab- terms to be retrieved. Niu and Kelly[6] found that when ulary of the topic to be searched and they do not know how searching for the most difficult topic, users save signifi- search algorithms work so as to produce proper queries.[2] cantly more documents retrieved by query suggestions than by user-generated queries. There exist many query sugges- One solution to these problems is to devise a query sugges- tion methods that extract query related terms or features tion module in search engines, which helps users in their [3] from log files, ontologies, and documents returned from searching activities. Kelly et al. pointed out that query search engines and use them to generate query suggestions. suggestions were useful when users ran out of ideas or faced ∗Correspondence: Suthira Plansangket; Email: [email protected]; Address: School of Computer Science and Electronic Engineering, Univer- sity of Essex, United Kingdom. Published by Sciedu Press 119 www.sciedu.ca/air Artificial Intelligence Research 2015, Vol. 4, No. 2 More related work will be described in the next section. appear first in query suggestions.[29] Therefore, it is reason- able to adapt ranked retrieval methods for query suggestion. This paper proposes a query suggestion method combining two ranked retrieval methods: TF-IDF and Jaccard coeffi- cient, and evaluates the method using several performance 3 Methods criteria and users’ judgement as well in terms of the qual- ity of the generated query suggestions and the improve- 3.1 TF-IDF ment of relevance of the returned documents in interactive TF-IDF[7] is the most popular term weighting scheme in in- web search. Comprehensive comparative experiments have formation retrieval. The TF-IDF score of a term in a set of demonstrated the effectiveness of the method developed in documents is calculated as follows: this paper. N X tfidfi = wij (1) 2 Related work j=1 2.1 Query expansion and reformulation N (1 + log fi,j) × log n , if fi,j > 0 wij = i (2) Query expansion is a technique to expand the query with re- 0, otherwise lated words and is widely used for query suggestion. It aims where fi,j is the frequency of term i in document j, ni is to improve the overall recall of the relevant documents.[7,8] the number of documents in which term i appears, N is the Query reformulation or dynamic query suggestion is more total number of available documents. complex than query expansion, which forms new queries us- [30] ing certain models.[8–10] This paper mainly addresses query TF-IDF has been used to measure word relatedness. expansion. Therefore, it can be applied to identify terms in the docu- ments returned from search engines, which are mostly rele- vant to the original query, as query suggestions. 2.2 Explicit and implicit feedback Relevance feedback plays an important role in query sug- 3.2 Jaccard coefficient gestion. There are two major categories of relevance feed- [28] back. Explicit feedback is provided directly by users, which Jaccard coefficient is a measure of overlap of two re- is expensive and time consuming. On the other hand, im- turned documents D1 and D2, which are represented as vec- plicit feedback is derived by the system.[7] The system de- tors of terms and may not have the same size. Jaccard coeffi- rives the feedback information from several sources of fea- cient has been used to measure the similarity between search [8] [31] tures, such as log files, web documents, and ontologies. This texts. Kulkarni and Caragea used this method to com- paper focuses on query suggestion methods based on im- pute semantic relatedness between two concept clouds. plicit relevance feedback. The Jaccard coefficient for a length-normalized model is There are many studies on query suggestion using log files, calculated as follows: from which user’s search behaviours and information need |D1 ∩ D2| can be derived.[1,4,8, 11–18] Various ontologies have been Jaccard(D1,D2) = √ (3) | D1 ∪ D2| applied to create knowledge-driven models for generating query suggestions, such as WordNet,[19, 20] Wikipedia,[21] where ∩ represents intersection and ∪ union. In this paper, ODP and YAGO.[22–25] Query suggestions can also be gen- D1 and D2 are bags of words which contain query sugges- erated from query related features extracted from web doc- tion candidates that are selected from words which appear in uments returned by search engines.[2] There are some stud- at least two returned documents. In mathematics, the notion ies on query suggestion that combined query log and web of multiset or bag is a generalization of the notion of set, in search results[26] or combined query log and ontology.[27] which members are allowed to appear more than once. The intersection or union of multisets is a multiset in general.[32] 2.3 Ranked retrieval models If a query suggestion candidate is from more than two re- turned documents, its Jaccard coefficient can be extended as In ranked retrieval models, the system returns an ordered follows: list of top matching documents with respect to a query. Typ- ical ranked retrieval methods include Jaccard coefficient and |D1 ∩ D2 ∩ · · · DM | [28] Jaccard(D1,D2, ··· DM ) = √ (4) Term frequency - inverse document frequency (TF-IDF) | D1 ∪ D2 ∪ · · · DM | which will be described in more detail in the next section. In this paper, for each query suggestion candidate, M doc- In information retrieval, ranked retrieval methods are used uments that contain this suggestion term are identified, and to order relevant documents with respect to a query. Sim- then Jaccard coefficient is calculated as the score to rank this ilarly, highly relevant query suggestions are preferable to candidate. 120 ISSN 1927-6974 E-ISSN 1927-6982 www.sciedu.ca/air Artificial Intelligence Research 2015, Vol. 4, No. 2 3.3 A new method based on the combination of processed as follows. First of all, not the whole document, TF-IDF and Jaccard coefficient but only the title and snippet content in each document are considered. After that, all HTML tags are removed and all Our initial experiment found that the TF-IDF method was contents are separated into tokens. Thirdly, since the most capable of producing suggestions relevant to the user’s orig- selective terms for query suggestions should be nouns,[7, 34] inal query whilst Jaccard coefficient was good to rank the only nouns are considered for suggestions. suggestions. Therefore, a method called Tfjac is proposed in this paper, which selects terms from the combination of There are two experiments in this paper. The first exper- the top ten candidate words from the TF-IDF method and up iment aims to evaluate the quality of the query sugges- to ten candidate words from the Jaccard coefficient method. tions generated by the method combing TF-IDF and Jac- The process starts with finding duplicate words from both card coefficient, in comparison with the TF-IDF and Jac- methods. If the number of these words is less than ten, card coefficient methods respectively. A simple search en- more candidate words from the Jaccard coefficient method gine using the Google API has been implemented in this are added.