Query Classification Based on a New Query Expansion Approach

Total Page:16

File Type:pdf, Size:1020Kb

Query Classification Based on a New Query Expansion Approach QUERY CLASSIFICATION BASED ON A NEW QUERY EXPANSION APPROACH by Li Shujie Thesis submitted in partial fulfillment of the requirements for the Degree of Master of Science (Statistics) Acadia University Fall Convocation 2009 © by Li Shujie, 2009 This thesis by Li Shujie was defended successfully in an oral examination on August 21, 2009. The examining committee for the thesis was: Dr. Anthony Tong, Chair Dr. Crystal Linkletter, External Reader Dr. Wilson Lu, Internal Reader Dr. Hugh Chipman and Dr. Pritam Ranjan, Supervisors Dr. Jeff Hooper, Department Head This thesis is accepted in its present form by the Division of Research and Graduate Studies as satisfying the thesis requirements for the degree Master of Science (Statis- tics). ...................................................... ii I, Li Shujie, grant permission to the University Librarian at Acadia University to reproduce, loan or distribute copies of my thesis in microform, paper or electronic formats on a non-profit basis. I, however, retain the copyright in my thesis. Author Supervisor Date iii Contents Abstract x Acknowledgments xi 1 Introduction 1 1.1 Query Classification . 1 1.1.1 What is query classification? . 1 1.1.2 Why is query classification useful? . 1 1.2 Relevant Work . 2 1.3 My Approach . 3 2 Data and Terminology 6 2.1 Information Retrieval . 6 2.1.1 Bag of words assumption . 6 2.1.2 Document frequency and term frequency . 7 2.2 Information Theory . 7 2.3 GenieKnows Taxonomy . 8 2.3.1 Topics . 8 2.3.2 Original data . 9 2.3.3 Multi-word taxonomy . 10 3 Feature Selection 12 3.1 Feature Selection Using Chi-Square Statistic . 12 3.1.1 Penalized feature selection . 13 iv 4 Query Expansion 16 4.1 Word Similarity . 16 4.1.1 Cosine similarity . 16 4.1.2 Smoothed KL divergence . 18 4.2 The Advantage of Using Feature Words . 19 5 Query Classification 21 5.1 Naive Bayes Classification Method . 21 5.1.1 Naive Bayes Bernoulli model . 22 5.1.2 Naive Bayes multinomial model . 24 5.2 Dirichlet/Multinomial Model . 26 6 Experiments 28 6.1 KDD Data . 28 6.1.1 Some problems in using KDD Cup 2005 queries . 29 6.1.2 Precision, recall and F1 value . 32 6.1.3 The number of return topics . 34 6.2 Experiment Results . 34 6.2.1 Choosing α and k ......................... 34 6.2.2 Notation . 36 6.2.3 F1 values for the KDD-Cup data . 37 6.2.4 Comparison of word similarity measures . 41 6.2.5 Comparison of three classification methods . 45 6.2.6 Comparison of feature word penalty parameters . 48 6.2.7 Comparison with the KDD-Cup 2005 competitors . 51 7 Conclusion and Future Work 53 A Appendix (Feature Words) 54 v List of Tables 2.1 Top-level topics in the GenieKnows taxonomy . 9 2.2 Taxonomy extract for topic Arts/Entertainment . 10 2.3 Multi-words taxonomy extract for topic Arts/Entertainment . 11 6.1 KDD-CUP Categories and Genieknows Topics . 30 6.2 Number of changes to feature words set for various values of penalty parameter α. Up to 198 changes are possible . 35 6.3 Notation . 37 6.4 F1 values: Cos+NBMUL . 38 6.5 F1 values: Cos+NBBER . 38 6.6 F1 values: Cos+DIRI . 39 6.7 F1 values: KL+NBMUL . 39 6.8 F1 values: KL+NBBER . 40 6.9 F1 values: KL+DIRI . 40 6.10 KDD-Cup 2005 results . 52 A.1 Feature Words for Topic 1 . 55 A.2 Feature Words for Topic 2 . 56 A.3 Feature Words for Topic 3 . 57 A.4 Feature Words for Topic 4 . 58 A.5 Feature Words for Topic 5 . 59 A.6 Feature Words for Topic 6 . 60 A.7 Feature Words for Topic 7 . 61 A.8 Feature Words for Topic 8 . 62 vi A.9 Feature Words for Topic 9 . 63 A.10 Feature Words for Topic 10 . 64 A.11 Feature Words for Topic 11 . 65 A.12 Feature Words for Topic 12 . 66 A.13 Feature Words for Topic 13 . 67 A.14 Feature Words for Topic 14 . 68 A.15 Feature Words for Topic 15 . 69 A.16 Feature Words for Topic 16 . 70 A.17 Feature Words for Topic 17 . 71 A.18 Feature Words for Topic 18 . 72 vii List of Figures 1.1 Query Classification system . 5 6.1 Number of change of feature words compared to the no penalty case . 36 6.2 F1 values for the Naive Bayes multinomial model. The red lines in- dicate smoothed KL divergence, and the green lines indicate cosine similarity . 42 6.3 F1 values for the Naive Bayes Bernoulli model. The red lines indicate smoothed KL divergence, and the green lines indicate cosine similarity 43 6.4 F1 values for the Dirichlet/multinomial model. The red lines indicate smoothed KL divergence, and the green lines indicate cosine similarity 44 6.5 F1 values for the models using smoothed KL divergence. The black lines indicate naive Bayes multinomial model, the green lines indicate naive Bayes Bernoulli model, and the red lines indicate Dirichlet/multi- nomial model . 46 6.6 F1 values for the models using Cosine similarity. The black lines in- dicate naive Bayes multinomial model, the green lines indicate naive Bayes Bernoulli model, and the red lines indicate Dirichlet/multino- mial model . 47 6.7 F1 values for methods using smoothed KL divergence. Top: naive Bayes multinomial; middle: naive Bayes Bernoulli; bottom: Dirichlet/- multinomial model. The different colors represent different penalty pa- rameters α: black (α=0), green (α=0.000015), red (α=0.00004), yellow (α=0.00006), pink (α=0.0001), blue (α=0.0002), orange (α=0.0006), purple (α=0.002) . 49 viii 6.8 F1 values for methods using cosine similarity. Top: naive Bayes multi- nomial; middle: naive Bayes Bernoulli; bottom: Dirichlet/multinomial model. The different colors represent different penalty parameters: black (α=0), green (α=0.000015), red (α=0.00004), yellow (α=0.00006), pink (α=0.0001), blue (α=0.0002), orange (α=0.0006), purple (α=0.002) 50 ix Abstract Query classification is an important and yet challenging problem for the search engine industry and e-commerce companies. In this thesis, I develop a query classification system based on a novel query expansion approach and classification methods. The proposed methodology is used to classify queries based on a taxonomy (a database of words and their corresponding topic classification). The taxonomy used was obtained from GenieKnows, a vertical search engine company in Halifax, Canada. The query classification system can be divided into three phases: feature selection, query expansion, and query classification. The first phase uses a chi-square statistic to select a subset of \feature words" from the GenieKnows taxonomy; the second phase uses cosine similarity and Kullback-Leibler divergence to find \feature words" similar to the query for query expansion; and finally the third phase introduces three classification methods: naive Bayes multinomial model, naive Bayes Bernoulli model and Dirichlet/multinomial model to classify the expanded queries. Results from the KDD-Cup 2005 competition are used to test the performance of the proposed query classification system. The experiment shows that the performance of the query classification system is quite good. x Acknowledgments There are many people who deserve thanks for helping me during my study at Acadia. The last two years in this department and GenieKnows has been an unforgettable experience for me. First and foremost, I express my sincere gratitude to my supervisors Dr. Hugh Chipman and Dr. Pritam Ranjan. They not only provided guidance, direction, and funding to me, but also encouraged me during my study at Acadia. My thesis could not be finished without their gracious help. I also express my gratitude to my committee members, Dr. Crystal Linkletter, Dr. Wilson Lu, Dr. Jeff Hooper and Dr. Anthony Tong. Mathematics of Information Technology and Complex Systems (MITACS) and Ge- nieKnows provided an eight-month internship to me, and this thesis is highly related to my internship at GenieKnows. Many thanks to MITACS and GenieKnows. Dr. Tony Abou-Assaleh, my internship supervisor at GenieKnows, provided a lot of sup- port and instruction to me during my internship. Philip O'Brien and Dr. Luo Xiao reviewed my thesis and gave me a lot of valuable suggestions. Dr. Luo Xiao helped familiarize me with the GenieKnows data and gave me great help during my prelim- inary research for this thesis. I will always be grateful to them for their help. xi Chapter 1 Introduction 1.1 Query Classification 1.1.1 What is query classification? In search engines (e.g., Google or Yahoo) and e-commerce companies (e.g., Ama- zon.com or ebay.com), users always type queries to obtain information. Queries are short pieces of text, such as a search term typed into a search engine. Query classifi- cation aims to classify queries into a set of target topics. Suppose there are five target topics feducation; shopping; restaurant; statistics; computer scienceg, and two queries \Acadia University" and “coffee”. Our main objective is to classify the two queries into one or more target topics. For instance, it is reasonable to classify the query \Acadia University" into the topic \education" and “coffee” into the topic \restaurant". 1.1.2 Why is query classification useful? Query classification can help e-commerce companies learn user preferences. If a user types five queries in an e-commerce company's website: fpattern recognition; statistical computing; statistical inference; linear models; data miningg, all these 1 CHAPTER 1. INTRODUCTION 2 queries should be classified into the topic \statistics" or \computer science".
Recommended publications
  • DEWS: a Decentralized Engine for Web Search
    DEWS: A Decentralized Engine for Web Search Reaz Ahmed, Rakibul Haque, Md. Faizul Bari, Raouf Boutaba David R. Cheriton School of Computer Science, University of Waterloo [r5ahmed|m9haque|mfbari|rboutaba]@uwaterloo.ca Bertrand Mathieu Orange Labs, Lannion, France [email protected] (Technical Report: CS-2012-17) Abstract The way we explore the Web is largely governed by centrally controlled, clustered search engines, which is not healthy for our freedom in the Internet. A better solution is to enable the Web to index itself in a decentralized manner. In this work we propose a decentralized Web search mechanism, named DEWS, which will enable the existing webservers to collaborate with each other to form a distributed index of the Web. DEWS can rank the search results based on query keyword relevance and relative importance of websites. DEWS also supports approximate matching of query keywords and incremental retrieval of search results in a decentralized manner. We use the standard LETOR 3.0 dataset to validate the DEWS protocol. Simulation results show that the ranking accuracy of DEWS is very close to the centralized case, while network overhead for collaborative search and indexing is logarithmic on network size. Simulation results also show that DEWS is resilient to changes in the available pool of indexing webservers and works efficiently even in presence of heavy query load. 1 Introduction Internet is the largest repository of documents that man kind has ever created. Voluntary contributions from millions of Internet users around the globe, and decentralized, autonomous hosting infrastructure are the sole factors propelling the continuous growth of the Internet.
    [Show full text]
  • An Annotated Bibliography
    Mark Andrea Standards in Single Search: An Annotated Bibliography Mark Andrea INFO 522: Information Access & Resources Winter Quarter 2010 Mark Andrea Introduction and Scope The following bibliography is a survey of scholarly literature in the field of metasearch standards as defined by the Library of Congress (LOC) and the National Information Standards Organization (NISO). Of particular interest is the application of the various protocols, as described by the standards, to real world searching of library literature found in scholarly databases, library catalogs and internally collected literature. These protocols include z39.50, Search Retrieval URL (SRU), Search Retrieval Web Service (SRW) and Context Query Language (CQL) as well as Metasearch XML Gateway (MXG). Description Libraries must compete with the web to capture users who often do not consider the wealth of information resources provided by the library. This has only been an issue in the last decade. Prior to that, most users, and that includes academic and specialty library users such as corporate users, went to a physical library for their research. With the rise of web-based information, users have become accustomed to easy keyword searching from web pages where sources can range from known and established authority to completely the opposite. Libraries have responded with attempts to provide easy search interfaces on top of complex materials that have been cataloged and indexed according to controlled vocabularies and other metadata type tools. These tools have enabled users for decades effectively find information. In some cases it’s merely an issue of education that most researchers are lacking. So are these metasearch systems ultimately a step backward to accommodate the new search community or do they really address the need to find information that continues to grow exponentially.
    [Show full text]
  • Open Search Environments: the Free Alternative to Commercial Search Services
    Open Search Environments: The Free Alternative to Commercial Search Services. Adrian O’Riordan ABSTRACT Open search systems present a free and less restricted alternative to commercial search services. This paper explores the space of open search technology, looking in particular at lightweight search protocols and the issue of interoperability. A description of current protocols and formats for engineering open search applications is presented. The suitability of these technologies and issues around their adoption and operation are discussed. This open search approach is especially useful in applications involving the harvesting of resources and information integration. Principal among the technological solutions are OpenSearch, SRU, and OAI-PMH. OpenSearch and SRU realize a federated model to enable content providers and search clients communicate. Applications that use OpenSearch and SRU are presented. Connections are made with other pertinent technologies such as open-source search software and linking and syndication protocols. The deployment of these freely licensed open standards in web and digital library applications is now a genuine alternative to commercial and proprietary systems. INTRODUCTION Web search has become a prominent part of the Internet experience for millions of users. Companies such as Google and Microsoft offer comprehensive search services to users free with advertisements and sponsored links, the only reminder that these are commercial enterprises. Businesses and developers on the other hand are restricted in how they can use these search services to add search capabilities to their own websites or for developing applications with a search feature. The closed nature of the leading web search technology places barriers in the way of developers who want to incorporate search functionality into applications.
    [Show full text]
  • Local Search 101
    Local search 101 Modern consumers start their shopping journeys on search engines and online directories. The results they get from those sites determine where they spend their money. If you don’t appear in the local search results, that sale goes to your competitor. What We Do For You Listings Management Listings Management ensures your online content is accurate and relevant. Our platform will get your locations listed and crawled for updates daily, weekly or monthly. Reputation Management Reputation Management within the platform allows you to monitor and respond to your reviews from directories including Google and Facebook. Keyword Ranking 80% of shoppers Keyword Ranking monitoring to make sure your keywords are performing so you can optimize where and when it search online matters. to find brick-and-mortar stores & 90% of all transactions Actionable Analytics are made in-store. Actionable Analytics allow you to track the performance of each of your locations — from the baseline measurement (Sources: Deloitte, Gartner) of your listings coverage and accuracy all the way to the revenue generated by your local marketing campaigns. We help you get found online. Getting your business found across the internet takes time and expertise to get it right. Our automated software takes care of the hard work for you, and drives customers to your locations. Local searches = motivated shoppers 51% of local searches convert to in-store sales within 24 hours (Source: Google) Why local? Marketing is about connecting with your customers. And today's consumer is local, and they are mobile. Consumers are searching for your business on their smartphones, and if you aren't there - they will choose your competition.
    [Show full text]
  • Efficient Focused Web Crawling Approach for Search Engine
    Ayar Pranav et al, International Journal of Computer Science and Mobile Computing, Vol.4 Issue.5, May- 2015, pg. 545-551 Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320–088X IJCSMC, Vol. 4, Issue. 5, May 2015, pg.545 – 551 RESEARCH ARTICLE Efficient Focused Web Crawling Approach for Search Engine 1 2 Ayar Pranav , Sandip Chauhan Computer & Science Engineering, Kalol Institute of Technology and Research Canter, Kalol, Gujarat, India 1 [email protected]; 2 [email protected] Abstract— a focused crawler traverses the web, selecting out relevant pages to a predefined topic and neglecting those out of concern. Collecting domain specific documents using focused crawlers has been considered one of most important strategies to find relevant information. While surfing the internet, it is difficult to deal with irrelevant pages and to predict which links lead to quality pages. However most focused crawler use local search algorithm to traverse the web space, but they could easily trapped within limited a sub graph of the web that surrounds the starting URLs also there is problem related to relevant pages that are miss when no links from the starting URLs. There is some relevant pages are miss. To address this problem we design a focused crawler where calculating the frequency of the topic keyword also calculate the synonyms and sub synonyms of the keyword. The weight table is constructed according to the user query. To check the similarity of web pages with respect to topic keywords and priority of extracted link is calculated.
    [Show full text]
  • Press Release
    GenieKnows.com Gains Access to Business- Verified Listings Through Partnership with Localeze May 1, 2008 New Local Search Engine Player Partners with Localeze to Provide Users with Enhanced Content, Offers 16 Million U.S. Business Listings SEATTLE Wa., – Localeze, the leading expert on local search engine business content management, announced today that it has partnered with GenieKnows.com to provide over 16 million U.S. business listings including listings directly verified and enhanced by businesses to GenieKnows’ local business directory search engine, GenieKnows Local. Genie Knows Local allows users to quickly pinpoint local businesses via map, and view addresses, phone numbers, reviews, references and related Web sites trough a unique hybrid landing page. Alongside Google and MSN, GenieKnows Local is one of only three search engines covering all of the U.S. and Canada. GenieKnows Local provides the ultimate combination in mapping technology and local search directories. Using its patent pending GeoRank™ algorithm, GenieKnows Local links verified business listings with potentially uncategorized web pages containing addresses. The algorithm extracts and codes the addresses, identifying the geographic coordinates with which the listings are associated. “The volume of new and repeat visits to GenieKnows Local will be driven by our ability to bridge ready-to-buy consumers with the right local businesses online,” said John Manning, senior vice president of business development at GenieKnows. “The decision to partner with Localeze for our U.S. content was natural one; Localeze’s unparalleled data integrity, which includes enhanced and up-to-date local business listings, will undoubtedly improve the search experience for GenieKnows Local’s users.” Localeze creates accurate, comprehensive, listing profiles on local businesses, and then uses proprietary intelligent category classification and keyword matching logic to interpret and tag the data exclusively for local search engines.
    [Show full text]
  • Your Local Business Guide to Digital Marketing
    Your Local Business Guide to Digital Marketing By Isabella Andersen Your Local Business Guide to Digital Marketing By Isabella Andersen Senior Content Writer Table of Contents Introduction 3 What Is Local Search Marketing? 5 Develop a Successful Review Marketing Strategy 10 Reach New Markets With Paid Advertising 15 Get Started With Social Media 19 Tips, Tricks & Trends 26 Sources 29 YOUR LOCAL BUSINESS GUIDE TO DIGITAL MARKETING ⋅ REVLOCAL.COM 2 1 Introduction id you know that 78 percent of local mobile searches result in Dan in-store purchase?1 Consumers search online for businesses like yours every day, but are you showing up? of local mobile searches end with 78% an oine purchase. If your business has no online marketing strategy, you will quickly fall behind the competition. It's time to build a digital footprint that drives foot traffic and sales and puts your business on the map. We created this guide to help you put your business in front of the right consumers wherever they're searching. YOUR LOCAL BUSINESS GUIDE TO DIGITAL MARKETING ⋅ REVLOCAL.COM 4 What is Local Search 2 Marketing? Some people call it local SEO. For others, it's map marketing. Whatever you call it, local search marketing is all about putting your business on the map and into local search results online. It's more important than ever that your business appears in the local results, since 72 percent of consumers who performed a local search visited a store within five miles.2 How can you do that? Provide Consistent, Correct Information You have to tell search engines like Google, Bing and Yahoo! where your business is located, what you do and that you're trustworthy, among other things.
    [Show full text]
  • Towards the Ontology Web Search Engine
    TOWARDS THE ONTOLOGY WEB SEARCH ENGINE Olegs Verhodubs [email protected] Abstract. The project of the Ontology Web Search Engine is presented in this paper. The main purpose of this paper is to develop such a project that can be easily implemented. Ontology Web Search Engine is software to look for and index ontologies in the Web. OWL (Web Ontology Languages) ontologies are meant, and they are necessary for the functioning of the SWES (Semantic Web Expert System). SWES is an expert system that will use found ontologies from the Web, generating rules from them, and will supplement its knowledge base with these generated rules. It is expected that the SWES will serve as a universal expert system for the average user. Keywords: Ontology Web Search Engine, Search Engine, Crawler, Indexer, Semantic Web I. INTRODUCTION The technological development of the Web during the last few decades has provided us with more information than we can comprehend or manage effectively [1]. Typical uses of the Web involve seeking and making use of information, searching for and getting in touch with other people, reviewing catalogs of online stores and ordering products by filling out forms, and viewing adult material. Keyword-based search engines such as YAHOO, GOOGLE and others are the main tools for using the Web, and they provide with links to relevant pages in the Web. Despite improvements in search engine technology, the difficulties remain essentially the same [2]. Firstly, relevant pages, retrieved by search engines, are useless, if they are distributed among a large number of mildly relevant or irrelevant pages.
    [Show full text]
  • Merging Multiple Search Results Approach for Meta-Search Engines
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by D-Scholarship@Pitt MERGING MULTIPLE SEARCH RESULTS APPROACH FOR META-SEARCH ENGINES By Khaled Abd-El-Fatah Mohamed B.S-------- Cairo University, Egypt, 1995 M.A-------Cairo University, Egypt, 1999 M.A------- University of Pittsburgh 2001 Submitted to the Graduate Faculty of School of Information Sciences in Partial Fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2004 UNIVERSITY OF PITTSBURGH INFORMATION SCIENCES This dissertation was presented by Khaled Abd-El-Fatah Mohamed It was defended on Janauary 29, 2004 and approved by Chris Tomer, PhD, Associate Professor, DLIS Jose-Marie Griffiths, PhD, Professor, DLIS Don King, Research Professor, DLIS Amy Knapp, PhD, ULS Dissertation Director: Chris Tomer, PhD, Associate Professor MERGING MULTIPLE SEARCH RESULTS APPROACH FOR META-SEARCH ENGINES Khaled A. Mohamed, PhD University of Pittsburgh, 2004 Meta Search Engines are finding tools developed for enhancing the search performance by submitting user queries to multiple search engines and combining the search results in a unified ranked list. They utilized data fusion technique, which requires three major steps: databases selection, the results combination, and the results merging. This study tries to build a framework that can be used for merging the search results retrieved from any set of search engines. This framework based on answering three major questions: 1. How meta-search developers could define the optimal rank order for the selected engines. 2. How meta-search developers could choose the best search engines combination. 3. What is the optimal heuristic merging function that could be used for aggregating the rank order of the retrieved documents form incomparable search engines.
    [Show full text]
  • Organizing User Search Histories
    Global Journal of Computer Science and Technology Network, Web & Security Volume 13 Issue 13 Version 1.0 Year 2013 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc. (USA) Online ISSN: 0975-4172 & Print ISSN: 0975-4350 Organizing user Search Histories By Ravi Kumar Yandluri Gokaraju Rangaraju Institute of Engineering & Technology, India Abstract - Internet userscontinuously make queries over web to obtain required information. They need information about various tasks and sub tasks for which they use search engines. Over a period of time they make plenty of related queries. Search engines save these queries and maintain user’s search histories. Users can view their search histories in chronological order. However, the search histories are not organized into related groups. In fact there is no organization made except the chronological order. Recently Hwang et al. studied the problem of organizing historical search information of users into groups dynamically. This automatic grouping of user search histories can help search engines also in various applications such as collaborative search, sessionization, query alterations, result ranking and query suggestions. They proposed various techniques to achieve this. In this paper we implemented those techniques practically using a prototype web application built in Java technologies. The experimental results revealed that the proposed application is useful to organize search histories. Indexterms : search engine, search history, click graph, query grouping. GJCST-E Classification : H.3.5 Organizing user Search Histories Strictly as per the compliance and regulations of: © 2013. Ravi Kumar Yandluri. This is a research/review paper, distributed under the terms of the Creative Commons Attribution- Noncommercial 3.0 Unported License http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution, and reproduction inany medium, provided the original work is properly cited.
    [Show full text]
  • A Community-Based Approach to Personalizing Web Search
    COVER FEATURE A Community-Based Approach to Personalizing Web Search Barry Smyth University College Dublin Researchers can leverage the latent knowledge created within search communities by recording users’search activities—the queries they submit and results they select—at the community level.They can use this data to build a relevance model that guides the promotion of community-relevant results during regular Web search. ver the past few years, current Web search through rates should be preferred during ranking. engines have become the dominant tool for Unfortunately, Direct Hit’s so-called popularity engine accessing information online. However, even did not play a central role on the modern search stage today’s most successful search engines struggle (although the technology does live on as part of the O to provide high-quality search results: Approx- Teoma search engine) largely because the technology imately 50 percent of Web search sessions fail to find proved inept at identifying new sites or less well- any relevant results for the searcher. traveled ones, even though they may have had more The earliest Web search engines adopted an informa- contextual relevance to a given search. tion-retrieval view of search, using sophisticated term- Despite Direct Hit’s fate, the notion that searchers based matching techniques to identify relevant docu- themselves could influence the ranking of results by virtue ments from repeated occurrences of salient query terms. of their search activities remained a powerful one. It res- Although such techniques proved useful for identifying onates well with the ideas that underpin the social web, a set of potentially relevant results, they offered little a phrase often used to highlight the importance of a suite insight into how such results could be usefully ranked.
    [Show full text]
  • Do Metadata Models Meet IQ Requirements?
    Do Metadata Mo dels meet IQ Requirements Claudia Rolker Felix Naumann Humb oldtUniversitat zu Berlin Forschungszentrum Informatik FZI Unter den Linden HaidundNeuStr D Berlin D Karlsruhe Germany Germany naumanndbisinformatikhub erli nde rolkerfzide Abstract Research has recognized the imp ortance of analyzing information quality IQ for many dierent applications The success of data integration greatly dep ends on the quality of the individual data In statistical applications p o or data quality often leads to wrong conclusions High information quality is literally a vital prop erty of hospital information systems Po or data quality of sto ck price information services can lead to economically wrong decisions Several pro jects have analyzed this need for IQ metadata and have prop osed a set of IQ criteria or attributes which can b e used to prop erly assess information quality In this pap er we survey and compare these approaches In a second step we take a lo ok at existing prominent prop osals of metadata mo dels esp ecially those on the Internet Then we match these mo dels to the requirements of information quality mo deling Finally we prop ose a quality assurance pro cedure for the assurance of metadata mo dels Intro duction The quality of information is b ecoming increasingly imp ortant not only b ecause of the rapid growth of the Internet and its implication for the information industry Also the anarchic nature of the Internet has made industry and researchers aware of this issue As awareness of quality issues amongst information
    [Show full text]