Taxonomy Based Image Retrieval

Total Page:16

File Type:pdf, Size:1020Kb

Taxonomy Based Image Retrieval DEGREE PROJECT, IN COMPUTER SCIENCE , SECOND LEVEL STOCKHOLM, SWEDEN 2015 Taxonomy Based Image Retrieval TAXONOMY BASED IMAGE RETRIEVAL USING DATA FROM MULTIPLE SOURCES JIMMY LARSSON KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION (CSC) Taxonomy Based Image Retrieval Taxonomy Based Image Retrieval using Data from Multiple Sources JIMMY LARSSON Master’s Thesis at CSC KTH - Royal Institute of Technology, Sweden Supervisor: Hedvig Kjellström Examiner: Danica Kragic TRITA xxx yyyy-nn Acknowledgements This page is dedicated to everyone who has been involved in this work. I would thus like to start by acknowledging and thanking, my professor at the university, Hedvig Kjellström for accepting yet another project while already having many other projects to supervise as well as for the feedback and time that she has given me. I would like to thank Danica Kragic, my examiner who also accepted the position of examiner for this work while already having many other tasks to attend to. I would like to thank my supervisors at Findwise, Martin Nycander, Birger Rydback and Simon Stenström for their help and supervision during my time at the company. I would also like to thank Findwise for accepting this project. Finally I would like to thank my family who supported me during this time and who kept pushing me forward. Abstract With a multitude of images available on the Internet, how do we find what we are looking for? This project tries to determine how much the precision and recall of search queries is improved by using a word taxonomy on tradi- tional Text-Based Image Search and Content-Based Image Search. By applying a word taxonomy to different data sources, a strong keyword filter and a keyword extender were implemented and tested. The results show that de- pending on the implementation, the precision or the recall can be increased. By using a similar approach on real life implementations, it is possible to force images with higher precisions to the front while keeping a high recall value, thus increasing the experienced relevance of image search. Referat Taxonomibaserad Bildsök Med den mängd bilder som nu finns tillgänglig på Inter- net, hur kan vi fortfarande hitta det vi letar efter? Den- na uppsats försöker avgöra hur mycket bildprecision och bildåterkallning kan öka med hjälp av appliceringen av en ordtaxonomi på traditionell Text-Based Image Search och Content-Based Image Search. Genom att applicera en ord- taxonomi på olika datakällor kan ett starkt ordfilter samt en modul som förlänger ordlistor skapas och testas. Resul- taten pekar på att beroende på implementationen så kan antingen precisionen eller återkallningen förbättras. Genom att använda en liknande metod i ett verkligt scenario är det därför möjligt att flytta bilder med hög precision längre fram i resultatlistan och samtidigt behålla hög återkallning, och därmed öka den upplevda relevansen i bildsök. Contents Acknowledgements List of Figures List of Tables 1 Introduction 1 1.1 Concept . 2 1.2 Abbreviations . 4 1.3 Problem Statement . 4 1.3.1 Research Question . 4 1.3.2 Hypothesis . 5 1.4 Contributions . 5 1.5 Delimitations . 5 I Background 7 2 Image Retrieval 9 2.1 Early History . 9 2.2 Search Users . 10 2.3 Presentation . 10 3 Content-Based Image Retrieval 13 3.1 Semantic Gap . 13 3.2 Features . 14 3.2.1 Global Features . 15 3.2.2 Local Features . 15 3.2.3 Image Segmentation . 15 3.3 Visual Signature . 15 3.4 Learning Approaches . 16 3.4.1 Relevance Feedback . 16 3.4.2 Support Vector Machines . 16 3.4.3 Artificial Neural Networks . 17 3.4.4 Convolutional Neural Networks . 17 3.4.5 Random Forest . 18 4 Text-Based Image Retrieval 19 4.1 Relevant Information in Text . 19 4.1.1 Metadata . 20 4.2 Current Techniques . 21 4.2.1 Term Frequency and Inverse Document Frequency . 21 4.2.2 Natural Language Processing . 21 4.2.3 Part-of-Speech Tagging . 22 4.2.4 Stop Words . 22 4.2.5 Stemming and Lemmatization . 22 4.3 Indexing . 22 4.4 Word Taxonomy . 23 5 Related Work 25 II Method 27 6 Architecture 29 6.1 Data Retrieval . 30 6.2 Extraction of Relevant Information . 30 6.3 Content-Based Image Retrieval . 31 6.3.1 Classification . 31 6.4 Text-Based Image Retrieval . 32 6.4.1 Natural Language Processing . 32 6.4.2 Data Clean-Up . 32 6.5 WordNet Evaluation . 32 6.6 Search Platform . 33 6.6.1 Filters, Stemming and Tokenizing . 33 7 Evaluation Method 35 7.1 Evaluation Data . 35 7.2 Classifier Evaluation . 36 7.3 Evaluation Formulas . 37 7.4 Baseline and Comparison . 38 III Results and Discussion 39 8 Results 41 8.1 Average Precision . 42 8.2 Average Recall . 42 8.3 Average F Measures . 43 9 Discussion and Conclusion 47 9.1 Conclusions . 48 9.2 Future Work . 49 Bibliography 51 Appendices 56 A Tags 57 List of Figures 1.1 The system concept portraying a simplified view of the full system. 3 4.1 A taxonomy example. 23 6.1 An extended figure portraying the system architecture. 29 6.2 Example of data that is extracted from a post. (Original blog post from Bites @ Animal Planet) .................................. 31 7.1 An image containing a single animal. Tag: sloth. (Figure from Bites @ Animal Planet) 36 7.2 An image containing two or more animals. Tags: dog, fox. (Figure from Bites @ Animal Planet) .................................. 36 7.3 A figure about positives and negatives. (Original figure from http: // en. wikipedia. org/ wiki/ Precision_ and_ recall ) ............................. 37 8.1 A graph of the averages. 41 8.2 A graph of the average precision scores. 42 8.3 A graph of the recall scores. 43 8.4 A graph of the F2 scores. 44 8.5 A graph of the F0.5 scores. 45 8.6 A graph of the F1 scores. 45 List of Tables 8.1 Average precision scores. 42 8.2 Average recall scores. 43 8.3 Average F scores. 44 Chapter 1 Introduction Within the multitude of images currently available on the Internet, how can we possibly find what we are looking for? With image retrieval now being applied to health and medical applications [29, 40] as well as military and traffic surveillance [25], rapid progress is not only inevitable but also fascinating. Image retrieval has for a couple of decades now, been a discipline with a constant flow of research be- ing done. Image retrieval is about making the images that are part of a computer systems, easily accessible through means such as search engines, and in the late 1970s the focus of image retrieval was found in what we today call Text-Based Im- age Retrieval (TBIR), also known as Context-Based Image Retrieval, Meta-Data Image Retrieval or Keyword-Based Image Retrieval. Initially with the databases being quite small, methods focused on the so called Database Management Systems (DBMS) [42] in which a user or an administrator would determine appropriate key- words for the images and store the keywords in a database. With the rapid growth of available data online however, the problem that is individual subjectivity became evident in which different people who were determining appropriate keywords, had their own subjective view of what keywords were appropriate. Another problem was the shear amount of data which had to be processed. Manually determining keywords was no longer a feasible option and thus Content-Based Image Retrieval (CBIR) was proposed. Content-Based Image Retrieval is a discipline with its ori- gin in the field of Computer Vision. The Content-Based Image Retrieval works on determining what an image may portray by looking at the image, its features, col- ors, textures and so on, and compare that to already known data. The Text-Based Image Retrieval of today on the other hand, is a field in which the information that surrounds an image is used as a basis for different Natural Language Processing (NLP) algorithms in order to determine to at least some degree, what the image may portray. While Content-Based and Text-Based Image Retrieval might not be enough, a strict word taxonomy applied to the result of such modern image retrieval systems might drastically increase the image retrieval precision, and as such, for this work, Extended Java WordNet Library [1] will be used. The original WordNet [39] which in essence is a lexical database for several languages have the capability to, 1 CHAPTER 1. INTRODUCTION given an input word, output information about said word. The output includes a description of the word, direct hyponyms of a word, inherited hypernyms and sister terms. The inherited hypernyms specifically can be seen as a tree structure with nodes and leaves. Using the WordNet tree structure feature on the results from the Content-Based Image Retrieval component, and the results from the Text-Based Image Retrieval component, it is possible to find words which has nodes in com- mon. The common nodes can then be used to augment the existing data or in order to filter out data which could be considered noise. As such, increasing the image recall at the cost of image precision or increasing the image precision at the cost of image recall should be possible. This work will therefore test different implemen- tations of the WordNet tree structure on the results of Content- and Text-Based output to measure the precision and recall of a Taxonomy-Based Image Retrieval (TaBIR) system. Following Chapter 1 which contains the introducing sections, the image retrieval background will be split into three chapters. This was done in order to avoid confusion when talking about image retrieval from the viewpoint of two similar, yet very different methodologies, namely Content-Based Image Retrieval as opposed to Text-Based Image Retrieval.
Recommended publications
  • Basic Concepts in Information Retrieval
    1 Basic concepts of information retrieval systems Introduction The term ‘information retrieval’ was coined in 1952 and gained popularity in the research community from 1961 onwards.1 At that time the organizing function of information retrieval was seen as a major advance in libraries that were no longer just storehouses of books, but also places where the information they hold is catalogued and indexed.2 Subsequently, with the introduction of computers in information handling, there appeared a number of databases containing bibliographic details of documents, often married with abstracts, keywords, and so on, and consequently the concept of information retrieval came to mean the retrieval of bibliographic information from stored document databases. Information retrieval is concerned with all the activities related to the organization of, processing of, and access to, information of all forms and formats. An information retrieval system allows people to communicate with an information system or service in order to find information – text, graphic images, sound recordings or video that meet their specific needs. Thus the objective of an information retrieval system is to enable users to find relevant information from an organized collection of documents. In fact, most information retrieval systems are, truly speaking, document retrieval systems, since they are designed to retrieve information about the existence (or non-existence) of documents relevant to a user query. Lancaster3 comments that an information retrieval system does not inform (change the knowledge of) the user on the subject of their enquiry; it merely informs them of the existence (or non-existence) and whereabouts of documents relating to their request.
    [Show full text]
  • Spatial-Semantic Image Search by Visual Feature Synthesis
    Spatial-Semantic Image Search by Visual Feature Synthesis Long Mai1, Hailin Jin2, Zhe Lin2, Chen Fang2, Jonathan Brandt2, and Feng Liu1 1Portland State University 2Adobe Research 1 2 {mtlong,fliu}@cs.pdx.com, {hljin,zlin,cfang,jbrandt}@adobe.com Person Surfboard Water Text-based query a) Image search with semantic constraints only Person Water Surfboard Person Water Surfboard Spatial-semantic query b) Image search with spatial-semantic constraints Figure 1: Spatial-semantic image search. (a) Searching with content-only queries such as text keywords, while effective in retrieving relevant content, is unable to incorporate detailed spatial intents. (b) Spatial-semantic image search allows users to interact with the 2-D canvas to express their search intent both spatially and semantically. Abstract 1. Introduction Image retrieval is essential for various applications, such The performance of image retrieval has been improved as browsing photo collections [6, 52], exploring large visual tremendously in recent years through the use of deep fea- data archives [15, 16, 38, 43], and online shopping [26, 37]. ture representations. Most existing methods, however, aim It has long been an active research topic with a rich literature to retrieve images that are visually similar or semantically in computer vision and multimedia [8, 30, 55, 56, 57]. In relevant to the query, irrespective of spatial configuration. recent years, advances in research on deep feature learning In this paper, we develop a spatial-semantic image search have led to effective image and query representations that technology that enables users to search for images with are shown effective for retrieving images that are visually both semantic and spatial constraints by manipulating con- similar or semantically relevant to the query [12, 14, 25, 53].
    [Show full text]
  • Image Retrieval Within Augmented Reality
    Image Retrieval within Augmented Reality Philip Manja May 5, 2017 Technische Universität Dresden Fakultät Informatik Institut für Software und Multimediatechnik Professur für Multimedia-Technologie Master’s Thesis Image Retrieval within Augmented Reality Philip Manja 1. Reviewer Prof. Raimund Dachselt Fakultät Informatik Technische Universität Dresden 2. Reviewer Dr. Annett Mitschick Fakultät Informatik Technische Universität Dresden Supervisors Dr. Annett Mitschick and Wolfgang Büschel (M.Sc.) May 5, 2017 Philip Manja Image Retrieval within Augmented Reality Master’s Thesis, May 5, 2017 Reviewers: Prof. Raimund Dachselt and Dr. Annett Mitschick Supervisors: Dr. Annett Mitschick and Wolfgang Büschel (M.Sc.) Technische Universität Dresden Professur für Multimedia-Technologie Institut für Software und Multimediatechnik Fakultät Informatik Nöthnitzer Straße 46 01187 Dresden Abstract The present work investigates the potential of augmented reality for improving the image retrieval process. Design and usability challenges were identified for both fields of research in order to formulate design goals for the development of concepts. A taxonomy for image retrieval within augmented reality was elaborated based on research work and used to structure related work and basic ideas for interaction. Based on the taxonomy, application scenarios were formulated as further requirements for concepts. Using the basic interaction ideas and the requirements, two comprehensive concepts for image retrieval within augmented reality were elaborated. One of the concepts was implemented using a Microsoft HoloLens and evaluated in a user study. The study showed that the concept was rated generally positive by the users and provided insight in different spatial behavior and search strategies when practicing image retrieval in augmented reality. Abstract (deutsch) Die vorliegende Arbeit untersucht das Potenzial von Augmented Reality zur Verbes- serung von Image Retrieval Prozessen.
    [Show full text]
  • The Seven Ages of Information Retrieval
    International Federation of Library Associations and Institutions UNIVERSAL DATAFLOW AND TELECOMMUNICATIONS CORE PROGRAMME OCCASIONAL PAPER 5 THE SEVEN AGES OF INFORMATION RETRIEVAL Michael Lesk Bellcore March, 1996 International Federation of Library Associations and Institutions UNIVERSAL DATAFLOW AND TELECOMMUNICATIONS CORE PROGRAMME The IFLA Core Programme on Universal Dataflow and Telecommunications (UDT) seeks to facilitate the international and national exchange of electronic data by providing the library community with pragmatic approaches to resource sharing. The programme monitors and promotes the use of relevant standards, promotes the use of relevant technologies and monitors relevant policy issues in an effort to overcome barriers to the electronic transfer of data in library fields. CONTACT INFORMATION Mailing Address: IFLA International Office for UDT c/o National Library of Canada 395 Wellington Street Ottawa, CANADA K1A 0N4 UDT Staff Contacts: Leigh Swain, Director Email: [email protected] Phone: (819) 994-6833 or Louise Lantaigne, Administration Officer Email: [email protected] Phone: (819) 994-6963 Fax: (819) 994-6835 Email: [email protected] URL: http://www.ifla.org/udt/ Occasional papers are available electronically at: http://www.ifla.org/udt/op/ UDT Occasional Papers # 5 Universal Dataflow and Telecommunications Core Programme International Federation of Library Associations and Institutions The Seven Ages of Information Retrieval Michael Lesk Bellcore [email protected] March, 1996 ABSTRACT analysis. This dates to a memo by Warren Weaver in 1949 [Weaver 1955] thinking about the success of Vannevar Bush's 1945 article set a goal of fast access computers in cryptography during the war, and to the contents of the world's libraries which looks suggesting that they could translate languages.
    [Show full text]
  • Information Retrieval (Text Categorization)
    Information Retrieval (Text Categorization) Fabio Aiolli http://www.math.unipd.it/~aiolli Dipartimento di Matematica Pura ed Applicata Università di Padova Anno Accademico 2008/2009 Dip. di Matematica F. Aiolli - Information Retrieval 1 Pura ed Applicata 2008/2009 Text Categorization Text categorization (TC - aka text classification) is the task of buiding text classifiers, i.e. sofware systems that classify documents from a domain D into a given, fixed set C = {c 1,…,c m} of categories (aka classes or labels) TC is an approximation task , in that we assume the existence of an ‘oracle’, a target function that specifies how docs ought to be classified. Since this oracle is unknown , the task consists in building a system that ‘approximates’ it Dip. di Matematica F. Aiolli - Information Retrieval 2 Pura ed Applicata 2008/2009 Text Categorization We will assume that categories are symbolic labels; in particular, the text constituting the label is not significant. No additional knowledge of category ‘meaning’ is available to help building the classifier The attribution of documents to categories should be realized on the basis of the content of the documents. Given that this is an inherently subjective notion, the membership of a document in a category cannot be determined with certainty Dip. di Matematica F. Aiolli - Information Retrieval 3 Pura ed Applicata 2008/2009 Single-label vs Multi-label TC TC comes in two different variants: Single-label TC (SL) when exactly one category should be assigned to a document The target function in the form f : D → C should be approximated by means of a classifier f’ : D → C Multi-label TC (ML) when any number {0,…,m} of categories can be assigned to each document The target function in the form f : D → P(C) should be approximated by means of a classifier f’ : D → P(C) We will often indicate a target function with the alternative notation f : D × C → {-1,+1} .
    [Show full text]
  • Information Retrieval System: Concept and Scope MODULE - 5B INFORMATION RETRIEVAL SYSTEM
    Information Retrieval System: Concept and Scope MODULE - 5B INFORMATION RETRIEVAL SYSTEM 15 Notes INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find information relevant to the task at hand. In view of this, information retrieval (IR) deals with the representation, storage, organization of/and access to information items. Here, types of information items include documents, Web pages, online catalogues, structured records, multimedia objects, etc. Chief goals of the IR are indexing text and searching for useful documents in a collection. Libraries were among the first institutions to adopt IR systems for retrieving information. In this lesson, you will be introduced to the importance, definitions and objectives of information retrieval. You will also study in detail the concept of subject approach to information, process of information retrieval, and indexing languages. 15.2 OBJECTIVES After studying this lesson, you will be able to: define information retrieval; understand the importance and need of information retrieval system; explain the concept of subject approach to information; LIBRARY AND INFORMATION SCIENCE 321 MODULE - 5B Information Retrieval System: Concept and Scope INFORMATION RETRIEVAL SYSTEM illustrate the process of information retrieval; and differentiate between natural, free and controlled indexing languages. 15.3 INFORMATION RETRIEVAL (IR) Notes The term ‘information retrieval’ was coined by Calvin Mooers in 1950. It gained popularity in the research community from 1961 onwards, when computers were introduced for information handling. The term information retrieval was then used to mean retrieval of bibliographic information from stored document databases.
    [Show full text]
  • A Philosophical Wish List for Research in Music Information Retrieval Cynthia M
    A Philosophical Wish List for Research in Music Information Retrieval Cynthia M. Grund Institute of Philosophy, Education and the Study of Religions - Philosophy University of Southern Denmark, Odense [email protected] Abstract what might otherwise seem utterly intractable questions. Within a framework provided by the traditional trio What insights could MIR-research provide into the consisting of metaphysics, epistemology and ethics, a first great questions of philosophy? The answer lies in the stab is made at a wish list for MIR-research from a particular facility which the tools of MIR possess for philosophical point of view. Since the tools of MIR are dealing with language as a sonic phenomenon, thus 1 equipped to study language and its use from a purely sonic providing yet another revolution in the linguistic turn. A standpoint, MIR research could result in another revealing search through the reams of literature produced regarding revolution within the linguistic turn in philosophy. the role of language in philosophical endeavor reveals that the language framework is virtually always a written one. Keywords: Philosophy and MIR, language as spoken, Little or no attention has been paid to the mechanisms at memory work in thinking, learning and communicating in a context 2 1. Introduction and Brief Setting of the Stage which is virtually of an exclusively oral, sonic. Since the overwhelming majority of time during which our thinking, Philosophy wrestles with questions regarding learning and communicative skills developed was metaphysics,
    [Show full text]
  • IEEE Paper Template in A4
    International Journal for Research in Advanced Computer Science and Engineering ISSN: 2208-2107 Image Search Engine of Mono Image Asmaa Salah Aldin Ibrahim1, Mohammed Ali Mohammed2 ¹Baghdad College of Economic Sciences University, Iraq ²University of Information Technology and Communication, Iraq Abstract— In recent year, images are widely used in many applications, such as facebook, snapchat. The large numbers of these images are saved in the smart system to easy access and retrieve. This paper aims to design and implement the new algorithm which is used in search of images. The mono image (black and white) is used as input data to the proposed algorithm. The methodology of this paper is to split image into number of block (block size = 8*8). For each block set 1 or 0 in order to count the number of black and white pixels. Finally, the result compare with other image dataset with the threshold value. The result show's that the proposed algorithm is successful passed in tested stage. Keywords— Image processing, Search Image, Mono Image, Search by Image. I. INTRODUCTION Describe the automatic selection of features from an image training set were done by using the theories of multidimensional discriminant analysis and the associated optimal linear projection. The demonstration of the effectiveness of these most discriminating features for view-based class retrieval from a large database of widely varying real-world objects presented as "well-framed" views, and compared with that of the principal component analysis[1]. An image retrieval system contains a database with a large number of images. The system retrieves images from the database are similar to a query image entered by the user.
    [Show full text]
  • An Architecture for Information Retrieval Agents* Craig A
    From: AAAI Technical Report SS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. An Architecture for Information Retrieval Agents* Craig A. Knoblock and Yigal Arens Information Sciences Institute University of Southern California 4676 Admiralty Way Marina del Rey, CA 90292, USA {KNOBLOCK,ARENS}QISI.ED Abstract sources that are available to it. Given an informa- tion request, an agent identifies an appropriate set of With the vast number of information resources information sources, generates a plan to retrieve and available today, a critical problem is howto lo- process the data, uses knowledge about the data to re- cate, retrieve and process information. It would formulate the plan, and then executes it. This paper be impractical to build a single unified system that describes our approach to the issues of representation, combines all of these information resources. A communication, problem solving, and learning, and de- more promising approach is to build specialized scribes how this approach supports multiple, collabo- information retrieval agents that provide access rating information retrieval agents. to a subset of the information resources and can send requests to other information retrieval agents Representing the Knowledge of an when needed. In this paper we present an archi- tecture for building such agents that addresses the Agent issues of representation, communication, problem Each information agent is specialized to a particular solving, and learning. Wealso describe how this area of expertise. This provides a modular organiza- architecture supports agents that are modular, ex- tion of the vast numberof information sources and pro- tensible, flexible and efficient, vides a clear delineation of the types of queries each agent can handle.
    [Show full text]
  • Image Based Search Engine
    International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 05 | May 2020 www.irjet.net p-ISSN: 2395-0072 Image based Search Engine Anjali Sharma1, Bhanu Parasher1 1M.Tech., Computer Science and Engineering, Indraprastha Institute of Information Technology, Delhi, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - We can see that different kinds of data are CNN model to resolve the problem of similar cloth retrieval floating on the internet from the last couple of years. It and clothing-related problems. As the dataset is large, a fine- includes audio, video, images, and text data. Processing of all tuned, pre-trained model is used to lower the complexity these data has been a key interest point for Researchers. To between training and transfer learning. Domain transfer effectively utilize these data, we want to explore more about it. learning helps in fine-tuning the main idea behind is reusing As it is said, an image speaks more than a thousand words, so the low-level and midlevel network across domains. In paper here in this article, we are working on images. Many [5] (Venkata and Yadav, 2012), the author proposed a researchers showed their interest in Content-based image method of image classification based on two features. First retrieval (CBIR). CBIR doesn't work on the metadata, for is edge detection using the Sobel edge detection filter. The example, tags, image description, etc. However, it works on the second feature is the colour of an image for which the author details of the images, or we can say the features of the images, used CCV.
    [Show full text]
  • Semantic Information Retrieval Based on Wikipedia Taxonomy
    International Journal of Computer Applications Technology and Research Volume 2– Issue 1, 77-80, 2013, ISSN: 2319–8656 Semantic Information Retrieval based on Wikipedia Taxonomy May Sabai Han University of Technology Yatanarpon Cyber City Myanmar Abstract: Information retrieval is used to find a subset of relevant documents against a set of documents. Determining semantic similarity between two terms is a crucial problem in Web Mining for such applications as information retrieval systems and recommender systems. Semantic similarity refers to the sameness of two terms based on sameness of their meaning or their semantic contents. Recently many techniques have introduced measuring semantic similarity using Wikipedia, a free online encyclopedia. In this paper, a new technique of measuring semantic similarity is proposed. The proposed method uses Wikipedia as an ontology and spreading activation strategy to compute semantic similarity. The utility of the proposed system is evaluated by using the taxonomy of Wikipedia categories. Keywords: information retrieval; semantic similarity; spreading activation strategy; wikipedia taxonomy; wikipedia categories 1. INTRODUCTION fashion than a search engine and with more coverage than WordNet. And Wikipedia articles have been categorized by Information in WWW are scattered and diverse in nature. So, providing a taxonomy, categories. This feature provides the users frequently fail to describe the information desired. hierarchical structure or network. Wikipedia also provides Traditional search techniques are constrained by keyword articles link graph. So many researches has recently used based matching techniques. Hence low precision and recall is Wikipedia as an ontology to measure semantic similarity. obtained [2]. Many natural language processing applications must estimate the semantic similarity of pairs of text We propose a method to use structured knowledge extracted fragments provided as input, e.g.
    [Show full text]
  • Introduction to Information Retrieval
    DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. 1 1 Boolean retrieval The meaning of the term information retrieval can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval. However, as an academic field of study, INFORMATION information retrieval might be defined thus: RETRIEVAL Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar pro- fessional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their email.1 Information retrieval is fast becoming the dominant form of information access, overtaking traditional database- style searching (the sort that is going on when a clerk says to you: “I’m sorry, I can only look up your order if you can give me your Order ID”). IR can also cover other kinds of data and information problems beyond that specified in the core definition above. The term “unstructured data” refers to data which does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to main- tain product inventories and personnel records.
    [Show full text]