Measuring Author Influence on Information Retrieval

Total Page:16

File Type:pdf, Size:1020Kb

Measuring Author Influence on Information Retrieval Social Impact Retrieval: Measuring Author Influence on Information Retrieval James Lanagan A dissertation submitted in fulfilment of the requirements for the award of Doctor of Philosophy (Ph.D.) to the Dublin City University School of Computing Supervisor: Prof. Alan F. Smeaton June 2009 Declaration I hereby certify that this material, which I now submit for assessment on the pro- gramme of study leading to the award of Doctor of Philosophy is entirely my own work and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work. Signed: ID No: Date: ACKNOWLEDGEMENTS Firstly I would like to thank my supervisor Alan Smeaton for the help he has given me in finding a direction for my research. I must thank him also for the effort he has put in to the balancing act of knowing when to let me follow my own ideas, whilst still providing that direction. A collection of people helped me through this thesis, all of whom provided me with intelligent insight and the odd push to keep me focussed and confident that the research could be done: During the creation of both my initial systems, Hyowon Lee was of immeasurable help in advising and tweaking the ideas that I had for how the systems should look. Throughout this thesis Paul Ferguson and Pete Wilkins have both provided valu- able knowledge and help in getting through my experiments, providing me with expert judgements, and lastly with perhaps some questionable sporting wisdom. Colum Foley, Adam Bermingham, Cathal Gurrin, and Gareth Jones all listened to my thoughts towards the later months of my thesis and helped to push me over the finish line. A word of thanks to all those both past and present within the CDVP who have helped me enjoy my PhD experience and in particular: Georgina, Sin´ead,Kieran, Mike, Kevin, Phil, Fabrice, Sandra, Neil, Edel, Daragh and Niamh. Mike Christel first got me into the information retrieval game, and put me in contact with Alan and the CDVP. For this I am always grateful. Mick Cooney for all the advice and direction he has provided thanks to his incred- ible knowledge of all things statistical or otherwise. I must also thank most sincerely my parents Geoff and Jane for all they have done for me through my many adventures. Your unwavering support, along with that of my sister Sarah has meant so much. I am truly blessed. Lastly, I thank my girlfriend Linda for all that she has put up with through these last few years. It has been made easier each day by your constant patience and encouragement and I could not have done this without you. Thank you from the bottom of my heart. i ABSTRACT The increased presence of technologies collectively referred to as Web 2.0 mean the entire process of new media production and dissemination has moved away from an authorcentric approach. Casual web users and browsers are increasingly able to play a more active role in the information creation process. This means that the traditional ways in which information sources may be validated and scored must adapt accordingly. In this thesis we propose a new way in which to look at a user's contributions to the network in which they are present, using these interactions to provide a measure of authority and centrality to the user. This measure is then used to attribute an query- independent interest score to each of the contributions the author makes, enabling us to provide other users with relevant information which has been of greatest interest to a community of like-minded users. This is done through the development of two algorithms; AuthorRank and MessageRank. We present two real-world user experiments which focussed around multimedia an- notation and browsing systems that we built; these systems were novel in themselves, bringing together video and text browsing, as well as free-text annotation. Using these systems as examples of real-world applications for our approaches, we then look at a larger-scale experiment based on the author and citation networks of a ten year period of the ACM SIGIR conference on information retrieval between 1997-2007. We use the citation context of SIGIR publications as a proxy for annotations, constructing large social networks between authors. Against these networks we show the effectiveness of incorporating user generated content, or annotations, to improve information retrieval. ii TABLE OF CONTENTS ACKNOWLEDGEMENTS ............................ i ABSTRACT ..................................... ii LIST OF TABLES ................................. ix LIST OF FIGURES ................................ xi I INTRODUCTION ............................... 1 1.1 Social Information Retrieval . .1 1.1.1 User Generated Content and the Social Web . .3 1.2 Thesis Hypothesis . .4 1.3 Research Objectives . .5 1.4 Thesis Organisation . .6 II BACKGROUND & RELATED WORK ................. 8 2.1 Information Retrieval . .9 2.1.1 An Information System . 10 2.1.2 Retrieval Strategies . 13 2.1.3 Linkage Analysis . 20 2.1.4 Evaluating Retrieval Performance . 24 2.1.5 Combining Sources of Evidence . 26 2.2 Social Network Analysis . 29 2.2.1 The Small World . 30 iii 2.2.2 Social Network Models . 31 2.3 Trust and Authority . 33 2.3.1 A Trust Framework . 34 2.3.2 Reputation . 36 2.3.3 Propagation of Trust . 37 2.4 Data Quality . 40 2.4.1 Defining Data . 41 2.4.2 Data Systems . 42 2.4.3 Classifications of Data Quality . 44 2.5 Annotation . 48 2.5.1 Physical Vs. Digital . 49 2.5.2 Annotations as Queries . 51 2.5.3 Annotations as Hyper-links . 51 2.5.4 Grouping Annotations . 52 III WEB 2.0 ..................................... 54 3.1 Web 2.0: People Talking to People . 54 3.1.1 Vote for me . 55 3.1.2 Tag, you're it. 57 3.1.3 Social Commentary . 59 3.2 Two novel Web 2.0 systems . 60 3.2.1 SportsAnno . 61 iv 3.2.2 Summarising Sporting Events . 65 3.2.3 Annoby . 73 3.2.4 Comparison of SportsAnno and Annoby Usage . 82 3.3 Conclusions . 89 IV EXPERIMENTAL SETUP ......................... 90 4.1 Citation . 91 4.1.1 Citation Indexing . 92 4.1.2 Citation Analysis . 94 4.2 An Annotated Corpus . 98 4.2.1 The SIGIR Corpus . 99 4.3 Deriving Annotations from Scientific Citations . 103 4.3.1 System Description . 104 4.3.2 An XML Collection . 108 4.4 Graph Theory . 110 4.5 Corpus Analysis . 112 4.5.1 Citation Network of SIGIR Publications . 114 4.5.2 Citation Network of SIGIR Authors . 115 4.6 Network measures . 119 4.6.1 The Value of Authorship . 122 4.7 An Hypothesis Re-visited . 126 V EXPERIMENTAL SETUP II ........................ 128 v 5.1 Creation of a Ground Truth . 128 5.1.1 Document Selection . 129 5.1.2 Expert Rankings . 133 5.1.3 A Combined Expert Ground-Truth Ranking . 136 5.2 Additional Ranking Sources . 137 5.3 Comparisons of Rankings . 139 5.3.1 Comparing Google Scholar to Our Experts . 141 5.3.2 Harnessing Community Expertise . 144 VI EXPERIMENTS ................................ 146 6.1 A Search System . 146 6.1.1 Document Relevance . 148 6.1.2 A TF-IDF Baseline . 148 6.2 Author Value Revisited . 150 6.3 Calculation of Author Feature Contributions . 156 6.3.1 Combination of Author Features . 167 6.4 Calculation of AuthorRank Weights . 170 6.5 The Contribution of Single Messages . 173 6.6 Calculation of Message Feature Contributions . 177 6.6.1 Combination of Message Features . 181 6.7 Calculation of MessageRank Weights . 186 6.7.1 The Performance of MessageRank . 189 vi 6.8 Comparisons with the SportsAnno Corpus . 191 6.8.1 Collection of a Ground-Truth . 192 6.8.2 Searching Against the SportsAnno Corpus . 195 6.8.3 Using Ar and Mr to Re-Rank . 197 VII CONCLUSIONS AND SUMMARY .................... 200 7.1 Hypothesis Re-visited . 200 7.2 Research Objectives Re-visited . 201 7.2.1 Annotation . 201 7.2.2 Utility . 202 7.3 Conclusions . 203 7.3.1 Annotation Creation . 203 7.3.2 Expert Opinion . 204 7.3.3 Author Network Features . 205 7.3.4 Citation/Annotation Network . 205 7.4 Considerations . 206 7.5 Directions for Future Work . 208 7.6 Summary . 210 APPENDIX A | ANNOBY USER QUESTIONNAIRE ....... 212 APPENDIX B | KENDALL'S CO-EFFICIENT OF CONCORDANCE 216 APPENDIX C | XML AND MPEG-7 .................. 219 vii REFERENCES ................................... 221 viii LIST OF TABLES 1 Dimensions of Data Quality according to the Intuitive Approach . 45 2 Dimensions of Data Quality according to the Empirical Approach . 47 5 PDF file error Statistics for the extended SIGIR corpus . 101 6 PDF file error Statistics for the extended SIGIR corpus . 105 7 Statistics for the different graphs of SIGIR and extended SIGIR corpora114 8 PageRank for citation of SIGIR papers within the extended SIGIR corpus116 9 Statistics For The Top-10 Authors By Co-Authorship . 118 10 Top 5 Authors by Co-Author and Publication Counts . 119 11 Statistics for the top-10 authors by citation within our extended SIGIR corpus . 120 12 Number of papers per year for each of the topics chosen. 130 13 Number of documents returned by Google Scholar queries restricted to the years 1997-2007, and only the ACM SIGIR publication series. 132 14 Rankings assigned for collaborative feedback documents by the experts.
Recommended publications
  • Basic Concepts in Information Retrieval
    1 Basic concepts of information retrieval systems Introduction The term ‘information retrieval’ was coined in 1952 and gained popularity in the research community from 1961 onwards.1 At that time the organizing function of information retrieval was seen as a major advance in libraries that were no longer just storehouses of books, but also places where the information they hold is catalogued and indexed.2 Subsequently, with the introduction of computers in information handling, there appeared a number of databases containing bibliographic details of documents, often married with abstracts, keywords, and so on, and consequently the concept of information retrieval came to mean the retrieval of bibliographic information from stored document databases. Information retrieval is concerned with all the activities related to the organization of, processing of, and access to, information of all forms and formats. An information retrieval system allows people to communicate with an information system or service in order to find information – text, graphic images, sound recordings or video that meet their specific needs. Thus the objective of an information retrieval system is to enable users to find relevant information from an organized collection of documents. In fact, most information retrieval systems are, truly speaking, document retrieval systems, since they are designed to retrieve information about the existence (or non-existence) of documents relevant to a user query. Lancaster3 comments that an information retrieval system does not inform (change the knowledge of) the user on the subject of their enquiry; it merely informs them of the existence (or non-existence) and whereabouts of documents relating to their request.
    [Show full text]
  • Analysis of Social Networks with Missing Data (Draft: Do Not Cite)
    Analysis of social networks with missing data (Draft: do not cite) G. Kossinets∗ Department of Sociology, Columbia University, New York, NY 10027. (Dated: February 4, 2003) We perform sensitivity analyses to assess the impact of missing data on the struc- tural properties of social networks. The social network is conceived of as being generated by a bipartite graph, in which actors are linked together via multiple interaction contexts or affiliations. We discuss three principal missing data mecha- nisms: network boundary specification (non-inclusion of actors or affiliations), survey non-response, and censoring by vertex degree (fixed choice design). Based on the simulation results, we propose remedial techniques for some special cases of network data. I. INTRODUCTION Social network data is often incomplete, which means that some actors or links are missing from the dataset. In a normal social setting, much of the incompleteness arises from the following main sources: the so called boundary specification problem (BSP); respondent inaccuracy; non-response in network surveys; or may be inadvertently introduced via study design (Table I). Although missing data is abundant in empirical studies, little research has been conducted on the possible effect of missing links or nodes on the measurable properties of networks at large.1 In particular, a revision of the original work done primarily in the 1970-80s [4, 17, 21] seems necessary in the light of recent advances that brought new classes of networks to the attention of the interdisciplinary research community [1, 3, 30, 37, 40, 41]. Let us start with a few examples from the literature to illustrate different incarnations of missing data in network research.
    [Show full text]
  • The Seven Ages of Information Retrieval
    International Federation of Library Associations and Institutions UNIVERSAL DATAFLOW AND TELECOMMUNICATIONS CORE PROGRAMME OCCASIONAL PAPER 5 THE SEVEN AGES OF INFORMATION RETRIEVAL Michael Lesk Bellcore March, 1996 International Federation of Library Associations and Institutions UNIVERSAL DATAFLOW AND TELECOMMUNICATIONS CORE PROGRAMME The IFLA Core Programme on Universal Dataflow and Telecommunications (UDT) seeks to facilitate the international and national exchange of electronic data by providing the library community with pragmatic approaches to resource sharing. The programme monitors and promotes the use of relevant standards, promotes the use of relevant technologies and monitors relevant policy issues in an effort to overcome barriers to the electronic transfer of data in library fields. CONTACT INFORMATION Mailing Address: IFLA International Office for UDT c/o National Library of Canada 395 Wellington Street Ottawa, CANADA K1A 0N4 UDT Staff Contacts: Leigh Swain, Director Email: [email protected] Phone: (819) 994-6833 or Louise Lantaigne, Administration Officer Email: [email protected] Phone: (819) 994-6963 Fax: (819) 994-6835 Email: [email protected] URL: http://www.ifla.org/udt/ Occasional papers are available electronically at: http://www.ifla.org/udt/op/ UDT Occasional Papers # 5 Universal Dataflow and Telecommunications Core Programme International Federation of Library Associations and Institutions The Seven Ages of Information Retrieval Michael Lesk Bellcore [email protected] March, 1996 ABSTRACT analysis. This dates to a memo by Warren Weaver in 1949 [Weaver 1955] thinking about the success of Vannevar Bush's 1945 article set a goal of fast access computers in cryptography during the war, and to the contents of the world's libraries which looks suggesting that they could translate languages.
    [Show full text]
  • Graph Theory and Social Networks Spring 2014 Notes
    Graph Theory and Social Networks Spring 2014 Notes Kimball Martin April 30, 2014 Introduction Graph theory is a branch of discrete mathematics (more specifically, combinatorics) whose origin is generally attributed to Leonard Euler's solution of the K¨onigsberg bridge problem in 1736. At the time, there were two islands in the river Pregel, and 7 bridges connecting the islands to each other and to each bank of the river. As legend goes, for leisure, people would try to find a path in the city of K¨onigsberg which traversed each of the 7 bridges exactly once (see Figure1). Euler represented this abstractly as a graph∗, and showed by elementary means that no such path exists. Figure 1: The Seven Bridges of K¨onigsberg (Source: Wikimedia Commons) Intuitively, a graph is just a set of objects which are connected in some way. The objects are called vertices or nodes. Pictorially, we usually draw the vertices as circles, and draw a line between two vertices if they are connected or related (in whatever context we have in mind). These lines are called edges or links. Here are a few examples of abstract graphs. This is a graph with 8 vertices connected in a circle. ∗In this course, graph does not mean the graph of a function, as in calculus. It is unfortunate, but these two very basic objects in mathematics have the same name. 1 Graph Theory/Social Networks Introduction Kimball Martin (Spring 2014) 1 2 8 3 7 4 6 5 This is a graph on 5 vertices, where all pairs of vertices are connected.
    [Show full text]
  • Graph Theory and Social Networks - Part I
    Graph Theory and Social Networks - part I ! EE599: Social Network Systems ! Keith M. Chugg Fall 2014 1 © Keith M. Chugg, 2014 Overview • Summary • Graph definitions and properties • Relationship and interpretation in social networks • Examples © Keith M. Chugg, 2014 2 References • Easley & Kleinberg, Ch 2 • Focus on relationship to social nets with little math • Barabasi, Ch 2 • General networks with some math • Jackson, Ch 2 • Social network focus with more formal math © Keith M. Chugg, 2014 3 Graph Definition 24 CHAPTER 2. GRAPHS A A • G= (V,E) • V=set of vertices B B C D C D • E=set of edges (a) A graph on 4 nodes. (b) A directed graph on 4 nodes. Figure 2.1: Two graphs: (a) an undirected graph, and (b) a directed graph. Modeling of networks Easley & Kleinberg • will be undirected unless noted otherwise. Graphs as Models of Networks. Graphs are useful because they serve as mathematical Vertex is a person (ormodels ofentity) network structures. With this in mind, it is useful before going further to replace • the toy examples in Figure 2.1 with a real example. Figure 2.2 depicts the network structure of the Internet — then called the Arpanet — in December 1970 [214], when it had only 13 sites. Nodes represent computing hosts, and there is an edge joining two nodes in this picture Edge represents a relationshipif there is a direct communication link between them. Ignoring the superimposed map of the • U.S. (and the circles indicating blown-up regions in Massachusetts and Southern California), the rest of the image is simply a depiction of this 13-node graph using the same dots-and-lines style that we saw in Figure 2.1.
    [Show full text]
  • Information Retrieval (Text Categorization)
    Information Retrieval (Text Categorization) Fabio Aiolli http://www.math.unipd.it/~aiolli Dipartimento di Matematica Pura ed Applicata Università di Padova Anno Accademico 2008/2009 Dip. di Matematica F. Aiolli - Information Retrieval 1 Pura ed Applicata 2008/2009 Text Categorization Text categorization (TC - aka text classification) is the task of buiding text classifiers, i.e. sofware systems that classify documents from a domain D into a given, fixed set C = {c 1,…,c m} of categories (aka classes or labels) TC is an approximation task , in that we assume the existence of an ‘oracle’, a target function that specifies how docs ought to be classified. Since this oracle is unknown , the task consists in building a system that ‘approximates’ it Dip. di Matematica F. Aiolli - Information Retrieval 2 Pura ed Applicata 2008/2009 Text Categorization We will assume that categories are symbolic labels; in particular, the text constituting the label is not significant. No additional knowledge of category ‘meaning’ is available to help building the classifier The attribution of documents to categories should be realized on the basis of the content of the documents. Given that this is an inherently subjective notion, the membership of a document in a category cannot be determined with certainty Dip. di Matematica F. Aiolli - Information Retrieval 3 Pura ed Applicata 2008/2009 Single-label vs Multi-label TC TC comes in two different variants: Single-label TC (SL) when exactly one category should be assigned to a document The target function in the form f : D → C should be approximated by means of a classifier f’ : D → C Multi-label TC (ML) when any number {0,…,m} of categories can be assigned to each document The target function in the form f : D → P(C) should be approximated by means of a classifier f’ : D → P(C) We will often indicate a target function with the alternative notation f : D × C → {-1,+1} .
    [Show full text]
  • Information Retrieval System: Concept and Scope MODULE - 5B INFORMATION RETRIEVAL SYSTEM
    Information Retrieval System: Concept and Scope MODULE - 5B INFORMATION RETRIEVAL SYSTEM 15 Notes INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find information relevant to the task at hand. In view of this, information retrieval (IR) deals with the representation, storage, organization of/and access to information items. Here, types of information items include documents, Web pages, online catalogues, structured records, multimedia objects, etc. Chief goals of the IR are indexing text and searching for useful documents in a collection. Libraries were among the first institutions to adopt IR systems for retrieving information. In this lesson, you will be introduced to the importance, definitions and objectives of information retrieval. You will also study in detail the concept of subject approach to information, process of information retrieval, and indexing languages. 15.2 OBJECTIVES After studying this lesson, you will be able to: define information retrieval; understand the importance and need of information retrieval system; explain the concept of subject approach to information; LIBRARY AND INFORMATION SCIENCE 321 MODULE - 5B Information Retrieval System: Concept and Scope INFORMATION RETRIEVAL SYSTEM illustrate the process of information retrieval; and differentiate between natural, free and controlled indexing languages. 15.3 INFORMATION RETRIEVAL (IR) Notes The term ‘information retrieval’ was coined by Calvin Mooers in 1950. It gained popularity in the research community from 1961 onwards, when computers were introduced for information handling. The term information retrieval was then used to mean retrieval of bibliographic information from stored document databases.
    [Show full text]
  • A Note on the Importance of Collaboration Graphs
    Int. J. of Mathematical Sciences and Applications, Vol. 1, No. 3, September 2011 Copyright Mind Reader Publications www.journalshub.com A Note on the Importance of Collaboration Graphs V.Yegnanarayanan1 and G.K.Umamaheswari2 1Senior Professor, Department of Mathematics, Velammal Engineering College,Ambattur-Red Hills Road, Chennai - 600 066, India. Email id:[email protected] 2Research Scholar, Research and Development Centre, Bharathiar University, Coimbatore-641046, India. Abstract Numerous challenging problems in graph theory has attracted the attention and imagination of researchers from physics, computer science, engineering, biology, social science and mathematics. If we put all these different branches one into basket, what evolves is a new science called “Network Science”. It calls for a solid scientific foundation and vigorous analysis. Graph theory in general and the collaboration graphs, in particular are well suited for this task. In this paper, we give a overview of the importance of collaboration graphs with its interesting background. Also we study one particular type of collaboration graph and list a number of open problems. Keywords :collaboration graph, network science, erdos number AMS subject Classification: 05XX, 68R10 1. Introduction In the past decade, graph theory has gone through a remarkable shift and a profound transformation. The change is in large part due to the humongous amount of information that we are confronted with. A main way to sort through massive data sets is to build and examine the network formed by interrelations. For example, Google’s successful web search algorithms are based on the www graph, which contains all web pages as vertices and hyperlinks as edges.
    [Show full text]
  • A Philosophical Wish List for Research in Music Information Retrieval Cynthia M
    A Philosophical Wish List for Research in Music Information Retrieval Cynthia M. Grund Institute of Philosophy, Education and the Study of Religions - Philosophy University of Southern Denmark, Odense [email protected] Abstract what might otherwise seem utterly intractable questions. Within a framework provided by the traditional trio What insights could MIR-research provide into the consisting of metaphysics, epistemology and ethics, a first great questions of philosophy? The answer lies in the stab is made at a wish list for MIR-research from a particular facility which the tools of MIR possess for philosophical point of view. Since the tools of MIR are dealing with language as a sonic phenomenon, thus 1 equipped to study language and its use from a purely sonic providing yet another revolution in the linguistic turn. A standpoint, MIR research could result in another revealing search through the reams of literature produced regarding revolution within the linguistic turn in philosophy. the role of language in philosophical endeavor reveals that the language framework is virtually always a written one. Keywords: Philosophy and MIR, language as spoken, Little or no attention has been paid to the mechanisms at memory work in thinking, learning and communicating in a context 2 1. Introduction and Brief Setting of the Stage which is virtually of an exclusively oral, sonic. Since the overwhelming majority of time during which our thinking, Philosophy wrestles with questions regarding learning and communicative skills developed was metaphysics,
    [Show full text]
  • An Architecture for Information Retrieval Agents* Craig A
    From: AAAI Technical Report SS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. An Architecture for Information Retrieval Agents* Craig A. Knoblock and Yigal Arens Information Sciences Institute University of Southern California 4676 Admiralty Way Marina del Rey, CA 90292, USA {KNOBLOCK,ARENS}QISI.ED Abstract sources that are available to it. Given an informa- tion request, an agent identifies an appropriate set of With the vast number of information resources information sources, generates a plan to retrieve and available today, a critical problem is howto lo- process the data, uses knowledge about the data to re- cate, retrieve and process information. It would formulate the plan, and then executes it. This paper be impractical to build a single unified system that describes our approach to the issues of representation, combines all of these information resources. A communication, problem solving, and learning, and de- more promising approach is to build specialized scribes how this approach supports multiple, collabo- information retrieval agents that provide access rating information retrieval agents. to a subset of the information resources and can send requests to other information retrieval agents Representing the Knowledge of an when needed. In this paper we present an archi- tecture for building such agents that addresses the Agent issues of representation, communication, problem Each information agent is specialized to a particular solving, and learning. Wealso describe how this area of expertise. This provides a modular organiza- architecture supports agents that are modular, ex- tion of the vast numberof information sources and pro- tensible, flexible and efficient, vides a clear delineation of the types of queries each agent can handle.
    [Show full text]
  • Semantic Information Retrieval Based on Wikipedia Taxonomy
    International Journal of Computer Applications Technology and Research Volume 2– Issue 1, 77-80, 2013, ISSN: 2319–8656 Semantic Information Retrieval based on Wikipedia Taxonomy May Sabai Han University of Technology Yatanarpon Cyber City Myanmar Abstract: Information retrieval is used to find a subset of relevant documents against a set of documents. Determining semantic similarity between two terms is a crucial problem in Web Mining for such applications as information retrieval systems and recommender systems. Semantic similarity refers to the sameness of two terms based on sameness of their meaning or their semantic contents. Recently many techniques have introduced measuring semantic similarity using Wikipedia, a free online encyclopedia. In this paper, a new technique of measuring semantic similarity is proposed. The proposed method uses Wikipedia as an ontology and spreading activation strategy to compute semantic similarity. The utility of the proposed system is evaluated by using the taxonomy of Wikipedia categories. Keywords: information retrieval; semantic similarity; spreading activation strategy; wikipedia taxonomy; wikipedia categories 1. INTRODUCTION fashion than a search engine and with more coverage than WordNet. And Wikipedia articles have been categorized by Information in WWW are scattered and diverse in nature. So, providing a taxonomy, categories. This feature provides the users frequently fail to describe the information desired. hierarchical structure or network. Wikipedia also provides Traditional search techniques are constrained by keyword articles link graph. So many researches has recently used based matching techniques. Hence low precision and recall is Wikipedia as an ontology to measure semantic similarity. obtained [2]. Many natural language processing applications must estimate the semantic similarity of pairs of text We propose a method to use structured knowledge extracted fragments provided as input, e.g.
    [Show full text]
  • Introduction to Information Retrieval
    DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. 1 1 Boolean retrieval The meaning of the term information retrieval can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval. However, as an academic field of study, INFORMATION information retrieval might be defined thus: RETRIEVAL Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar pro- fessional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their email.1 Information retrieval is fast becoming the dominant form of information access, overtaking traditional database- style searching (the sort that is going on when a clerk says to you: “I’m sorry, I can only look up your order if you can give me your Order ID”). IR can also cover other kinds of data and information problems beyond that specified in the core definition above. The term “unstructured data” refers to data which does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to main- tain product inventories and personnel records.
    [Show full text]