A Survey on How to Cross-Reference Web Information Sources Joe Raad, Aurélie Bertaux, Christophe Cruz

A Survey on how to Cross-Reference Web Information Sources Joe Raad, Aurélie Bertaux, Christophe Cruz To cite this version: Joe Raad, Aurélie Bertaux, Christophe Cruz. A Survey on how to Cross-Reference Web Information Sources. Science and Information Conference, Jul 2015, Londres, United Kingdom. pp.609 - 618, 10.1109/SAI.2015.7237206. hal-01199440 HAL Id: hal-01199440 https://hal.archives-ouvertes.fr/hal-01199440 Submitted on 2 Feb 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A Survey on how to Cross-Reference Web Information Sources Joe Raad1, Aurelie Bertaux2, and Christophe Cruz3 CheckSem Department, Le2i Laboratory University of Burgundy Dijon, France [email protected], [email protected], [email protected] Abstract—The goal of giving information a well-defined express a single event in thousands of ways in natural meaning is currently shared by different research communities. language sentences. Therefore, paraphrase identification, Once information has a well-defined meaning, it can be searched which determines whether or not two formally distinct strings and retrieved more effectively. Therefore, this paper is a survey about the methods that compare different textual information are similar in meaning is an effective way to solve this sources in order to determine whether they address a similar problem. Furthermore, in order to identify whether two information or not. The improvement of the studied methods will expressions are paraphrases or not, these techniques rely on eventually lead to increase the efficiency of documentary evaluating the semantic measure between these expressions. research. In order to achieve this goal, the first category of Next section presents the definition and general concepts of methods focuses on semantic measure definitions. A second semantic measure. Section three deals with the different category of methods focuses on paraphrase identification techniques, and the last category deals with improving event approaches of cross-referencing information by discussing the extraction techniques. A general discussion is given at the end of following main approaches: Semantic measure definitions, the paper presenting the advantages and disadvantages of these paraphrase identification and event extraction. Finally, before methods. concluding, the last section discusses the limits and the complementarity of these methods. Keywords—Cross-Reference Web Information Sources; Documentary Research; Event Extraction; Paraphrase II. CONTEXT Identification; Semantic Measures; Semantic Relatedness; Similarity Definition. In order to compare cross-referencing methods of web information sources, one must try to evaluate the Semantic I. INTRODUCTION Measure between concepts. Semantic Measure is normally a philosophic term. It is a point of view which differs from one The quantity of the data on the web is increasing day after person to another regarding the semantic links strengths day in an immense way. This property is mainly due to the between two concepts. Trying to computerize this philosophic ease of creating web sites and the importance of social medias. term, in order to compare different textual information, is a In fact, and according to the analysis of Lawrence and Giles complex task and requires to perform high-level language [31], the page number of World Wide Web doubles every two processing. However, the evaluation of the Semantic Measure years. As a result, the difficulty in analyzing and retrieving between two concepts depends firstly on the type of the information on the web increases with the growth of the semantic links, and secondly on the type of knowledge number of pages. This problem proves the necessity of resources. integrating knowledge in the web. This leads us to the purpose A. Semantic Measure Types of converting the current web, dominated by unstructured and In order to compare two concepts, and precisely two textual semi-structured documents into a “web of data” that can be information sources in the case of documentary research, one processed and analyzed by machines. must evaluate the semantic measure between these sources. This paper focuses on one aspect, which is how to improve However, semantic measure is a generic term covering several cross-referencing of web information sources. This basically concepts: means linking the different textual information sources that Semantic relatedness, which is the most general share similar meanings. In order to reach this goal, one must semantic link between two concepts. Two concepts focus on extracting knowledge, and precisely extracting events do not have to share a common meaning to be from different web information sources. However, extracting considered semantically related or close, they can be useful structured representation of events from a disorganized linked by a functional relationship or frequent source like the web is a challenging problem. As one can 1 association relationship like meronym or antonym However, the use of ontologies shows some limitations in concepts. their low coverage (e.g. containing few proper names) and in (e.g. Pilot “is related to” Airplane) their need for experts to supply their knowledge, which makes Semantic similarity, which is a specific case of it difficult to be expanded and updated. semantic relatedness. Two concepts are considered similar if they share common meanings and DBPedia characteristics, like synonym, hyponym and DBPedia is a project aiming to extract structured hypernym concepts. (e.g. Old “is similar to” Ancient) contents from Wikipedia, by allowing users to query Semantic distance, is the inverse of the semantic sophisticated queries containing relationships and properties. relatedness, as it indicates how much two concepts Its structure combined with a very large amount of data, are unrelated to one another. makes DBPedia a reliable and important source of knowledge for several applications including semantic measurements. B. Knowledge Resources Types The evaluation of the semantic measure strongly depends on the level of knowledge of the person comparing different concepts, which is the knowledge resource in the case of a machine trying to perform this task. For instance, the relationship of the two terms “Titanic” and “Avatar” does not exist at all for a given person. But, another person identifies them as related since these terms are both movie titles. Furthermore, a movie addict strongly relates these two terms, as they are not only movie titles, but these movies also share the same writer and director. Therefore, we can see the influence and the importance of the knowledge resources (level of knowledge for humans) in the evaluation of the semantic measure between two concepts. Figure 1. A Fragment of (is-a) Relation in WordNet Different knowledge sources are used in different approaches. Some of the approaches use only one knowledge source, while other approaches use several ones. Among the 2) Semi-Structured Sources most common sources, we can find structured sources like ontologies and the DBPedia, semi-structured sources like Wikipedia Wikipedia and unstructured sources like textual data from the Wikipedia is the most popular internet encyclopedia, and web. the seventh most popular website in the world according to 1) Structured Sources Alexa’s latest web data analysis [32]. Therefore, being the Internet’s largest and most popular general reference work Ontologies makes Wikipedia one of the most used knowledge resource in The word ontology is generally used to mean different this field. Though the fact that it covers most of the fields and things, e.g. glossaries & data dictionaries, thesauri & domains, and contains lot of proper nouns. It is not as taxonomies, schemas & data models, and formal ontologies & important as the web to discover and evaluate semantic inference. These different databases have the same goal, to relatedness. Actually you cannot always find semantic related provide information on the meaning of elements. One of the terms in the same Wikipedia article. most used ontologies is the large English lexical database WordNet. In WordNet, there are four commonly used 3) Unstructured Sources semantic relations for nouns, which are hyponym/hypernym (is-a), part meronym/part holonym (part-of), member Web meronym/member holonym (member-of) and substance The web is now the biggest and the fastest growing source meronym/substance holonym (substance-of). A fragment of of information on earth. Therefore, it is one of the most (is-a) relation between concepts in WordNet is shown in important knowledge resource. Even though the number of Figure1. We can also find many other popular general purpose experts is small compared to the total number of internet users, which can make it sometimes an unreliable source of ontologies like YAGO and SENSUS, and some domain information, it can still be easy to remove the noise in order to specific ontologies like UMLS and MeSH (for biomedical and extract relevant information. health

A Survey on How to Cross-Reference Web Information Sources Joe Raad, Aurélie Bertaux, Christophe Cruz

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support