Information Extraction Meets the Semantic Web: a Survey

Total Page:16

File Type:pdf, Size:1020Kb

Information Extraction Meets the Semantic Web: a Survey Undefined 0 (2016) 1 1 IOS Press Information Extraction meets the Semantic Web: A Survey Editor(s): Andreas Hotho, Julius-Maximilians-Universität Würzburg, Würzburg, Germany Solicited review(s): Dat Ba Nguyen, Vietnam National University, Hanoi, Vietnam; Simon Scerri, University of Bonn, Germany; 2 Anonymous Reviewers Open review(s): Michelle Cheatham, Wright State University, Dayton, Ohio, USA Jose L. Martinez-Rodriguez a, Aidan Hogan b and Ivan Lopez-Arevalo a a Cinvestav Tamaulipas, Ciudad Victoria, Mexico E-mail: {lmartinez,ilopez}@tamps.cinvestav.mx b IMFD Chile; Department of Computer Science, University of Chile, Chile E-mail: [email protected] Abstract. We provide a comprehensive survey of the research literature that applies Information Extraction techniques in a Se- mantic Web setting. Works in the intersection of these two areas can be seen from two overlapping perspectives: using Seman- tic Web resources (languages/ontologies/knowledge-bases/tools) to improve Information Extraction, and/or using Information Extraction to populate the Semantic Web. In more detail, we focus on the extraction and linking of three elements: entities, concepts and relations. Extraction involves identifying (textual) mentions referring to such elements in a given unstructured or semi-structured input source. Linking involves associating each such mention with an appropriate disambiguated identifier referring to the same element in a Semantic Web knowledge-base (or ontology), in some cases creating a new identifier where necessary. With respect to entities, works involving (Named) Entity Recognition, Entity Disambiguation, Entity Linking, etc. in the context of the Semantic Web are considered. With respect to concepts, works involving Terminology Extraction, Keyword Extraction, Topic Modeling, Topic Labeling, etc., in the context of the Semantic Web are considered. Finally, with respect to relations, works involving Relation Extraction in the context of the Semantic Web are considered. The focus of the majority of the survey is on works applied to unstructured sources (text in natural language); however, we also provide an overview of works that develop custom techniques adapted for semi-structured inputs, namely markup documents and web tables. Keywords: Information Extraction, Entity Linking, Keyword Extraction, Topic Modeling, Relation Extraction, Semantic Web 1. Introduction able on the Web – information that is constantly chang- ing – for it to be feasible to apply manual annotation to The Semantic Web pursues a vision of the Web even a significant subset of what might be of relevance. where increased availability of structured content en- While the amount of structured data available on ables higher levels of automation. Berners-Lee [20] the Web has grown significantly in the past years, described this goal as being to “enrich human read- there is still a significant gap between the coverage able web data with machine readable annotations, al- of structured and unstructured data available on the lowing the Web’s evolution as the biggest database in Web [248]. Mika referred to this as the semantic the world”. However, making annotations on informa- gap [205], whereby the demand for structured data on tion from the Web is a non-trivial task for human users, the Web outstrips its supply. For example, in an anal- particularly if some formal agreement is required to ysis of the 2013 Common Crawl dataset, Meusel et ensure that annotations are consistent across sources. al. [201] found that of the 2.2 billion webpages con- Likewise, there is simply too much information avail- sidered, 26.3% contained some structured metadata. 0000-0000/16/$00.00 © 2016 – IOS Press and the authors. All rights reserved 2 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web Thus, despite initiatives like Linking Open Data [274], language that is founded on one of the core Semantic Schema.org [200,204] (promoted by Google, Mi- Web standards: RDF/RDFS/OWL/SKOS/SPARQL.2 crosoft, Yahoo, and Yandex) and the Open Graph Pro- By Information Extraction methods, we focus on the tocol [127] (promoted by Facebook), this semantic gap extraction and/or linking of three main elements from is still observable on the Web today [205,201]. an (unstructured or semi-structured) input source. As a result, methods to automatically extract or en- hance the structure of various corpora have been a core 1. Entities: anything with named identity, typically topic in the context of the Semantic Web. Such pro- an individual (e.g., Barack Obama, 1961). cesses are often based on Information Extraction meth- 2. Concepts: a conceptual grouping of elements. We ods, which in turn are rooted in techniques from areas consider two types of concepts: such as Natural Language Processing, Machine Learn- – Classes: a named set of individuals (e.g., ing and Information Retrieval. The combination of U.S. President(s)); techniques from the Semantic Web and from Informa- – Topics: categories to which individuals or tion Extraction can be seen from two perspectives: on documents relate (e.g, U.S. Politics). the one hand, Information Extraction techniques can 3. Relations: an n-ary tuple of entities (n ≥ 2) with a be applied to populate the Semantic Web, while on the predicate term denoting the type of relation (e.g., other hand, Semantic Web techniques can be applied marry(Barack Obama;Michele Obama;Chicago). to guide the Information Extraction process. In some cases, both aspects are considered together, where an More formally, we can consider entities as atomic el- existing Semantic Web ontology or knowledge-base is ements from the domain, concepts as unary predi- used to guide the extraction, which further populates cates, and relations as n-ary (n ≥ 2) predicates. We the given ontology and/or knowledge-base (KB).1 take a rather liberal interpretation of concepts to in- In the past years, we have seen a wealth of research clude both classes based on set-theoretic subsumption dedicated to Information Extraction in a Semantic Web of instances (e.g., OWL classes [135]), as well as top- setting. While many such papers come from within ics that form categories over which broader/narrower the Semantic Web community, many recent works relations can be defined (e.g., SKOS concepts [206]). have come from other communities, where, in partic- This is rather a practical decision that will allow us to ular, general-knowledge Semantic Web KBs – such draw together a collective summary of works in the in- as DBpedia [170], Freebase [26] and YAGO2 [138]– terrelated areas of Terminology Extraction, Keyword have been broadly adopted as references for enhanc- Extraction, Topic Modeling, etc., under one heading. ing Information Extraction tasks. Given the wide va- Returning to “extracting and/or linking”, we con- riety of works emerging in this particular intersection sider the extraction process as identifying mentions re- from various communities (sometimes under different ferring to such entities/concepts/relations in the un- nomenclatures), we see that a comprehensive survey structured or semi-structured input, while we consider is needed to draw together the techniques proposed in the linking process as associating a disambiguated such works. Our goal is then to provide such a survey. identifier in a Semantic Web ontology/KB for a men- tion, possibly creating one if not already present and Survey Scope: This survey provides an overview of using it to disambiguate and link further mentions. published works that directly involve both Information Extraction methods and Semantic Web technologies. Information Extraction Tasks: The survey deals with Given that both are very broad areas, we must be rather various Information Extraction tasks. We now give an explicit in our inclusion criteria. introductory summary of the main tasks considered With respect to Semantic Web technologies, to be (though we note that the survey will delve into each included in the scope of a survey, a work must make task in much more depth later): non-trivial use of an ontology, knowledge-base, tool or Named Entity Recognition: demarcate the locations of mentions of entities in an input text: 1Herein we adopt the convention that the term “ontology” refers – aka. Entity Recognition, Entity Extraction; primarily to terminological knowledge, meaning that it describes classes and properties of the domain, such as person, knows, coun- try, etc. On the other hand, we use the term “KB” to refer to primar- 2Works that simply mention general terms such as “semantic” or ily “assertional knowledge”, which describes specific entities (aka. “ontology” may be excluded by this criteria if they do not also di- individuals) of the domain, such as Barack Obama, China, etc. rectly use or depend upon a Semantic Web standard. J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 3 – e.g., in the sentence “Barack Obama was Topic Labeling: For clusters of words identified as born in Hawaii”, mark the underlined abstract topics, extract a single term or phrase that phrases as entity mentions. best characterizes the topic; Entity Linking: associate mentions of entities with – aka. Topic Identification, esp. when linked an appropriate disambiguated KB identifier: with an ontology/KB identifier; often used – involves, or is sometimes synonymous with, for the purposes of Text Classification; Entity Disambiguation;3 often used for the – e.g., identify that the topic f “cancer”, purposes of Semantic Annotation; “breast”, “doctor”, “chemotherapy” g is – e.g., associate “Hawaii” with the DBpedia best characterized with the term “cancer” identifier dbr:Hawaii for the U.S. state (potentially linked to dbr:Cancer for the (rather than the identifier for various songs disease and not, e.g., the astrological sign). 4 or books by the same name). Relation Extraction: Extract potentially n-ary rela- Terminology Extraction: extract the main phrases tions (for n ≥ 2) from an unstructured (i.e., text) that denote concepts relevant to a given domain or semi-structured (e.g., HTML table) source; described by a corpus, sometimes inducing hier- – a goal of the area of Open Information Ex- archical relations between concepts; traction; – aka.
Recommended publications
  • Information Extraction from Electronic Medical Records Using Multitask Recurrent Neural Network with Contextual Word Embedding
    applied sciences Article Information Extraction from Electronic Medical Records Using Multitask Recurrent Neural Network with Contextual Word Embedding Jianliang Yang 1 , Yuenan Liu 1, Minghui Qian 1,*, Chenghua Guan 2 and Xiangfei Yuan 2 1 School of Information Resource Management, Renmin University of China, 59 Zhongguancun Avenue, Beijing 100872, China 2 School of Economics and Resource Management, Beijing Normal University, 19 Xinjiekou Outer Street, Beijing 100875, China * Correspondence: [email protected]; Tel.: +86-139-1031-3638 Received: 13 August 2019; Accepted: 26 August 2019; Published: 4 September 2019 Abstract: Clinical named entity recognition is an essential task for humans to analyze large-scale electronic medical records efficiently. Traditional rule-based solutions need considerable human effort to build rules and dictionaries; machine learning-based solutions need laborious feature engineering. For the moment, deep learning solutions like Long Short-term Memory with Conditional Random Field (LSTM–CRF) achieved considerable performance in many datasets. In this paper, we developed a multitask attention-based bidirectional LSTM–CRF (Att-biLSTM–CRF) model with pretrained Embeddings from Language Models (ELMo) in order to achieve better performance. In the multitask system, an additional task named entity discovery was designed to enhance the model’s perception of unknown entities. Experiments were conducted on the 2010 Informatics for Integrating Biology & the Bedside/Veterans Affairs (I2B2/VA) dataset. Experimental results show that our model outperforms the state-of-the-art solution both on the single model and ensemble model. Our work proposes an approach to improve the recall in the clinical named entity recognition task based on the multitask mechanism.
    [Show full text]
  • Knowledge Extraction by Internet Monitoring to Enhance Crisis Management
    El Hamali Knowledge Extraction by Internet Monitoring to Enhance Crisis Management Knowledge Extraction by Internet Monitoring to Enhance Crisis Management El Hamali Samiha Nadia Nouali-TAboudjnent, Omar Nouali National School for Computer Science ESI, Research Center of Scientific and Technique Algiers, Algeria Information CERIST, Algeria [email protected] {nnouali, onouali}@cerist.dz ABSTRACT This paper presents our work on developing a system for Internet monitoring and knowledge extraction from different web documents which contain information about disasters. This system is based on ontology of the disasters domain for the knowledge extraction and it presents all the information extracted according to the kind of the disaster defined in the ontology. The system disseminates the information extracted (as a synthesis of the web documents) to the users after a filtering based on their profiles. The profile of a user is updated automatically by interactively taking into account his feedback. Keywords Crisis management, internet monitoring, knowledge extraction, ontology, information filtering. INTRODUCTION Several crises and disasters happened in the last decades. After the southern Asian tsunami, the Katrina hurricane in the United States, and the Kashmir earthquake, people throughout the world used the Web as a source of information and a means of communication. In the hours and days immediately following a disaster, a lot of information becomes available on the web. With a single Google search, 17 millions “emergency management” sites and documents have been listed (Min & Peishih, 2008). The user finds a lot of information which comes from different sources, for instance : during the first 3 days of tsunami disaster in south-east Asia, 4000 reports were gathered by the Google News service, and within a month of the incident, 200000 distinct news reports from several online sources (Yiming, et al., 2007 IEEE).
    [Show full text]
  • The History and Recent Advances of Natural Language Interfaces for Databases Querying
    E3S Web of Conferences 229, 01039 (2021) https://doi.org/10.1051/e3sconf/202122901039 ICCSRE’2020 The history and recent advances of Natural Language Interfaces for Databases Querying Khadija Majhadi1*, and Mustapha Machkour1 1Team of Engineering of Information Systems, Faculty of Sciences Agadir, Morocco Abstract. Databases have been always the most important topic in the study of information systems, and an indispensable tool in all information management systems. However, the extraction of information stored in these databases is generally carried out using queries expressed in a computer language, such as SQL (Structured Query Language). This generally has the effect of limiting the number of potential users, in particular non-expert database users who must know the database structure to write such requests. One solution to this problem is to use Natural Language Interface (NLI), to communicate with the database, which is the easiest way to get information. So, the appearance of Natural Language Interfaces for Databases (NLIDB) is becoming a real need and an ambitious goal to translate the user’s query given in Natural Language (NL) into the corresponding one in Database Query Language (DBQL). This article provides an overview of the state of the art of Natural Language Interfaces as well as their architecture. Also, it summarizes the main recent advances on the task of Natural Language Interfaces for databases. 1 Introduction of recent Natural Language Interfaces for databases by highlighting the architectures they used before For several years, databases have become the preferred concluding and giving some perspectives on our future medium for data storage in all information management work.
    [Show full text]
  • BIOINFORMATICS Doi:10.1093/Bioinformatics/Bti144
    Vol. 00 no. 0 2004, pages 1–11 BIOINFORMATICS doi:10.1093/bioinformatics/bti144 Solving and analyzing side-chain positioning problems using linear and integer programming Carleton L. Kingsford, Bernard Chazelle and Mona Singh∗ Department of Computer Science, Lewis-Sigler Institute for Integrative Genomics, Princeton University, 35, Olden Street, Princeton, NJ 08544, USA Received on August 1, 2004; revised on October 10, 2004; accepted on November 8, 2004 Advance Access publication … ABSTRACT set of possible rotamer choices (Ponder and Richards, 1987; Motivation: Side-chain positioning is a central component of Dunbrack and Karplus, 1993) for each Cα position on the homology modeling and protein design. In a common for- backbone. The goal is to choose a rotamer for each position mulation of the problem, the backbone is fixed, side-chain so that the total energy of the molecule is minimized. This conformations come from a rotamer library, and a pairwise formulation of SCP has been the basis of some of the more energy function is optimized. It is NP-complete to find even a successful methods for homology modeling (e.g. Petrey et al., reasonable approximate solution to this problem. We seek to 2003; Xiang and Honig, 2001; Jones and Kleywegt, 1999; put this hardness result into practical context. Bower et al., 1997) and protein design (e.g. Dahiyat and Mayo, Results: We present an integer linear programming (ILP) 1997; Malakauskas and Mayo, 1998; Looger et al., 2003). In formulation of side-chain positioning that allows us to tackle homology modeling, the goal is to predict the structure for a large problem sizes.
    [Show full text]
  • Injection of Automatically Selected Dbpedia Subjects in Electronic
    Injection of Automatically Selected DBpedia Subjects in Electronic Medical Records to boost Hospitalization Prediction Raphaël Gazzotti, Catherine Faron Zucker, Fabien Gandon, Virginie Lacroix-Hugues, David Darmon To cite this version: Raphaël Gazzotti, Catherine Faron Zucker, Fabien Gandon, Virginie Lacroix-Hugues, David Darmon. Injection of Automatically Selected DBpedia Subjects in Electronic Medical Records to boost Hos- pitalization Prediction. SAC 2020 - 35th ACM/SIGAPP Symposium On Applied Computing, Mar 2020, Brno, Czech Republic. 10.1145/3341105.3373932. hal-02389918 HAL Id: hal-02389918 https://hal.archives-ouvertes.fr/hal-02389918 Submitted on 16 Dec 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Injection of Automatically Selected DBpedia Subjects in Electronic Medical Records to boost Hospitalization Prediction Raphaël Gazzotti Catherine Faron-Zucker Fabien Gandon Université Côte d’Azur, Inria, CNRS, Université Côte d’Azur, Inria, CNRS, Inria, Université Côte d’Azur, CNRS, I3S, Sophia-Antipolis, France I3S, Sophia-Antipolis, France
    [Show full text]
  • Ontologies and Semantic Web for the Internet of Things - a Survey
    See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/312113565 Ontologies and Semantic Web for the Internet of Things - a survey Conference Paper · October 2016 DOI: 10.1109/IECON.2016.7793744 CITATIONS READS 5 256 2 authors: Ioan Szilagyi Patrice Wira Université de Haute-Alsace Université de Haute-Alsace 10 PUBLICATIONS 17 CITATIONS 122 PUBLICATIONS 679 CITATIONS SEE PROFILE SEE PROFILE Some of the authors of this publication are also working on these related projects: Physics of Solar Cells and Systems View project Artificial intelligence for renewable power generation and management: Application to wind and photovoltaic systems View project All content following this page was uploaded by Patrice Wira on 08 January 2018. The user has requested enhancement of the downloaded file. Ontologies and Semantic Web for the Internet of Things – A Survey Ioan Szilagyi, Patrice Wira MIPS Laboratory, University of Haute-Alsace, Mulhouse, France {ioan.szilagyi; patrice.wira}@uha.fr Abstract—The reality of Internet of Things (IoT), with its one of the most important task in an IoT system [6]. Providing growing number of devices and their diversity is challenging interoperability among the things is “one of the most current approaches and technologies for a smarter integration of fundamental requirements to support object addressing, their data, applications and services. While the Web is seen as a tracking and discovery as well as information representation, convenient platform for integrating things, the Semantic Web can storage, and exchange” [4]. further improve its capacity to understand things’ data and facilitate their interoperability. In this paper we present an There is consensus that Semantic Technologies is the overview of some of the Semantic Web technologies used in IoT appropriate tool to address the diversity of Things [4], [7]–[9].
    [Show full text]
  • Handling Semantic Complexity of Big Data Using Machine Learning and RDF Ontology Model
    S S symmetry Article Handling Semantic Complexity of Big Data using Machine Learning and RDF Ontology Model Rauf Sajjad *, Imran Sarwar Bajwa and Rafaqut Kazmi Department of Computer Science & IT, The Islamia University Bahawalpur, Bahawalpur 63100, Pakistan; [email protected] (I.S.B.); [email protected] (R.K.) * Correspondence: [email protected]; Tel.: +92-62-925-5466 Received: 3 January 2019; Accepted: 14 February 2019; Published: 1 March 2019 Abstract: Business information required for applications and business processes is extracted using systems like business rule engines. Since the advent of Big Data, such rule engines are producing rules in a big quantity whereas more rules lead to more complexity in semantic analysis and understanding. This paper introduces a method to handle semantic complexity in rules and support automated generation of Resource Description Framework (RDF) metadata model of rules and such model is used to assist in querying and analysing Big Data. Practically, the dynamic changes in rules can be a source of conflict in rules stored in a repository. It is identified during the literature review that there is a need of a method that can semantically analyse rules and help business analysts in testing and validating the rules once a change is made in a rule. This paper presents a robust method that not only supports semantic analysis of rules but also generates RDF metadata model of rules and provide support of querying for the sake of semantic interpretation of the rules. The results of the experiments manifest that consistency checking of a set of big data rules is possible through automated tools.
    [Show full text]
  • A Comparison of Knowledge Extraction Tools for the Semantic Web
    A Comparison of Knowledge Extraction Tools for the Semantic Web Aldo Gangemi1;2 1 LIPN, Universit´eParis13-CNRS-SorbonneCit´e,France 2 STLab, ISTC-CNR, Rome, Italy. Abstract. In the last years, basic NLP tasks: NER, WSD, relation ex- traction, etc. have been configured for Semantic Web tasks including on- tology learning, linked data population, entity resolution, NL querying to linked data, etc. Some assessment of the state of art of existing Knowl- edge Extraction (KE) tools when applied to the Semantic Web is then desirable. In this paper we describe a landscape analysis of several tools, either conceived specifically for KE on the Semantic Web, or adaptable to it, or even acting as aggregators of extracted data from other tools. Our aim is to assess the currently available capabilities against a rich palette of ontology design constructs, focusing specifically on the actual semantic reusability of KE output. 1 Introduction We present a landscape analysis of the current tools for Knowledge Extraction from text (KE), when applied on the Semantic Web (SW). Knowledge Extraction from text has become a key semantic technology, and has become key to the Semantic Web as well (see. e.g. [31]). Indeed, interest in ontology learning is not new (see e.g. [23], which dates back to 2001, and [10]), and an advanced tool like Text2Onto [11] was set up already in 2005. However, interest in KE was initially limited in the SW community, which preferred to concentrate on manual design of ontologies as a seal of quality. Things started changing after the linked data bootstrapping provided by DB- pedia [22], and the consequent need for substantial population of knowledge bases, schema induction from data, natural language access to structured data, and in general all applications that make joint exploitation of structured and unstructured content.
    [Show full text]
  • Rdfa in XHTML: Syntax and Processing Rdfa in XHTML: Syntax and Processing
    RDFa in XHTML: Syntax and Processing RDFa in XHTML: Syntax and Processing RDFa in XHTML: Syntax and Processing A collection of attributes and processing rules for extending XHTML to support RDF W3C Recommendation 14 October 2008 This version: http://www.w3.org/TR/2008/REC-rdfa-syntax-20081014 Latest version: http://www.w3.org/TR/rdfa-syntax Previous version: http://www.w3.org/TR/2008/PR-rdfa-syntax-20080904 Diff from previous version: rdfa-syntax-diff.html Editors: Ben Adida, Creative Commons [email protected] Mark Birbeck, webBackplane [email protected] Shane McCarron, Applied Testing and Technology, Inc. [email protected] Steven Pemberton, CWI Please refer to the errata for this document, which may include some normative corrections. This document is also available in these non-normative formats: PostScript version, PDF version, ZIP archive, and Gzip’d TAR archive. The English version of this specification is the only normative version. Non-normative translations may also be available. Copyright © 2007-2008 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply. Abstract The current Web is primarily made up of an enormous number of documents that have been created using HTML. These documents contain significant amounts of structured data, which is largely unavailable to tools and applications. When publishers can express this data more completely, and when tools can read it, a new world of user functionality becomes available, letting users transfer structured data between applications and web sites, and allowing browsing applications to improve the user experience: an event on a web page can be directly imported - 1 - How to Read this Document RDFa in XHTML: Syntax and Processing into a user’s desktop calendar; a license on a document can be detected so that users can be informed of their rights automatically; a photo’s creator, camera setting information, resolution, location and topic can be published as easily as the original photo itself, enabling structured search and sharing.
    [Show full text]
  • A Meaningful Information Extraction System for Interactive Analysis of Documents Julien Maitre, Michel Ménard, Guillaume Chiron, Alain Bouju, Nicolas Sidère
    A meaningful information extraction system for interactive analysis of documents Julien Maitre, Michel Ménard, Guillaume Chiron, Alain Bouju, Nicolas Sidère To cite this version: Julien Maitre, Michel Ménard, Guillaume Chiron, Alain Bouju, Nicolas Sidère. A meaningful infor- mation extraction system for interactive analysis of documents. International Conference on Docu- ment Analysis and Recognition (ICDAR 2019), Sep 2019, Sydney, Australia. pp.92-99, 10.1109/IC- DAR.2019.00024. hal-02552437 HAL Id: hal-02552437 https://hal.archives-ouvertes.fr/hal-02552437 Submitted on 23 Apr 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A meaningful information extraction system for interactive analysis of documents Julien Maitre, Michel Menard,´ Guillaume Chiron, Alain Bouju, Nicolas Sidere` L3i, Faculty of Science and Technology La Rochelle University La Rochelle, France fjulien.maitre, michel.menard, guillaume.chiron, alain.bouju, [email protected] Abstract—This paper is related to a project aiming at discover- edge and enrichment of the information. A3: Visualization by ing weak signals from different streams of information, possibly putting information into perspective by creating representa- sent by whistleblowers. The study presented in this paper tackles tions and dynamic dashboards.
    [Show full text]
  • The Application of Semantic Web Technologies to Content Analysis in Sociology
    THEAPPLICATIONOFSEMANTICWEBTECHNOLOGIESTO CONTENTANALYSISINSOCIOLOGY MASTER THESIS tabea tietz Matrikelnummer: 749153 Faculty of Economics and Social Science University of Potsdam Erstgutachter: Alexander Knoth, M.A. Zweitgutachter: Prof. Dr. rer. nat. Harald Sack Potsdam, August 2018 Tabea Tietz: The Application of Semantic Web Technologies to Content Analysis in Soci- ology, , © August 2018 ABSTRACT In sociology, texts are understood as social phenomena and provide means to an- alyze social reality. Throughout the years, a broad range of techniques evolved to perform such analysis, qualitative and quantitative approaches as well as com- pletely manual analyses and computer-assisted methods. The development of the World Wide Web and social media as well as technical developments like optical character recognition and automated speech recognition contributed to the enor- mous increase of text available for analysis. This also led sociologists to rely more on computer-assisted approaches for their text analysis and included statistical Natural Language Processing (NLP) techniques. A variety of techniques, tools and use cases developed, which lack an overall uniform way of standardizing these approaches. Furthermore, this problem is coupled with a lack of standards for reporting studies with regards to text analysis in sociology. Semantic Web and Linked Data provide a variety of standards to represent information and knowl- edge. Numerous applications make use of these standards, including possibilities to publish data and to perform Named Entity Linking, a specific branch of NLP. This thesis attempts to discuss the question to which extend the standards and tools provided by the Semantic Web and Linked Data community may support computer-assisted text analysis in sociology. First, these said tools and standards will be briefly introduced and then applied to the use case of constitutional texts of the Netherlands from 1884 to 2016.
    [Show full text]
  • Span Model for Open Information Extraction on Accurate Corpus
    Span Model for Open Information Extraction on Accurate Corpus Junlang Zhan1,2,3, Hai Zhao1,2,3,∗ 1Department of Computer Science and Engineering, Shanghai Jiao Tong University 2 Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China 3 MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University [email protected], [email protected] Abstract Sentence Repeat customers can purchase luxury items at reduced prices. Open Information Extraction (Open IE) is a challenging task Extraction (A0: repeat customers; can purchase; especially due to its brittle data basis. Most of Open IE sys- A1: luxury items; A2: at reduced prices) tems have to be trained on automatically built corpus and evaluated on inaccurate test set. In this work, we first alle- viate this difficulty from both sides of training and test sets. Table 1: An example of Open IE in a sentence. The extrac- For the former, we propose an improved model design to tion consists of a predicate (in bold) and a list of arguments, more sufficiently exploit training dataset. For the latter, we separated by semicolons. present our accurately re-annotated benchmark test set (Re- OIE2016) according to a series of linguistic observation and analysis. Then, we introduce a span model instead of previ- ous adopted sequence labeling formulization for n-ary Open long short-term memory (BiLSTM) labeler and BIO tagging IE. Our newly introduced model achieves new state-of-the-art scheme which built the first supervised learning model for performance on both benchmark evaluation datasets.
    [Show full text]