Knowledge Graphs and Big Data Processing

Valentina Janev Damien Graux Hajira Jabeen Emanuel Sallinger (Eds.) State-of-the-Art Survey Knowledge Graphs and LNCS 12072 Big Data Processing Lecture Notes in Computer Science 12072 Founding Editors Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Gerhard Woeginger RWTH Aachen, Aachen, Germany Moti Yung Columbia University, New York, NY, USA More information about this series at http://www.springer.com/series/7409 Valentina Janev • Damien Graux • Hajira Jabeen • Emanuel Sallinger (Eds.) Knowledge Graphs and Big Data Processing 123 Editors Valentina Janev Damien Graux Institute Mihajlo Pupin ADAPT SFI Centre, O’Reilly Institute University of Belgrade Trinity College Dublin Belgrade, Serbia Dublin, Ireland Hajira Jabeen Emanuel Sallinger CEPLAS, Botanical Institute Institute of Logic and Computation University of Cologne Faculty of Informatics Cologne, Germany TU Wien Wien, Austria University of Oxford Oxford, UK ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-53198-0 ISBN 978-3-030-53199-7 (eBook) https://doi.org/10.1007/978-3-030-53199-7 LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI The Editor(s) (if applicable) and The Author(s) 2020. This book is an open access publication. Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Preface Data Analytics involves applying algorithmic processes to derive insights. Nowadays it is used in many industries to allow organizations and companies to make better decisions as well as to verify or disprove existing theories or models. The term data analytics is often used interchangeably with intelligence, statistics, reasoning, data mining, knowledge discovery, and others. Being in the era of big data, Big Data Analytics thus refers to the strategy of analyzing large volumes of data gathered from a wide variety of sources, including social networks, transaction records, videos, digital images, and different kinds of sensors. The goal of this book is to introduce some of the definitions, methods, tools, frameworks, and solutions for big data processing, starting from the process of information extraction and knowledge representation, via knowledge processing and analytics to visualization, sense-making, and practical applications. However, this book is not intended either to cover the whole set of big data analytics methods or to provide a complete collection of references. Each chapter in this book addresses some pertinent aspect of the data processing chain, with a specific focus on understanding Enterprise Knowledge Graphs, Semantic Big Data Architectures, and Smart Data Analytics solutions. Chapter 1’s purpose is to characterize the relevant aspects of the Big Data Ecosystem and to explain the ecosystem with respect to the big data characteristics, the components needed for implementing end-to-end big data processing and the need to use semantics to improve data management, integration, processing, and analytical tasks. Chapter 2 gives an overview of different definitions of the term Knowledge Graphs (KGs). In this chapter, we are going to take the position that precisely in the multitude of definitions lies one of the strengths of the area. We will choose a particular perspective, which we will call the layered perspective, and three views on Knowledge Graphs to guide the reader in a structured way. Chapter 3 introduces the key technologies and business drivers for building big data applications and presents in detail several open-source tools and Big Data Frameworks for handling Big Data. The subsequent chapters discuss the knowledge processing chain from the perspective of Knowledge Graph Creation (Chapter 4), via Federated Query Processing (Chapter 5), to Reasoning in Knowledge Graphs (Chapter 6). Chapter 7 brings to attention the SANSA framework, which combines distributed analytics and semantic technologies into a scalable semantic analytics stack. Chapter 8 elaborates further the semantic data integration problems and presents COMET (COntextualized MoleculE-based matching Technique and framework) for matching contextually equivalent RDF entities from different sources into a set of 1-1 perfect matches between entities. vi Preface As the goal of the LAMBDA Project is to study the potentials, prospects, and challenges of Big Data Analytics in real-world applications, in addition to Chapter 1 (traffic management example), Chapter 9 discusses the role of big data in different industries. Finally, in Chapter 10, one sector has been selected – the energy domain – and insight is given into some potential applications of big data-oriented tools and analytical technologies for the control and monitoring of electricity production, distribution, and consumption. This book is addressed to graduate students from technical disciplines, to professional audiences following continuous education short courses, and to researchers from diverse areas following self-study courses. Basic skills in computer science, mathematics, and statistics are required. June 2020 Valentina Janev Damien Graux Hajira Jabeen Emanuel Sallinger Acknowledgments This book is prepared as part of the LAMBDA Project (Learning, Applying, Multiplying Big Data Analytics), funded by the European Union under grant agreement number 809965. The project aims at advancing state-of-the-art in Big Data Analytics and fostering excellence in the Big Data Ecosystem through a combination of training, research, and innovation activities. As the number of Big Data-related methods, tools, frameworks, and solutions are growing, there is a need to systematize knowledge about the domain. Hence, in the LAMBDA project framework, an effort has been made to develop a new set of lectures and training materials based on state-of-the-art analysis and education materials and courses offered by project partners. The lectures were presented at the LAMBDA Big Data Analytics Summer School (the first edition was held in Belgrade during June 17–19, 2019; the second edition was held online during June 16–17, 2020). We are grateful to the esteemed keynote speakers Prof. Dr. Sören Auer, Director of the German National Library for Science and Technology and Professor of Data Science and Digital Libraries at Leibniz Universität Hannover; Mr. Atanas Kiryakov, Chief Executive Officer of OntoText; Prof. Dr. Maria-Esther Vidal, Head of Scientific Data Management Research Group, German National Library for Science and Technology; Prof. Dr. Georgios Paliouras, Head of the Division of Intelligent Information Systems of IIT of the National Centre of Scientific Research “Demokritos,” Greece; Dr. Mariana Damova, Chief Executive Officer of Mozaika; and Dr. Gloria Bordogna, Senior Researcher at the Italian National Research Council IREA. The authors acknowledge the infrastructure and support of the Ministry of Science and Technological Development of the Republic of Serbia. D. Graux acknowledges the support of the ADAPT SFI Centre for Digital Media Technology funded by Science Foundation Ireland through the SFI Research Centres Programme and co-funded under the European Regional Development Fund (ERDF) through grant # 13/RC/2106. E. Sallinger acknowledges the support of the Vienna Science and Technology (WWTF) grant VRG18-013 and the EPSRC program grant EP/M025268/1. Acronyms and Definitions ABD After Big Data AI Artificial Intelligence BBD Before Big Data BDA Big Data Analytics CC Cloud Computing COMET COntextualized MoleculE-based matching Technique DBMS Database Management System DL Deep Learning DM Data Mining EB Exabyte HDFS Hadoop Distributed File System IEEE Institute of

Knowledge Graphs and Big Data Processing

Artificial Intelligence in Health Care: the Hope, the Hype, the Promise, the Peril

Search with Meanings: an Overview of Semantic Search Systems

Automatic Knowledge Retrieval from the Web

Automated Development of Semantic Data Models Using Scientific Publications Martha O

Table of Contents

Probabilistic Databases

Adaptive Schema Databases ∗

Information Extraction Using Natural Language Processing

Open Mind Common Sense: Knowledge Acquisition from the General Public

Semantic Search

SEKI@Home, Or Crowdsourcing an Open Knowledge Graph

Finding Interesting Itemsets Using a Probabilistic Model for Binary Databases