Knowledge Graphs and Big Data Processing

Total Page:16

File Type:pdf, Size:1020Kb

Knowledge Graphs and Big Data Processing Valentina Janev Damien Graux Hajira Jabeen Emanuel Sallinger (Eds.) State-of-the-Art Survey Knowledge Graphs and LNCS 12072 Big Data Processing Lecture Notes in Computer Science 12072 Founding Editors Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Gerhard Woeginger RWTH Aachen, Aachen, Germany Moti Yung Columbia University, New York, NY, USA More information about this series at http://www.springer.com/series/7409 Valentina Janev • Damien Graux • Hajira Jabeen • Emanuel Sallinger (Eds.) Knowledge Graphs and Big Data Processing 123 Editors Valentina Janev Damien Graux Institute Mihajlo Pupin ADAPT SFI Centre, O’Reilly Institute University of Belgrade Trinity College Dublin Belgrade, Serbia Dublin, Ireland Hajira Jabeen Emanuel Sallinger CEPLAS, Botanical Institute Institute of Logic and Computation University of Cologne Faculty of Informatics Cologne, Germany TU Wien Wien, Austria University of Oxford Oxford, UK ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-53198-0 ISBN 978-3-030-53199-7 (eBook) https://doi.org/10.1007/978-3-030-53199-7 LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI The Editor(s) (if applicable) and The Author(s) 2020. This book is an open access publication. Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Preface Data Analytics involves applying algorithmic processes to derive insights. Nowadays it is used in many industries to allow organizations and companies to make better decisions as well as to verify or disprove existing theories or models. The term data analytics is often used interchangeably with intelligence, statistics, reasoning, data mining, knowledge discovery, and others. Being in the era of big data, Big Data Analytics thus refers to the strategy of analyzing large volumes of data gathered from a wide variety of sources, including social networks, transaction records, videos, digital images, and different kinds of sensors. The goal of this book is to introduce some of the definitions, methods, tools, frameworks, and solutions for big data processing, starting from the process of information extraction and knowledge representation, via knowledge processing and analytics to visualization, sense-making, and practical applications. However, this book is not intended either to cover the whole set of big data analytics methods or to provide a complete collection of references. Each chapter in this book addresses some pertinent aspect of the data processing chain, with a specific focus on understanding Enterprise Knowledge Graphs, Semantic Big Data Architectures, and Smart Data Analytics solutions. Chapter 1’s purpose is to characterize the relevant aspects of the Big Data Ecosystem and to explain the ecosystem with respect to the big data characteristics, the components needed for implementing end-to-end big data processing and the need to use semantics to improve data management, integration, processing, and analytical tasks. Chapter 2 gives an overview of different definitions of the term Knowledge Graphs (KGs). In this chapter, we are going to take the position that precisely in the multitude of definitions lies one of the strengths of the area. We will choose a particular perspective, which we will call the layered perspective, and three views on Knowledge Graphs to guide the reader in a structured way. Chapter 3 introduces the key technologies and business drivers for building big data applications and presents in detail several open-source tools and Big Data Frameworks for handling Big Data. The subsequent chapters discuss the knowledge processing chain from the per- spective of Knowledge Graph Creation (Chapter 4), via Federated Query Processing (Chapter 5), to Reasoning in Knowledge Graphs (Chapter 6). Chapter 7 brings to attention the SANSA framework, which combines distributed analytics and semantic technologies into a scalable semantic analytics stack. Chapter 8 elaborates further the semantic data integration problems and presents COMET (COntextualized MoleculE-based matching Technique and framework) for matching contextually equivalent RDF entities from different sources into a set of 1-1 perfect matches between entities. vi Preface As the goal of the LAMBDA Project is to study the potentials, prospects, and challenges of Big Data Analytics in real-world applications, in addition to Chapter 1 (traffic management example), Chapter 9 discusses the role of big data in different industries. Finally, in Chapter 10, one sector has been selected – the energy domain – and insight is given into some potential applications of big data-oriented tools and ana- lytical technologies for the control and monitoring of electricity production, distribu- tion, and consumption. This book is addressed to graduate students from technical disciplines, to professional audiences following continuous education short courses, and to researchers from diverse areas following self-study courses. Basic skills in computer science, mathematics, and statistics are required. June 2020 Valentina Janev Damien Graux Hajira Jabeen Emanuel Sallinger Acknowledgments This book is prepared as part of the LAMBDA Project (Learning, Applying, Multiplying Big Data Analytics), funded by the European Union under grant agreement number 809965. The project aims at advancing state-of-the-art in Big Data Analytics and fostering excellence in the Big Data Ecosystem through a combination of training, research, and innovation activities. As the number of Big Data-related methods, tools, frameworks, and solutions are growing, there is a need to systematize knowledge about the domain. Hence, in the LAMBDA project framework, an effort has been made to develop a new set of lectures and training materials based on state-of-the-art analysis and education materials and courses offered by project partners. The lectures were presented at the LAMBDA Big Data Analytics Summer School (the first edition was held in Belgrade during June 17–19, 2019; the second edition was held online during June 16–17, 2020). We are grateful to the esteemed keynote speakers Prof. Dr. Sören Auer, Director of the German National Library for Science and Technology and Professor of Data Science and Digital Libraries at Leibniz Universität Hannover; Mr. Atanas Kiryakov, Chief Executive Officer of OntoText; Prof. Dr. Maria-Esther Vidal, Head of Scientific Data Management Research Group, German National Library for Science and Technology; Prof. Dr. Georgios Paliouras, Head of the Division of Intelligent Information Systems of IIT of the National Centre of Scientific Research “Demokritos,” Greece; Dr. Mariana Damova, Chief Executive Officer of Mozaika; and Dr. Gloria Bordogna, Senior Researcher at the Italian National Research Council IREA. The authors acknowledge the infrastructure and support of the Ministry of Science and Technological Development of the Republic of Serbia. D. Graux acknowledges the support of the ADAPT SFI Centre for Digital Media Technology funded by Science Foundation Ireland through the SFI Research Centres Programme and co-funded under the European Regional Development Fund (ERDF) through grant # 13/RC/2106. E. Sallinger acknowledges the support of the Vienna Science and Technology (WWTF) grant VRG18-013 and the EPSRC program grant EP/M025268/1. Acronyms and Definitions ABD After Big Data AI Artificial Intelligence BBD Before Big Data BDA Big Data Analytics CC Cloud Computing COMET COntextualized MoleculE-based matching Technique DBMS Database Management System DL Deep Learning DM Data Mining EB Exabyte HDFS Hadoop Distributed File System IEEE Institute of
Recommended publications
  • Artificial Intelligence in Health Care: the Hope, the Hype, the Promise, the Peril
    Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril Michael Matheny, Sonoo Thadaney Israni, Mahnoor Ahmed, and Danielle Whicher, Editors WASHINGTON, DC NAM.EDU PREPUBLICATION COPY - Uncorrected Proofs NATIONAL ACADEMY OF MEDICINE • 500 Fifth Street, NW • WASHINGTON, DC 20001 NOTICE: This publication has undergone peer review according to procedures established by the National Academy of Medicine (NAM). Publication by the NAM worthy of public attention, but does not constitute endorsement of conclusions and recommendationssignifies that it is the by productthe NAM. of The a carefully views presented considered in processthis publication and is a contributionare those of individual contributors and do not represent formal consensus positions of the authors’ organizations; the NAM; or the National Academies of Sciences, Engineering, and Medicine. Library of Congress Cataloging-in-Publication Data to Come Copyright 2019 by the National Academy of Sciences. All rights reserved. Printed in the United States of America. Suggested citation: Matheny, M., S. Thadaney Israni, M. Ahmed, and D. Whicher, Editors. 2019. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril. NAM Special Publication. Washington, DC: National Academy of Medicine. PREPUBLICATION COPY - Uncorrected Proofs “Knowing is not enough; we must apply. Willing is not enough; we must do.” --GOETHE PREPUBLICATION COPY - Uncorrected Proofs ABOUT THE NATIONAL ACADEMY OF MEDICINE The National Academy of Medicine is one of three Academies constituting the Nation- al Academies of Sciences, Engineering, and Medicine (the National Academies). The Na- tional Academies provide independent, objective analysis and advice to the nation and conduct other activities to solve complex problems and inform public policy decisions.
    [Show full text]
  • Search with Meanings: an Overview of Semantic Search Systems
    Search with Meanings: An Overview of Semantic Search Systems Wang Wei, Payam M. Barnaghi, Andrzej Bargiela School of Computer Science, University of Nottingham Malaysia Campus Jalan Broga, 43500 Semenyih, Selangor, Malaysia Email: feyx6ww; payam.barnaghi; [email protected] Abstract: Research on semantic search aims to improve con- information. The paper provides a survey to gain an overall ventional information search and retrieval methods, and facil- view of the current research status. We classify our studied sys- itate information acquisition, processing, storage and retrieval tems into several categories according to their most distinctive on the semantic web. The past ten years have seen a num- features, as discussed in the next section. The categorisation ber of implemented semantic search systems and various pro- by no means prevents a system from being classified into other posed frameworks. A comprehensive survey is needed to gain categories. Further, we limit the scope of the survey to Web an overall view of current research trends in this field. We and Intranet searching and browsing systems (also including have investigated a number of pilot projects and corresponding some question answering and multimedia presentation gener- practical systems focusing on their objectives, methodologies ation systems). There are also few other survey studies of se- and most distinctive characteristics. In this paper, we report mantic search research, Makel¨ a¨ provides a short survey con- our study and findings based on which a generalised semantic cerning search methodologies [34]; Hildebrand et al discuss search framework is formalised. Further, we describe issues the related research from three perspectives: query construc- with regards to future research in this area.
    [Show full text]
  • Automatic Knowledge Retrieval from the Web
    Automatic Knowledge Retrieval from the Web Marcin Skowron and Kenji Araki Graduate School of Information Science and Technology, Hokkaido University, Kita-ku Kita 14-jo Nishi 8-chome, 060–0814 Sapporo, Japan Abstract. This paper presents the method of automatic knowledge retrieval from the web. The aim of the system that implements it, is to automatically create entries to a knowledge database, similar to the ones that are being provided by the volunteer contributors. As only a small fraction of the statements accessible on the web can be treated as valid knowledge concepts we considered the method for their filtering and verification, based on the similarity measurements with the concepts found in the manually created knowledge database. The results demonstrate that the system can retrieve valid knowledge concepts both for topics that are described in the manually created database, as well as the ones that are not covered there. 1 Introduction Despite the years of research in the field of Artificial Intelligence, the creation of a machine with the ability to think is still far from realization. Although computer systems are capable of performing several complicated tasks that require human beings to extensively use their thinking capabilities, machines still cannot engage into really meaningful conversation or understand what people talk about. One of the main unresolved problems is the lack of machine usable knowledge. Without it, machines cannot reason about the everyday world in a similar way to human beings. In the last decade we have witnessed a few attempts to create knowledge databases using various approaches: man- ual, machine learning and mass collaboration of volunteer contributors.
    [Show full text]
  • Automated Development of Semantic Data Models Using Scientific Publications Martha O
    University of New Mexico UNM Digital Repository Computer Science ETDs Engineering ETDs Spring 5-12-2018 Automated Development of Semantic Data Models Using Scientific Publications Martha O. Perez-Arriaga University of New Mexico Follow this and additional works at: https://digitalrepository.unm.edu/cs_etds Part of the Computer Engineering Commons Recommended Citation Perez-Arriaga, Martha O.. "Automated Development of Semantic Data Models Using Scientific ubP lications." (2018). https://digitalrepository.unm.edu/cs_etds/89 This Dissertation is brought to you for free and open access by the Engineering ETDs at UNM Digital Repository. It has been accepted for inclusion in Computer Science ETDs by an authorized administrator of UNM Digital Repository. For more information, please contact [email protected]. Martha Ofelia Perez Arriaga Candidate Computer Science Department This dissertation is approved, and it is acceptable in quality and form for publication: Approved by the Dissertation Committee: Dr. Trilce Estrada-Piedra, Chairperson Dr. Soraya Abad-Mota, Co-chairperson Dr. Abdullah Mueen Dr. Sarah Stith i AUTOMATED DEVELOPMENT OF SEMANTIC DATA MODELS USING SCIENTIFIC PUBLICATIONS by MARTHA O. PEREZ-ARRIAGA M.S., Computer Science, University of New Mexico, 2008 DISSERTATION Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Computer Science The University of New Mexico Albuquerque, New Mexico May, 2018 ii Dedication “The highest education is that which does not merely give us information but makes our life in harmony with all existence" Rabindranath Tagore I dedicate this work to the memory of my primary role models: my mother and grandmother, who always gave me a caring environment and stimulated my curiosity.
    [Show full text]
  • Table of Contents
    Table of Contents International Journal on Semantic Web and Information Systems Volume 14 • Issue 2 • April-June-2018 • ISSN: 1552-6283 • eISSN: 1552-6291 An official publication of the Information Resources Management Association Research Articles 1 Rapid Relevance Feedback Strategy Based on Distributed CBIR System; Jianxin Liao, Beijing University of Posts and Telecommunications, Beijing, China baoran li, Beijing University of Posts and Telecommunications, Beijing, China Jingyu Wang, Beijing University of Posts and Telecommunications, Beijing, China Qi Qi, Beijing University of Posts and Telecommunications, Beijing, China Jing Wang, Beijing University of Posts and Telecommunications, Beijing, China Tonghong Li, Technical University of Madrid, Madrid, Spain 27 Using the Linked Data Approach in European e-Government Systems: Example from Serbia; Valentina Janev, The Mihajlo Pupin Institute, University of Belgrade, Belgrade, Serbia Vuk Mijović, The Mihajlo Pupin Institute, University of Belgrade, Belgrade, Serbia Sanja Vraneš, The Mihajlo Pupin Institute, University of Belgrade, Belgrade, Serbia 47 N-Dimensional Matrix-Based Ontology: A Novel Model to Represent Ontologies; Ahmad A. Kardan, Department of Computer Engineering and Information Technology, Amirkabir University of Technology, Tehran, Iran Hamed Jafarpour, Department of Computer Engineering and Information Technology, Amirkabir University of Technology, Tehran, Iran 70 On the Graph Structure of the Web of Data; Alberto Nogales Moyano, Alcalá University, Alcalá de Henares, Spain
    [Show full text]
  • Probabilistic Databases
    Series ISSN: 2153-5418 SUCIU • OLTEANU •RÉ •KOCH M SYNTHESIS LECTURES ON DATA MANAGEMENT &C Morgan & Claypool Publishers Series Editor: M. Tamer Özsu, University of Waterloo Probabilistic Databases Probabilistic Databases Dan Suciu, University of Washington, Dan Olteanu, University of Oxford Christopher Ré,University of Wisconsin-Madison and Christoph Koch, EPFL Probabilistic databases are databases where the value of some attributes or the presence of some records are uncertain and known only with some probability. Applications in many areas such as information extraction, RFID and scientific data management, data cleaning, data integration, and financial risk DATABASES PROBABILISTIC assessment produce large volumes of uncertain data, which are best modeled and processed by a probabilistic database. This book presents the state of the art in representation formalisms and query processing techniques for probabilistic data. It starts by discussing the basic principles for representing large probabilistic databases, by decomposing them into tuple-independent tables, block-independent-disjoint tables, or U-databases. Then it discusses two classes of techniques for query evaluation on probabilistic databases. In extensional query evaluation, the entire probabilistic inference can be pushed into the database engine and, therefore, processed as effectively as the evaluation of standard SQL queries. The relational queries that can be evaluated this way are called safe queries. In intensional query evaluation, the probabilistic Dan Suciu inference is performed over a propositional formula called lineage expression: every relational query can be evaluated this way, but the data complexity dramatically depends on the query being evaluated, and Dan Olteanu can be #P-hard. The book also discusses some advanced topics in probabilistic data management such as top-kquery processing, sequential probabilistic databases, indexing and materialized views, and Monte Carlo databases.
    [Show full text]
  • Adaptive Schema Databases ∗
    Adaptive Schema Databases ∗ William Spothb, Bahareh Sadat Arabi, Eric S. Chano, Dieter Gawlicko, Adel Ghoneimyo, Boris Glavici, Beda Hammerschmidto, Oliver Kennedyb, Seokki Leei, Zhen Hua Liuo, Xing Niui, Ying Yangb b: University at Buffalo i: Illinois Inst. Tech. o: Oracle {wmspoth|okennedy|yyang25}@buffalo.edu {barab|slee195|xniu7}@hawk.iit.edu [email protected] {eric.s.chan|dieter.gawlick|adel.ghoneimy|beda.hammerschmidt|zhen.liu}@oracle.com ABSTRACT in unstructured or semi-structured form, then an ETL (i.e., The rigid schemas of classical relational databases help users Extract, Transform, and Load) process needs to be designed in specifying queries and inform the storage organization of to translate the input data into relational form. Thus, clas- data. However, the advantages of schemas come at a high sical relational systems require a lot of upfront investment. upfront cost through schema and ETL process design. In This makes them unattractive when upfront costs cannot this work, we propose a new paradigm where the database be amortized, such as in workloads with rapidly evolving system takes a more active role in schema development and data or where individual elements of a schema are queried data integration. We refer to this approach as adaptive infrequently. Furthermore, in settings like data exploration, schema databases (ASDs). An ASD ingests semi-structured schema design simply takes too long to be practical. or unstructured data directly using a pluggable combina- Schema-on-query is an alternative approach popularized tion of extraction and data integration techniques. Over by NoSQL and Big Data systems that avoids the upfront time it discovers and adapts schemas for the ingested data investment in schema design by performing data extraction using information provided by data integration and infor- and integration at query-time.
    [Show full text]
  • Information Extraction Using Natural Language Processing
    INFORMATION EXTRACTION USING NATURAL LANGUAGE PROCESSING Cvetana Krstev University of Belgrade, Faculty of Philology Information Retrieval and/vs. Natural Language Processing So close yet so far Outline of the talk ◦ Views on Information Retrieval (IR) and Natural Language Processing (NLP) ◦ IR and NLP in Serbia ◦ Language Resources (LT) in the core of NLP ◦ at University of Belgrade (4 representative resources) ◦ LR and NLP for Information Retrieval and Information Extraction (IE) ◦ at University of Belgrade (4 representative applications) Wikipedia ◦ Information retrieval ◦ Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on full-text or other content-based indexing. ◦ Natural Language Processing ◦ Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation. Experts ◦ Information Retrieval ◦ As an academic field of study, Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collection (usually stored on computers). ◦ C. D. Manning, P. Raghavan, H. Schutze, “Introduction to Information Retrieval”, Cambridge University Press, 2008 ◦ Natural Language Processing ◦ The term ‘Natural Language Processing’ (NLP) is normally used to describe the function of software or hardware components in computer system which analyze or synthesize spoken or written language.
    [Show full text]
  • Open Mind Common Sense: Knowledge Acquisition from the General Public
    Open Mind Common Sense: Knowledge Acquisition from the General Public Push Singh, Grace Lim, Thomas Lin, Erik T. Mueller Travell Perkins, Mark Tompkins, Wan Li Zhu MIT Media Laboratory 20 Ames Street Cambridge, MA 02139 USA {push, glim, tlin, markt, wlz}@mit.edu, [email protected], [email protected] Abstract underpinnings for commonsense reasoning (Shanahan Open Mind Common Sense is a knowledge acquisition 1997), there has been far less work on finding ways to system designed to acquire commonsense knowledge from accumulate the knowledge to do so in practice. The most the general public over the web. We describe and evaluate well-known attempt has been the Cyc project (Lenat 1995) our first fielded system, which enabled the construction of which contains 1.5 million assertions built over 15 years a 400,000 assertion commonsense knowledge base. We at the cost of several tens of millions of dollars. then discuss how our second-generation system addresses Knowledge bases this large require a tremendous effort to weaknesses discovered in the first. The new system engineer. With the exception of Cyc, this problem of scale acquires facts, descriptions, and stories by allowing has made efforts to study and build commonsense participants to construct and fill in natural language knowledge bases nearly non-existent within the artificial templates. It employs word-sense disambiguation and intelligence community. methods of clarifying entered knowledge, analogical inference to provide feedback, and allows participants to validate knowledge and in turn each other. Turning to the general public 1 In this paper we explore a possible solution to this Introduction problem of scale, based on one critical observation: Every We would like to build software agents that can engage in ordinary person has common sense of the kind we want to commonsense reasoning about ordinary human affairs.
    [Show full text]
  • Semantic Search
    Semantic Search Philippe Cudre-Mauroux Definition ter grasp the semantics (i.e., meaning) and the context of the user query and/or Semantic Search regroups a set of of the indexed content in order to re- techniques designed to improve tra- trieve more meaningful results. ditional document or knowledge base Semantic Search techniques can search. Semantic Search aims at better be broadly categorized into two main grasping the context and the semantics groups depending on the target content: of the user query and/or of the indexed • techniques improving the relevance content by leveraging Natural Language of classical search engines where the Processing, Semantic Web and Machine query consists of natural language Learning techniques to retrieve more text (e.g., a list of keywords) and relevant results from a search engine. results are a ranked list of documents (e.g., webpages); • techniques retrieving semi-structured Overview data (e.g., entities or RDF triples) from a knowledge base (e.g., a knowledge graph or an ontology) Semantic Search is an umbrella term re- given a user query formulated either grouping various techniques for retriev- as natural language text or using ing more relevant content from a search a declarative query language like engine. Traditional search techniques fo- SPARQL. cus on ranking documents based on a set of keywords appearing both in the user’s Those two groups are described in query and in the indexed content. Se- more detail in the following section. For mantic Search, instead, attempts to bet- each group, a wide variety of techniques 1 2 Philippe Cudre-Mauroux have been proposed, ranging from Natu- matical tags (such as noun, conjunction ral Language Processing (to better grasp or verb) to individual words.
    [Show full text]
  • SEKI@Home, Or Crowdsourcing an Open Knowledge Graph
    SEKI@home, or Crowdsourcing an Open Knowledge Graph Thomas Steiner1? and Stefan Mirea2 1 Universitat Politècnica de Catalunya – Department LSI, Barcelona, Spain [email protected] 2 Computer Science, Jacobs University Bremen, Germany [email protected] Abstract. In May 2012, the Web search engine Google has introduced the so-called Knowledge Graph, a graph that understands real-world en- tities and their relationships to one another. It currently contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects. Soon after its announce- ment, people started to ask for a programmatic method to access the data in the Knowledge Graph, however, as of today, Google does not provide one. With SEKI@home, which stands for Search for Embedded Knowledge Items,weproposeabrowserextension-basedapproachtocrowdsource the task of populating a data store to build an Open Knowledge Graph. As people with the extension installed search on Google.com, the ex- tension sends extracted anonymous Knowledge Graph facts from Search Engine Results Pages (SERPs) to a centralized, publicly accessible triple store, and thus over time creates a SPARQL-queryable Open Knowledge Graph. We have implemented and made available a prototype browser extension tailored to the Google Knowledge Graph, however, note that the concept of SEKI@home is generalizable for other knowledge bases. 1Introduction 1.1 The Google Knowledge Graph With the introduction of the Knowledge Graph, the search engine Google has made a significant paradigm shift towards “things, not strings” [7], as a post on the official Google blog states. Entities covered by the Knowledge Graph include landmarks, celebrities, cities, sports teams, buildings, movies, celestial objects, works of art, and more.
    [Show full text]
  • Finding Interesting Itemsets Using a Probabilistic Model for Binary Databases
    Finding interesting itemsets using a probabilistic model for binary databases Tijl De Bie University of Bristol, Department of Engineering Mathematics Queen’s Building, University Walk, Bristol, BS8 1TR, UK [email protected] ABSTRACT elegance and e±ciency of the algorithms to search for fre- A good formalization of interestingness of a pattern should quent patterns. Unfortunately, the frequency of a pattern satisfy two criteria: it should conform well to intuition, and is only loosely related to interestingness. The output of fre- it should be computationally tractable to use. The focus quent pattern mining methods is usually an immense bag has long been on the latter, with the development of frequent of patterns that are not necessarily interesting, and often pattern mining methods. However, it is now recognized that highly redundant with each other. This has hampered the more appropriate measures than frequency are required. uptake of FPM and FIM in data mining practice. In this paper we report results in this direction for item- Recent research has shifted focus to the search for more set mining in binary databases. In particular, we introduce useful formalizations of interestingness that match practical a probabilistic model that can be ¯tted e±ciently to any needs more closely, while still being amenable to e±cient binary database, and that has a compact and explicit repre- algorithms. As FIM is arguably the simplest special case of sentation. We then show how this model enables the formal- frequent pattern mining, it is not surprising that most of the ization of an intuitive and tractable interestingness measure recent work has focussed on itemset patterns, see e.g.
    [Show full text]