Interoperability Landscaping Report December 28, 2015

Ref. Ares(2016)2259176 - 13/05/2016 Interoperability Landscaping Report December 28, 2015 Deliverable Code: D5.1 Version: 3 – Intermediary Dissemination level: PUBLIC H2020-EINFRA-2014-2015 / H2020-EINFRA-2014-2 Topic: EINFRA-1-2014 Managing, preserving and computing with big research data Research & Innovation action Grant Agreement 654021 Interoperability Landscaping Report Document Description D5.1 – Interoperability Landscaping Report WP5 – Interoperability Framework WP participating organizations: ARC, UNIMAN, UKP-TUDA, INRA, EMBL-EBI, AK, LIBER, UvA, OU, EPFL, CNIO, USFD, GESIS, GRNET, Frontiers, UoS Contractual Delivery Date: 12/2015 Actual Delivery Date: 12/2015 Nature: Report Version: 1.0 (Draft) Public Deliverable Preparation slip Name Organization Date Authors Listed in the document Edited by Piotr Przybyła UNIMAN 20/12/2015 Matthew Shardlow UNIMAN Reviewed by John McNaught UNIMAN 20/12/2015 Natalia Manola ARC 22/12/2015 Approved by Natalia Manola ARC 16/1/2016 For delivery Mike Hatzopoulos ARC Document change record Issue Item Reason for Change Author Organization V0.1 Draft version Initial document structure Piotr Przybyła, UNIMAN Matthew Shardlow V0.2 Draft version Included initial content Piotr Przybyła, UNIMAN Matthew Shardlow V1.0 First delivery Applied corrections and Piotr Przybyła, UNIMAN added discussion Matthew Shardlow Public Page 1 of 112 Interoperability Landscaping Report Table of Contents 1. INTRODUCTION 8 2. REPOSITORIES 11 2.1 DESCRIPTION OF RESOURCES 11 2.1.1 METADATA SCHEMAS & PROFILES 12 2.1.2 VOCABULARIES AND ONTOLOGIES FOR DESCRIBING SPECIFIC INFORMATION TYPES 19 2.1.3 MECHANISMS USED FOR THE IDENTIFICATION OF RESOURCES 20 2.1.4 SUMMARY 20 2.2 REGISTRIES/REPOSITORIES OF RESOURCES 21 3. TOOLS AND SERVICES 32 3.1 TEXT MINING COMPONENT COLLECTIONS AND WORKFLOW ENGINES 32 3.1 NON-TEXT MINING AND GENERAL-PURPOSE WORKFLOW ENGINES 40 3.2 ANNOTATION EDITORS 48 4. CONTENT 54 4.1 CHARACTER STANDARDS 54 4.1.1 MAJOR STANDARDS 55 4.1.2 STANDARDS FOR SPECIFIC LANGUAGES 55 4.1.3 INFORMING THE CHARACTER ENCODING 56 4.1.4 CHARACTER ENCODING APIS 57 4.1.5 SUMMARY 58 4.2 ANNOTATION FORMATS 58 4.2.1 DOMAIN-DEPENDENT 60 4.2.2 GENERIC 60 4.2.3 SUMMARY 63 4.3 ONTOLOGIES, DICTIONARIES AND LEXICONS 63 4.3.1 FORMATS FOR ENCODING LEXICA, ONTOLOGIES, DICTIONARIES 63 4.3.2 ONTOLOGIES, DICTIONARIES AND LEXICA 66 4.3.3 SUMMARY 70 5. COMPUTING INFRASTRUCTURES 71 5.1 CLOUD COMPUTING OVERVIEW 71 5.1.1 CLOUD DEPLOYMENT MODELS 72 5.1.2 CLOUD SERVICE MODELS 72 5.2 INTEROPERABILITY AND PORTABILITY 74 5.2.1 RESOURCE MANAGEMENT 75 Public Page 2 of 112 Interoperability Landscaping Report 5.2.2 DATA MANAGEMENT 76 5.2.3 IDENTITY MANAGEMENT 78 5.2.4 APPLICATION DEVELOPMENT AND DEPLOYMENT 80 5.2.5 APPLICATION PORTABILITY 83 5.2.6 CONTAINERS 86 5.2.7 CLOUD MARKETPLACES 87 5.3 CLOUD COMPUTING STANDARDS ORGANIZATIONS AND INITIATIVES 87 5.4 PRODUCTION AND COMMERCIAL CLOUD OFFERINGS 90 5.5 DISTRIBUTED PROCESSING 92 5.5.1 MAPREDUCE & APACHE HADOOP 92 5.5.2 DISTRIBUTED PROCESSING PROVIDERS 93 6. LEGAL ISSUES 96 6.1 CONTENT LICENSING 97 6.2 TOOL LICENSES 98 6.3 SERVICE TERMS OF USE 99 6.4 DATA PROTECTION 99 6.5 LICENSE INTEROPERABILITY 100 7. DISCUSSION 102 7.1 RELATED EFFORTS 102 7.2 LIMITATIONS OF THIS STUDY 102 7.3 OVERVIEW OF THE FIELD 103 7.4 GAP ANALYSIS 104 8. REFERENCES 105 Public Page 3 of 112 Interoperability Landscaping Report Table of tables TABLE 1. COMPARISON OF METADATA SCHEMATA AND PROFILES. 13 TABLE 2. COMPARISON OF VOCABULARIES AND ONTOLOGIES FOR INFORMATION TYPES. 19 TABLE 3. WEB SERVICE REGISTRIES. 22 TABLE 4. AGGREGATORS OF SCHOLARLY PUBLICATIONS. 22 TABLE 5. REPOSITORIES, REGISTRIES & CATALOGUES OF LANGUAGE RESOURCES. 23 TABLE 6. REPOSITORIES & REGISTRIES OF SERVICES AND RESOURCES. 24 TABLE 7. LANGUAGE SPECIFIC REPOSITORIES. 25 TABLE 8. PLATFORMS/INFRASTRUCTURES INTEGRATING LANGUAGE RESOURCES & SERVICES. 26 TABLE 9. COMPARISON OF COMPONENT COLLECTIONS AND WORKFLOW ENGINES. 33 TABLE 10. GENERAL COMPARISON OF WORKFLOW ENGINES. 42 TABLE 11. TECHNICAL ASPECTS OF WORKFLOW ENGINES. 43 TABLE 12. GENERAL COMPARISON OF ANNOTATION EDITORS. 50 TABLE 13. SUMMARY TABLE OF CHARACTER STANDARDS (NON-EXHAUSTIVE LIST). 54 TABLE 14. SUMMARY TABLE OF ANNOTATION FORMATS 59 TABLE 15. MAIN FORMATS FOR LEXICAL RESOURCES. 64 TABLE 16. A NON-EXHAUSTIVE LIST OF POPULAR KNOWLEDGE RESOURCES. 67 TABLE 17. COMPARISON TABLE FOR CLOUD API LIBRARIES 81 TABLE 18. FILE IMAGE FORMATS. 84 TABLE 19. CLOUD COMPUTING STANDARD ORGANIZATIONS AND INITIATIVES. 88 TABLE 20. INITIATIVES RELEVANT OF CLOUD COMPUTING INTEROPERABILITY ISSUES. 89 TABLE 21. COMPARISON BETWEEN HADOOP SOFTWARE AND PROVIDERS. 95 Public Page 4 of 112 Interoperability Landscaping Report Disclaimer This document contains description of the OpenMinTeD project findings, work and products. Certain parts of it might be under partner Intellectual Property Right (IPR) rules so, prior to using its content please contact the consortium head for approval. In case you believe that this document harms in any way IPR held by you as a person or as a representative of an entity, please do notify us immediately. The authors of this document have taken any available measure in order for its content to be accurate, consistent and lawful. However, neither the project consortium as a whole nor the individual partners that implicitly or explicitly participated in the creation and publication of this document hold any sort of responsibility that might occur as a result of using its content. This publication has been produced with the assistance of the European Union. The content of this publication is the sole responsibility of the OpenMinTeD consortium and can in no way be taken to reflect the views of the European Union. The European Union is established in accordance with the Treaty on European Union (Maastricht). There are currently 28 Member States of the Union. It is based on the European Communities and the member states cooperation in the fields of Common Foreign and Security Policy and Justice and Home Affairs. The five main institutions of the European Union are the European Parliament, the Council of Ministers, the European Commission, the Court of Justice and the Court of Auditors. (http://europa.eu.int/) OpenMinTeD is a project funded by the European Union (Grant Agreement No 654021). Public Page 5 of 112 Interoperability Landscaping Report Contribution Editors Piotr Przybyła (UNIMAN) Matthew Shardlow (UNIMAN) Authors Sophia Ananiadou (UNIMAN) Thomas Margoni (UoS) Sophie Aubin (INRA) Mandy Neumann (GESIS) Robert Bossy (INRA) Wim Peters (USFD) Louise Deléger (INRA) Stelios Piperidis (ARC) Richard Eckart de Castilho (UKP-TUDA) Prokopis Prokopidis (ARC) Vangelis Floros (GRNET) Piotr Przybyła (UNIMAN) Dimitris Galanis (ARC) Angus Roberts (USFD) Byron Georgantopoulos (GRNET) Ian Roberts (USFD) Mark Greenwood (USFD) Stavros Sachtouris (GRNET) Masoud Kiaeeha (UKP-TUDA) Jochen Schirrwagen (OpenAIRE) Penny Labropoulou (ARC) Matthew Shardlow (UNIMAN) Reviewers John McNaught (UNIMAN) Natalia Manola (ARC) Public Page 6 of 112 Interoperability Landscaping Report Publishable Summary The goal of this document is to present the current state of the field of text and data mining (TDM) with a focus on interoperability issues. This will serve as a starting point for subsequent OpenMinTeD efforts concentrating on developing interoperability specifications for the future platform. The report starts by introducing the concept of interoperability and its importance for TDM. Section 2 describes existing repositories of resources, services and components, allowing users to find a solution that best suits their needs. We also survey methods for describing resources, i.e., metadata schemata and vocabularies. The next part focusses on tools and services – active elements used to process textual data and produce annotations. We cover various component collections, annotation editors and workflow engines, both general-purpose and those focussed on text mining. In section 4 we concentrate on static resources, including text encoding standards, annotation formats and finally common representations used in developing ontologies, dictionaries and lexicons. Sections 5 and 6 describe additional aspects that have a strong influence on TDM, i.e., computing infrastructure and legal issues. Finally, we provide a discussion, in which we describe previous related efforts, explain the limitations of this study and analyse the current gaps that need to be addressed in order to realise the goals of OpenMinTeD. Public Page 7 of 112 Interoperability Landscaping Report 1. Introduction The Open Mining Infrastructure for Text and Data (OpenMinTeD) is a new European initiative which seeks to promote the cause of text and data mining (TDM). OpenMinTeD will work within the field of TDM to promote collaboration between the providers of TDM infrastructures as well as working outside of the field to encourage uptake in other areas which may benefit from TDM. Service providers will benefit from this project through the standardisation of formats for TDM as well as the creation of new interoperable TDM workflows, which will seek to standardise existing content and allow previously incompatible services to work together. Service users will benefit through the provision of training and support for TDM tailored to their fields of expertise. Interoperability is key to the goals of OpenMinTeD and to the content in this report. Interoperability is an ability of two or more information systems to integrate with

Interoperability Landscaping Report December 28, 2015

Hieroglyphs for the Information Age: Images As a Replacement for Characters for Languages Not Written in the Latin-1 Alphabet Akira Hasegawa

SLP-5 Setup for Uni-3/5/7 & WM-Nano Chinese BIG5 Characters

File Formats

Enterprise Integration Patterns N About Apache Camel N Essential Patterns Enterprise Integration Patterns N Conclusions and More

SUPPORTING the CHINESE, JAPANESE, and KOREAN LANGUAGES in the OPENVMS OPERATING SYSTEM by Michael M. T. Yau ABSTRACT the Asian L

Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

Reference Guide

Apache Camel

An Enterprise Knowledge Network

Polycom Realpresence Cloudaxis Open Source Software OFFER

Legacy Character Sets & Encodings

Basis Technology Unicode対応ライブラリスペックシート文字コードその他の名称 Adobe-Standard-Encoding A