Semantic Data Integration Prof. Dr. Taysir H. Soliman and Marwa H. Abdel Reheim Information Systems Department Faculty of Computers and Information Assiut University, Egypt BioDialog Team

BioDialog Summer School, Hurghada 2017 1 BioDialog Summer School, Hurghada 2017 2 BioDialog Summer School, Hurghada 2017 3 Finding Relevant Data … A Big Problem

Text Data Image Data In 2007 Jim Gray preached about the effects of the Data Deluge in the sciences (Hey, Tansley, and Tolle 2009). Whereas experimental and theoretical paradigms originally led science, some natural phenomena were not easily addressed by analytical models. In this scenario, computational simulation arose as a new paradigm enabling scientists to deal with these complex phenomena. Simulation produced increasing amounts of data, particularly from the use of advanced exploration instruments (large-scale telescopes, particle colliders, etc.) In this scenario, scientists were no longer interacting directly with the phenomena, but used powerful computational configurations to analyze the data gathered from simulations or captured by instruments. Sky maps built from the Sloan Digital Sky Survey observations, or the evidences found about the Higgs Boson are just two successful stories of just another paradigm, what Gray called the fourth paradigm: the eScience. A Lot of Heterogeneous Data Everywhere????

Excel Sheets

Public Databases Signal Data BioDialog Summer School, Hurghada 2017 4 Example 1

• Find data relevant to a change of temperature affecting a kind of agriculture

BioDialog Summer School, Hurghada 2017 5 Example 2

• Find data relevant to publication in Excel sheets Air Temperature ??

BioDialog Summer School, Hurghada 2017 6 Example 3

• Find relevant data in related web sites , i.e. gbif “ auriculata “ …

BioDialog Summer School, Hurghada 2017 7 Data from GBif phylum class order family genus species Tracheophyta Magnoliopsida Ammannia Ammannia auriculata Tracheophyta Magnoliopsida Myrtales Lythraceae Ammannia Ammannia auriculata Tracheophyta Magnoliopsida Ericales Primulaceae Anagallis Anagallis arvensis Tracheophyta Magnoliopsida Boraginales Boraginaceae Arnebia Arnebia hispidissima Tracheophyta Magnoliopsida Fabales Fabaceae Astragalus Astragalus sieberi Tracheophyta Magnoliopsida Fabales Fabaceae Astragalus Astragalus sieberi Lecanoromycet Ascomycota es Teloschistales Teloschistaceae Caloplaca Caloplaca erythrina Tracheophyta Magnoliopsida Cucurbitales Cucurbitaceae Cucurbita Cucurbita maxima Tracheophyta Magnoliopsida Asterales Asteraceae Centaurea Centaurea scoparia

.csv data Two oleananes from Ammannia auriculata Willd. Gohar AA1, Maatooq GT, Mrawan EM, Zaki AA, Takaya Y. Author information Abstract Two new compounds: 3-β,15-α,23,28-tetrahydroxyolean- 12-en-3-O-arabinopyaranoside and 3-β,23,28-trihydroxy- olean-12-en-3-O-β-D-glucopyranoside were isolated from the aerial parts of Ammania auriculata along with the known compounds kaempferol, β-sitosterol-3-O-β- D- glucoside, 2-α,3-β,23-trihydroxyolean-12-en-28-oic acid- 28-O-β-D-glucopyranoside, quercetin, kaempferol-3-O-α- L-arabinofuranoside, kaempferol-3-O-β-D-xylopyranoside and ellagic acid. Structures of these compounds were elucidated on the basis of their spectroscopic data (NMR, UV, MS and IR spectra). The antioxidant activities of the total extract, the fractions CH(2)Cl(2), EtOAc and the remaining aqueous together with the compounds 1, 6 BioDialog Summer School, Hurghada 2017and 9 were comparable with that of the standard8 antioxidant, ascorbic acid. Data from GBif phylum class order family genus species Tracheophyta Magnoliopsida Myrtales Lythraceae Ammannia Ammannia auriculata Tracheophyta Magnoliopsida Myrtales Lythraceae Ammannia Ammannia auriculata Tracheophyta Magnoliopsida Ericales Primulaceae Anagallis Anagallis arvensis Tracheophyta Magnoliopsida Boraginales Boraginaceae Arnebia Arnebia hispidissima Tracheophyta Magnoliopsida Fabales Fabaceae Astragalus Astragalus sieberi Tracheophyta Magnoliopsida Fabales Fabaceae Astragalus Astragalus sieberi Lecanoromycet Ascomycota es Teloschistales Teloschistaceae Caloplaca Caloplaca erythrina Tracheophyta Magnoliopsida Cucurbitales Cucurbitaceae Cucurbita Cucurbita maxima Tracheophyta Magnoliopsida Asterales Asteraceae Centaurea Centaurea scoparia

.csv data

BioDialog Summer School, Hurghada 2017 9 Outline

• Semantic Data Integration • Semantic Web • Ontologies • Semantic Data Annotation • Hands-On Tutorial

BioDialog Summer School, Hurghada 2017 10 Ammannia auriculata The question is do we need more integration?? What data do we need to integrate and How??

BioDialog Summer School, Hurghada 2017 11 BioDialog Summer School, Hurghada 2017 12 BioDialog Summer School, Hurghada 2017 13 BioDialog Summer School, Hurghada 2017 14 BioDialog Summer School, Hurghada 2017 15 BioDialog Summer School, Hurghada 2017 16 BioDialog Summer School, Hurghada 2017 17 BioDialog Summer School, Hurghada 2017 18 BioDialog Summer School, Hurghada 2017 19 BioDialog Summer School, Hurghada 2017 20 BioDialog Summer School, Hurghada 2017 21 BioDialog Summer School, Hurghada 2017 22 BioDialog Summer School, Hurghada 2017 23 BioDialog Summer School, Hurghada 2017 24 BioDialog Summer School, Hurghada 2017 25 BioDialog Summer School, Hurghada 2017 26 BioDialog Summer School, Hurghada 2017 27 BioDialog Summer School, Hurghada 2017 28 BioDialog Summer School, Hurghada 2017 29 BioDialog Summer School, Hurghada 2017 30 BioDialog Summer School, Hurghada 2017 31 BioDialog Summer School, Hurghada 2017 32 BioDialog Summer School, Hurghada 2017 33 BioDialog Summer School, Hurghada 2017 34 BioDialog Summer School, Hurghada 2017 35 BioDialog Summer School, Hurghada 2017 36 BioDialog Summer School, Hurghada 2017 37 BioDialog Summer School, Hurghada 2017 38 BioDialog Summer School, Hurghada 2017 39 BioDialog Summer School, Hurghada 2017 40 BioDialog Summer School, Hurghada 2017 41 More Examples in the tutorial

BioDialog Summer School, Hurghada 2017 42 BioDialog Summer School, Hurghada 2017 43 Semantic Annotation Example

BioDialog Summer School, Hurghada 2017 44 BioDialog Summer School, Hurghada 2017 45 Time for Hands-On Tutorial

BioDialog Summer School, Hurghada 2017 46 Semantic Data Integration Marwa Hussein (Hands-on Tutorial )

BioDialog Summer School, Hurghada 2017 47 Outline

• Introduction • NCBO Bioportal • Protégé • Semantic Annotations • RightField

48 Introduction

• “An ontology is an explicit specification of some topic”.

• A formal vocabulary and relationships among them, for representing and communicating knowledge about some topic.

49 Introduction

• Classes

Animal

Carnivore Herbivore

50 Introduction

• Classes+ Object Properties

Animal Plant

is_a eats is_a eats

Carnivore Herbivore

51 Introduction

• Classes+ Object Properties+ Individuals

Animal Plant

is_a eats is_a eats

Carnivore Herbivore

is_a is_a

Lion Antelope

52 Introduction

• An ontology with some individuals is considered a knowledgebase.

53 Outline

Introduction • NCBO Bioportal • Protégé • Semantic Annotations • RightField

54 NCBO Bioportal

• An open repository of biomedical ontologies.

• Ontologies are in different representation formats. – (e.g. OWL, OBO, UMLS)

• Provides a wide range of tools: – Via BioPortal web site, or the BioPortal web API.

• BioPortal also includes community features for adding notes, reviews, and even mappings to specific ontologies.

55 BioPortal- tools

• BioPortal contains some tools to: – Browse ontologies – Search terms – Browse mappings – Recommend ontologies

56 BioPortal-Ontology Browser

Ontology name

57 BioPortal- Term Searcher

Ontologies containing melanoma concept

58 BioPortal- Mappings Browser

Ontologies and number of mappings with ENVO ontology

59 BioPortal- Ontology Recommender

Selected ontologies and scores of each selection criteria

60 The Environment Ontology (ENVO)

• ENVO is comprised of classes (terms) referring to key environment-types that may be used to facilitate the retrieval and integration of a broad range of biological data.

61 ENVO Ontology https://bioportal.bioontology.org/ontologies/ENVO

62 63 Outline

Introduction NCBO Bioportal • Protégé • Semantic Annotations • RightField

64 Protégé-

• http://protege.stanford.edu/ • A free, open-source platform that provides user community with a suite of tools to construct domain models and knowledge- based applications with ontologies.

65 Protégé- Biodiversity Ontology (BOF)

Class hierarchy

Description of each class

66 Protégé- Biodiversity Ontology (BOF)

Object Properties hierarchy

Description of each property

67 Protégé- Biodiversity Ontology (BOF)

Description Data of each Properties property hierarchy

68 Protégé- Biodiversity Ontology (BOF)

69 Protégé- Biodiversity Ontology (BOF)

OntoGraf: to visualize the ontology

70 Outline

Introduction NCBO Bioportal Protégé • Semantic Annotations • RightField

71 Semantic Annotations

• To attach data to some other piece of data.

72 Semantic Annotation- An Example

“Aristotle, the author of Politics, established the Lyceum”

73 Semantic Annotation- An Example

• “Aristotle, the author of Politics, established the Lyceum”

• To semantically annotate this sentence: 1. Analyze text and Identify the concepts: • Aristotle as a Person • Politics as a written work of political philosophy 2. Classify and interlink the identified concepts in a semantic graph database. • i.e., Aristotle can be linked to his date of birth, his students, his works. • Politics can be linked to its subject, to its date of creation etc.

74 Semantic Annotation- An Example

• Algorithms will be able to automatically:

– Find out who tutored Alexander the Great. – Answer which of Plato’s pupils established the Lyceum. – Retrieve a list of political thinkers who lived between 380 and 310 BC.

75 Outline

Introduction NCBO Bioportal Protégé Semantic Annotations • RightField

76 RightField

• An open-source tool.

• To create semantically aware Excel spreadsheet templates (can be reused by Scientists to collect and annotate their data).

• Presents these terms to the users as a simple drop-down list.

77 RightField – An Example

ontologies to get annotations Annotated terms

78 References

• https://bioportal.bioontology.org/ • http://protege.stanford.edu/ • Fensel, Dieter. Knowledge Acquisition, Modeling and Management: 11th European Workshop, EKAW'99, Dagstuhl Castle, Germany, May 26-29, 1999, Proceedings. No. 1621. Springer Science & Business Media, 1999. • Buttigieg, Pier Luigi, et al. "The environment ontology: contextualising biological and biomedical entities." Journal of biomedical semantics 4.1 (2013): 43. • Wolstencroft K, Owen S, Horridge M, Krebs O, Mueller W, Snoep JL, du Preez F, Goble CA (2011), RightField: Embedding ontology annotation in spreadsheets, Bioinformatics (2011) 15;27(14):pp2021-2

79 Tasks for Semantic Data Integration

• Choose some keywords related to the field of your project (ex.: the species, environmental conditions, etc).

• Use BioPortal “ontology recommender” tool to select the appropriate ontologies related to your keywords.

• Open one or more ontology using protégé and explore their classes and properties, you can also visualize them.

• Finally, use RightField to annotate your data or simply keywords, using the appropriate ontologies.

80 References

• https://bioportal.bioontology.org/ • http://protege.stanford.edu/ • Fensel, Dieter. Knowledge Acquisition, Modeling and Management: 11th European Workshop, EKAW'99, Dagstuhl Castle, Germany, May 26-29, 1999, Proceedings. No. 1621. Springer Science & Business Media, 1999. • Buttigieg, Pier Luigi, et al. "The environment ontology: contextualising biological and biomedical entities." Journal of biomedical semantics 4.1 (2013): 43. • Wolstencroft K, Owen S, Horridge M, Krebs O, Mueller W, Snoep JL, du Preez F, Goble CA (2011), RightField: Embedding ontology annotation in spreadsheets, Bioinformatics (2011) 15;27(14):pp2021-2

BioDialog Summer School, Hurghada 2017 81