Scaling Ontology Alignment

Home , Hyponymy and hypernymy

SCALING ONTOLOGY ALIGNMENT

RYAN E. FRECKLETON

B.S.C.p.E, University of Colorado Colorado Springs, 2008

A thesis submitted to the Graduate Faculty of the

University of Colorado Colorado Springs

in partial fulﬁllment of the

requirements for the degree of

Master of Science in Computer Science

Department of Computer Science

ii This thesis for the Master of Science in Computer Science degree by Ryan E. Freckleton has been approved for the Department of Computer Science by

Dr. Jugal Kalita, Chair

Dr. Charles Shub

Dr. Lisa Hines

Dr. Suzette Stoutenburg

Date

iii iv Freckleton, Ryan E. (M.S.C.S., Computer Science) Scaling Ontology Alignment Thesis directed by Professor Dr. Jugal Kalita

Abstract As ontologies become more prevalent in biomedicine and other ﬁelds, effective ontology alignment is a necessary for their economical and practical use. An ontology is a group of concepts derived from a corpus of knowledge. Ontology alignment determines the relationships between these concepts across different ontologies. Therefore ontology alignment is an area of active research, especially scaling ontology alignment, as the number and size of ontologies increases dramatically. This thesis describes an approach and implementation of ontology alignment called Parallel Ontology Bridge, which maintains good alignment quality while increasing scalability and speed of ontology alignment by matching linguistic and structural features in a support vector machine. This approach is based on Ontology Bridge [1] and provides the same advantages. It is able to handle non-equivalence relationships very effectively and is a general approach to ontology alignment that can be used across many domains. Parallel Ontology Bridge increases scalability by using map-reduce, an approach to breaking down problems and running them in parallel. This thesis describes how this is done. Parallel Ontology Bridge is almost two orders of magnitude faster than Ontology Bridge and shows very good scalability while maintaining quality as measured through F-Measure. The results of Parallel Ontology Bridge are compared against several other scalability approaches, both with experimental data and theoretical maximum scalability. Parallel Ontology Bridge is signiﬁcantly more scalable in the experimental data and maintains this advantage during theoretical analysis.

v vi To my Montessori teacher, who always knew the joy of learning and understanding.

vii viii Acknowledgements

I’d like to acknowledge all the people that have positively affected the creation of this thesis. My employer, The MITRE Corporation, my coworkers, my advisory committee, my family. I’d like to especially thank Dr. Suzette Stoutenburg. Without her help and previous work in this area I would not be able to create this thesis. I’d especially like to thank my mother, Irene Freckleton and father, Grover Freckleton for their emotional support as well as deep discussions on the concepts of ontology alignment and graphical presentation. I’d also like to thank my Aunt Karen, whose excitement was infectious and way with words helped make this thesis succinct and clear. My friend, Dr. Gregory Plett gave me incomparable help and advice with typesetting. Tim Flink, my friend and fellow graduate student, saw architectural issues I was blind to. My friend and colleague Dr. Norman Facas gave unparalleled advice on organization and the appropriate layout of graphs and data. Thank you Dr. Lisa Hines, for giving me one-on-one attention get up to speed on biology and medicine. Dr. Charlie Shub, thank you for your continued support and focus. Your mentorship during my undergraduate studies prepared me for this thesis and my professional career. Finally, I’d like to thank my advisor Dr. Jugal Kalita. His expertise in artiﬁcial intelligence has been unparalleled. Without their assistance, feedback and support this would not be possible to complete. It’s been a long, sometimes stressful journey on this path of knowledge. I appreciate all that you’ve done for me. Thank you.

ix x Table of Contents

1 Background on Ontologies and Ontology Alignment 1 1.1 Purpose ...... 1 1.2 Ontologies Used ...... 2 1.3 Example Applications ...... 5 1.4 Issues ...... 5 1.5 Scaling Ontology Alignment ...... 6 1.6 Achievements ...... 7 1.7 Focus of This Thesis ...... 7 1.8 Background and Organization of Thesis ...... 8

2 Motivation 9

3 Survey of the State of the Art in Ontology Alignment 12 3.1 Developments in the State of the Art ...... 13 3.2 Approach to Comparison and Analysis ...... 13 3.3 Comparing Approaches ...... 14 3.4 Comparison of Systems ...... 14 3.4.1 LOOM ...... 15 3.4.2 AROMA ...... 16 3.4.3 SOBOM ...... 16 3.4.4 Falcon AO ...... 17 3.4.5 Stoutenburg Ontology Bridge ...... 17 3.4.5.1 Branch and Bound Approach ...... 18

xi 3.5 Survey Results ...... 18

4 Deﬁnitions 19 4.1 Performance Measures ...... 19 4.2 Tools ...... 20 4.3 Function Primitives ...... 21 4.4 Scaling Nomenclature ...... 21 4.5 Ontology Nomenclature ...... 21

5 The Strategic Approach of Parallel Ontology Bridge 23 5.1 Test Data ...... 23 5.1.1 Platelet Activation ...... 23 5.1.2 Mannose Binding ...... 24 5.1.3 Immune System ...... 24 5.1.4 Phenylalanine Conversion ...... 24 5.1.5 Bone Remodeling ...... 24 5.1.6 Bone Marrow ...... 25 5.1.7 Osteoblast Differentiation ...... 25 5.1.8 Osteoclast Differentiation ...... 25 5.1.9 Behavior ...... 25 5.1.10 Circadian Rhythm ...... 25 5.2 Attempts to Enhance Alignment ...... 25 5.2.1 Parallel Human Computation ...... 26 5.2.2 Information Entropy and Morpheme Based Extraction ...... 26 5.3 Summary of Parallel Ontology Bridge ...... 27 5.3.1 MapReduce ...... 29 5.4 Aligning Ontologies with MapReduce ...... 31 5.5 Architecture ...... 32 5.6 Comparison to Ontology Bridge ...... 33 5.7 Feature Extraction ...... 34

xii 5.8 Implementation ...... 36 5.8.1 Issues With Java Implementation ...... 38 5.9 Parallel Ontology Bridge F-Measure ...... 39 5.10 Contributions ...... 40

6 Scalability Results 41 6.1 Scalability Metrics ...... 41 6.2 Scalability Experiments ...... 42 6.3 Other Systems Compared ...... 43 6.3.1 Gross’s Approaches ...... 43 6.3.2 Zhang’s Approach ...... 43 6.4 Input ...... 44 6.5 Hardware ...... 44 6.6 Comparison of Scalability ...... 45 6.7 Summary of Scalability ...... 48

7 Conclusion 51 7.1 Future Work ...... 53

A Code Listings 54 A.1 Align Implementation ...... 55 A.2 Parallel Implementation ...... 58 A.3 Primitives Implementation ...... 61 A.4 OpenCyc Implementation ...... 73

Bibliography 77

xiii xiv List of Tables

2.1 Biomedical Ontology Sizes ...... 10

3.1 Results From Literature ...... 15

5.1 Human Computation Experiment ...... 27 5.2 Chunking (segmenting strings based on information entropy) Results . . . . 28 5.3 Example Feature Extraction ...... 35 5.4 Features ...... 37 5.5 Unit Test Statement Coverage. This shows how much code in each one of these modules is covered by automated unit tests...... 38 5.6 Cross-Validation Results ...... 39

6.1 Input Data ...... 45 6.2 Architecture Comparison ...... 45 6.3 Scalability Metrics Comparison ...... 49 6.4 Speed of Execution ...... 50

xv xvi List of Figures

1.1 Gene Ontology. The different shading represents different subdomains. . . 3 1.2 Mammalian Phenotype Ontology ...... 4

4.1 Precision and Recall Public Domain image from WikiMedia ...... 20

5.1 Ontology Alignment ...... 31 5.2 Architecture ...... 33 5.3 Performance of Ontology Bridge variants at various numbers of ontology pairs ...... 34

6.1 Other Systems Curve Fits ...... 44 6.2 Comparison of scalability of ontology alignment approaches based on data points ...... 46 6.3 Comparison of extrapolated scalability of ontology alignment approaches . 47

xvii xviii CHAPTER 1

Background on Ontologies and Ontology Alignment

1.1 Purpose

Modern civilization is based on a dynamic and changing foundation of information. There are 8.7 million species of lifeforms cataloged [2] as well as 19 million articles on Wikipedia [3] in various languages and an endless number of entries in ontologies. Some of these fast growing ontologies are in the biomedical field. An ontology is a group of concepts derived from corpus of knowledge. Information about biomedicine is being updated on a daily basis [4]. At this rate the amount of information continues to increase and it is becoming more difficult for researchers to make sense of it [5]. Computer based reasoning holds great promise of a solution to this problem of increasing information by allowing inferences and deductions about information and knowledge. To do this economically and practically, it is necessary to coordinate the effective development and reuse of ontologies. New tools are needed to meet these new challenges [5]. One of these new tools is ontology alignment. Ontology alignment relates two existing ontologies to each other. A general approach to ontology alignment is Ontology Bridge, described by Stoutenburg in [1]. Ontology Bridge finds multiple relationships between ontologies. Overlapping relationships and concepts are linked by semantic bridges based on linguistic features, semantic information and structure. It gives good results, but can be slow for large ontologies. This thesis builds on Ontology Bridge and increases the scalability and speed of execution. The implementation of this thesis is called Parallel Ontology Bridge. 2 Chapter 1. Background on Ontologies and Ontology Alignment

Parallel Ontology Bridge improved upon the performance of Ontology Bridge by a factor of 19 to 48, depending on the test, while maintaining F-Measure. There are few approaches to scaling ontology alignment [6]. The scalability of Ontology Bridge was increased in [1] by scalability through a branch and bound optimization. In contrast, this thesis uses parallel execution to increase scalability. The results achieved with this parallelization were so good that it was felt that using branch and bound would not be beneficial, reducing the quality of alignment for no significant gain in speed. There are many purposes for ontology alignment [7]. By finding relationships between objects in disparate models, ontology alignment can be used for data integration. Agents that use logic programming can use ontology alignment to learn about new domains and integrate with systems. Aligned ontologies provide a basis for extracting information from natural language documents. Larger ontologies allow for more precise and detailed models of the world, which is the limiting factor in many of these application areas. Ontology alignment also enables interoperability of independently developed systems by providing an accurate shared semantic vocabulary [7]. As the size of ontologies being aligned increases, the computations necessary grows quadratically. Therefore, effective approaches to scalability are paramount. The purpose of this thesis is to provide a solution to scaling ontology alignment.

1.2 Ontologies Used

The two ontologies used in this thesis are the Gene Ontology (GO) [4] (Figure 1.1) and Mammalian Phenotype Ontology (MP), [8] (Figure 1.2). As of the publication of this thesis, it is estimated that the Gene Ontology consists of 32,481 OWL [9] classes. The MP ontology consists of 6,516 ontology classes. OpenCyc, a “universal” ontology, consists o116,822 OWL classes. 1.2. Ontologies Used 3

Figure 1.1: Gene Ontology. The different shading represents different subdomains. 4 Chapter 1. Background on Ontologies and Ontology Alignment

Figure 1.2: Mammalian Phenotype Ontology

The entirety of OpenCyc1 [10] was used as an upper ontology in Parallel Ontology Bridge. To reduce necessary execution time for the majority of testing 1,000 class subsets were used of the Gene Ontology and Mammalian Phenotype Ontology. The Gene Ontology is a dynamic, controlled vocabulary for eukaryotic cells. It is actively updated as daily discoveries are made. By gathering and sharing information on the common genes and proteins, the hope is to help the grand uniﬁcation of biology, which is the understanding of all organisms. The information in GO provides strong inference to the functions of other organisms. The goal is to enable the annotation of the genomes of all organisms using a shared system of nomenclature and understanding.

1http : / / www . cyc . com / platform / opencyc is an ontology and reasoning engine that aims to cover “common sense”. It deﬁnes concepts such as physical, temporal and conceptual entities and the relationships between them. 1.3. Example Applications 5

Individual species and organisms are not represented in GO, nor are specialized or- gans or body parts. Instead, knowledge from GO is transferred to these speciﬁc contexts through the use of species and anatomical databases. This transfer can be aided by ontology alignment. The Mammalian Phenotype Ontology provides a computationally accessible way to annotate phenotype information to individual genotypes through a shared vocabulary to describe concepts. Since annotations require more than simple vocabularies, it is implemented as an ontology. Phenotype information tends to be complex and incomplete, so these constraints are handled directly by MP. Both these ontologies are continuously updated and exist in complimentary, but separate domains. Alignment of these two ontologies will enable further and better annotation of genotype and gene expression information. Alignment will also beneﬁt other uses of these ontologies, such as hypothesis generation and collaboration.

1.3 Example Applications

Examples of small-scale ontology alignment include database migration and interoperability between different systems. For example, a library that has records using the Dewey Decimal System needs to be properly aligned to a library using a different system, such as the Library of Congress Classiﬁcation. One approach of solving this problem would be to create an ontology describing each system and match them so that concepts could be translated between the systems.

1.4 Issues

Often ontology alignment occurs at a small scale and is assisted by humans. This approach is uneconomical for the large ontologies in the biomedical field. There already exists methods for matching ontologies [7]. Some use machine learning such as AROMA [11], which finds associations, and Codi [12], which uses Markov logic to detect similarities [13]. Oth- ers use natural language processing, such as AgreementMaker [14], which finds lexical 6 Chapter 1. Background on Ontologies and Ontology Alignment

similarity and FALCON-AO [15], which uses statistics on virtual documents [13]. Large ontologies can take hours or days to complete alignment [16]. Doing ontology alignment by human means is unfeasibly expensive and time consuming.

1.5 Scaling Ontology Alignment

Generally, the number of operations needed to align two ontologies, o1 and o2, grows O (m n) where ·

m = o | 1| n = o . | 2|

The number of concepts in each ontology has to be compared to concepts in the other ontology to determine relationships. Using the sizes of ontologies described in Section 1.2, gives a size on the order of

8 106 6 105 5 1012 ⇥ · ⇥ ⇡ ⇥ comparisons. Even if each comparison took only one microsecond, that would take approximately 60 days of processing. Since each concept pair in the alignment can be compared independently, this is amenable to a divide and conquer approach. First, using traditional branch and bound techniques as described in [1] and secondly, standard distributed computing and parallelization [17]. This thesis focuses on the latter technique, reducing the processing time by executing the necessary comparisons in parallel. Parallel computation is now especially attractive because of the large amount of cloud computing resources available. The system used in this thesis was rented on Amazon Web Services for a total of $512.44 in cost and 172.8 hours of execution time. 1.6. Achievements 7

1.6 Achievements

The creation of Parallel Ontology Bridge achieved several advances to the state of the art. First, it is very scalable because it can take advantage of many processing units. The results described in this thesis use a 16 core machine, but it is theoretically scalable up to dozens of processing units. Secondly, it is quite fast. Parallel Ontology Bridge can process 25,000 3000 pairs per ± minute on the 16 core processor system. At this speed, it is able to complete the ontology alignment of the GO and MP ontologies, which has 210 million possible pairs, in 5.8 days. Due to a fault in concurrency implementation, this entire run was not totally ﬁnished, but enough results exist to show the scalability and validity of the approach. The hardware used to run these tests was from Amazon’s new High Performance Computing2 services, showing that cloud computing is an amenable ﬁt to scaling ontology alignment. A single instance with 16 cores was created on the Amazon Elastic Computer Cloud, the software uploaded and installed and the tests executed. The method of parallelization used, dividing the problem using MapReuce [18] with some shared resources for ontology graph lookup is simple and unique in this domain. The details of these results are elaborated in Chapter 6.

1.7 Focus of This Thesis

Ontology Bridge, described in [1], used feature extraction, upper ontologies and support vector machines [19] to align ontologies. Scalability was done through a branch and bound algorithm which sacriﬁced recall for speed. Parallel Ontology Bridge, described in this thesis, uses the same approach to aligning ontologies. However, instead of a branch and bound technique, Parallel Ontology Bridge uses parallelism to increase speed of execution without sacriﬁcing alignment quality. Ontologies may have many types of relationships within them. Some items in ontologies are stated to be equivalent to each other, these are like synonyms. Non-equivalence

2http : / / aws . amazon . com / hpc-applications/ 8 Chapter 1. Background on Ontologies and Ontology Alignment

relationships include all others, such as hypernymy, hyponymy, antonymy or relationships that are ontology specific. The algorithm described here is done using non-equivalence relationships. It can be expanded to equivalence relationships with no loss of generality. Stoutenburg’s work also described matching relationships defined in the ontologies. For simplicity, these ontology defined relationships were omitted in this work, only hyponymy relationships were used.

1.8 Background and Organization of Thesis

The work described in this thesis is a novel and practical approach to scalable ontology alignment in the biomedical domain. First, the motivation for this work will be explained, then the state of the art will be reviewed. Finally, the novel work will be described. The approaches described in this thesis provides a scalable, fast and accurate ontology alignment technique. This will be illustrated by examples in the biomedical domain. CHAPTER 2

Motivation

There are many purposes for ontology alignment. According to the seminal source for this topic, [7], some of the most interesting relate to better reasoning over ontologies and integrating disparate data sources. Logic programming and artificial intelligence require a good model of a domain to work effectively. As such, they are limited by the accuracy of the ontologies they use. Better ontologies make logic programming and artificial intelligence effective and usable. More complete ontologies enable other applications. Some examples of these applications are machine translation, strong artificial intelligence and ontology extraction. Ontology alignment also enables interoperability of independently developed systems. By providing a shared semantic vocabulary, ontology alignment allows for the translation of data and information between systems. A simple example of this is two systems with different schemas. If the schemas are matched, they can interoperate by translating the appropriate fields. Because ontology alignment is generally an O (n2) problem, scaling ontology alignment is important [1]. As described in Chapter 5, the alignment process can be easily broken into parallel pieces. Biomedical ontologies are especially affected by scaling, since they tend to be large. Some biomedical ontologies and their sizes are given in Table 2.1. If a scalable approach to ontology alignment is not found, they cannot be aligned in a reasonable amount of time [20]. Alignments must be high quality to be applied effectively. If mistakes are included in the ontology it will cause incorrect conclusions to be reached. Performance, scalability and precision are some of the key measures of quality for an ontology alignment algorithm. Performance is important because the ontologies must be aligned in a reasonable amount of time. 10 Chapter 2. Motivation

Ontology Size Inﬂuenza Ontology 1,368 Mammalian Phenotype 130,268 Gene Ontology 878,379 NCI Thesaurus 1,758,354

Table 2.1: Biomedical Ontology Sizes

Speciﬁc use cases of ontology alignment include, but are not limited to:

Catalog integration [7], offering products from different vendors on a single portal, • Data integration [21], combining data sources into a single view for consumption, • Data extraction from biomedical texts using natural language processing [5], for in- • stance, building biomedical ontologies from research papers or textbooks,

Peer-to-peer information sharing [22], such as between online agents which solve • problems autonomously,

Inference on biomedical information [23], like bioinformatics prediction, • Data exchange among biomedical applications [24], for example health care databases, • Computer reasoning with biomedical data [5], such as hypothesis generation, • Decision support [25], such as automatic diagnostics, • Federated databases[7], integrating multiple databases from different enterprises, and • Encyclopedic knowledge [5][7], for example annotated Wikipedia and DBpedia. • The number of biomedical researchers interested in biomedical ontology has been rapidly expanding and it is difﬁcult to make sense of the new biomedical information available [5]. Practitioners hope that ontology alignment tools will be incorporated into BioPortal [26], a website with many ontologies, and similar resources. Ontologies could enable the large majority of data produced by the spectrum of life sciences to be easily retrieved and understood by those working in these ﬁelds [27][5]. Chapter 2 Motivation 11

These beneﬁts are similar to what happened in chemistry after the introduction of the pe- riodic table. Scientists used the same symbols and categorizations for the elements so researchers could understand the experiments. With ontologies biomedical information similarly can be understood with a universal model allowing measurement and prediction across biomedical sub-domains. Ontology alignment is one of the tools needed to handle this huge task [5]. 12 CHAPTER 3

Survey of the State of the Art in Ontology Alignment

3.1 Developments in the State of the Art

In the past few years there have been great advances in ontology alignment. Scalability has dramatically improved. Various research groups are increasing their efforts to align biomedical ontologies [20]. Previously, many automatic ontology matchers took hours or days to align larger ontologies, in contrast, modern systems, such as Falcon-AO, only take minutes to complete [6]. Fifteen different research groups took part in the OAIE1 2012 large scale ontology tests, more than twice the number of total participants in the 2004 OAIE [28][29]. Only recently larger scale tests, those with tens or hundreds of thousands of items for ontology alignment, have been created. This is still an area of active research and development [29]. Over the years both the accuracy and performance of large-scale ontology alignment have improved [28]. This is crucial because the size and number of ontologies used by biomedical researchers continues to increase rapidly [28].

3.2 Approach to Comparison and Analysis

Ontology alignment is similar to information retrieval in that the quality is subjectively measured, there is no perfect mathematical deﬁnition of correctness [28]. Because of this,

1The Ontology Alignment Evaluation Initiative – An annual “competition” for ontology alignment algorithm researches described in [28] 14 Chapter 3. Survey of the State of the Art in Ontology Alignment

there can be multiple correct alignments for two given ontologies. For the purposes of the OAEI, it is assumed that there exists a unique and ideal reference alignment between any two ontologies [28]. Ontologies have distinct and sometimes contradictory ways of classifying data [30]. This makes it a challenge to align ontologies, especially when they are designed for different purposes. For example, WordNet [31] does not relate the words “renal” and “kidney” directly, but uses a special relationship called “pertainymy” to connect them [30]. Various ontology alignment techniques have been tested through the Ontology Align- ment Evaluation Initiative since 2007. Recently they have started looking at scalability [28]. Table 3.1 shows a comparison of precision, recall and runtime for some of the algorithms compared below. These algorithms generally ran on 2.0 to 3.1 GHz processors with 2 to 4 GB of RAM. These were run against the Adult Mouse Anatomy (2744 classes) [32] and the NCI Thesaurus (3304 classes) [33] except when otherwise noted [20].

3.3 Comparing Approaches

The approaches described in Table 3.1 vary from heuristic approaches to machine learning. Some of them are very well described, while others have details which are opaque in the published literature [20]. In addition to the variation in approach, the ontologies and relationships used to test alignment also varied. This table gives a rough understanding of the diversity of results, performance and approaches in ontology alignment. Most of the state of the art systems takes less than an hour to do equivalence relationship alignments on the OAEI O (3,000 3,000) test case. Precision tends to be much higher · than recall, from 0.77 to 0.99. Recall is around 0.52 to 0.77, depending on the algorithm.

3.4 Comparison of Systems

The following subsections describe several ontology alignment projects selected for their high accuracy, simplicity or uniqueness of approach. They represent a summary from the state of the art for biomedical ontology alignment. 3.4. Comparison of Systems 15

Table 3.1: Results From Literature

Project Precision Recall Runtime Ontology Size LOOM 0.99 0.65 ? O (3,000 3,000)[34] · AROMA 0.775 0.678 ~1 minute O (3,000 3,000)[20] · SOBOM 0.952 0.777 19 minutes O (3,000 3,000)[20] · Falcon AO 0.964 0.591 12 minutes O (3,000 3,000)[35] · Stoutenburg (super) 0.84 0.55 96 hours O (1,000 1,000)[1] · Stoutenburg (sub) 0.93 0.54 96 hours O (1,000 1,000)[1] · Stoutenburg (ontology deﬁned) 0.62 0.52 96 hours O (1,000 1,000)[1] ·

3.4.1 LOOM

Described in [34], LOOM is a simple approach to ontology alignment that uses string normalization and string comparison, producing highly precise results with good recall. Remarkably, this seemingly naive approach provides better results than some approaches that use machine learning. This is likely due to the ontologies selected, the Adult Mouse Anatomy and a part of the NCI Thesaurus. These ontologies have been going through a process of label harmonization which increased the correlation of concepts within them2. Its simplicity and precision makes this an interesting and practical approach for aligning biomedical ontologies. The LOOM approach compares ontology classes using string comparison in these two steps:

1. Normalize ontology class titles by removing all delimiters from strings (spaces and punctuation) and normalize case.

2. Match strings approximately. Allow for a mismatch of no more than one character in strings with length greater than four and no mismatches for shorter strings.

This heuristic (2) can be replaced with an exact string match to boost precision. In speciﬁc instances precision was much higher than the OAEI reference alignment. Precision is the strength of this algorithm.

2http : / / oaei . ontologymatching . org / 2012 / anatomy/ 16 Chapter 3. Survey of the State of the Art in Ontology Alignment

3.4.2 AROMA

AROMA [36] compares vocabularies used to describe ontologies through statistical analysis. It measures the number of words used to describe concepts. If a concept, A, is described with a subset of the words used to describe B, that implies that A is a more generic type of B. These relations are found through association mining [37], an unsupervised machine learning algorithm that ﬁnds association rules. Association rules are inferences about what concepts are found together and which concepts imply other concepts. For an example, an association rule would be in the form

French fries, shakes Hamburger { }) states that when people buy French fries and shakes at a restaurant they also buy hamburg- ers. These association rules are ﬁltered using implication intensity, a measure of the number of expected and observed counter-examples [36]. This method is capable of ﬁnding hyponymy and hypernymy relationships. This method is incredibly fast, however it does not have outstanding performance for either precision or recall. It is also one of the few methods besides [38] and [1] that describes hyponymy and hypernymy matching.

3.4.3 SOBOM

SOBOM [38] uses a series of steps for alignment. Firstly, “anchor concepts” are found between the ontologies. These are concepts that have precise equivalence and are not leaf nodes. Using these anchors as roots, sub-ontologies are segmented and aligned. These sub-ontologies are matched using a similarity propagation graph. Secondly, additional semantic information is used to align non-superclass relationships. The details of these non-superclass relationship matches are not clearly explained or cited. 3.4. Comparison of Systems 17

This method gives impressive results. The details of how this is accomplished is not elaborated in the paper. Whether this is a domain speciﬁc approach or can be used in other areas is not known. This is one of the few methods besides [36] and [1] that describes hyponymy and hypernymy matching.

3.4.4 Falcon AO

Falcon AO [15] combines graphical and linguistic methods for matching. First good alignments are created on some objects, these are then expanded to match other items. This allows for a partitioning approach to scalability, reducing the number of comparisons as large ontologies are aligned. Some work has been done on using the VDoc algorithm of Falcon AO with a MapRe- duce framework to increase scalability further [39].

3.4.5 Stoutenburg Ontology Bridge

The Stoutenburg Ontology Bridge algorithm [1] uses a combination of support vector machines (SVMs) [40], upper ontologies and natural language processing. Pairs of concepts between the ontologies are enumerated and compared. Approxi- mately two dozen features are extracted from each pair. These features are compared by using a radial-basis function SVM [41] which infers what relations exists between the concepts in the pair. The relationships supported are hyponymy, hypernymy and ontology deﬁned relations. Using SVMs has several drawbacks. They are relatively slow and there are only a few implementations available. Also, they require training and parameter tuning. The results from an SVM cannot be explained. A numerical value is provided and normalized, but it doesn’t necessarily reﬂect the quality of matches. Upper ontologies enable better matching by mapping the meaning of labels to deep semantic information. This allows for “common sense” reasoning about the ontologies as they’re being aligned. Finding relations in these upper ontologies is often slow because of their size and complexity. The OpenCyc project [10] software is especially complex and memory intensive. 18 Chapter 3. Survey of the State of the Art in Ontology Alignment

3.4.5.1 Branch and Bound Approach

In addition to the primary approach above, [1] describes a branch and bound algorithm for scaling ontology alignment. This approach can trade recall for time so it allows for high precision alignments with reduced execution time. This branch and bound algorithm relies on the ontologies being well structured in order to select ontology pairs to compare and align.

3.5 Survey Results

There still are only a few ontology alignment systems that handle non-equivalence relationships, speciﬁcally Ontology Bridge, SOBOM and AROMA. The methods used for alignment have a large diversity of approaches. Some systems, such as Falcon-AO, combine multiple approaches, which seems to give better results. CHAPTER 4

Deﬁnitions

This chapter deﬁnes the technical terms used in this thesis.

4.1 Performance Measures

Ontology alignment uses precision, recall and F-measure to determine quality [42]. Algo- rithms produce results which are compared against a “reference alignment”. A reference alignment is a ontology alignment which has been veriﬁed to be correct. In Figure 4.1, the ﬁlled in dots are all relevant pieces of information while the retrieved information is within the oval. Errors are shown in gray.

Recall Measures how many of the relevant relationships the algorithm found. In Figure 4.1, it is denoted by the R , the white oval area divided by the gray area to the left. $

Precision Measures that the alignments found are relevant. In Figure 4.1, it is denoted by P , the white oval area divided by the gray oval area. $

F-Measure Measures overall quality. It is the harmonic mean of precision and recall

precision recall F =2 · . precision + recall 20 Chapter 4. Deﬁnitions





Figure 4.1: Precision and Recall Public Domain image from WikiMedia

Deﬁning these mathematically gives

R A Recall (A, R)=| \ | R | | and

R A P recision (A, R)=| \ | A | | where A is the set of alignments detected from the ontology alignment system (the dots and circles within the oval in Figure 4.1) and R is the set of true alignments (the black dots on the right hand side in Figure 4.1).

4.2 Tools

SVM Support Vector Machine [43]. This is a machine learning algorithm that creates a high dimensional space based on many features. Using training data, it creates a maximum margin hyperplane between the classes it is to discriminate between. For this thesis, a two class support vector machine is used. A SVM reports the distance from the separating margin between the two classes, in some cases this can be used to gauge how likely it is to be correct. For this thesis, the distance from the margin is discarded during classiﬁcation. The report from the SVM is only used to determine which side of the margin the class is on. 4.3. Function Primitives 21

Upper Ontology An ontology that describes abstract or broad terms that can be used across contexts.

4.3 Function Primitives map A function that takes in another function and executes it on all the items of a sequence. map(f,[a,b,c,...]) = [f (a),f(b),f(c),...] reduce A function that takes in another function and executes it on a sequence, taking the previous result as the operand. reduce(f,[a,b,c,...]) = f (a,f(b,f(c,...))) product A function that produces the Cartesian Product of two sequences.

product ([a, b, c..], [A, B, C,...]) = [(A, a) , (A, b) , (A, c) , (B, a) ,...]

4.4 Scaling Nomenclature parallelism Running multiple calculations concurrently on separate processors to reduce execution speed. parallelization Changing operations to work in parallel.

4.5 Ontology Nomenclature

Ontology An ontology consists of a vocabulary that describes a specific domain and the definitions of the terms in that vocabulary in a formal manner [7]. Ontologies model entities, assign their significances and group them based on relationships [44][30].

Ontology Alignment The process of creating a set of correspondences between ontologies is called ontology alignment [7]. Concepts in each ontology are related to one another by equivalence, hyponymy, hypernymy or other relations. This process is called schema matching when it is done with format schemas instead of ontologies [7]. 22 Chapter 4. Deﬁnitions

Hyponymy The relationship of being more speciﬁc: “Dog is a hyponym of animal.”

Hypernymy The relationship of being more general: “Animal is a hypernym of dog.”

MP Mammalian Phenotype Ontology1. This ontology covers mammalian phenotypes. Phenotypes are the physical attributes of organisms. Most of its data is based on experimental results from mice and rats. These experiments are on organisms that have been selectively bred, genetically engineered or mutated to show certain traits.

GO Gene Ontology2. This ontology covers general information about gene expression, metabolism, and cellular processes. It can be used to annotate results from bioinfor- matic experiments. Much of the information in the Gene Ontology has come from comparing the genomes of various organisms.

1http : / / www . informatics . jax . org / searches / MP _ form . shtml 2http : / / www . geneontology . org/ CHAPTER 5

The Strategic Approach of Parallel Ontology Bridge

This chapter discusses the approach to ontology alignment used in the work of this thesis. It covers the experimental data, the algorithms and the implementation. Comparison with Ontology Bridge from [1] is emphasized. Parallel Ontology Bridge has the same general form as Ontology Bridge. Features are extracted, SVMs are used and non-equivalence relations are detected. The main difference is that Parallel Ontology Bridge executes these steps across multiple, concurrently running jobs.

5.1 Test Data

The biomedical reference alignments generated for [1] were used in development and testing. They provided the training data used throughout this thesis and were used to determine the quality of alignment. These test cases are relatively small, with around 10 concepts from each ontology. The hyponymy reference alignments were used for the testing of this work. There are 10 biomedical test-cases covering various concepts and the relationships between them. These concepts are as follows:

5.1.1 Platelet Activation

The conditions under which platelets cohere to one another and related activities, i.e. scab- bing. This can be triggered by various events, such as the platelet encountering collagen or other proteins. As a platelet activates, it changes shape to a more amorphous form, adheres 24 Chapter 5. The Strategic Approach of Parallel Ontology Bridge

to other platelets and promotes coagulation reactions. This test case consists of 10 concepts from the GO ontology such as cell activation, blood coagulation and platelet activation and 4 concepts from the MP ontology such as abnormal platelet physiology and hematopoietic system phenotype.

5.1.2 Mannose Binding

The process of certain proteins binding to the surfaces of pathogen. Deﬁciency of mannose binding is associated with higher rates of infection. This test case consists of six concepts in GO relating to binding and 6 concepts in MP relating to immune system and protein physiology.

5.1.3 Immune System

The system of the body which ﬁghts pathogens and foreign elements. This test had eight concepts concepts related to immune response in GO while the seven MP concepts relate to immune system physiology, response and phenotype.

5.1.4 Phenylalanine Conversion

The processes of converting the amino acid phenylalanine into other amino acids, such as tyrosine. This test had 18 concepts from GO related to metabolic processes were selected, while 6 concepts related to activity were selected from MP. These require semantic features to appropriately match.

5.1.5 Bone Remodeling

The process of mature bone tissue being removed from the skeleton and new tissue being formed in its place. This is done, respectively, by osteoclasts and osteoblasts. This test had 9 concepts related to regulation, resorption and remodeling in GO and 6 in MP related to remodeling, physiology and increase/decrease of resorption. 5.2. Attempts to Enhance Alignment 25

5.1.6 Bone Marrow

The tissue inside of bones which creates various cells and components of blood. This test case has 5 concepts in GO related to development and morphogenesis and 4 concepts in MP related to development and morphology.

5.1.7 Osteoblast Differentiation

How the cells that create bone tissue are created. This test has 7 concepts from GO about osteoblast differentiation, ossiﬁcation and regulation of osteoblast differentation. Only one concept from MP was selected, abnormal osteoblast differentation.

5.1.8 Osteoclast Differentiation

How the cells that destroy bone tissue are created. This test has 4 concepts from GO about regulation of osteoclast differentation and osteoclast differentation. One concept from MP was used, abnormal osteoclast differentation.

5.1.9 Behavior

The actions of an organism in response to stimulus. The concepts in both ontologies were generic, 4 concepts in GO related to behavior and regulation of behavior and 2 concepts in MP: abnormal behavior and behavior phenotype.

5.1.10 Circadian Rhythm

Processes that have a daily cycle. In GO these concepts were circadian rhythm, regulation of circadian rhythm and response to external stimulus. In MP the single concept selected was abnormal circadian rhythm.

5.2 Attempts to Enhance Alignment

In addition to the primary thrust of increasing scalability, this thesis did some work on novel ways to enhance alignment quality. These results are described in this section. 26 Chapter 5. The Strategic Approach of Parallel Ontology Bridge

5.2.1 Parallel Human Computation

In human computation, sometimes called crowd-sourcing, a large number of people solve a problem by connecting and collaborating through the Internet. A problem is broken down into pieces that can be done in small increments by many participants. This approach was attempted for ontology alignment. The Mechanical Turk1 service of Amazon Web Services was used. Mechanical Turk offers a cheap crowd-sourcing platform for companies and individuals to manage tasks that can be delegated to users across the world. The labels and descriptions of potential matches were given to independent participants who selected what relationships existed between the pairs. These experiments gave unsatisfactory results. The results were much worse than random selections of alignments. Only one participant out of twenty may get an alignment correct. This is likely due to lack of expertise in the biomedical domain for Mechanical Turk participants. Unlike in simple domains, the population does not converge on a correct result. The details of this experiment are given in Table 5.1. Participants took less than a minute on average to determine results, and were given a 1 cent bounty. These tasks were based on the MP and GO ontology test cases from [1], which had full results to compare against. There are several approaches that may help with this. Using confusion matrices [45], an approach to noisy classiﬁers, has had good results in other domains. Breaking the problem into very small tasks, which are used as input to an SVM, may be helpful. This is similar to the approach used in [46] for crowd-sourced elaborate editing tasks and [47] which uses crowd-sourcing for training an SVM.

5.2.2 Information Entropy and Morpheme Based Extraction

An information entropy and morpheme based extraction was also attempted. Inspired by [34], ontology concept labels were compared, weighing the characters by their information entropy. Based on this entropy, they were grouped into “chunks” that were compared. Words in titles were broken into chunks based on information entropy. Entropy measures

1https : / / www . mturk . com/ 5.3. Summary of Parallel Ontology Bridge 27

Table 5.1: Human Computation Experiment

Variable Value Number of participants 200 Average time per task 47 seconds Bounty per match $0.01 Matches per participant 1 Total number of pairs 200 how informative characters or strings of characters are in predicting the rest of the title. This was not very successful, as can be seen in Table 5.2 on the following page. The baseline results are F-Measures for the test cases used on the approach described by Stoutenburg [1]. Top 4 Stems shows the performance of creating a feature vector based on the top 4 most distinct stems extracted from the titles and descriptions. Chomp shows extractions of phonemes at various cut-off levels. ? is shown for the cases where either the retrieved documents or relevant documents retrieved were zero. The baseline is the F-Measure based on the existing features, without the addition of the morpheme based extraction.

5.3 Summary of Parallel Ontology Bridge

The non-optimized approach of Ontology Bridge described in [48] can be summarized as follows:

1. For the two ontologies being aligned, the classes were paired from each ontology. This creates a Cartesian product of classes between the two ontologies.

2. For each pair of classes, features were extracted.

(a) Features include information in upper ontologies, linguistic features of labels and structural features of ontologies. These features are all numerical in nature.

(b) These features were normalized into feature vectors for use in a radial-basis function SVM. 28 Chapter 5. The Strategic Approach of Parallel Ontology Bridge hnllnn Conversion Phenylalanine sebatDifferentiation Osteoblast secatDifferentiation Osteoclast ltltActivation Platelet icda Rhythm Circadian ans Binding Mannose mueSystem Immune oeModeling Bone oeMarrow Bone etCase Test Behavior al .:Cukn sgetn tig ae nifrainetoy Results entropy) information on based strings (segmenting Chunking 5.2: Table Baseline 0.700 0.700 0.707 0.707 0.692 0.709 0.424 0.623 0.685 0.707 o Stems 4 Top 0.639 ? ? ? ? ? ? ? ? ? hm on 1 Count Chomp 0.513 0.710 0.164 0.717 ? ? ? ? ? ? hm on 2 Count Chomp 0.710 0.513 0.448 0.513 ? ? ? ? ? ? hm on 3 Count Chomp 0.710 0.710 0.717 0.448 0.717 ? ? ? ? ? 5.3. Summary of Parallel Ontology Bridge 29

3. Based on these features, pairs that have relationships were selected using an SVM. The SVM has two classes for each relationship, that the relationship exists or it does not exist.

Parallel Ontology Bridge, the implementation of this thesis, re-implements the approach above with the additional enhancement of running feature extraction and pair selection with an SVM on multiple processors. This dramatically increases scalability. For additional performance, some feature calculations are optimized by storing precomputed information in lookup tables.

5.3.1 MapReduce

Parallel Ontology Bridge is an implementation of the Ontology Bridge algorithm which runs in parallel over multiple processing units. It does this by using MapReduce. MapRe- duce is described in [18]. The name MapReduce comes from the two concurrency primitives used during execution. The function map takes two parameters, a function and a sequence. It runs the function on every item in the sequence. The function map can be executed on multiple processors with little effort. Input data is partitioned, scheduled and executed on a number of workers. This is carried out by a master process which manages workers and assigns tasks. Each worker receives input data and processes it using the function passed into map. Similarly, reduce also has parameters that are a function and a sequence. But unlike map, reduce returns a single value by executing the function over the pairs of the sequence. Here MapReduce is illustrated with an example analysis of a sum of squares expression. The sum of squares is used when calculating various quantities in statistics. In com- pact notation it is N 2 S = xi . i X Calculating this by hand can be done by “unrolling” the summation and doing each operation in sequence N 2 2 2 2 2 xi = x1 + x2 + x3 + ... + xN . i X 30 Chapter 5. The Strategic Approach of Parallel Ontology Bridge

This expression has a few interesting properties. One, each squaring of the terms is inde-

2 pendent of all the others. We could give each xi term to a separate processor and then add them together. The other property of note is that items can be added in any order. We could divide the problem into pieces such as

N 2 2 2 2 2 2 xi = x1 + x2 + ... + x N + x N + ... + xN . 2 2 +1 i b c b c X ⇣ ⌘ ⇣ ⌘ This means the problem can be broken into small pieces and executed concurrently and reassembled easily with no change in the calculations that have occurred. MapReduce does this by rewriting the problem in terms of two higher order functions. Higher order functions are functions which take functions as arguments. For the squaring, rewriting the sequence 2 2 2 2 x1,x2,x3,...,xN ⇥ ⇤ gives

f (x)=x2

2 2 2 2 ) x1,x2,x3,...,xN =[f(x1),f(x2),f(x3),...,f(xN )]

⇥ ⇤ =map(f,[x1,x2,x3,...,xN ]), an expression in terms of map. Similarly, the addition can be rewritten in terms of reduce

g (a,b)=a + b

) x1 + x2 + x3 + ... + xN = g (x1,g(x2,g(x3,g(...,xN ))))

=reduce(g,[x1,x2,x3,...,xN]) 5.4. Aligning Ontologies with MapReduce 31

If an algorithm can be written in this form of using map and reduce, then it can be trivially parallelized across many processors with minimal contention between shared resources.

5.4 Aligning Ontologies with MapReduce

Ontology alignment can be thought of in the following manner (see Figure 5.1). Each ontology is represented as a graph. Each concept is represented as a vertex in a graph. A vertex in Ontology A is matched to a vertex in Ontology B. When a new relationship exists between these two vertices, it is represented as an edge.   



 Figure 5.1: Ontology Alignment

Rewriting Ontology Bridge in pseudo-code gives Algorithm 5.1. Algorithm 5.1 can be explained as follows:

Step 1 For each pair of classes between the two ontologies,

Step 2 extract features from them.

Step 3 If there is a relationship based on these features, add a match between these classes .

This approach can be rewritten using higher order functions, such as map (Algorithm 5.2 where align_pair is described in Algorithm 5.3). 32 Chapter 5. The Strategic Approach of Parallel Ontology Bridge

Algorithm 5.1 Naive Matching for (c1, c2) in product (o1, o2): #Step1 features = get_features (c1, c2) #Step2 if has_relation ( features ): #Step3 add_match (c1 , c2)

Algorithm 5.2 Matching with higher-order functions map(align_pair , product(o1, o2))

Algorithm 5.3 align_pair implementation def align_pair (pair ): c1 , c2 = pair features = get_features (c1, c2) if has_relation ( features ): add_match (c1 , c2)

In this work, map was implemented as a pool of processes, called “jobs”, running on separate processors, thereby allowing parallelization of ontology alignment. These jobs were given batches of 100 ontology class pairs process at a time. These batches were collected in a “master” process which pushed the job to waiting “worker” processes through inter-process communication queue. Results similarly fed back to the master through a inter-process communication queue. get_features and add_match were potential bottlenecks for parallelization because they made use of shared resources.

5.5 Architecture

Figure 5.2 shows the architecture. This was based on [1]. Two input ontologies are given to the system in OWL XML format, pairs are created from these ontologies and distributed to the worker jobs. These pairs consist of one class from each ontology. Each worker job extracts feature primitives and determines the relationships that exist using an SVM. 5.6. Comparison to Ontology Bridge 33





 

    

   

Figure 5.2: Architecture

Finally, these results from all the workers are combined to create a new alignment. The features extracted look up information in upper ontologies such as OpenCyc and WordNet. The code is organized into small modules, simplifying the system architecture and aiding debugging, analysis and modiﬁcation. The code as implemented is in Appendix ??. This system was run on a system with two Intel Xeon E5-2670 Sandy Bridge processors. Each one of these processors has 8 cores, giving a total of 16 processing units available. During execution, 16 worker jobs were created with one “master” processor coordinating data between. This gave the best performance, likely due to the master node using taking advantage of the IO concurrency.

5.6 Comparison to Ontology Bridge

Scalability was evaluated in the same manner as [1]. Subsets of the ontologies being aligned were selected randomly, sampling 100, 500 and 1,000 classes from each ontology. The execution time was measured as these tests were run with various numbers of concurrent jobs. These results were analyzed to determine the scalability of the system in Chapter 6. The performance (i.e. time taken) of Ontology Bridge, Ontology Bridge with Branch and Bound and Parallel Ontology Bridge is shown in Figure 5.3. Ontology Bridge’s time increases quadratically with the number of class pairs, while Ontology Bridge with Branch 34 Chapter 5. The Strategic Approach of Parallel Ontology Bridge

Figure 5.3: Performance of Ontology Bridge variants at various numbers of ontology pairs and Bound does not grow as fast. The performance of Parallel Ontology Bridge is almost ﬂat, although it grows quadratically, it grows at much slower rate.

5.7 Feature Extraction

As illustrated in Figure 5.2, classes are extracted from the ontologies. Features are extracted from these class entries and fed into a support vector machine. OpenCyc is accessed through a cache, signiﬁcantly reducing the amount of time to use the OpenCyc ontologies to create features. Software timers and calculations of alignment metrics are also integrated. Various graphs and plain text output of results were generated for diagnosis. Features are turned into a vector sequence that’s normalized during SVM processing. Table 5.3a shows an example ontology pair. The row “Origin Ontology” contains the ontology where the class came from, the row “Class Label” contains the label given the class and 5.7. Feature Extraction 35

Class A Class B Origin Ontology GO MP Class Label negative regulation of platelet activation abnormal platelet activation Parent Classes abnormal platelet physiology (a) Ontology Pair for Feature Extraction

Primitive Name Value count_opencyc_synonyms 1 count_wordnet_synonym 4 has_matching_labels 0 has_same_ﬁrst_word 0 has_same_last_word 1 (b) Feature Results

Table 5.3: Example Feature Extraction the row “Parent Classes” is a comma delineated list of superclasses of the selected class. Table 5.3 shows the features extracted for these two classes. An example feature extraction would take the ontology pair shown in Table 5.3a and turn it into the vector [1, 4, 0, 0, 1] with the features in Table 5.3. The individual vectors are made up of binary values, integers or real numbers. They are all normalized during SVM training to be in the range [0, 1]. It is relatively easy to create additional features. Table 5.4 has the list of features used for this thesis. To allow for valid comparisons, these features are the same as used in Stoutenburg’s work [1]. This required porting from Java to the Python programming language and some tuning by trial and error to ﬁnd the appropriate normalization schemes. These features were originally created based on analysis of biomedical ontologies assisted by biology and medical experts. These features make up patterns of relationships between biomedical ontologies. For example titles that end with the same words are likely related by hyponymy, such as “platelet activation” and “abnormal platelet activation” and titles that have synonyms in them are likely to be equivalent or related. Since these features are consistent with the heuristics developed by these experts, they make a good choice for this domain. Linguistic features, such as number of words in the labels, structural features, for example many child concepts a class has, and features looked up in upper ontologies, like 36 Chapter 5. The Strategic Approach of Parallel Ontology Bridge

how many synsets the words in two labels have in WordNet or how many shared concepts they have in OpenCyc, are included in Ontology Bridge and Parallel Ontology Bridge.

5.8 Implementation

The software was implemented in Python using the scikit-learn2 and joblib3 libraries. The input ontologies were in Web Ontology Language (OWL) format. This is a W3C standard- ized format for ontologies that can be serialized into XML and other text formats. Two approaches were used to access upper ontologies. WordNet was accessed through the natural language tool kit (nltk4), further described in [50]. OpenCyc, which consists of an upper ontology in OWL format and a reasoning engine, was accessed through its ontology file. The ontology was extracted into a graph data structure with references to the necessary relations put into a fast lookup table. The reasoner was not used in this implementation. The scikit-learn library had the necessary interfaces to libSVM, an implementation of SVMs that performs well. It included grid search for the parameters of the SVM kernel. This was used to determine the constants during training. The joblib library had the implementation of MapReduce used in this thesis. The number of jobs used during a run was a parameter into this method. In addition, some joblib caching was used to reduce the load times of various files and training data. The software used for this thesis has progressed through several iterations of development. The final version is written in the Python programming language. Test-driven development, writing automated test cases before production code [51], has been used whenever possible. As such, a suite of unit tests exists for most of the functionality of this software. Table 5.5 shows the test coverage of the various modules of the implementation in this thesis.

2http : / / scikit-learn . org / stable/ 3http : / / packages . python . org / joblib/ 4http : / / nltk . googlecode . com / svn / trunk / doc / howto / wordnet . html 5.8. Implementation 37

Feature Primitive Name Description count_opencyc_hypernyms Number of words that are hypernyms through OpenCyc count_opencyc_hyponyms Number of words that are hyponyms through OpenCyc count_opencyc_synonyms Number of words that are synonyms through OpenCyc count_wordnet_hypernyms Number of words that are hypernyms through WordNet count_wordnet_hyponyms Number of words that are hyponyms through WordNet count_wordnet_synonym Number of words that are synonyms through WordNet has_matching_labels Whether any labels match has_opencyc_subclass_synonym Whether there is a synonym of concept subclass through OpenCyc has_opencyc_superclass_hypernym Whether there is a hypernym of concept superclass through OpenCyc has_opencyc_superclass_hyponym Whether there is a hyponym of concept superclass through OpenCyc has_opencyc_superclass_synonym Whether there is a synonym of concept superclass through OpenCyc has_opencyc_synonym Whether any OpenCyc synonym exists has_same_beginning Whether there is a shared substring at the start has_same_ending Whether there is a shared substring at the end has_same_first_word Whether the first word is the same has_same_label Whether primary label matches has_same_last_word Whether the last word is the same has_stoilos_similarity Stoilos similarity metric [49] has_sub_prefix Starts with “sub” has_superclass_1 First concept has superclasses has_superclass_2 Second concept has superclasses has_wordnet_hypernym Has any hypernym in common in wordnet has_wordnet_hyponym Has any hyponym in common in wordnet has_wordnet_subclass_synonym Whether there is a synonym of concept subclass through WordNet has_wordnet_superclass_hyperym Whether there is a hypernym of concept superclass through WordNet has_wordnet_superclass_hyponym Whether there is a hyponym of concept superclass through WordNet has_wordnet_superclass_synonym Whether there is a synonym of concept superclass through WordNet has_wordnet_synonym Whether any WordNet synonym exists is_opencyc_hypernym Whether any OpenCyc hypernym exists is_opencyc_hyponym Whether any OpenCyc hyponym exists

Table 5.4: Features 38 Chapter 5. The Strategic Approach of Parallel Ontology Bridge

Module Name Statements Missed Coverage align 49 15 69% import_csv 53 28 47% opencyc 21 9 57% primitives 245 149 39% utils 25 13 48% Total 393 214 54%

Table 5.5: Unit Test Statement Coverage. This shows how much code in each one of these modules is covered by automated unit tests.

There are 32 tests which take 1.85 seconds to run. These tests primarily cover the functionality of feature primitives, with a few tests for metric calculation, training data parsing and simple scenarios of alignment.

5.8.1 Issues With Java Implementation

Initially, this software was written in Java, taking advantage of several existing libraries to deal with ontologies. However, this proved untenable for the following reasons:

Jena This library, the industry standard for dealing with ontologies in Java, was unable to handle the large number of ontology pairs used in this thesis. Its performance was poor. The python equivalent, rdﬂib, used in this thesis, was able to easily handle large ontologies.

WordNet This resource was difficult to incorporate with the rest of the Java software. It uses external files in non-standard ways, which makes distributing in a jar file very difficult. WordNet functions are not idempotent or reentrant, which made multi- threading very difficult. The Python library did not run into these difficulties.

OpenCyc The author found it extremely slow to use the OpenCyc reasoning engine. The alternative, the OpenCyc OWL file, was very large and cumbersome. This was mit- igated by caching. The OpenCyc OWL file was stored in a Python dictionary and serialized into a b-tree based file, which allowed for very rapid lookup. 5.9. Parallel Ontology Bridge F-Measure 39

5.9 Parallel Ontology Bridge F-Measure

As described in [1], cross validation was used on the reference alignments to determine F-measure for the hyponymy relationship (shown in Table 5.6). Ten way cross validation separates the test data into 10 disjoint sets and runs tests on each one, using the remaining 9 sets of data as training. These results are generally consistent with the best results from [1], which had an average F-Measure of 0.68 for the same test cases. This implies that the implementation of Parallel Ontology Bridge performs as well as Ontology Bridge for F-measure. Based on a cursory analysis of the data, it seems that more generic sub-domains, such as behavior and immune system have better results than more specific sub-domains such as Mannose Binding and Bone Modeling. Phenylaanine Conversion specifically is the only case of several features related to OpenCyc, since this is the case, no other training data is available in the other domains, so it does significantly more poorly than the others.

Table 5.6: Cross-Validation Results

(a) Parallel Ontology Bridge (b) Ontology Bridge [1] Test Case F-Measure F-Measure Platelet Activation 0.707 Biomedical Test 1 0.7 Mannose Binding 0.685 Biomedical Test 2 0.69 Immune System 0.623 Biomedical Test 3 0.58 Phenylalanine Conversion 0.424 Average 0.68 Bone Modeling 0.709 Bone Marrow 0.692 Osteoblast Differentiation 0.707 Osteoclast Differentiation 0.707 Behavior 0.7 Circadian Rhythm 0.7 Average 0.67 40 Chapter 5. The Strategic Approach of Parallel Ontology Bridge

5.10 Contributions

This thesis implemented Parallel Ontology Bridge, a parallel method of executing the Ontology Bridge algorithm. The algorithm is broken into map and reduce steps to run alignments on individual classes concurrently. This approach shows good F-measure on hyponymy relationships and is generic enough to be used with other ontology alignment methods and features. CHAPTER 6

Scalability Results

The scalability of a system refers to its ability to handle larger problems, sometimes by adding additional resources. In this thesis, it refers speciﬁcally to how a system performs as more computation resources are added.

6.1 Scalability Metrics

For computer systems, there are two major metrics that affect scalability, contention delay and coherency delay. Contention delay is caused by sequential execution of computations. This can be due either to the structure of the problem or contention over shared resources. Coherency delay is the time required for caches and memory hierarchies to be updated with appropriate data. Coherency delays are always due to implementation of the system. These relationships are shown in Gunther’s Universal Scalability Law [52], a model for scalability that will be used throughout this thesis. Gunther’s Universal Scalability law is

p C (p)= 1+ (p 1) + p (p 1) where p is the number of processing units, is contention delay and  is coherency delay. To determine the scalability of a system, data points are ﬁt to this model. The maximum performance of a system is at p⇤ processing units, given by

1 p⇤ = . $r  % 42 Chapter 6. Scalability Results

Adding more processes above p⇤ either does not change or reduce performance. Adding additional processes after p⇤ is counterproductive because the performance of the system is reduced as the number of processors added grows above p⇤.

6.2 Scalability Experiments

To analyze the scalability of the Parallel Ontology Bridge system, several experiments were run with various numbers of computation units available. These data points were fit to Gunther’s Universal Scalability Law using least squares regression. Least squares regression is a technique for fitting curves by minimizing the squared error of data points versus the model. These data points are shown in Figure 6.1a. Similar data already existed for the other systems compared, these data are shown in the other sub-figures of Figure 6.1. All of these data come from running tests on an Amazon HPC Instance with 60.5 GB of RAM and two Intel Xeon E5-2670, eight-core “Sandy Bridge” processors. This is one of many cloud computing services offered by Amazon. In addition to the subsets described below, a full alignment of GO and MP was attempted. This failed due to a contention fault which only occurs rarely, a mutable data structure does not have an appropriate mutex in some of the library code that was used. This full scalability test would have compared 211 million pairs in 5.8 days, but only 84 million pairs were successfully processed. Of these 84 million pairs, 8 million were shown to have a relationship. Since the algorithm remains unchanged from the test cases described in Section 5.9, additional human evaluation is unnecessary. For commercial application, additional tuning and testing is necessary, but these tests should be sufficient to show the scalability of the approach. 6.3. Other Systems Compared 43

6.3 Other Systems Compared

6.3.1 Gross’s Approaches

Figure 6.1c shows Gross’s [53] intra-node system performance data points and the least squares ﬁtted curve, while Figure 6.1b shows the same for Gross’s inter-node system [53]. Intra-node means “within one node”, in this case, one computer, while inter-node means “between nodes”, in this case, multiple computers. Gross’s approaches run various “matchers” on ontology pairs in a parallel manner on a single computer. Matchers include string similarity measures, structural comparisons and semantic lookups. Gross’s inter-node system takes the same approach as Gross’s intra-node, except it runs across multiple machines in a distributed system. This paper only describes the infrastructure to run the matches, not combine the outputs. This system ran on the Adult Mouse Anatomy1 (MA) and NCI Thesaurus2 as its test ontologies. These ontologies are approximately 3000 concepts in size and are used as the Anatomy track for the OAEI.

6.3.2 Zhang’s Approach

Zhang’s approach is a Hadoop based system [39]. Its performance data points and least squares ﬁtted curve are shown in Figure 6.1d. This approach constructs “virtual documents” and uses a term frequency inverse document frequency (TF-IDF) [54] metric for ontology alignment implemented in map-reduce. TF-IDF is a kind of measure for document retrieval, balancing the frequency of speciﬁc words against their uniqueness across documents. This system ran on the FMA3 and GALEN4 ontologies.

1http : / / www . obofoundry . org / cgi-bin / detail . cgi ? id = adult _ mouse _ anatomy 2http : / / ncit . nci . nih . gov/ 3http : / / sig . biostr . washington . edu / projects / fm / AboutFM . html 4http : / / www . co-ode . org / galen/ 44 Chapter 6. Scalability Results

(a) Parallel Ontology Bridge Curve Fit (b) Gross Inter

Figure 6.1: Other Systems Curve Fits

6.4 Input

Table 6.1 shows the ontologies used for these scalability tests and their size. The different input data should not affect the scalability results, since all the values used in the comparisons are normalized. Systems that use larger input datasets may however uncover coherency delays if the approach to cache invalidation is not optimal. The input data for Parallel Ontology Bridge was selected at random from all the possible pairs.

6.5 Hardware

Table 6.2 shows a comparison of the hardware used in each system. Zhang’s Hadoop system had the most number of processors and RAM available. Gross’s intra-node had 6.6. Comparison of Scalability 45

System Ontology size Gene Ontology Subset 1,000 Parallel Ontology Bridge Mammalian Phenotype Subset 1,000 GO Molecular Functional 9,395 Gross Inter GO Biological Process 17,104 Adult Mouse Anatomy 3,289 Gross Intra NCI Thesaurus (Anatomy) 2,737 FMA 40,000 Zhang ⇠ GALEN 40,000 ⇠ Table 6.1: Input Data the fewest number of processors and least amount of RAM. Zhang’s approach and Gross’s inter-node were both distributed systems over a network.

System CPUs Cache Memory Architecture Parallel Ontology Bridge 16x2.60GHz 20 MB 60.5 GB Single Computer Gross Inter 16x2.66GHz 8 MB 4x4 GB Network Gross Intra 4x2.66GHz 8 MB 4 GB Single Computer Zhang 40x2.4GHz 12 MB 10x24 GB Network

Table 6.2: Architecture Comparison

6.6 Comparison of Scalability

Figure 6.2 shows the scalability of Parallel Ontology Bridge, Gross’s intra-node, Gross’s inter-node and Zhang’s Hadoop approaches. The ideal system is one that has no contention or coherency delay. Parallel Ontology Bridge starts to perform worse from the ideal possible scalability at more than 15 concurrent jobs. Since Parallel Ontology Bridge is the closest to the ideal, it has the best scalability. Gross’s inter-node approach is the next closest, so it shows the second best scalability. Gross’s inter-node approach performs worse than the ideal at more than 4 concurrent jobs. Gross’s intra-node approach performs better than Zhang’s Hadoop approach when fewer than 8 jobs are executed in parallel. This worse performance may be due to the lower quality hardware used in these other systems or it may be due to poorer software architecture which causes more contention. 46 Chapter 6. Scalability Results

Figure 6.2: Comparison of scalability of ontology alignment approaches based on data points

Figure 6.3 shows the system behavior extrapolated beyond the experimental data. Par- allel Ontology Bridge has maximum effectiveness at approximately 60 jobs and a 30 times speedup. Gross’s intra-node approach continues to increase in performance slowly, reach- ing maximum effectiveness with thousands of jobs executing in parallel. Gross’s inter-node approach declines rapidly after 30 jobs while Zhang’s approach starts to decline slowly at approximately 100 jobs. The retrograde performance of Parallel Ontology Bridge is due to it’s coherency delay. Since Gross’s intra-node approach has signiﬁcantly lower coherency delay, but higher contention delay, its performance continues to improve slowly as more processors are added. Zhang’s approach has poor coherency and contention delay compared to the other approaches, this is why it does not increase in performance early on and plateaus. 6.6. Comparison of Scalability 47

Cross’s inter-node approach has better contention delay than coherency delay, this is why it shows improvement early on, but has signiﬁcant retrograde performance as more processors are added.

Figure 6.3: Comparison of extrapolated scalability of ontology alignment approaches

Table 6.3 shows the scalability metrics for these systems. Parallel Ontology Bridge has very small contention delay and small coherency delay. In contrast, Gross’s inter-node system has moderate contention delays and very low coherency delays. This is why Gross’s inter-node approach does not scale as well as Parallel Ontology Bridge when running a few jobs, but continues to scale after Parallel Ontology Bridge shows retrograde performance. Coherency causes performance to drop as the number of jobs increases. Gross’s inter-node and Parallel Ontology Bridge had dramatically better scalability than the other two approaches that were compared. This better performance is evident in both graphs, both experimentally at a small scale and theoretically at a large scale. This is due to their better scalability metrics. 48 Chapter 6. Scalability Results

Zhang’s Hadoop approach has significantly more contention delays than any of the other systems that were compared. This is possibly due to the use of a single “house- keeping” node which holds all of the persisted data during execution. Gross’s intra-node approach has some contention delay and the largest coherency delay. This large coherency delay is what causes it to have retrograde performance so early on. Parallel Ontology Bridge has the highest theoretical speedup of any system, a speedup of 30.6 when 60 processors are used, giving an efficiency of 51%. Gross’s inter-node approach has the second best speedup at 128,309 processors, giving the worst efficiency of any system at 0.02%. Gross’s intra-node system also has good efficiency. The efficiency of Gross’s intra- node system and Parallel Ontology Bridge may be due to them communicating over a system bus instead of the network (see Table 6.2). The overall speed of Parallel Ontology Bridge is much faster than Ontology Bridge. Table 6.4 shows the execution speed of Ontology Bridge (96 hours) versus that of Parallel Ontology Bridge (40 minutes).

6.7 Summary of Scalability

This chapter compared results of ontology alignment at the system level. This is the only appropriate comparison to be made with existing results, since the underlying hardware and system architecture are different. The causes of scalability performance depend on many factors. How the problem is broken down, the processor architecture used, how they are connected, the speed of these connections, how contention is handled and how much and what type of memory is available. The Parallel Ontology Bridge system showed better scalability than many similar approaches that have been published. 6.7. Summary of Scalability 49 ⇤ p 51% 53% 9.8% 0.02% efﬁciency at ⇤ p 30.6 29.7 4.77 8.65 speedup at ⇤ 9 p 60 88 128,309  11 10 0123 ⇥ . 000272 000115 . 0 . 87 0 0 . Coherency 5 9 9 10 10 ⇥ ⇥ 0336 0953 . . 0 0 01 93 . . Table 6.3: Scalability Metrics Comparison 1 1 Contention Zhang System Gross Inter Gross Intra Parallel Ontology Bridge 50 Chapter 6. Scalability Results

Table 6.4: Speed of Execution

Project Runtime Ontology Size Ontology Bridge 96 hours O (1,000 1,000)[1] · Parallel Ontology Bridge 40 minutes O (1,000 1,000)[1] · CHAPTER 7

Conclusion

This work describes Parallel Ontology Bridge, an approach to ontology alignment using support vector machines to find non-equivalence relationships that scales through the use of parallelization. Parallel Ontology Bridge maintained the alignment quality of the previous work, Ontology Bridge, at an F-Measure of 0.67, while reducing execution time from 96 hours to 2 hours. In addition, it showed a theoretical scalability factor of 30 with an efficiency of 51%. This shows that Parallel Ontology Bridge is very scalable. The other systems compared only had a maximum scalability factor of 29 and efficiency of 9.8%. Like Ontology Bridge, ontologies were aligned by matching linguistic and structural features in a support vector machine. However, unlike Ontology Bridge, Parallel Ontology Bridge can be scaled across many processing units. By using MapReduce with shared memory, Par- allel Ontology Bridge offers a simple, unique method to parallelizing ontology alignment which is domain independent. This work explored the use of MapReduce, human computation, information entropy and morpheme based extraction and cloud computing for ontology alignment. MapRe- duce proved to be a straightforward method of implementing this parallelization. It’s a clear match for the Ontology Bridge system and has been proven in industry. Google uses MapReduce for their PageRank algorithm which determines the order of search results in their search engine. Parallel Ontology Bridge shows that a similar approach of investing in parallel infrastructure can solve the ontology alignment scalability problem. This parallel infrastructure can come from cloud providers, such as Amazon1 or RackSpace2. Scalable ontology alignment is a necessary technology for the Semantic Web [55], both in its full operation and for integrating with existing systems. Scalability ontology

1http : / / aws . amazon . com 2http : / / rackspace . com / cloud 52 Chapter 7. Conclusion

alignment is still an unsolved problem in the field. Parallel Ontology Bridge shows a direction, mimicking some of the same outcomes that occurred in more straightforward document analysis and search engine development. Parallel Ontology Bridge combines a modular design, MapReduce distribution of jobs, caching and cloud computing to provide an effective solution to scaling ontology alignment. With appropriate training data, it can automatically tune the support vector machine to optimally align ontologies, eliminating one form of manual tuning. There were several challenges with this system. The features from Ontology Bridge had to be ported and appropriately tuned to get the necessary F-measure that Ontology Bridge showed. Tuning primarily consisted of selecting appropriate feature normalization. The goal of Parallel Ontology Bridge was to match the F-measure of Ontology Bridge. This thesis is the first paper in the ontology alignment community which discusses theoretical scalability of systems. None of the other papers compared address it. Some papers describe approaches that were similar to this one, but none went into the detail of analyzing theoretical scalability. Success and failure of this project was based on two measures: scalability and F- measure. This project succeeded on both, maintaining F-measure while dramatically increasing scalability. This thesis contributes an end-to-end parallelization technique to the ontology alignment community, as well as the first reported use of cloud computing resources for ontology alignment. The work was implemented and empirically shown to scale. The approach shows promise for scaling ontology alignment because of its good empirical results and simplicity of approach. There may be additional bottlenecks when using a different hardware system that cannot be detected from the research in this thesis. Scaling ontology alignment is still a developing field [28] which has great potential applications in biomedicine [30]. 7.1. Future Work 53

7.1 Future Work

Ontology alignment is a key component to enhancing the research uses of medical ontologies. Ontology alignment tools need to be better integrated into the work ﬂows of ontology researchers. Effective research assistance and diagnostic tools need to be developed, as well as methods for on demand or just-in-time processing of ontology alignment. This would enable integrated and up to date ontologies to be used in biomedical research without additional effort or cost on the part of the researchers. A well aligned ontology is not useful if it remains stagnant. For Parallel Ontology Bridge to be used broadly, an effective approach to gathering training data is necessary. Bootstrapping techniques with training data and using an in- cremental learning SVM, which will allow for interactive and continuous training, offer potential solutions to this problem. There is future work to incorporate the results and approaches of other ontology alignment systems into the architecture described in this thesis. In addition, there is much more testing and research to compare the results of this thesis to the Ontology Alignment Evalu- ation Initiative, speciﬁcally the physiology and scalability tracks. To do this, an approach to unsupervised learning for equivalence relationships may be necessary. Unsupervised learning does not require data to be labeled, and labeling data can be a labor intensive process. Future work includes modifying Parallel Ontology Bridge to be a distributed system. This may require further analysis and design of an approach of sharing data across multiple processing nodes. 54 APPENDIX A

Code Listings

A.1 Align Implementation

Listing A.1: align.py from __future__ import division import sys import rdflib import numpy from itertools import product , starmap from sklearn import svm, grid_search , metrics , preprocessing

class Bridge () : def __init__ (self , name, training , features , expected =[]) : self . relation_name = name self . features = features self.classifier = Classifier(training) self .expected = expected self . results = [] self . new_relations = [] 56 Appendix A. Code Listings

def f(self , c1, c2): feature_vector = [f(c1, c2, self .g1, self .g2) for f in self . features] match = self.classifier.classify(feature_vector) label_1 = str ( self .g1. label (c1)) label_2 = str ( self .g2. label (c2)) as_expected = ( label_1 , label_2 ) in self .expected return (c1, c2, feature_vector , int(match) , as_expected )

def align (self , g1, g2): self .g1 = g1 self .g2 = g2 o1 = get_classes ( g1 ) o2 = get_classes ( g2 )

self . results = list (starmap( self .f , product(o1, o2)) ) self . new_relations = [(c1, c2) for (c1, c2, _, match, _) in self . results ]

def test (self , test_data): tuples , labels = zip( test_data ) ⇤ results = [self . classifier . classify(t) for t in tuples ] return metrics . precision_recall_fscore_support ( labels , results , A.1. Align Implementation 57

labels =[1 ,0])

def progress (i , total_comparisons , found) : sys.stdout.write(’\r’) sys.stdout.write("{:,}/{:,} comparisons made , {: ,} found ." .format(i+1, total_comparisons , found) ) sys . stdout . flush ()

class Classifier () : def __init__ (self , data) : tuned_parameters = [{ ’kernel ’: [ ’rbf ’] , ’gamma’: [1e 3, 1e 4], ’C’: [1, 10, 100, 1000]}, {’kernel’: [’linear’], ’C’: [1, 10, 100, 1000]}] self .svm = grid_search .GridSearchCV(svm.SVC() , tuned_parameters ) self . train (data) def classify (self , vector): result = self .svm. predict ([ vector ]) return result def train (self , data): tuples , labels = zip( data ) ⇤ 58 Appendix A. Code Listings

self .svm. fit (preprocessing .normalize(numpy. array( tuples , dtype=float )) , labels )

def get_classes (g): return list (g.subjects(rdflib .RDF.type, rdflib .OWL.Class ))

A.2 Parallel Implementation

Listing A.2: joblib_align.py import itertools import pickle import rdflib from datetime import datetime from rdflib import OWL, RDF from sklearn . externals import joblib import align import import_csv import primitives memory = j o b l i b . Memory( cachedir =" cache " ) import_csv . import_csv = memory. cache( import_csv . import_csv ) features = primitives .members relation = "hyponymy" tc_names = [ ’01 _platelet_activation ’ , ’02_mannose_binding ’ , A.2. Parallel Implementation 59

’03_immune_system’ , ’04_phenylalanine_conversion ’ , ’05_bone_modeling ’ , ’06_bone_marrow’ , ’07_osteoblast_differentiation ’, ’08_osteoclast_differentiation ’, ’09_behavior ’ , ’10_circadian_rhythm ’]

cases = [import_csv . import_csv(tc , relation , features ) for tc in tc_names ]

@memory . cache def get_training (tc_name): other_cases = (case for (tc, case) in zip (tc_names , cases ) if tc != tc_name) training = sum(other_cases , []) return training

def parse_ont (filename) : g= rdflib.Graph() g.parse(filename) return g

class OntologySubstitute () : def __init__ (self , filename) : self .ont = parse_ont(filename) def objects (self , subject=None, predicate=None): return self .ont. objects(subject , predicate) def subjects (self , predicate=None, object=None): 60 Appendix A. Code Listings

return self .ont. subjects(predicate , object) def label (self , subject , default=’’): return self .ont. label(subject , default) def grouper (n, iterable ) : "Collect data into fixed length chunks or blocks" #grouper(3,’ABCDEFG’,’x’) >ABCDEFGxx args = [ iter ( iterable )] n ⇤ return (itertools.ifilter(None, g) for g in itertools . izip_longest(fillvalue=None, args )) ⇤ if __name__ == ’__main__ ’ : tc_name = ’ all ’ training = get_training(tc_name) g1 = OntologySubstitute("test_cases/"+tc_name+"/ GOTestCase . owl" ) g2 = OntologySubstitute("test_cases/"+tc_name+"/ MPTestCase . owl" ) g1_classes = g1. subjects (RDF. type , OWL.Class) g2_classes = g2. subjects (RDF. type , OWL.Class) pairs = itertools . product(g1_classes , g2_classes) c=align.Classifier(training) def f(c1, c2): feature_vector = [f(c1, c2, g1, g2) for f in features ] match = c . classify ( feature_vector ) label_1 = str (g1. label (c1)) label_2 = str (g2. label (c2)) return (c1, c2, int(match)) A.3. Primitives Implementation 61

start = datetime.now() jobs = joblib . Parallel (n_jobs=16, verbose=1, pre_dispatch=’100 n_jobs ’) ⇤ for subset in grouper(25000, pairs ) : results = jobs(joblib .delayed(f)(c1, c2) for (c1, c2) in subset ) end = datetime .now() print "time elapsed" , str (end start ) for a, b, m in results : print str (a) , str (b) , m

A.3 Primitives Implementation

Listing A.3: primitives.py

""" Evidience primitive extractors , all follow the generic form of f(c1, c2, o1, o2) where c1 is a class from ontology o1 and c2 is a class from ontology o2. """ import inspect import sys import shelve import opencyc 62 Appendix A. Code Listings

import rdflib from itertools import product , starmap from contextlib import closing from nltk . corpus import wordnet from rdflib import RDFS from utils import get_labels rdflib .plugin. register (’text /xml’, rdflib .plugin.Parser , ’rdflib.plugins.parsers.rdfxml’, ’ RDFXMLParser ’ ) opencyc_db = opencyc .OpenCyc() def has_same_label (c1 , c2 , o1, o2) : return o1.label(c1, default=None) == o2.label(c2, default=None) != None def count_wordnet_synonym (c1 , c2 , o1 , o2) : count = 0 for (w1, w2) in get_word_pairs (c1 , c2 , o1, o2) : count += len ( set (wordnet . synsets (w1) ) & set (wordnet . synsets (w2))) return count def count_wordnet_hypernyms (c1 , c2 , o1 , o2) : count = 0 for (w1, w2) in get_word_pairs (c1 , c2 , o1, o2) : A.3. Primitives Implementation 63

for s2 in wordnet . synsets (w2) : count += sum(s1 in hypernyms (s2 , 5) for s1 in wordnet . synsets (w1) ) return count def count_wordnet_hyponyms (c1 , c2 , o1 , o2) : return False def count_opencyc_hypernyms (c1 , c2 , o1 , o2) : count = 0 for (w1, w2) in get_word_pairs (c1 , c2 , o1, o2) : count += w1 in ancestors (w2) return count def count_opencyc_hyponyms (c1 , c2 , o1 , o2) : count = 0 for (w1, w2) in get_word_pairs (c1 , c2 , o1, o2) : count += w2 in ancestors (w1) return count def has_same_beginning (c1 , c2 , o1 , o2) : return any ( l1 . startswith ( l2 ) or l2 . startswith ( l2 ) for (l1, l2) in label_pairs (c1, c2, o1, o2)) def has_same_ending (c1 , c2 , o1 , o2) : return any ( l1 [:: 1]. startswith (l2 [:: 1]) or l2 [:: 1]. startswith (l2[:: 1]) for (l1, l2) in label_pairs (c1, c2, o1, o2)) def synonym (w1, w2) : 64 Appendix A. Code Listings

return w1 == w2 and w1 in opencyc_db def count_opencyc_synonyms (c1 , c2 , o1 , o2) : return sum ( synonym (w1 , w2) for (w1, w2) in get_word_pairs (c1 , c2 , o1, o2)) def has_same_first_word (c1 , c2 , o1, o2) : return any(l1.split()[0] == l2.split()[0] for (l1, l2) in label_pairs (c1, c2, o1, o2)) def has_same_last_word (c1 , c2 , o1 , o2) : return any ( l1 . split () [ 1] == l2 . s p l i t () [ 1] for (l1, l2) in label_pairs (c1, c2, o1, o2)) def has_matching_labels (c1 , c2 , o1=None, o2=None) : return any ( l1 == l2 for (l1, l2) in label_pairs (c1, c2, o1 , o2 ) ) def has_sub_prefix (c1, c2, o1, o2) : return any(starmap(sub_prefix , get_word_pairs(c1, c2, o1 ,o2))) def has_superclass_1 (c1, c2, o1, o2) : return bool ( list ( get_parents (c1 , o1) ) ) def has_superclass_2 (c1, c2, o1, o2) : return bool ( list ( get_parents (c2 , o2) ) ) def has_opencyc_subclass_synonym (c1 , c2 , o1 , o2) : for subclass in subclasses (c1, o1): A.3. Primitives Implementation 65

for (w1, w2) in get_word_pairs ( subclass , c2 , o1, o2) : if w1 == w2 and w1 in opencyc_db : return True return False def has_opencyc_superclass_hypernym (c1 , c2 , o1, o2) : for superclass in get_parents (c1, o1): for (w1, w2) in get_word_pairs ( superclass , c2 , o1, o2 ) : if w1 in ancestors (w2): return True return False def has_opencyc_superclass_synonym (c1 , c2 , o1, o2) : for superclass in get_parents (c1, o1): for (w1, w2) in get_word_pairs ( superclass , c2 , o1, o2 ) : if w1 == w2 and w1 in opencyc_db : return True return False def has_opencyc_superclass_hyponym (c1 , c2 , o1 , o2) : for superclass in get_parents (c1, o1): for (w1, w2) in get_word_pairs ( superclass , c2 , o1, o2 ) : if w2 in ancestors (w1): return True return False def subclasses (cls , ont): return ont . subjects (RDFS. subClassOf , cls ) def has_wordnet_synonym (c1 , c2 , o1 , o2) : 66 Appendix A. Code Listings

for (w1, w2) in get_word_pairs (c1 , c2 , o1, o2) : if set (wordnet. synsets(w1)) & set (wordnet. synsets(w2 )): return True return False def has_wordnet_hypernym (c1 , c2 , o1 , o2) : for (w1, w2) in get_word_pairs (c1 , c2 , o1, o2) : for s1 in wordnet . synsets (w1) : for s2 in wordnet . synsets (w2) : if s1 in hypernyms (s2 , 5) : return True return False def has_wordnet_hyponym (c1 , c2 , o1 , o2) : for (w1, w2) in get_word_pairs (c1 , c2 , o1, o2) : for s1 in wordnet . synsets (w1) : for s2 in wordnet . synsets (w2) : if s2 in hypernyms (s1 , 5) : return True return False def has_wordnet_subclass_synonym (c1 , c2 , o1 , o2) : for subclass in subclasses (c1, o1): for (w1, w2) in get_word_pairs ( subclass , c2 , o1, o2) : if set (wordnet. synsets(w1)) & set (wordnet. synsets (w2)) : return True return False A.3. Primitives Implementation 67

def has_wordnet_superclass_synonym (c1 , c2 , o1, o2) : for superclass in get_parents (c1, o1): for (w1, w2) in get_word_pairs ( superclass , c2 , o1, o2 ) : if set (wordnet. synsets(w1)) & set (wordnet. synsets (w2)) : return True return False def has_wordnet_superclass_hyponym (c1 , c2 , o1 , o2) : for superclass in get_parents (c1, o1): for (w1, w2) in get_word_pairs ( superclass , c2 , o1, o2 ) : for s1 in wordnet . synsets (w1) : for s2 in wordnet . synsets (w2) : if s2 in hypernyms (s1 , 5) : return True return False def has_wordnet_superclass_hyperym (c1 , c2 , o1, o2) : for superclass in get_parents (c1, o1): for (w1, w2) in get_word_pairs ( superclass , c2 , o1, o2 ) : for s1 in wordnet . synsets (w1) : for s2 in wordnet . synsets (w2) : if s1 in hypernyms (s2 , 5) : return True return False def hypernyms ( synset , level ) : 68 Appendix A. Code Listings

if level == 1: return synset .hypernyms() hyps = synset . hypernyms () for s in synset .hypernyms() : hyps += hypernyms(s , level 1) return hyps def has_opencyc_synonym (c1 , c2 , o1 , o2) : for (w1, w2) in get_word_pairs (c1 , c2 , o1, o2) : if w1 == w2 and w1 in opencyc_db : return True return False def is_opencyc_hypernym (c1 , c2 , o1 , o2) : for (w1, w2) in get_word_pairs (c1 , c2 , o1, o2) : return w1 in ancestors (w2) def is_opencyc_hyponym (c1 , c2 , o1 , o2) : for (w1, w2) in get_word_pairs (c1 , c2 , o1, o2) : return w2 in ancestors (w1) def get_parents (cls , ont): return ont . objects ( cls , RDFS. subClassOf) def label_pairs (c1, c2, o1, o2): labels1 = labels_for (c1, o1) labels2 = labels_for (c2, o2) return product ( labels1 , labels2 ) def get_word_pairs (c1 , c2 , o1, o2) : for n1 , n2 in label_pairs (c1, c2, o1, o2): A.3. Primitives Implementation 69

for (w1, w2) in product (n1. split () , n2. split () ) : yield w1, w2 def labels_for (cls , ont): labels = [cls . split ("#")[ 1]] if "#" in cls else [cls] if ont and get_labels (cls , ont): labels = get_labels (cls , ont) return labels def ancestors (concept): if concept not in opencyc_db : return [] return [c for c in opencyc_db [ concept ] if c!=concept] def sub_prefix (s1 , s2) : return s1 == "sub" + s2 def has_stoilos_similarity (c1, c2, o1, o2): for (l1, l2) in label_pairs (c1, c2, o1, o2): return stoilos_similarity (l1 , l2) def stoilos_similarity (st1 , st2): if st1 == None or st2 == None: return 1

s1 = st1 . lower () s2 = st2 . lower ()

s1 = s1 . replace ( ’ . ’ , ’ ’) s2 = s2 . replace ( ’ . ’ , ’ ’) s1 = s1 . replace ( ’_’ , ’ ’) s2 = s2 . replace ( ’_’ , ’ ’) 70 Appendix A. Code Listings

s1 = s1 . replace ( ’ ’, ’’) s2 = s2 . replace ( ’ ’, ’’)

l1 = len (s1) l2 = len (s2)

L1 = l1 L2 = l2

if L1 == 0 and L2 == 0: return 1 if L1 == 0 or L2 == 0: return 0

common = 0 best = 2

while len (s1) > 0 and len (s2) > 0 and best !=0 : best = 0

l1 = len (s1) l2 = len (s2)

i=0 j=0

startS2 = 0 endS2 = 0 startS1 = 0 endS1 = 0 p=0 A.3. Primitives Implementation 71

for i in xrange ( l1 ) : if not (l1 i>best):break j=0; while (l2 j>best): k=i; while jbest: best = k i startS1 = i endS1 = k startS2 = p endS2 = j

s1 = s1 [ startS1 : endS1] s2 = s2 [ startS2 : endS2]

commonality = 0 scaledCommon = float (2 common ) / ( L1+L2 ) ⇤ commonality = scaledCommon;

winkler = winklerImprovement( st1 , st2 , commonality ) ; 72 Appendix A. Code Listings

dissimilarity = 0;

rest1 = L1 common ; rest2 = L2 common ;

unmatchedS1 = max( rest1 , 0) unmatchedS2 = max( rest2 , 0) unmatchedS1 = rest1 /L1 unmatchedS2 = rest2 /L2

#HamacherProduct suma = unmatchedS1 + unmatchedS2 ; product = unmatchedS1 unmatchedS2 ; ⇤ p=0.6; #For1itcoincideswiththealgebraic product if ((sumaproduct ) == 0 ) : dissimilarity = 0; else : dissimilarity = (product) /(p+(1 p) (suma ⇤ product )) ;

#ModificationJE:returnednormalization(instead of [ 11]) result = commonality dissimilarity + winkler; return (result+1)/2; def winklerImprovement (s1 , s2 , commonality) : n=min(len(s1),len(s2)) for i in xrange (n) : if s1 [ i ] != s2 [ i ]: A.4. OpenCyc Implementation 73

break

commonPrefixLength = min(4 , i ) ; winkler = commonPrefixLength 0.1 (1 commonality ) ⇤ ⇤ return winkler

members = [ func for (name, func) in inspect .getmembers(sys . modules [__name__ ]) if inspect . isfunction (func) and (name. startswith (’has_’) or name . s t a r t s w i t h ( ’ count_ ’) or name . s t a r t s w i t h ( ’ is_ ’ ) ) ]

A.4 OpenCyc Implementation

Listing A.4: opencyc.py

#OpenCycShelve #OpenOpenCyc #Iteratethrougheachconceptthathasalabel #Foreachconceptthathasalabel,findalltheancestors up to level N #Createadictionaryofthemoftheform #{key:[[parents],[grandparents],...]} # #Phasetwo:OpenOpenCycandgeteachthingthathasa label , create a #listofthetransitiveclosureoverOWL.subClassOf #Phasethree:OpenOpenCycandparseintolistoflists based on parents , 74 Appendix A. Code Listings

#grandparentsetc.bydoingabreadth first traversal over OWL . s u b C l a s s O f import rdflib import shelve from rdflib import RDFS import cPickle as pickle class OpenCyc () : def __init__ ( self ) : self . shelf = shelve .open("opencyc. shelve")

def __contains__ ( self , item) : return item . encode( ’UTF 8’) in self . shelf def __getitem__ ( self , item) : return self . shelf[item.encode(’UTF 8’)] if __name__ == "__main__" : g= rdflib.Graph() g.parse("ontologies/opencyc 2012 05 10 readable .owl") pickle .dump(g, open( ’opencyc. pickle ’ , ’w’)) nodes = (n for n in g.all_nodes() if g.label(n))

s=shelve.open("opencyc.shelve") for n in nodes : key = g . label (n) . encode ( ’UTF 8’) s[key] = [g.label(n).encode(’uTF 8’) for n in g. transitive_objects(n, RDFS. subClassOf ) if g.label(n)] A.4. OpenCyc Implementation 75

s.close() 76 Bibliography 77

Bibliography

[1] S. K. Stoutenburg, “Advanced ontology alignment: New methods for biomedical ontology alignment using non-equivalence relations,” Ph.D. dissertation, University of Colorado at Colorado Springs, 2009.

[2] L. Sweetlove. (2011) Number of species on earth tagged at 8.7 million. Nature. [Online]. Available: http : / / www. nature . com / news / 2011 / 110823 / full / news . 2011 . 498 . html

[3] (2012, September 17th) Wikipedia live statistics page. [Online]. Available: http : / / stats . wikimedia . org / EN / TablesWikipediaZZ . htm # distribution

[4] T. G. O. Consortium, “Gene ontology: tool for the uniﬁcation of biology,” Natural Genetics, vol. 25(1), pp. 25–9, May 2000.

[5] D. Rubin, N. Shah, and N. Noy, “Biomedical ontologies: a functional perspective,” Brieﬁngs in Bioinformatics, vol. 9, no. 1, pp. 75–90, 2008.

[6] P. Shvaiko and J. Euzenat, “Ontology matching: state of the art and future challenges,” IEEE Transactions on Knowledge and Data Engineering, vol. 99, 2012.

[7] J. Euzenat and P. Shvaiko, Ontology Matching. Springer, 2007.

[8] C. Smith, C. Goldsmith, J. Eppig et al., “The mammalian phenotype ontology as a tool for annotating, analyzing and comparing phenotypic information,” Genome Biol, vol. 6, no. 1, p. R7, 2005.

[9] D. L. McGuinness, F. Van Harmelen et al., “Owl web ontology language overview,” W3C recommendation, vol. 10, no. 2004-03, p. 10, 2004. 78 Bibliography

[10] C. Elkan and R. Greiner, “Building large knowledge-based systems: Representation and inference in the cyc project: Db lenat and rv guha,” Artiﬁcial Intelligence, vol. 61, no. 1, pp. 41–52, 1993.

[11] J. David, F. Guillet, and H. Briand, “Matching directories and owl ontologies with aroma,” in Conference on Information and Knowledge Management: Proceedings of the 15 th ACM international conference on Information and knowledge management, vol. 6, no. 11, 2006, pp. 830–831.

[12] J. Noessner and M. Niepert, “Codi: Combinatorial optimization for data integration– results for oaei 2010,” Ontology Matching, p. 142, 2010.

[13] M. Hussain and S. Srivatsa, “A study of different ontology matching system,” Inter- national Journal of Computer Applications (0975–8887) Volume, 2012.

[14] I. F. Cruz, F. P. Antonelli, and C. Stroe, “Agreementmaker: efﬁcient matching for large real-world schemas and ontologies,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1586–1589, 2009.

[15] N. Jian, W. Hu, G. Cheng, and Y. Qu, “Falcon-ao: Aligning ontologies with falcon,” in In: K-Cap 2005 Workshop on Integrating Ontologies., 2005, pp. 87–93.

[16] E. Rahm, “Towards large-scale schema and ontology matching,” Schema matching and mapping, pp. 3–27, 2011, Springer.

[17] I. Foster, Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison-Wesley, 1995.

[18] J. Dean and S. Ghemawat, “Mapreduce: simpliﬁed data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

[19] M. A. Hearst, S. Dumais, E. Osman, J. Platt, and B. Scholkopf, “Support vector machines,” Intelligent Systems and their Applications, IEEE, vol. 13, no. 4, pp. 18– 28, 1998. Bibliography 79

[20] J. Euzenat, A. Ferrara, L. Hollink, A. Isaac, C. Joslyn, V. Malaisé, C. Meil- icke, A. Nikolov, J. Pane, M. Sabou, F. Scharffe, P. Shvaiko, V. Spiliopou- los, H. Stuckenschmidt, O. Šváb Zamazal, V. Svátek, C. Trojahn, G. Vouros, and S. Wang, “Results of the ontology alignment evaluation initiative 2009,” http://eprints.biblio.unitn.it/1807/1/006.pdf.

[21] N. F. Noy, “Semantic integration: a survey of ontology-based approaches,” SIGMOD record, vol. 33, no. 4, pp. 65–70, 2004.

[22] A. Doan, J. Madhavan, P. Domingos, and A. Halevy, “Ontology matching: A machine learning approach,” Handbook on Ontologies in Information Systems, pp. 397–416, 2004.

[23] B. Smith, W. Ceusters, B. Klagges, J. Köhler, A. Kumar, J. Lomax, C. Mungall, F. Neuhaus, A. L. Rector, and C. Rosse, “Relations in biomedical ontologies,” Genome biology, vol. 6, no. 5, p. R46, 2005.

[24] A. Johnson and C. O’Donnell, “An open access database of genome-wide association results,” BMC medical genetics, vol. 10, no. 1, p. 6, 2009.

[25] A. Gruzdz, A. Ihnatowicz, J. Siddiqi, and B. Akhgar, “Mining genes relations in mi- croarray data combined with ontology in colon cancer automated diagnosis system,” World Academy of Science, Engineering and Technology, vol. 16, no. 26, pp. 140– 144, 2006.

[26] N. F. Noy, N. H. Shah, P. L. Whetzel, B. Dai, M. Dorf, N. Grifﬁth, C. Jonquet, D. L. Rubin, M.-A. Storey, C. G. Chute et al., “Bioportal: ontologies and integrated data resources at the click of a mouse,” Nucleic acids research, vol. 37, no. suppl 2, pp. W170–W173, 2009.

[27] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, L. J. Goldberg, K. Eilbeck, A. Ireland, C. J. Mungall et al., “The obo foundry: coordinated evolution of ontologies to support biomedical data integration,” Nature biotechnology, vol. 25, no. 11, pp. 1251–1255, 2007. 80 Bibliography

[28] J. Euzenat, C. Meilicke, H. Stuckenschmidt, P. Shvaiko, and C. Trojahn, “Ontology alignment evaluation initiative: six years of experience,” Journal on Data Semantics XV, vol. n/a, pp. 158–192, 2011.

[29] J. Aguirre, B. Grau, K. Eckert, J. Euzenat, A. Ferrara, R. van Hague, L. Hollink, E. Jimenez-Ruiz, C. Meilicke, A. Nikolov et al., “Results of the ontology alignment evaluation initiative 2012,” in Proc. 7th International Semantic Web Conference Workshop on Ontology Matching (OM), Boston, MA, 2012, pp. 73–115.

[30] O. Bodenreider and A. Burgun, “Biomedical ontologies,” Medical Informatics, pp. 211–236, 2005.

[31] G. A. Miller et al., “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995.

[32] T. F. Hayamizu, M. Mangan, J. P. Corradi, J. A. Kadin, M. Ringwald et al., “The adult mouse anatomical dictionary: a tool for annotating and integrating data,” Genome biology, vol. 6, no. 3, p. R29, 2005.

[33] N. Sioutos, S. d. Coronado, M. W. Haber, F. W. Hartel, W.-L. Shaiu, and L. W. Wright, “Nci thesaurus: a semantic model integrating cancer-related clinical and molecular information,” Journal of biomedical informatics, vol. 40, no. 1, pp. 30–43, 2007.

[34] A. Ghazvinian, P. Natalya F. Noy, and P. Mark A. Musen, MD, “Creating mappings for ontologies in biomedicine: Simple methods work,” in AMIA Annual Symposium Proceedings. San Francisco, CA: American Medical Informatics Association, 2009, p. 198.

[35] W. Hu, Y. Zhao, D. Li, G. Cheng, H. Wu, and Y. Qu, “Falcon-AO: results for OAEI 2007,” in Proceedings of the International Workshop on Ontology Matching, Busan, Korea, 2007.

[36] J. David, F. Guillet, and H. Briand, “Association rule ontology matching approach,” International Journal of Semantic Web Information Systems, vol. 3, no. 2, pp. 27–49, 2007. Bibliography 81

[37] R. Agrawal, T. Imielinski,´ and A. Swami, “Mining association rules between sets of items in large databases,” in ACM SIGMOD Record, vol. 22, no. 2. ACM, 1993, pp. 207–216.

[38] P. Xu, Y. Wang, L. Cheng, and T. Zang, “Alignment results of SOBOM for OAEI 2010,” Ontology Matching, p. 203, 2010.

[39] H. Zhang, W. Hu, and Y. Qu, “Vdoc+: a virtual document based approach for matching large ontologies using mapreduce,” Journal of Zhejiang University-Science C, vol. 13, no. 4, pp. 257–267, 2012.

[40] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.

[41] A. J. Smola, B. Schölkopf, and K.-R. Müller, “The connection between regularization operators and support vector kernels,” Neural Networks, vol. 11, no. 4, pp. 637–649, 1998.

[42] J. Euzenat, “Semantic precision and recall for ontology alignment evaluation,” in Proc. 20th International Joint Conference on Artiﬁcial Intelligence (IJCAI), Hyder- abad, India, 2007, pp. 348–353.

[43] Z. Xuegong, “Introduction to statistical learning theory and support vector machines,” Acta Automatica Sinica, vol. 26, no. 1, pp. 32–42, 2000.

[44] R. Neches, R. Fikes, T. Finin, T. Gruber, R. Patil, T. Senator, and W. Swartout, “En- abling technology for knowledge sharing,” AI magazine, vol. 12, no. 3, p. 36, 1991.

[45] V. Sheng, F. Provost, and P. Ipeirotis, “Get another label? improving data quality and data mining using multiple, noisy labelers,” in Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, NV: ACM, 2008, pp. 614–622.

[46] M. Bernstein, G. Little, R. Miller, B. Hartmann, M. Ackerman, D. Karger, D. Crowell, and K. Panovich, “Soylent: a word processor with a crowd inside,” in Proceedings of 82 Bibliography

the 23nd Annual ACM Symposium on User Interface Software and Technology. New York City, NY: ACM, 2010, pp. 313–322.

[47] P. Wais, S. Lingamneni, D. Cook, J. Fennell, B. Goldenberg, D. Lubarov, D. Marin, and H. Simons, “Towards building a high-quality workforce with mechanical turk,” in Proceedings of Computational Social Science and the Wisdom of Crowds (NIPS), Vancouver, BC, 2010, pp. 1–5.

[48] S. K. Stoutenburg, J. Kalita, K. Ewing, and L. M. Hines, “Scaling alignment of large ontologies,” International journal of bioinformatics research and applications, vol. 6, no. 4, pp. 384–401, 2010.

[49] G. Stoilos, G. Stamou, and S. Kollias, “A string metric for ontology alignment,” in The Semantic Web–International Semantic Web Conference 2005. Galway, Ireland: Springer, 2005, pp. 624–637.

[50] E. Loper and S. Bird, “Nltk: the natural language toolkit,” in Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1, ser. ETMTNLP ’02. Stroudsburg, PA, USA: Association for Computational Linguistics, 2002, pp. 63–70. [Online]. Available: http : / / dx . doi . org / 10 . 3115 / 1118108 . 1118117

[51] K. Beck, Test-driven development: by example. Addison-Wesley Professional, 2003.

[52] N. Gunther, Guerrilla capacity planning : a tactical approach to planning for highly scalable applications and services. Berlin London: Springer, 2011.

[53] A. Gross, M. Hartung, T. Kirsten, and E. Rahm, “On matching large life science ontologies in parallel,” in Data Integration in the Life Sciences. Springer, 2010, pp. 35–49.

[54] O. Chum, J. Philbin, and A. Zisserman, “Near duplicate image detection: min-hash and tf-idf weighting,” in Proceedings of the British Machine Vision Conference, vol. 3, 2008, p. 4. Bibliography 83

[55] A. Doan, J. Madhavan, R. Dhamankar, P. Domingos, and A. Halevy, “Learning to match ontologies on the semantic web,” The VLDB Journal, vol. 12, no. 4, pp. 303– 319, 2003.