DCO: A Mid Level Generic Data Collection Ontology

by

Joel Cummings

A Thesis presented to The University of Guelph

In partial fulfilment of requirements for the degree of Master of Science in Computer Science

c Joel Cummings, November, 2017 ABSTRACT

DCO: A Mid Level Generic Data Collection Ontology

Joel Cummings Advisor: University of Guelph, 2017 Professor Deborah Stacey

Ontologies have established themselves as the major framework for knowledge transfer and sharing. They allow consistent understanding of data to both computers and human modellers. This is done through a standard representation of a domains world view which captures the classes and relationships that exist in a particular domain. The interest in capturing a domains world view has led to the creation of many ontologies as ontological developers create ontologies to model their world views. An ontologies world view is a major contributor to reuse and interoperation.

The significant number of ontologies produced has created a wealth a knowledge but particular application or domain specific views creates issue for others. The issue is com- munication and interoperation between ontologies. With so many different designs and terminology it is difficult to make use of existing terms and instances within these ontolo- gies without creating some way to relate or translate terminology.

This thesis tackles the problem of data collection among ontologies through answering the question: How does one model the domain of data collection using an ontology while maintaining a level of domain agnosticism such that the ontology can be reused for any domain? We propose that a mid level ontology design can be used to model the domain of data collection while remaining domain generic otherwise. Consequently, we present the Data Collection Ontology (DCO) and evaluate it to show that we enable reusability through its high level class hierarchies that allow domain level terminology to be represented within the DCO.

Direct contributions of this work include the Data Collection Ontology (DCO), the DCO Survey Ontology as well the philosophy of Classifiers as a way to introduce reasoning and dynamic ontology support. Contents

1 Introduction 1

1.1 Motivation ...... 2

1.1.1 General Motivations ...... 3

1.1.2 Motivations for a Generic Data Collection Ontology ...... 3

1.2 Terminology ...... 4

1.3 Research Question and Hypotheses ...... 5

1.3.1 Domain Overlap ...... 6

1.3.2 Term Specificity ...... 6

1.3.3 Ontology Coverage ...... 6

1.4 Thesis Statement ...... 7

2 Literature Review and Design Anaylsis 8

2.1 Terminology Defined ...... 10

2.2 Classification Research ...... 13

2.2.1 Classifying Ontologies ...... 13

2.2.2 Upper Level Ontologies ...... 18

2.3 Problem Placement of Generic Data Collection ...... 22

iv 2.3.1 Mid Level Ontologies as Placement ...... 23

2.4 Conclusion ...... 24

3 Ontology Design 26

3.1 BFO Discussion ...... 26

3.2 Design Intentions ...... 27

3.2.1 Classifiers Examined ...... 30

3.3 Competency Questions Defined ...... 31

3.4 Ontology Components ...... 32

3.4.1 Classes ...... 33

3.4.2 Relations ...... 34

3.5 Working Example ...... 40

3.6 Design Summary ...... 42

4 Evaluation Methodology 47

4.1 Research Hypotheses ...... 48

4.1.1 Domain Overlap ...... 48

4.1.2 Term Specificity ...... 49

4.1.3 Ontology Coverage ...... 49

4.2 Experiment 1: The EPA Fuel Economy Ontology ...... 49

4.2.1 Classes ...... 51

4.2.2 Relations ...... 53

4.3 Experiment 2: A Case Study of the Survey Ontology ...... 56

4.3.1 The Survey Ontology Premise ...... 57

v 4.3.2 Integrated Survey Ontology ...... 59

4.3.3 DCO Survey Ontology Variant ...... 63

4.4 Evaluation using Traditional Techniques ...... 65

4.4.1 The FOCA Methodology ...... 66

5 Results 76

5.1 Data Collection Ontology Evaluation ...... 77

5.1.1 Competency Question Evaluation ...... 77

5.1.2 FOCA Evaluation ...... 79

5.2 Comparing the Survey Ontologies ...... 84

5.2.1 FOCA Evaluation Table Notation Defined ...... 86

5.2.2 Evaluating Ontology Hypotheses ...... 92

5.2.3 Survey Ontology Evaluation Conclusions ...... 92

5.3 Conclusion ...... 93

6 Conclusions 95

6.1 Contributions ...... 97

6.2 Future Work and Limitations ...... 97

A Diagrams 104

A.1 Fuel Economy Ontology Structure ...... 105

A.2 Survey Ontology Structure ...... 110

A.3 Integrated Survey Ontology Class Structure ...... 113

A.4 DCO Survey Ontology Class Structure ...... 119

vi List of Tables

2.1 The evaluated upper level ontology implementations summarized...... 20

3.1 The top level classes of the Basic Formal Ontology (BFO) that divide major elements ...... 27

3.2 DCO Competency Questions to define the uses expected for the DCO imple- mentation...... 32

3.3 DCO Object Relations Summarized...... 37

3.4 Object Relations Continued ...... 38

3.5 DCO Data Properties Summarized...... 39

4.1 Subjects defined in the EPA Fuel Economy Ontology. These subjects are based on the aspects that the EPA captures on the Vehicles tested...... 53

4.2 EPA Ontology Relations. These relations are defined to match the terminol- ogy used by the EPA for the values captured...... 53

4.3 Object Relations for the Vehicle Class ...... 54

4.4 DCO Survey Object Relation Locations within the DOC. Each relation is presented with the it’s parent in the DCO...... 62

4.5 DCO Survey Data Relation Locations within the DCO. Each relation is pre- sented with the it’s parent in the DCO...... 62

vii 4.6 DCO Survey Relations added to DCO along with the DCO parent under which it is defined...... 65

4.7 FOCA Goal 1 [14] defined along with a description used to evaluate each question...... 68

4.8 FOCA Goal 2 [14] defined along with a description used to evaluate each question...... 69

4.9 FOCA Goal 3 [14] defined along with a description used to evaluate each question...... 70

4.10 FOCA Goal 4 [14] defined along with a description used to evaluate each question...... 71

4.11 FOCA Goal 5 [14] defined along with a description used to evaluate each question...... 72

5.1 Competency Question Justifications to assess the if the ontology implemen- tation meets the requirements of each question...... 78

5.2 Competency Question Justifications Continued ...... 79

5.3 Question Scores for the DCO Ontology ...... 80

5.4 DCO FOCA Goal 1 Justifications ...... 81

5.5 DCO FOCA Goal 2 Justifications ...... 82

5.6 DCO FOCA Goal 3 Justifications ...... 82

5.7 DCO FOCA Goal 4 Justifications ...... 83

5.8 DCO FOCA Goal 5 Justifications ...... 83

5.9 Goal 1 Questions and Justifications for Survey Ontologies ...... 88

5.10 Goal 2 Questions and Justifications for Survey Ontologies ...... 89

5.11 Goal 3 Questions and Justifications for Survey Ontologies ...... 90

viii 5.12 Goal 4 Questions and Justifications for Survey Ontologies ...... 90

5.13 Goal 5 Questions and Justifications for Survey Ontologies ...... 91

5.14 Survey Ontology Sizes Compared ...... 93

ix List of Figures

2.1 The Ontology Hierarchy demonstrates how term specificity and structural design affects use case...... 16

2.2 An example of how domain level ontologies can utilize one or more mid level ontologies to extend capability while reusing existing terms...... 24

2.3 The BFO Hierarchy demonstrates the use of different ontological levels (as defined within the Ontology Hierarchy, see Figure 2.1) by BFO developers, making it suitable for mid level construction...... 25

3.1 The Basic Formal Ontologies Class Structure in its OWL implementation. All terms under owl:Thing are defined in BFO as an OWL Class...... 28

3.2 An example of classifiers that demonstrates the use of equivalency relations defined on the Classifier classes to reason the type of instances. This type can then be compared to the has expected type on the relation to determine consistency...... 30

3.3 DCO Process Types. State Driven Processes represent those that are block- ing and require one process to finish before the next starts. The Independent Process structure allows for parallel tasks...... 34

3.4 DCO Classifier Type. Classifiers utilize the has expected type relation to compare to the type assigned by the OWL reasoner through equivalency relations assigned to a classifier to check consistency...... 35

x 3.5 A partial model of the Vehicle Performance Ontology showing how DCO datums are used and linked to Subjects to specify units on captured instances. 41

3.6 The top level class structure of the DCO within the BFO. Classifiers exist at the entity to level support reasoning with any type...... 43

3.7 The branch of BFO Continuant decedents defined in the DCO. Continuants represent entities that remain the same throughout time...... 44

3.8 The branch of BFO Occurrent decedents defined in the DCO. Occurrents represent entities that exist within a period of time...... 45

3.9 The list of DCO relations and their structures broken down by type. Rela- tions are used to link DCO classes and instances...... 46

4.1 EPA Ontology Classifiers that are used to group vehicles based on their combined passenger and cargo capacity...... 52

4.2 Vehicle Ontology Structure. This defines the structure of the ontology using DCO Subjects, Measurement Units, and Datums to represent the domain specific EPA Fuel Economy content...... 55

4.3 A Survey Question with Example Answer Formats demonstrating how answer formats can be linked to questions along with responses...... 57

4.4 The total quality estimator of FOCA ...... 73

4.5 The total quality estimator of FOCA ...... 74

4.6 The FOCA Methodology Steps [14] for evaluating an ontology...... 75

A.1 Fuel Economy Ontology Base ...... 105

A.2 Fuel Economy Ontology Continuant Tree Structure from BFO...... 106

A.3 The Fuel Economy Ontology’s class Structure starting with the Occurrent Class...... 107

xi A.4 Fuel Economy Ontology Data Relations ...... 108

A.5 Recreated Survey Ontology Object Relations ...... 109

A.6 The original Survey Ontology Class Structure ...... 110

A.7 The original Survey Ontology Data Relations ...... 111

A.8 The original Survey Ontology Data Relations ...... 112

A.9 Integrated Survey Ontology Base ...... 113

A.10 The Integrated Survey Ontology Class structure starting from the Classifier Class...... 114

A.11 Integrated Survey Ontology Continuant ...... 115

A.12 Integrated Survey Ontology Occurrent ...... 116

A.13 Integrated Survey Ontology Data Relations ...... 117

A.14 Integrated Survey Ontology Object Relations ...... 118

A.15 DCO Survey Ontology Base ...... 119

A.16 DCO Survey Ontology Classifier Class Structure ...... 120

A.17 DCO Survey Ontology Continuant Class Structure ...... 121

A.18 DCO Survey Ontology Occurrent Class Structure ...... 122

A.19 Recreated Survey Ontology Data Relations ...... 123

A.20 Recreated Survey Ontology Object Relations ...... 124

xii Chapter 1

Introduction

Ontologies have established themselves as the major framework for knowledge transfer and sharing. They allow consistent understanding of data to both computers and human mod- ellers. This is done through a standard representation of a domain’s world view which captures the terms and relationships that exist in a particular domain. The interest in capturing a domains world view has led to the creation of many ontologies as researchers and industrial workers create ontologies to model their world views. An ontology’s world view is what can enable reuse and interoperation, that is the ability for an ontology be integrated with existing systems [45].

One of the major issues with ontological design is the balance of specificity along with the balance of domain related terminology that makes it suitable for particular use cases [45]. Studies have shown that reuse is a major issue across domains with only a small number reusing terms and even fewer reusing ontologies in their entirety [42, 45]. This issue largely centers around hierarchies and generic terms which allow other developers to adapt them to their use case [45, 42]. Many domain level ontologies focus on defining terms that fit their use case or problem but that often doesn’t fit other domains or problem areas within that domain [45, 23]. As a result of issues with reuse and the ability to adapt terms there exists many ontologies within the same domain that fail to make any impact in terms of reuse

1 to others in the same field [37, 45, 42]. This results in a lot of redefinitions and reduced productivity where developers could leverage the work of others in their own solutions.

Solutions for reuse have been researched with generic terminology in the form of upper level ontologies [36, 35] that provide terms common to all ontologies and create a structure that allows ontology developers to organize their terms. However, upper level ontologies still leave a large gap between their high level definitions and the terms that exist in application or domain specific ontologies.

The result has been upper level designs that enable reuse and provide structures to those building ontologies in any domain. The issue with utilizing upper level ontologies is that it puts the onus of placing the terms within that structure on users that may only be familiar with their particular domain or use case [45]. To developers not aware of the issue of reuse the additional effort required to develop using a generic base may lead to developing from scratch [45]. This prevents wider use of upper level ontologies by developers and in turn reduces the likelihood of reuse of the resultant ontology by future developers.

This thesis tackles the problem of developing a bridge to reduce the gap between high level terms and domain specific terms for ontologies that capture data. We define a data collection ontology as an ontology that focuses on the processes and instances that occur as a result of data collection. To maintain reusability, a data collection ontology defines a world view which is based on data collection and will not include the terms of any particular domain. This will allow developers to extend its design to place their domain specific terms within a common hierarchy that can be shared among all data collection ontologies.

1.1 Motivation

Ontologies being central to the must also play a role in sharing or trans- ferring information across domains and between ontologies. The ability for ontologies to be constructed using the same core for defining and capturing data would allow for much easier interoperation; interoperation being the transfer or sharing of data between systems

2 and/or ontologies. We can separate these motivations into a few general motivations that span ontologies as a whole and then specific motivations for what a data collection ontology should provide to domain level developers.

1.1.1 General Motivations

• Potential for Reuse Reuse is the overarching goal for ontologies [36, 37]. It is a key concept of several evaluation methods [14, 47] that suggest more general terminology be provided al- lowing for a greater number ontologies to utilize the same terms. The potential for reuse allows researchers to make use of existing work and designs while tailoring their solutions to meet specific requirements. A domain agnostic data collection ontology more easily allows reuse through a structure that is recognizable across designs that extend it.

• Potential for Reasoning We are cognizant that ontologies are designed to be reasoned with [16] . The ability to reason enables a more powerful ontology that can answer a greater number of questions while representing more complex data [16]. Reasoning support in terms of data collection can enable the ability for the ontology to determine where instances fit into the overall world view by dynamically assigning type. Assigning type through a reasoner avoids the need for external systems to perform the task outside of the ontology definitions. Additionally reasoning can reduce the need for external systems to perform data validation through the use of range restrictions on instances [24, 16].

1.1.2 Motivations for a Generic Data Collection Ontology

• System Interoperation Ontologies that capture data should support integration with existing or future sys- tems to allow data transfer and for querying of terms and instances. As an example

3 the ontology could be used as a backend for a data collection application that stores its data into the ontology and/or can query the ontology about validity or the process used for data collection.

• Data Collection Domain Ontologies that capture data tend to do so as a secondary purpose to modelling the domain they operate in [45]. This means they are not focused on the process itself and may miss important data collection components. By creating an ontology for generic data collection a focus is placed on the collection itself, providing the basic components and organization for domain ontologies that require data collection.

1.2 Terminology

We will briefly define some key terms here to assist the reader with the research questions and hypotheses. More extensive definitions are defined in Chapter 2.1.

A URI or a Uniform Resource Identifier is a string, generally in the form of a URL [25] that in terms of an ontology defines the method of identifying classes, relations, datatypes, annotations, and instances. These URI’s allow for elements to be linked via relations as well as shared among other ontologies.

Namespace Through the use of URIs, namespaces can be defined by taking the base form of a URI. For example, in the URI http://schema.org/person, http://schema.org would be the base form of the URI. Using the base form ontology can define overloaded terms so that two ontologies may define Person but have a different definition.

Relations allow ontology elements to be linked providing a human understandable way to describe classes.

Classes are used to define high level terms or elements that define the domain the ontology seeks to model, another common term is to call them Concepts. Classes should represent a general description.

4 Instances are concrete examples of one or more classes. Instances must have all rela- tions defined at the class level to be a considered an instance of that class.

Datatypes are a way of assigning type to the value that is captured within an instance.

Annotations are the method used for documenting ontology components, they allow for textual descriptions for designers, users, and maintainers of an ontology to understand the purpose and use of each component.

OWL/RDF is an example of a language that an ontology can be expressed in [24]. XML is the typical format that is used to define OWL/RDF, which is a language that defines standard relations, classes, and the triple format of subject, predicate, object (i.e. class, relation, (instance/data)). All OWL class definitions are a derivation of owl:Thing meaning owl:Thing is the base level class. OWL/RDF is a standard language and is the language used by DCO and its derivatives.

Foundational or upper level ontologies can be defined as: an ontology that seeks to provide definitions and terms that are general to all domains. [35] [19].

A mid level ontology seeks to provide a bridge between an and a domain level ontology by providing terms that will be common to several domain level ontologies or areas of a domain level ontology [22].

Domain level ontologies are ontologies that seek to capture a shared conceptualization of a particular domain. These ontologies contain domain specific terms and may only be linked to a specific application [43].

1.3 Research Question and Hypotheses

The main research question of this thesis is: How does one model the domain of data collection using an ontology while maintaining a level of domain agnosticism such that the ontology can be reused for any domain or stated another way Is it possible for one to construct an ontology that models the data collection domain such that it can be reused in

5 other domains. Based on our research questions we develop hypotheses for ontologies that are created using the proposed solution. Specific terminology is fully defined in Section 2.1. Each of our hypotheses has a null hypothesis in the case our hypothesis is not satisfied.

1.3.1 Domain Overlap

• Hypothesis (H1): There exists overlap with domain specific data collection terms and data collection terms within the DCO.

• Null Hypothesis: There is no overlap between domain specific data collection terms and the terms in the DCO.

1.3.2 Term Specificity

• Hypothesis (H2): If and where overlap exists the DCO includes higher level terms than that of the domain specific data collection ontology (i.e. terms introduced are subclassing the DCO terms).

• Null Hypothesis: The overlapping terms are less generic or at the same level as in the DCO (i.e. terms introduced are parents to the DCO terms)

1.3.3 Ontology Coverage

• Hypothesis (H3): There are no terms defined outside of the DCO hierarchy (i.e. there are no terms defined at the owl:Thing level).

• Null Hypothesis: There are terms that are outside of the DCO hierarchy (i.e there are terms defined at the owl:Thing level).

6 1.4 Thesis Statement

Based on our research question we define a thesis statement that is: A mid level ontology design can be used to model the domain of data collection in a domain agnostic manner. This is based on the notion that mid level designs are to be extended by design and that through using an existing upper level design we enable developers with a familiar starting point. This statement will be proven or disproven (i) by outlining our design, (ii) by establishing an evaluation methodology, and (iii) through experimentation, and evaluating our hypotheses.

7 Chapter 2

Literature Review and Design Anaylsis

Capturing data is a common ontological purpose where ontologies, terms, descriptions, and relations capture the universal aspects of a particular dataset and the instances serve as specific examples of those universals. The result is an ontology that categorizes instances through its understanding of universals. However, with ontologies focused on capturing a specific domain they define terms only applicable to their domain, ignoring hierarchies and higher level terms that are reusable amongst other domains. This is true for the data collection aspects as well [45]. The commonality of such designs in relation to our problem directed research toward generic designs. In creating a generic design domain ontologies are able to reuse existing work in the design of their ontology with the lower level components being the domain terms. These lower level terms will describe data collection in language that coincides with that particular domain while reusing higher level terms to provide an overall hierarchy. Based on this definition we can reword our problem to be the construction of a generic data collection ontology that defines terms and hierarchies at a high level to enable reuse among other ontologies that collect data. This definition will help direct the rest of the information in this chapter to examine existing work in the field.

8 Our definition of ontology further alludes to a preference towards generic design. On- tology is defined as a shared conceptualization that should seek to define terms in their most formal regard and should not use terms that are specific to particular areas wherever possible [31]. In other words, ontology developers should strive to produce solutions that enable reuse among other ontologies. To do this, high level terms and sub-classing to create hierarchies is preferred over single more specific terms [17] [31]. Therefore, our definition of ontology and the goal of a data collection ontology are the components that guided research into existing designs.

Research in the area of a generic ontology for data collection cannot be found in pub- lished work at the time of this writing. For this reason the problem has been broken down into subproblems that are general to ontology construction. The first issue is that of reuse; if we think of the problem in a global context, reuse is one of the most significant problems involved in developing a solution. This is evident since our solution is based around other ontologies reusing the terms and relations in the data collection ontology. Additionally, from a second perspective, we may be interested in reusing other ontology components.

The second major area is how ontologies are defined in terms of a hierarchy, and what characteristics define that category. Categorization is important because it will determine the ontological research area our problem fits in. It is expected that users will be able to reuse or extend the solution’s components in some aspect. In terms of definition we consider the categories to be based on the generality of the ontology’s world view.

In this chapter we open by defining important terms for ontologies for those not familiar with the field as well as defining terms that may be known but have overloaded definitions. We then move on to discuss different types of ontologies and the level of concern they have for reuse to determine if and where existing designs or ontologies can be reused by a generic data collection ontology. We then focus on ontology categorization in the context of where our problem is best tackled keeping in mind our definition of ontology and our high level view of data collection. Finally, we summarize by drawing conclusions on existing designs and what needs to be done in terms of defining generic and reusable ontologies.

9 2.1 Terminology Defined

In this section we will define major terminology we will use in this chapter as well as future chapters so readers have an understanding of ontology terms as well as how we will refer to overloaded terms.

A URI or a Uniform Resource Identifier is a string, generally in the form of a URL [25] that in terms of an ontology defines the method of identifying classes, relations, datatypes, annotations, and instances. These URI’s then allow for elements to be linked via relations as well as shared among other ontologies. Finally, URIs can also be used external to ontologies to identify other resources such as a class in a programming language that can then be linked into an ontology using its URI.

Namespace Through the use of URIs, namespaces can be defined by taking the base form of a URI. For example, in the URI http://schema.org/person, http://schema.org would be the base form of the URI. Using the base form ontology can define overloaded terms so that two ontologies may define Person but have a different definition. This provides the ability to include an alternate definition from another ontology. An example would be dco:person and schema:person, where dco and schema define specific URIs that end with the term after the colon.

Relations in terms of an ontology will be described as either an Object relation or a Data relation. Relations allow ontology elements to be linked providing a human under- standable way to describe classes. Additionally, relations are used to impose restrictions and requirements for classes that allow the reasoner to determine consistency. In terms of language, relations tend to be in the form of a verb (i.e. has part, branches to) where classes are nouns. An example usage of a relation would be Vehicle has part Engine. An Object relation is used to link instances to classes or instances to instances. Data relations link data elements such as strings, numbers, and boolean values to instances. An example of an object relation was seen in the Vehicle example above while an example of a data relation would be Engine displaces 5.7, where 5.7 is the float value that is captured. All relations

10 will be italicized in this document for easy reference.

Classes are used to define high level terms or elements that define the domain the ontology seeks to model; another common term is to call them Concepts. Classes should represent a general description. Classes should not contain any data that is specific to an example of that term or object. As an example we would not include that a Person has blue eyes since that is not representative of all people. Additionally, classes typically have relations between them to link data or instances and this allows the term or object to be placed within the world the ontology is modelling (i.e what it is related to), and the other classes that are a part of or contain that class. Classes are typically expressed as nouns such as Vehicle, Person, Word, etc.

Instances are concrete examples of one or more class. Instances must have all relations defined at the class level to be a considered a instance of that class. Establishing class type is reasoned though the use of restrictions. For example, we may define a class of Person which defines the attributes common to all persons (such as having eyes) whereas an instance would define a person with green eyes which is not something common to all Persons.

Datatypes are a way of assigning type to the value that is contained within a string. The reason for this is that ontologies are stored in text form and all data relations are captured as strings with datatypes that are appended to the end. Datatype definitions allow types such as float, int, string, date, etc. to be interpreted though a program. An example of a datatype would be ‘‘12”:ˆˆ xsd:Int, where the xsd:Int tells an application that the string preceding it is to be interpreted as an integer type. RDF, OWL, and XSD are XML schemas that allow users to import and use default types commonly used across ontologies and understood by applications such as Protege [46]. These default types allow for interoperation between ontologies and other software programs that can interpret the types and constraints imposed by these schemas.

Annotations are the method used for documenting ontology components. Annotations allow for textual descriptions of ontology components for the consumption of designers,

11 users, and maintainers of an ontology to understand the purpose and use of each component and why a component exists. Annotations are in the form of a relation type which links a string of some format.

OWL/RDF is an example of a language that an ontology can be expressed in [24]. XML is the typical format that is used to define OWL/RDF, which is a language that defines standard relations, classes, and the triple format of subject, predicate, object (i.e. class, relation, (instance/data)). RDF defines base relations and data types while OWL imposes base classes and relations that enable reasoning capability. For example, all OWL class definitions are a derivation of owl:Thing meaning owl:Thing is the base level class. OWL/RDF is a standard language and is the language used by DCO and its derivatives.

Competency Questions are commonly used [15, 26, 29] to define the scope and goals an ontology sets out to accomplish. Competency questions can change over time as ontology purpose is realized and iterated upon. Competency questions do not define a single ontology but rather impose constraints on what the ontology can be; however, anyone could develop their own ontology against a set of competency questions. Competency questions allow both users and developers to establish an ontology’s functionality and ideals about how to accomplish a particular task. Competency questions are typically written in an informal fashion but they must be able to represented or answered using the ontology’s axioms and definitions [26]. An example competency question comes from a project that seeks to design an ontology that models the Brazilian Family Health Program [28] that is: “What is the epidemiological profile of patients in a given region, i.e. age range, gender and ICD (International Classification of Diseases)?” [28]. This question seeks to establish that the ontology can answer a high level question that is pertinent to the domain it models. In this case since we are tracking people’s health, the ontology must contain information regarding its patients which the competency question above seeks to establish.

12 2.2 Classification Research

Ontological research has become widespread in the design of information systems. Recently the desire for ontologies to span and integrate different views of a domain and even across domains has come to fruition [36]. The development of these ontologies provides the oppor- tunity for systems to integrate and become interoperable allowing for information sharing [32] [36]. In this case an ontology acts as a bridge between systems unifying information [32] and allowing systems to communicate through the ontology using their standard lan- guage and message passing techniques. Unifying data allows the key components of one or more domains to be captured and shared among ontologies that further define a particular domain. The ability for an ontology to capture a particular domain is related to its view- point of the world. Each ontology imposes a particular view that defines its ability to share information; this viewpoint is therefore what we are concerned with.

Furthermore, each ontology exhibits its own degree of formality through how it views the world and its domain [32]. These views can allow us to classify ontologies by the generality of their view. This world view becomes important when we look at defining an ontology as generic where we would say its definition allows it to span across a wide variety of domains and thus integrate systems with different domains and views of the world. Ontologies can be broadly classified based on their formalism and world view allowing us to narrow down suitable designs for our purposes.

2.2.1 Classifying Ontologies

In this section we seek the broad classifications that are given to ontologies based on their perspective of defining the world. These categories may be further broken down but for our purposes of searching for generality at a high level it will be suitable.

Domain level ontologies are ontologies that seek to capture a shared conceptualization of a particular domain. These ontologies contain domain specific terms and may only be linked to a specific application [43]. Domain ontologies are important in that they describe

13 the type of data to capture but they assume a particular domain to capture data from. A domain level ontology, however, could represent the end product for a system reusing an ontology.

A local or application ontology is a specialization of the domain ontology and rep- resents data from the viewpoint of one user or developer [43]. This differs from a domain ontology that seeks to capture the shared view of a particular domain that may vary de- pending on particular users [43]. An application ontology can therefore have very specific term definitions within the ontology; in other words, the ontology’s correctness is based upon application features or requirements. That requirement differs from a domain on- tology where the correctness is based on whether it captures all views and ideas within a particular domain. Again, an application ontology may be the result of using a generic data collection ontology in a larger system but does not provide reusable terms for the base of an ontology.

A core ontology is linked to a particular domain but has the advantage of providing several viewpoints relating to different user groups [43]. Core ontologies are often the result of several domain level ontologies mapped together [43]. Core level ontologies represent a higher level of term generality as they seek to span and provide definitions for a wider domain or domains. From the problem perspective core level ontologies are certainly closer but still maintain the requirement of domain specific content within them thus, they cannot generically be applied to any domain.

Foundational or upper level ontologies can be defined as: a foundational ontology seeks to provide definitions and terms that are general to all domains. [35] [19]. The first point is that they serve as a building block for future ontologies by enabling reuse since they define common terms that will be contained by domain level ontologies [40]. The goal of an upper level ontology is to avoid the redefinition of common terms to allow for easier and consistent reuse of defined terms. In other words, they provide a single agreed upon definition of terms [36] [43]. More importantly however is the fact that they are designed to support all domains which differs from core or domain ontologies that only define terms

14 for their particular domain, likely choosing specific (overloaded) definitions over general definitions [43].

Upper level ontologies are also commonly used for other tasks including merging domain level ontologies [36] where their global terminology can be used to match differing terms between source ontologies. By doing so it allows the translation and merging of two or more domain ontologies to create a core ontology, or to link separate but related domains [36]. This is common to work in systems that are built to interoperate with others and is one of the key usages for upper level ontologies [36].

A mid level ontology seeks to provide a bridge between an upper ontology and a domain level ontology by providing terms that will be common to several domain level ontologies or areas of a domain level ontology [22]. Mid level ontologies serve a similar purpose to the upper level ontology by preventing term redefinition and providing consistent relationships but at a more specific level. This has several advantages in addition to avoiding redefinition. Firstly, it provides a common understanding between derived ontologies through similar terms, structures and relations. Secondly, it provides a more streamlined starting place for those new to the construction of ontologies by providing terms more closely related to their domain than that of upper level ontologies. In terms of an ontology category hierarchy the mid level ontology falls in the middle with domain level ontologies extending mid level ontologies and mid level ontologies extending upper level ontologies.

To summarize the classifications we can order the view point of ontology classification as in Figure 2.1 which allows one to see that upper level ontologies are a general source of reuse for all ontologies. They contain terms at such a high level that every ontology could start by subclassing those terms to create their ontology. However, at the other end of the spectrum, domain and application level ontologies create a world view that only extends to whatever domain or view of a domain is necessary for an application. These are the ontologies that we expect to derive from a generic data collection ontology with the benefit being that the ontology will implicitly have a higher level view because it will subclass terms within an existing reused hierarchy.

15 Going back to our definition of Ontology (See Section 2) we can see that upper level ontologies satisfy the condition of avoiding the definition of specific terms where general terms can be used. Additionally, they fit well with the ability to reuse and bridge other ontologies making upper level ontologies the only type that satisfies the requirement for reuse and therefore will work well as a base for generic data collection.

Figure 2.1: The Ontology Hierarchy demonstrates how term specificity and structural design affects use case.

General terms meant for all ontologies to be based on. Upper Level Ontologies Enforces structure at a high level.

Mid Level ontologies bridge the gap between highly generic Mid Level Ontologies upper level terms to the terms of a particular domain.

Core ontologies define multiple view points or multiple Core Ontologies domains.

Ontologies that define the view a particular domain has of Domain Level Ontologies the world.

Local or Application ontologies define a view specific to a Application Ontologies Term Generality particular application.

A substantial amount of work has been done in the area of upper level ontologies to pro- duce a solution to become the basis for all ontologies. These developments have prompted authors such as Mascardi et al. to perform a comparison of 7 upper level ontologies [35] in an effort analyze the characteristics amongst them. This number is substantial when you consider one of the goals of upper level ontologies is to unify all ontologies with a common core [32], [18]. This demonstrates that despite the work that has been put into developing upper level ontologies we have not reached an agreed upon design that satisfies the needs of everyone and that their development is ongoing research [32].

One might then wonder why the work put into the development of upper level ontologies has not resulted in a common ontology that is shared among all domain level ontologies. One particular reason for this is the result of language implementation. Languages implemented

16 by computer scientists are based on set theory that captures abstract content well but does not capture the concrete objects and their relationships well enough to be completely generic [27]. More recently the author of the General Formal Ontology (GFO) stated that we may not be able to meet such a lofty goal at all [32]. However for the problem’s purposes this is still acceptable since one must work with what is available. With that in mind we will focus on available upper level ontologies to seek a design that best meets the requirements of our problem. All of the ontologies are sourced from Mascardi et al. [35] due to it capturing recent and active implementations.

To describe such needs we will define criteria that will help us to draw conclusions based on upper level ontology design, purpose, and applications. When developing such a criteria we are seeking to choose the ontology that has the smallest delta from what is considered the ideal upper level ontology based on how we have defined an upper level ontology, our definition of ontology, and our problem. The first criteria we want to define is based on the number of terms and relations in the ontology, where we prefer to have little of each for two main reasons. Firstly, upper level ontologies are meant to be derived into a domain level ontology and thus will have more terms and relations added over time and, large ontologies introduce performance penalties potentially resulting in an ontology that is intractable for a reasoner [34]. Secondly, in terms of understandability, the fewer terms a person must know to use an ontology the easier it is to get started. Also, it will reduce reliance on documentation and expert knowledge making it easier to design and organize derived ontologies. Furthermore large ontologies may deter usage of the ontology altogether.

The second criteria we care about is usage and popularity. Popularity of an upper level ontology is important when considering its purpose of unifying ontologies [32]. We want to look at what people are using to see what is working and how many domains are being captured by the upper level ontology. If only one domain is using a particular ontology it is possible that it has not met the needs of others. Additionally greater popularity increases the likelihood that ontology developers will have experience with the ontology.

Finally, we move on to a more formal definition for upper level ontologies which is used

17 for the purpose of ensuring that the base is kept generic, again to satisfy our definition. Thus we say an upper level ontology must be free from any domain specific terms or relations. We are not interested in ontologies that take the role of defining thousands of terms to satisfy a large number of domains since it is unlikely such an ontology could satisfy each domain realistically.

2.2.2 Upper Level Ontologies

Based on our criteria we will examine several upper level ontologies to see how they suit our desired characteristics defined above. When choosing ontologies we sought only ontologies in active development and only open source ontologies. We left out other major ones that appear to be no longer active. This is because the development of upper level ontologies is considered a long term research effort that requires continued progress [35]. The selected ontologies include the Basic Formal Ontology (BFO), the General Formal Ontology (GFO), a Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), and the Sug- gested Merged Upper Ontology (SUMO). We will start by introducing each ontology, its purpose, design decisions and use cases. We will then move on to a comparison and the application of our defined criteria.

BFO is an upper level ontology with development starting in 1998 [35]. It currently consists of 35 classes in version 2 [7]. BFO is commonly applied in the biology domain but does exist in a number of other domains and is used in over 150 ontologies as of this writing [12]. BFO itself contains no domain specific content and focuses on describing objects through time and space which is common to all physical objects. It considers both abstract and concrete terms and seeks to define terms based on their lifespan as either occurrent or continuant, with occurrent defining objects that exist during a period of time while continuants exist throughout time [12].

GFO started development during the definition of the GOL language [27, 35] and con- tinued with a more recent paper by Herre [32]. GFO separates its entities based on whether they will be instantiable or non-instantiable. In addition GFO focuses on class axioms that

18 dictate where in the ontology derived entities will go with GFO supporting 212 [5] logical axioms to BFOs 52 [7]. GFO contains 45 classes in version 1.0 [5]. Like BFO, GFO defines both time and space and allows objects to have different relationships depending on being abstract or concrete [5] .

SUMO started development prior to 2001 with a paper by Niles and Pearse [39]. SUMO in its base ontological form currently consists of 4558 classes and is based around the use of modules. With all of its domain extensions SUMO currently consists of approximately 25000 classes [11]. SUMO takes the principle of combining a number of domain specific ontologies to provide an upper level ontology capable of supporting any domain [39] [40]. Unlike GFO and BFO, SUMO defines its terms using concrete terms and does not make use of abstract definitions of time and space to group terms as abstract or concrete.

DOLCE started development in 2002 as part of the Wonder Web project but has since separated from the project [35]. DOLCE divides terms into abstract and entities [9] while defining time and space for terms. DOLCE lite consists of 37 classes and 349 logical axioms [9]. Like SUMO, DOLCE is divided into sub modules that allows for additional terms [35].

Comparison of Upper Level Ontologies

Each of the discussed upper level ontologies has its own merits which will be examined in this section. Once the merits are discussed we will look at how the ontologies fit with our definition of an upper level ontology.

BFO and GFO are generally very similar to each other in their size and definition type. They both focus on defining abstract and concrete (instance) type objects as well as the world they live in. The GFO’s concept of Abstract roughly matches that of the BFO’s generically dependant continuant but without the restriction of time [5, 7]. BFO takes the approach of dividing based on time whereas GFO defines on instantiation. It appears that due to similar high level content both ontologies could serve as high level designs.

SUMO and DOLCE are similar in that they focus on more common sense type terms

19 Table 2.1: The evaluated upper level ontology implementations summarized.

Name Year Classes Axioms Purpose Cre- ated BFO 1998 35 52 Designed to provide needed definitions for all on- tologies. BFO works based on the idea of divid- ing elements into their existence in timespace. BFO does not contain domain specific content but is most often implemented in the biology domain. GFO 2010 45 212 Designed to provide needed definitions for all on- tologies. GFO works based on dividing elements into whether are instantiated or not. GFO does not contain domain specific content but is cur- rently lacking implementations using it. SUMO 2001 4558 N/A Designed around the idea of separation into modules based on ontology purpose and there- fore does contain domain specific content. SUMO is a merged ontology and the result of merging non-commercial freely available terms to form an upper level ontology. DOLCE 2002 37 349 Designed for linguistics, DOLCE organizes by Lite dividing terms into abstract items and entities while also defining time and space. Similar to SUMO, DOCLE also provides several sub mod- ules and based on its design for linguistics; does contain domain specific content.

20 without the abstract terms of GFO and BFO. Additionally they both use the concept of modules that enables them to be expanded with greater domain content. This provides the benefit of quickly pulling in several domain modules. These domain modules can serve the purpose of defining a core level ontology. The disadvantage of both SUMO and DOLCE is that they require significant size to include domain level content since they take a less abstract approach to the grouping of terms [35].

Moving on to our specified criteria we will the ontologies examine to see which ones best suit our purposes. One of the first requirements we determined was that the size of the ontology was significant since it would be built upon. In this case both DOLCE and SUMO present the issue of not using abstract terms for their definitions. The base size of DOLCE is only 37 classes but modules quickly add up to over 100 terms [35]. SUMO starts out significantly larger at 4558 [11] and then rises to 25000 with all modules added [11]. When that is compared to GFOs 45 and BFOs 35 it is easy to see that they are at least a magnitude larger.

The next criteria requires no domain specific content. DOLCE includes a mathematical set definition in the ontology [9] which creates ties to the math domain. A Mathematical set is unlikely to be used in an ontology that describes business practices or that describes the parts of a body, so in these cases such a term provides no benefit and is out of place in such an ontology. SUMO has many domain specific terms and is based around domain specific modules to build up its capabilities as a upper level ontology [40].

We start to see a trend while looking at SUMO and DOLCE and that is because they are more interested in concrete rather than abstract categories and thus they accumulate a lot of terms and begin to rely on domain specific definitions to develop their ontologies. For our purposes, this is undesirable since we are seeking to build an ontology that is agnostic to any domain. DOLCE and SUMO violate this constraint in their design.

BFO and GFO do not define any domain specific content by focusing on time and space which are required by all domains and all concrete entities we might want to define in the real world [7] [5]. For our purposes these ontologies work well since they are compact,

21 provide only abstract, domain independent definitions which ensures that we start with only the essentials.

The last measure to look at is popularity and usage and for this we move to resources on the web to see how many ontologies cite themselves as using each example here. BioPortal cites no known projects using GFO based on their database [5], while they cite 22 usages for BFO [2]. On the BFO web page they cite well over 150 ontologies or projects using BFO [12]. The importance is the projects stem from more than just the biological field. From the GFO site they only present GFO-BIO and biology ontologies as users of GFO [6]. Therefore based on what has been located on the published websites it appears BFO is more popular and diverse in its implementations. In terms of SUMO they note there are 7 projects using SUMO on their website [10]. The number of users of DOLCE could not be found as a direct number but based on papers and project it appears to be relatively popular based on a web search.

Based on the criteria above BFO fares best which is why we will focus on discussing BFO and why it best fits our needs. The Basic Formal Ontology is in its second major version therefore we will focus on that version in discussion of terms and structure although the first version is quite similar [1].

2.3 Problem Placement of Generic Data Collection

Upper level ontologies a starting point for the creation of an ontology but do not give direction about the specificity of ontological terms (see Section 2.2.1). We propose a mid level ontology as the design target for the problem and in this section discuss why that choice best reflects the problem, our definition of ontology, and works with the chosen upper level ontology. We will start at examining how the problem fits with this design as well as discussing downsides to the design.

22 2.3.1 Mid Level Ontologies as Placement

When looking for the class of ontology that best represents our problem we must assume that it handles the following restrictions: support BFO as a base ontology; enable reuse to develop further ontologies; and cannot impose highly specific domain dependent terms. This allows the re-examination of the classification hierarchy Figure 2.1 to determine that mid level ontologies best serve our purpose. In this section we will examine how and why a mid level ontology will best serve our purposes.

Mid level ontologies seek to define a domain that is at a very high level and span multiple ontologies. They therefore are generally independent and define terms at a high level to avoid conflict with ontologies that will extend them. This fits well with our problem since it is expected that the solution will form the basis of a domain ontology but not be exhaustive in term definition. Second, in terms of our definition we seek to define terms as generically as possible but while also avoiding the redesigning of existing ontologies and the redefining of terms. Mid level ontologies allow reuse through having the upper level ontology define its basic structure.

Another important part of mid level ontology compatibility is exposed when looking at the source ontology. It was noted that SUMO and DOLCE already have modules that are used for domain specific ontologies. In a way these modules are somewhat like mid level ontologies in that they fill a gap but differ in that they are at the domain level in many cases. BFO however, does support domain ontologies through its existing ontological framework and repositories that are built around it. The OBO foundry provides the framework and existing ontologies that have been developed using BFO and demonstrates existing mid level ontologies that are active, see Figure 2.3. This demonstrates merit to the proposed pattern as it provides concrete examples functioning with the Basic Formal Ontology [48]. Furthermore, the Basic Formal Ontology does not have a derived ontology that exists for our particular problem (data collection) demonstrating a gap in existing mid level ontologies where our proposed solution could fit. The design of BFO has taken into consideration mid level ontologies with working examples of mid level ontologies and domain level ontologies

23 Figure 2.2: An example of how domain level ontologies can utilize one or more mid level ontologies to extend capability while reusing existing terms.

Ontological Level

Upper Level Basic Formal Ontology (BFO)

Mid Level Data Collection Ontology (DCO) Other Mid Level Ontologies

Domain Ontology Based on Domain Ontology Based on Domain Ontology Not DCO and other Mid Level DCO Based on DCO Domain Level Ontology(s)

utilizing those mid level ontologies [48].

2.4 Conclusion

In this chapter we have examined ontology classifications and determined that upper level ontologies provide domain independent designs and focus on describing the world at the highest level and based on our definition of ontology they are valuable in terms of reuse. We therefore began looking for particular designs since it was noted that there are several implementations. It was determined that due to the simplicity, popularity, flexibility, and focus on high level terms, the BFO was the best choice of examined upper level ontologies for our problem. Finally we looked at BFO in terms of a mid level extension and found that several implementations currently exist demonstrating that the pattern has merit and domain level ontologies already exist that make use of these designs.

It should be noted that there are implications of these designs starting with the fact

24 Figure 2.3: The BFO Hierarchy demonstrates the use of different ontological levels (as de- fined within the Ontology Hierarchy, see Figure 2.1) by BFO developers, making it suitable for mid level construction.

Upper Level Ontologies Basic Formal Ontology (BFO) Information Artifact Ontology for Biomedical Mid-level Spatial Ontology (BSPO) Ontologies Ontology (IAO) Investigation (OBI)

Infectious Disease Anatomy Ontology Ontology (IDO) Environmental Ontology

Cellular Component Cell Ontology Phenotypic Quality Biological Process Domain Ontologies Ontology Ontology (PQO) Ontology

Subcellular Anatomy Ontology

Sequence Ontology Molecular Function Protein Ontology

that upper level ontologies are incomplete at best and inaccurate at worst with current implementations requiring versioning [1] that alter the design. This means that any mid level ontology could also be invalidated by a major design change. The second major issue with the design is the size of the ontology. While size was one of our comparison criteria when choosing an upper level ontology, these additional levels still introduce greater numbers of terms than ontologies that focus exclusively on domain level terms. We feel these issues are worthwhile trade offs for reasons of reusability, more standardized design among ontologies, and easier to understand terminology with domain specific terms which are sub-classed by more generic parent terms.

25 Chapter 3

Ontology Design

This chapter is dedicated to an overview of the Data Collection Ontology (DCO), its com- ponents, relations, and design choices that make it suitable for data collection. The DCO is designed as a mid level ontology that extends the Basic Formal Ontology (BFO) to organize and provide placement for data collection terms. The DCO provides domain independent definitions as a starting place for domain data collection developers. Due to the fact that the DCO is a mid level ontology and is in its early design it is by no means finished and is expected to change over time. Like an upper level ontology it may be found to be incorrect or lacking and will need to be updated.

We start by examining the BFO more in depth and then establish how the DCO is expected to be used by domain ontology developers. We then move on to discussing the components and relations of the ontology to understand why particular components exist and how they contribute to the intended use of the ontology.

3.1 BFO Discussion

The Basic Formal Ontology provides several key advantages that serve us well starting with its focus on organization of terms into their existence in the world. BFO bases its concepts

26 around objects that exist in time space and those that do not, see Figure 3.1. BFO defines Occurents as terms that exist within time space in that they occur in a particular time period and/or take up space in the world. The second type is Continuant that do not exist in time space. This applies to any domain as it focuses on the world at a very high level and takes into consideration ideas and objects allowing separation of these in the designer’s mind.

Term Description Occurrent Entities that exist in some form of time space, i.e. they are an object that lives and dies and consumes some physical space Continuant Entities that exist outside of time space, i.e. a concept or idea in someone’s mind.

Table 3.1: The top level classes of the Basic Formal Ontology (BFO) that divide major elements

BFO also provides a relatively small number of terms which is important from the developer’s perspective as a large number of terms requires a greater understanding to know where derived terms should go or if they already exist. Furthermore, considering that the ontology will also be further derived to create a domain ontology, a smaller starting size helps developers to maintain a small size relative to other domain level ontologies. Another and perhaps more important benefit to small size is with the performance of reasoning. It becomes a greater factor as ontology size grows due to the fact that high term and instance counts prove intractable for current reasoners [34] which would remove capability from the proposed system and for derived ontologies.

3.2 Design Intentions

The design has a philosophy about how data collection should be performed and does so at a high level allowing for more specific work flows to be integrated through defining data

27 Figure 3.1: The Basic Formal Ontologies Class Structure in its OWL implementation. All terms under owl:Thing are defined in BFO as an OWL Class.

28 collection processes. This view is based on first describing what you are collecting; these are Subjects that represent a timeless view of your object. The DCO uses the BFO’s independent continuants to define Subjects that describe objects as they are in concept but not as an instance that exists in time and space. Within DCO captured data are represented as instances and have the type of the Subject they are a data point of. DCO also includes processes to capture how data is collected and what stages it goes though. This is to support uses such as surveys or cyclic forms of collection. Additionally DCO places stress on types and units through the definitions of Datums that capture both measures and unit of measure ensuring all values are labelled appropriately. The final portion of the ontology is Classifiers which are defined as entities in the BFO structure since they are time and space irrelevant and may be used to classify any type. In this case it was felt that Classifiers should not be restricted to time or space due to their function of classifying any type.

Classifiers are hierarchies of terms on which one defines equivalence relations to define what constitutes this particular class. Classifiers are designed around the suspicions or anecdotal estimates of what range one expects data to fall into. Classifiers are designed to be populated with instances which exist as individuals of any type in the ontology. These instances are then grouped based on the reasoner and can be queried to determine if they are of the expected type when entered in the ontology. In other words, it allows validation of the estimates or anecdotal data that one has. Classifiers provide an additional advantage when concerning data validity in that they are non-destructive whereas traditional approaches may place strict boundaries on collected data, removing instances that don’t fit. Classifiers allow invalid data to be filtered but not permanently removed if an error in classification is detected but the datum is valid. This supposes that there is a dynamic aspect of the ontology that overtime it will shaped by the instances that it collects and therefore definitions will be challenged.

29 Figure 3.2: An example of classifiers that demonstrates the use of equivalency relations defined on the Classifier classes to reason the type of instances. This type can then be compared to the has expected type on the relation to determine consistency.

Classifiers Compact Vehicle Midsize Vehicle Large Vehicle

Reasoned Relationships

Chevrolet Chevrolet Incoming Data Ford Edge Ford F-150 Lexus ES350 w/ Attributes Camaro Impala

3.2.1 Classifiers Examined

Classifiers are an important concept so we will examine them in further detail so one can understand their purpose and use cases. Classifiers in conjunction with the has expected type relation allow two types of inconsistencies to be detected in the ontology. Classifiers are defined as classes using OWL equivalence relations to restrict instances (Illustrated in Figure 3.2 . The first we consider to be a datum inconsistency which occurs when the ontology has not classified your datum into a category you expected and this is because it does not belong in that category and the ontology has marked it as such. In other words, something about that datum is invalid. The second type is a world view inconsistency which occurs when there is a mismatch but you know your datum is accurate and in this case the ontology must be updated since its world view is inconsistent with actual collected data. In both cases the error is caught through finding instances that have has expected type relations that do not match the reasoner assigned type or the rdf:type. We will now go over an example of where this can be integrated into an external system.

30 Suppose we are creating an ontology to classify cats based on their age, weight, and breed. We have some estimates of what these ranges are for each breed and create Classifiers based on this. However, since they are estimates they may not be accurate. In this case the ontology may be used as a front end for vet clinic software and is able to present an error if a cats weight falls outside of a range based on its age and breed using the Classifiers. Since we know the Classifiers may not be accurate users can override and have the system alter the ontology to adjust ranges. With such a system Classifiers allow for both error detection (with mismatched classified values) and a dynamic ontology that re-captures Classifier ranges to be consistent with its domain.

A second case using the same example would be creating a dataset of only valid data and have an external system adjust the ontology’s classifications each time an inconsis- tency arises. In other words, you have the ontology learn the correct boundaries for given classifications.

Classifiers are designed to be used as a part of a larger system to gain understanding of the data that is being captured. If we consider our examples above it can act like an error detection system as well as having the system automatically update the ontology when its world view is inconsistent. Furthermore one could determine the distribution of a Classifier category by importing its instances and using the has expected type to link to a data type in the external system and running a statistics package on the datums. Therefore, Classifiers used along with has expected type provide the means to link ontology instances to an external system making them a key component of the ontology.

3.3 Competency Questions Defined

Here we will define the competency questions (see Section 2.1) used when designing the ontology, whose creation was based on the design principles and goals. Each question belongs to one of 4 categories based the type of question. These categories are defined as: Capability, a goal the ontology must accomplish or support in its design; Reasoning:

31 a goal that centres around the ontologies ability to reason in some aspect; Counting: a query that can return the result of some aggregation of instances; Selection: a query that returns instances based on some condition. The competency questions are defined in Table 3.2.

Table 3.2: DCO Competency Questions to define the uses expected for the DCO implemen- tation.

ID Category Question 1 Capability Can construct a process based on blocked and unblocked flows allowing support for concurrency in processes? 2 Capability Has the ability to link complex classes to a similar type in an external system? 3 Capability Has the ability to apply universal time of any format across the ontology? 4 Capability Has the ability to assign qualities to data type? 5 Capability Has the ability to assign units of measure to data captured? 6 Reasoning Has the ability to re-classify existing data? 7 Reasoning Has the ability to assign expected types to individuals allowing automatic classification using the reasoner? 8 Counting What is the amount of captured aggregates? 9 Selection Has the ability to query based on data type and by data structure? 10 Selection Has the ability to query based on time?

3.4 Ontology Components

With the high level view of DCO established we can further break down its components into main components: Subjects, Processes, Data Qualities, Classifiers, and Meta Data. In this section each of these terms is defined with examples and use cases. We can then move on to using the components with the defined object and data properties to form a working example of the DCO. The working example seeks to establish what a result of the DCO will will look like as its components are declared at a high level, like other mid level ontologies.

32 3.4.1 Classes

Here we will cover all the major classes in the ontology however we will not be exhaustive with all subclassed types, for a visual representation of the classes in DCO, see Figure 3.6.

Subjects represent what data is being collected from or about. Subjects can be either physical objects or concepts, meaning types can be either material or immaterial. Subjects are designated as Continuant meaning they represent subjects at a universal level leaving out anomalies of particular individuals. For example, if we are surveying people then the Subject may be Person and would define a person at a universal level. When instances are entered into the ontology they may have a relationship with the Person Subject i.e. part to Person but are themselves Occurrent and do exist in space and time. The person is instance is where specific attributes are captured such as height and weight or their name.

Processes fall under the BFO’s process definition with some extensions provided by DCO for convenience. DCO’s processes allow one to support both state driven and inde- pendent processes. State driven processes require one process block to finish before another can continue while independent processes can have any number of process blocks running concurrently. An example of the process types can be seen in Figure 3.3.

Classifiers are where equivalence relations are defined to classify instances in your ontology. Classifiers are where one defines the range data is expected to fall into to form a particular category or type. One may think of a Classifier as having the ontology assign type to an individual based on its understanding of the world. Classifiers can be thought of as the dynamic component of the ontology as they are designed to change if data proves them to be invalid. Alternatively individuals may change if they are proven invalid based on the ontology’s view of the world. For an example of how classifiers are represented see Figure 3.4.

Meta Data are descriptors that exist to define a data point or complex structure that one expects an individual to contain. Meta data describes the types and units allowing for data in multiple formats and multiple units while having it link to the same individual type.

33 Figure 3.3: DCO Process Types. State Driven Processes represent those that are blocking and require one process to finish before the next starts. The Independent Process structure allows for parallel tasks.

Basic State Driven Process Independent Process

Process Process Process Block Block Block

Process Process Process Process Process Block 1 Block 2 Block 3 Block Block

Process Process Process Block Block Block

Sequential processes require one process block to end Parallel processes act like several sequential processes before another starts. This is the most common form of a allowing process blocks to overlap. Any number can process where block N depends on block N-1. overlap as there are no direct restrictions within the DCO.

An example case of this would be if a study were conducted across North America where in Canada the metric system dominates while Americans use the Imperial system. Data could be captured for the same study just using different Meta Data classes to describe the units while declaring Classifiers to capture both formats.

Data Qualities exist to define restrictions and set theory properties on instances that can be used as part of classifiers to group instances or as a part of a larger system. Examples of Data Qualities include boundedness, cardinality, and equality.

3.4.2 Relations

In this section we will examine the relations defined in the DCO, and go over how they are used and the reason for their existence. The DCO defines several object and data type relations that are meant to be extended and added to, therefore these relations are by no means exhaustive but should provide good coverage for most ontologies. All data relations can be seen in Figure 3.9.

34 Figure 3.4: DCO Classifier Type. Classifiers utilize the has expected type relation to compare to the type assigned by the OWL reasoner through equivalency relations assigned to a classifier to check consistency.

Classifiers group instances based on equivalency relations using a reasoner Classifier where the has_expected_type serves to store the classifier you expect an instance to be grouped under

ClassiferEX1 Equivalency Relations ClassifierEX2

InstanceEX1 InstanceEX1 InstanceEX2

Has_expected_type ClassifierEX2"

Has_expected_type ClassifierEX1" Has_expected_type Based on the has_expected_type one can see where there is a discrepancy between the reasoned type or the ontology view of the world and the expected view highlighting an error on either side

35 Object Relations

In this section top level object relations are discussed that are relations that act to link individuals to other individuals (see table 3.3) and are expected to be extended when de- riving the DCO in a domain specific implementation. When viewing the tables a dash (-) reflects a relation that is at the top level.

36 Table 3.3: DCO Object Relations Summarized.

Relation Purpose Inverse has part Allows individuals to be composed of part to other individuals. This is important where data is captured on different parts of a larger item or data is aggregated into a larger sum. Composition should not be thought of only in terms of physical ob- jects having parts. has measure Measurements are considered any numeri- measure to cal value one captures and links to an indi- vidual. Note that this is a object property so it forces one to link to some descriptor for the value. Subclass of has measure has measurement da- This will be one of the most common measurement datum tum properties as it links measurement datums to to individuals so data is annotated with units. has measurement unit This provides a link for unit definitions to measurement unit to measurement datums. has time stamp Links a time value to some measurement time stamp to datum that contains some time unit al- lowing a universal way to save time in an ontology.

37 Table 3.4: Object Relations Continued

Relation Childof Purpose Inverse contains pro- has part Links a process to an object; for exam- process to cess ple, some subject may go through some process that captures data. The data col- lection may itself be a process and have a relation to a process. has quality - Allows individuals to possess particular quality of qualities or require particular qualities on data being classified. has object - Used for objects that act as a control, for object control control example, in a process something may be a to terminator. branches to has object Supposes that an object in a process will branch of control branch to another object when it has com- pleted allowing for order to captured.

Data Relations

In this section key data relations are defined and discussed in table 3.5. These data prop- erties are designed to cover most basic relations necessary to interact with the defined objects. It is expected most derivations of DCO will extend these properties for domain specific terms to establish the language used in that particular domain. When viewing the tables a dash (-) reflects a relation that is at the top level.

38 Table 3.5: DCO Data Properties Summarized.

Relation Purpose has expected property Denotes what property this value is intended to represent. This is designed primarily for external use where a value may link with a variable. has expected type Denotes what type we expect an instance to be. This is intended to be used in conjunction with Classifiers allowing ontology verification based on expected types. It is addition- ally intended to be used to link to external systems where we want an instance to link to a particular type. has control Represents data values that act as controls such as booleans that alter the flow of a process Subclass of has control can repeat Denotes whether a particular entity can repeat such as a process block. Some processes may be cyclical has sequence Denotes a sequence value that may be used to order process blocks or other entities. has value The base compositor for values, allows an instance to be composed of particular values. Subclass of has value has coordinate value Used for denoting the location of instances. has maximum Represents a maximum expected value for an instance to have; good for creating ranges. has minimum Represents a minimum expected value for an instance to have; good for creating ranges. has time value Links time values to instances. Note that format is indepen- dent and can be any type based on the ontology design. has percentage Commonly values are captured as percentages, reflected here. has measurement value Used to link measurement values to measurement datums.

39 3.5 Working Example

As an example of how the design works we will construct a very basic ontology around collecting vehicle performance data with a goal of comparing consistency of output figures against other instances of the same type (see Figure 3.5). We will refer to this as the Vehicle Output Ontology. The first Subject of our collection will be Vehicle which is the most generic object. The Vehicle Subject will describe what a vehicle is composed of from the performance perspective as this is the view our ontology has of the world. For example, every Vehicle has an Engine and a Transmission so we will define those as other Subjects since we are interested in these components as they alter a vehicle’s performance substantially. Our last Subjects will be the Make and Model since we need to compare like for like vehicles and therefore need to know who manufactured them.

Now we define the relations between our Subjects. Vehicles are made up of an Engine and a Transmission so we can use composition to define a Vehicle having those parts. DCO defines the part of relation which allows us to produce a composite relationship. Additionally Models are produced by some Make, and we can consider them part of what the company produces. For our example we can use the has part for Vehicle to the Engine and Transmission and we can subclass part to to include example of for Vehicle to Model while adding produces to has part to state that a manufacturer produces Vehicles and Models.

Moving on to the data we would like to capture we will define measurement datums that will capture key performance points for a Vehicle. In this simple example we would like to capture the power and torque that the vehicle produces so we will define some common units. Power is measured commonly using horsepower and kilowatts while torque is measured commonly using foot pounds and newton meters. These are defined as instances under Power Unit and Torque Unit measurement units respectively. Finally we create datums for Power that require a numerical value and some Power Unit as well as a Torque Datum that requires a numerical value and some Torque Unit. With these measures defined we will say a subclassof Vehicle requires at least one of each measure using the has measurement datum

40 relation.

The design can be partially illustrated as seen in Figure 3.5 where datums and units are defined as well as Subjects linking to their respective datums. This is the general structure expected for data that is to be collected on subjects.

Figure 3.5: A partial model of the Vehicle Performance Ontology showing how DCO datums are used and linked to Subjects to specify units on captured instances.

Two Subjects are defined for fuel economy in this case, Subject Subjects the engine and a vehicle that data will be Measurement types captured from. are divided into broad categories: power units Measurement for engines and Unit consumption for vehicle fuel economy Vehicle Has_part some Engine Engine

Has Measurement Datum exactly 2 power datums Has Measurement Datum exactly 3 Fuel Economy Datum

Fuel Power Consumption Economy Power Unit Has unit some Consumption Unit Datum Unit Datum

Has Unit exactly 1 Power Unit

Horsepower LB/FT l/100km Measurement datums are defined for the measures that expected to be Measurement captured from subjects. Datum They serve to link values with some unit type. Measurement Units Defined as Instances

Now since we are capturing data on vehicle performance we may define Classifiers that are based on estimations of what we expect. For this example, let us say we are verifying that Vehicles are within their rated power measurements so we will define Classifiers based around manufacturer provided power and torque ratings for a particular vehicle and apply some expected variance to create boundaries. These Classifiers will use range values around

41 the Power and Torque measurement Datums we just defined. This allows us to create Classifiers around a particular Model using values for the Make and Model as well as ranges for Power and Torque for a particular Engine to group our Vehicles. The greatest importance here is when populating the ontology we must use the has expected type relation and link to the corresponding Make and Model Classifier to allow the data and the ontology to be validated. This is done by using the reasoner to add type to our added instances and then querying the reasoner for the intersection of instances that are not the same type as the has expected type URI value stated. In other words, this presents that there is an error with the Vehicle that data was captured on or that the ontology has an inaccurate view of what values are valid for that Vehicle. In creating classifiers it is useful to assign a name that relates to what you are capturing so has expected type values can be application generated or human generated very easily. For example, in our case we might name a classifier DodgeRam57Auto to denote that we expect this classifier to group all Dodge Ram pickups with the 5.7 litre engine and automatic transmission making it easy to to generate the URI with the data we use to populate an instance.

3.6 Design Summary

The design of the DCO is based around an ontology validating and being validated by its own instances through the use of classifiers. It does this by using the has expected type relation to link to the URI of the Classifier class the subject is expected to align with and then using a reasoner to find all instances whose type does not match that of the has expected type relation.

Another key point to the design of the DCO is its basis around generality of Subjects that are described based on attributes and measurements that define them within the view of the ontology. Instances can then be examples of Subjects using the relation has subject type. This is done to keep definitions at a high level for the purpose of re-usability where well described Subjects may be linked into other ontologies.

42 Figure 3.6: The top level class structure of the DCO within the BFO. Classifiers exist at the entity to level support reasoning with any type.

owl:Thing bfo:entity dco:classifier bfo:continuant bfo:occurrent

43 Figure 3.7: The branch of BFO Continuant decedents defined in the DCO. Continuants represent entities that remain the same throughout time.

bfo:continuant bfo:generically dependent continuant dco:meta data dco:measurement datum dco:complex measurement datum dco:scalar measurement datum dco:length measurement datum dco:time measurement datum dco:measurement unit label dco:length unit dco:time unit bfo:independent continuant dco:subject bfo:immaterial entity bfo:continuant fiat boundary bfo:one-dimensional continuant fiat boundary bfo:two-dimensional continuant fiat boundary bfo:zero-dimensional continuant fiat boundary bfo:site bfo:spatial region bfo:one-dimensional spatial region bfo:two-dimensional spatial region bfo:three-dimensional spatial region bfo:zero-dimensional spatial region bfo:material entity bfo:fiat object part bfo:object bfo:object aggregate bfo:pecifically dependent continuant bfo:quality dco:relational quality dco:datatype property dco:boundedness dco:bounded dco:numerically bounded dco:unbounded dco:cardinality dco:countable dco:finite dco:uncountable dco:equality dco:equal dco:inequal dco:exactness dco:exact dco:approximate dco:numericalness dco:non numerical dco:numerical dco:realizable entity dco:disposition dco:function bfo:role dco:data role 44 Figure 3.8: The branch of BFO Occurrent decedents defined in the DCO. Occurrents rep- resent entities that exist within a period of time.

bfo:occurrent bfo:process bfo:process profile bfo:history dco:independent process dco:dependent process dco:exclusive state driven process dco:basic state dco:probabalistic state bfo:process boundary dco:process part bfo:spatiotemporal region bfo:temporal region bfo:one-dimensional temporal region bfo:zero-dimensional temporal region

45 Figure 3.9: The list of DCO relations and their structures broken down by type. Relations are used to link DCO classes and instances.

(a) DCO Object Relations

owl:topObjectProperty dco:has data entity dco:has measure dco:has measurement datum dco:has measurement unit dco:has time stamp dco:has object control dco:branches to dco:has part dco:contains process dco:has subject type dco:has quality dco:location of dco:realizes

(b) DCO Data Relations

owltopDataProperty dco:has expected property dco:has expected type dco:has control dco:can repeat dco:has sequence dco:has value dco:has coordinate value dco:has measurement value dco:has time value dco:has maximum dco:has minimum dco:has percentage

46 Chapter 4

Evaluation Methodology

Evaluating the Data Collection Ontology requires the examination of our problem to es- tablish the criteria and methods that are appropriate. The goal of the DCO is to facilitate domain level ontologies with providing terms and hierarchies that they can extend and reuse in their data collection components. In Chapter 2 we established this would be done through developing a mid level ontology that extends the Basic Formal Ontology, and will be extended directly through domain level ontologies.

Based on the design of the DCO there are several areas we can use to evaluate the ontol- ogy. Firstly, we are concerned with general design of the DCO in terms of the components it provides. A data collection ontology must contain terms relevant to data collection while avoiding the declaration of terms that are too specific. Secondly, there is concern with doc- umentation, and general usability. This is important because users are expected to interact directly with the ontology components. Finally, we are concerned with the criteria we used in Chapter 2 as that was the criteria we used when looking at reuse of source ontologies and therefore we must hold the same standard for our solution.

With our criteria in mind we begin the chapter by outlining our hypotheses for a derived ontology constructed with the DCO. This sets up our experiments for the ontology of which there are two. These experiments serve to produce derived ontologies using the DCO, the

47 first of which is based on the Vehicle Ontology described in Chapter 3 but using data collected by United States Environmental Protection Agency for new vehicle fuel economy. The second ontology is based on the comparison of an existing ontology, the Survey Ontology [30], to a version derived using DCO. Each of these designs are outlined in this chapter to describe components, areas of reuse and the philosophy behind their design as well as the evaluation techniques used. The creation of ontologies using the Data Collection Ontology is important because it evaluates both usability and ensures that our ontology criteria is met and allows us to use additional ontology evaluation techniques.

The last topic of this chapter is the discussion of the traditional evaluation methods of which we will use two. We will use the FOCA method [14] which is a framework for evaluation and additionally we will use our criteria defined in Chapter 2 to examine the derived ontologies ensuring they still reasonably meet the criteria we outlined in our search for upper level ontologies.

4.1 Research Hypotheses

In this section we further discuss the hypotheses we defined in Chapter 1 that will be used to evaluate the merit of the Data Collection Ontology. Specifically, the hypotheses will be tested against ontologies constructed using the Data Collection Ontology as a base. For convenience the hypotheses are reiterated below.

4.1.1 Domain Overlap

• Hypothesis (H1): There exists overlap with domain specific data collections terms and data collection terms within the DCO.

• Null Hypothesis: There is no overlap between domain specific data collection terms and the terms in the DCO.

48 4.1.2 Term Specificity

• Hypothesis (H2): If and where overlap exists the DCO includes higher level terms than that of the domain specific data collection ontology (i.e. terms introduced are subclassing the DCO terms).

• Null Hypothesis:The overlapping terms are less generic or at the same level as in the DCO (i.e. terms introduced are parents to the DCO terms)

4.1.3 Ontology Coverage

• Hypothesis (H3): There are no terms defined outside of the DCO hierarchy (i.e. there are no terms defined at the owl:Thing level).

• Null Hypothesis: There are terms that subclass owl:Thing and are therefore outside of the DCO hierarchy.

Each of these hypotheses seeks to evaluate a part of our problem. The first hypothesis is about problem coverage since the goal of the DCO is to enable data collection ontologies to reuse its components therefore the terms defined must cover data collection regardless of domain. The second hypothesis ensures that if and when the terms are being reused they are being subclassed and are not becoming the subclass of a defined term because they are too specific. Lastly, the third hypothesis is based around coverage, it is important that when using the BFO as a base and defining terms in the DCO we are not forcing developers to start declaring items outside of the DCO terms or the BFO terms. In other words nothing should be defined at the entity level or above.

4.2 Experiment 1: The EPA Fuel Economy Ontology

The first experiment takes the idea of the Vehicle Output Ontology and uses available data from the Environmental Protection Agency (EPA) to create the EPA Fuel Economy

49 Ontology. Instead of modelling vehicle output however, we use similar components to capture vehicle fuel consumption. The EPA releases the fuel consumption ratings for new vehicles from 1984 to present. In addition to fuel economy, it includes engine information, transmission information, and vehicle size classifications. These size classifications are based on rules of interior and cargo capacity. These rules work as an example of how classifiers should work in the DCO ontology. The EPA labels each vehicle with a size classification of: Minicompact, Subcompact, Compact, Midsize, or Large. These values are used as a has expected type URI and the classifiers contain the rules for each size class. It is unlikely the EPA has made an error with their data but it does provide a nice example of where classifiers could be used. The structure of the Fuel Economy Ontology can be seen in Appendix A.1.

The dataset can be found in [13]. The dataset starts from 1984 and goes until present day with updates occuring often as the EPA tests vehicles. The ratings of historic vehicles are also updated periodically. The CSV format was used with each row representing a test or data point and was parsed into the OWL format using Python 3 and the Owlready2 API to generate OWL out of Python Objects [8]. From the data set the following fields were used city08, highway08, comb08, displ, engId, make, model, year, and VClass [4]. We define the fields below:

• city08 - The city Miles per Gallon (MPG) a vehicle achieves

• highway08 - The highway MPG a vehicle achieves

• comb08 - The combined MPG, calculated based on 55% city driving

• displ - The Vehicle’s engine displacement

• engId - The ID that identifies a particular engine

• make - The make of the vehicle

• model - The model of the vehicle

• year - The vehicle’s model year

50 • VClass - The vehicle’s size class (Minicompact, Subcompact, Compact, Midsize, Large)

Vehicles were selected from a reduced number of years and we chose long standing models, such as the Honda Accord, Toyota Camry, and Chevrolet Malibu to allow trends to be seen and to reduce the ontology size. This reduction in size was primarily to reduce reasoning time of the classifiers to allow for greater experimentation.

In the next sections the design and components of the ontology will be described along with relations used and how the DCO allows for reuse of terms and relations as well as how it provided the overall structure for the design of the EPA Fuel Economy Ontology.

4.2.1 Classes

The ontology defines Vehicle as the main subject since that a particular vehicle is what data is collected from. The Vehicle is defined in a universal type with the EPA test providing specific examples of a Vehicle. In addition the ontology also will model some of the data captured by the EPA such as information about the engine which is one of the biggest contributors to fuel consumption and what the EPA has the user select before presenting ratings. Finally, we will define Make and Model as subjects. We defined Make since makes must maintain fuel economy averages and model because the data spans from 1984 to present so while models have the same name, they can vary significantly between generations making it important to separate.

For the meta data components, fuel economy datums are defined as well as volume and fuel consumption units that allow us to capture consumption, engine displacement, interior volume (for size classification), and fuel tank volume in the relevant units. Volume is particularly important since engine displacement is measured in litres today but was previously measured in cubic inches which is still the case with older EPA data, while interior volume is measured in cubic feet.

51 Figure 4.1: EPA Ontology Classifiers that are used to group vehicles based on their combined passenger and cargo capacity.

Vehicle Size

Minicompact Subcompact Compact Midsize Large

This ontology does not involve processes since we are using already captured data, although one can imagine there is some process that the EPA uses to capture this data and while it may be worth modelling it is out of scope for our purposes but still supported by the DCO through the process hierarchy.

Classifiers were defined in the EPA ontology to group vehicles based on their size, in this case the EPA has them labelled so we will use that as the has expected type URI to create groups for each vehicle size as defined by the EPA. The classifiers are based around the criteria for each size class allowing the ontology to validate the data. Additional classifiers around fuel economy based around an estimate of what one may expect a vehicle to achieve were created for further examples. All classes in the Fuel Economy Ontology can be seen in Figure A.1.

52 Table 4.1: Subjects defined in the EPA Fuel Economy Ontology. These subjects are based on the aspects that the EPA captures on the Vehicles tested.

Term Purpose Vehicle Captures a particular vehicle and its relevant attributes Manufacturer Links vehicles and models to the manufacturer that produced them Model Links vehicles to the particular model they describe Engine Links an engine to a particular vehicle as many vehicles offer choice of engine

Table 4.2: EPA Ontology Relations. These relations are defined to match the terminology used by the EPA for the values captured.

Term Purpose achieves fuel economy Captures a fuel economy value has displacement Captures engine displacement has engine code Captures the unique code for a particular engine

4.2.2 Relations

Several domain specific relations were added so the ontology makes sense from a vehicle’s perspective and the terms used for composition are human readable. Each of these relations is placed under one of the existing DCO relation hieararchies meaning there are no new top level relations. These relations were largely for the purpose of composition and therefore many of the relations are sub classing the part of relation.

When linking meta data attributes many of the original relations were used since no more specific terms were needed. We found that the has meta property was fine since the meta data types were labeled with domain specific names. However, more specific data relations were created to label the values being collected such as fuel economy and various volumes.

53 Vehicle Object Relations

Relation Purpose Subclass of has engine Links an engine individual to a vehicle part of manufactured by Links a vehicle to a manufacturer part to model of Links a vehicle to a model part to

Table 4.3: Object Relations for the Vehicle Class

54 Figure 4.2: Vehicle Ontology Structure. This defines the structure of the ontology using DCO Subjects, Measurement Units, and Datums to represent the domain specific EPA Fuel Economy content.

Two Subjects are Measurement types defined for fuel are divided into broad economy in this case, Subject categories, length for Subjects the engine and a the vehicle dimensions, vehicle that data will be power units for captured from. Measurement datums are engines and Measurement defined for the measures consumption for Unit Measurement that expected to be vehicle fuel economy Datum captured from subjects. Engine has part some Engine Vehicle They serve to link values with some unit type.

has measurement datum exactly 2 Power Datum 55 Consumption Unit has unit some Consumption Unit Length Unit Power Unit has measurement datum some VehicleLengthDatum

has Measurement Datum some Power Unit

has unit some Length Unit

Vehicle Fuel Power Horsepower LB/FT Inches l/100km Length Economy Datum Datum Datum

has measurement exactly 3 Fuel Economy Datum

Measurement Units Defined as Instances Subject Measurement Datums 4.3 Experiment 2: A Case Study of the Survey Ontology

The second experiment involves the working version of the Survey Ontology [30] [38] which involves reconstructing the ontology as well as taking the premise of a survey ontology and constructing from scratch for a means of comparison. We acknowledge that the Survey ontology is not complete but is recent and provides a good example for a use case of the DCO [38]. We will take into consideration its recency and incompleteness when doing our full evaluation. The structure of the Survey Ontology can be seen in Appendix A.2.

The goal of the Survey Ontology is to develop a standard for a survey structure’s repre- sentation and to allow for communication over the Semantic web. It has the hopes to enable reuse of survey data for other purposes [30]. This purpose of the Survey Ontology overlaps well with our problem and design intentions of the DCO but at a more domain level since it is focused on surveys, a form of data collection but not data collection as a whole. The Survey Ontology is also recent but still under construction making it a reasonable point of comparison as it is in similar infancy to that of the DCO.

The three ontologies to be compared will be defined below with names which will be used throughout the remaining chapters. The rest of the chapter will be dedicated to the complete design and terms of each ontology.

• Survey Ontology - This is the original unaltered Survey Ontology as of this writing [38]. This is main point of comparison as it establishes the premise, design, and use cases of a Survey Ontology.

• Integrated Survey Ontology - This the first of the DCO based Survey Ontologies which integrates all the terms into the DCO reusing equivalent DCO classes and relations where applicable. The Integrated Survey Ontology seeks to establish that it is possible to model any data collection ontology within the DCO.

• DCO Survey Ontology - This is the DCO designed and developed Survey Ontology, it does not reuse all classes and terms, instead it takes the premise of what a Survey

56 Ontology should be based on [30] and implements the design from the DCO philosophy of data collection. This would be the ontology that one developed given the DCO and the requirements of a Survey Ontology.

4.3.1 The Survey Ontology Premise

Figure 4.3: A Survey Question with Example Answer Formats demonstrating how answer formats can be linked to questions along with responses.

Question Subject: Age

What is your age?* When were you born?

____ years? Or ____ months? YYYY-MM-DD

A) 0-1 YYYY B) 2 – C) 11-19 D) 20 -49 A) 1960s Answer E) 50+ B) 1970s Possibilities C) 1980s D) 1990s A) Baby B) Child C) Teen D) Adult E) Senior

* For future use of this question with captured data, one also needs the date a respondent took the survey, otherwise data is out of date.

The Survey Ontology is focused on creating a generic and reusable framework for producing surveys and capturing survey answers. The work is based off of early work that used XML to create a general format for questions, answers, and survey logic [30]. The main issue noted is that one often wants to compare questions as well as responses to questions in

57 surveys. More generally, analysis is often missed out in existing implementations [30]. The Survey Ontology focuses on creating a representation for the semantic web that allows for questions, answers, and responses to be linked and reused.

The Survey Ontology contains a number of classes starting with Survey. The Survey class defines a recursive structure of Survey Part instances that allow surveys to be sec- tioned and broken down into smaller chunks. Next it defines Question that defines the basic format including text, an identifier, and a sequence. The Question type is then bro- ken down into several types based on usecase. Survey Response and Survey Answer are similar to Question in that they define required components that focus on establish- ing links between users responses, the Question they are related to, as well as start times and end times so surveyors can track when each component was used [30]. The last key component is the Subject which serves to link questions and responses to a general sub- ject which serves two purposes, firstly, it allows for more semantic understanding of what a Question or Response seeks to capture and secondly it allows the use of more general relations since a relation does not need to focus on describing the Subject it collects data on [30]. An example would be surveytaker hasHadHeartAttack yes versus surveytaker ha- sAilment HeartAttack and assigning the subject Heart Attack to the question being asked i.e. surveyquestion hasSubject HeartAttack.

The structure of the ontology makes it clear through its restrictions and relations that Answers, Questions, Response Formats, and Surveys are linked so that one can examine each component individually while maintaining a connection to relevant components so one can move between related classes. In addition the Survey Ontology imposes some default values [30] and ranges for question subtypes to enable faster construction. It also provides the ability to produce some standard survey types.

A specific example of a survey question can be seen in Figure 4.3 where we present questions for the subject of age. We note that there are multiple ways of asking the question of age and depending on the way we ask the information we require to analyze and make inferences changes. The first question we pose is What is your age? which we note that

58 during analysis one would also need to know when a respondent answered the question in order for age to be determined. The second question we pose is When were you born? where we can calculate age based on the current date. However, both questions have the same subject of age and that is one important component of the Survey Ontology. The second area where an ontology is important is with the different formats that respondents can provide their responses in. In other words, they are guided by the answer format of which we provided 3 examples for question. For comparative and analytical purposes the format that is used is important as not all responses can be converted depending on their format. For example if age is provided in ranges it cannot be converted into specific integer values, however, this conversion can be done the opposite way. A survey ontology provides benefit through linking formats to responses, questions, and subjects one is able to make decisions about how to analyze their results.

This generic structure of capturing survey data and survey components is an example of where the DCO could be used to speed up ontology development and provide structure and components which is the motivation for demonstrating this design both integrated using the DCO and secondly creating a DCO variant with the same goal in mind.

4.3.2 Integrated Survey Ontology

The Integrated Survey Ontology was developed using a translation approach, it took classes and relations from the Survey Ontology and focused on putting them into their respective positions in the DCO. For each of the major subjects we will discuss their positions in the ontology, along with major relations.

We note that all components were carried over when we derived the Survey Ontology from the DCO but there are several advantages gained though doing so, First is the in- clusion of units and types for measurements. The Survey Ontology does not include the definition of units for captured data, this results in unit-less instances within the ontology. The DCO version utilizes the DCO datums to capture unit along with the measure. We also note that through the BFO, the DCO provides generic time and does not force the

59 use of the xsd:dateTime which enforces a particular format using the Gregorian Calendar. The xsd:dateTime format therefore does not allow freedom for system specific datetime capture [3]. The DCO allows users to use different calendars or simply use epoch time measures denoting seconds from a particular date (the DCO default). These are specific areas where improvement is seen with the development using the DCO versus the original Survey Ontology.

DCO Survey Class Integration

For integration of classes, each class in the Survey Ontology was examined to determine its place in the DCO based on purpose and functionality. Some pieces were slightly re- engineered with most pieces falling in with no change. All classes and their placement within the DCO can be seen visually in Appendix A.3.

The main Survey class can be seen as composed of parts and taking some place in time. Since it also requires one item to finish after another this is placed under process ->dependent process ->Exclusive State Driven Process ->Basic State ->Survey. This meant that the relations normally used for time are inherited though the process which is said to be part of some one dimensional time region and that it contains at least one part. Therefore the DCO allowed reuse of process properties creating a place for a Survey class.

The Survey Part and its sub classes were placed under Process Part which is a logical place for these pieces to go as they are merely the components of the Survey. The Survey object then links to these parts as in the original Survey Ontology.

It should be noted that both Survey and Survey Part classes were defined as Occurrents meaning they take place in time and space which we can justify by saying that a Survey is conducted in some place during some period of time. Each Survey therefore is composed of parts (Survey Part) which take place in some time and place making them obvious candidates for Occurrent typing.

60 When looking at Question, and Question Answer we do not consider these as Occurrents but instead as Subjects that represent a general conceptual view of the things we are collecting data on. It is the elements which we are collecting data on. For example, there may be one Question that is asked several ways and we want to compare the types of Responses we get back. Similarly Answers have different formats which is something that is often the subject of study for survey conductors. It is for this reason that we made the Question and Question Answer types as Subjects. Similarly, Disorders, Medications, and Population was considered a Subject as well as they are types we are collecting information about.

Statements were considered Classifiers since based on the definition of a Statement and its purpose of tying together Agents and Verbs to understand context and time values, Statements are acting to classify Survey components based on particular criteria.

DCO Survey Relation Integration

Relations for the survey ontology are largely composition type relations meaning they were defined under has part with particular components moving under has meta property and has data entity. The full list of relations and there locations can be seen in Table 4.4.

The data relations were largely translated under has value by reusing the has time value, has expected property, has expected type, and the has control properties. The has expected property and has expected type were used for the ontologies ability to link to external types. has time value was used for its start and end time properties. The DCO uses a single property since time is linear and can be sorted to determine start and end, additionally following DCO standards the format of time is not strictly enforced. All migrated relations can be seen in Table 4.5 with a visual representation in Figure A.13.

61 Table 4.4: DCO Survey Object Relation Locations within the DOC. Each relation is presented with the it’s parent in the Table 4.5: DCO Survey Data Relation DCO. Locations within the DCO. Each relation Relation DCO Parent is presented with the it’s parent in the contingentOn has part DCO. contingentPart has part hasAgent has part Relation DCO Parent hasDative has part forProperty has value hasDemographic has part hasBaseURI has value hasExitResponse has part hasName has value hasFacitive has part hasPurpose has value hasInstrument has part hasProperty has value hasProvider has part hasQuestionText has value hasQuestion has part hasRepeatCount has value hasQuestionComponent has part hasResponseString has value hasQuestionType has part hasSpecification has value hasRespondent has part hasSubject has value hasResponse has part hasVerb has value hasStatement has part maxRepeat has value hasSurvey has part referenceNumber has value hasSymptom has part responseData has value hasTerminationResponse has part targetObject has value hasTreatment has part targetProperty has value hasVerb has part targetValue has value hasFocus has data entity hasSubjectProperty has data entity hasTopic has data entity subjectIDQuestion has meta property

62 4.3.3 DCO Survey Ontology Variant

The DCO Survey Ontology focuses on the premise of the Survey Ontology (see Section 4.3.1) but does not focus on integrating all of the Survey Ontology’s components. With that in mind the main focus of the DCO Survey was to develop a Survey Ontology with the DCO components in mind and to focus on the shortcomings of the Survey Ontology’s design. The DCO Survey Ontology thinks about Surveys in a non-traditional sense by considering it through its components and leaves the process in which it is conducted up to the domain. By that we mean a survey may be conducted with a person as a respondent, it may also be conducted on behalf of someone, or be conducted on a subject that cannot answer questions directly but responses come from studied data. For example, we may survey natural resources and provide answers to questions based on findings or results of tests. This assumption is applied throughout the ontology which makes the structure of classes seem sparse in comparison to the original Survey Ontology. The reasons for the empty structure is noted in our differences below. The full structure of the DCO Survey Ontology can be seen in Appendix A.4.

Answer presents the first notable change as it stores possible formats or options the user has when answering a particular question as seen in Figure 4.3. The Survey Ontology instead uses Questions to establish these formats and establish ranges and limits for particular response values. The DCO Survey Ontology uses Answer to provide basic types but does not impose any restrictions, instead allowing Classifiers to establish the range respondent values should fall into. This was chosen because we feel specifying and including range into a high level ontology is presumptive of the domain that will be using it. One specific example we point to is that of the AgeQT that specifies a range of 1-120 [30] which would pose an issue if one were doing a survey on trees, where age can get into the thousands range if we are discussing years. Another example is YearQT where particular values are provided [30]. The DCO Survey provides similar types but does not include restrictions and encourages subclassing or creating instances in domain ontologies to specify the options a respondent has when answering a Question. For capturing ranges one should use Classifiers

63 to determine if the ontology has an accurate world view of the survey subject.

This difference is down to the structure of the DCO where Subjects are to reflect a high level of what one is capturing and therefore because a Question, an Answer, and the subject of the Survey are really themselves subjects they are defined as Subject classes in the DCO. The DCO Survey Ontology has focus on including only main components and leave ranges to Classifiers. In a similar sense concrete examples are left to be created as instances. Through using the DCO’s structure we avoid issues of specific ranges that are ingrained inside classes designed for reuse while allowing for ontology elements to be grouped or classified with domain specific Classifiers.

The second major difference is the focus on units that comes from the DCO base. The Survey Ontology provides ranges which can be numerical or instance based but it does not provide any method for entering or associating units with captured data. As an example of why this is a problem we can look at the example in Figure 4.3 where for age one can respond with months or years expressed as an integer. Therefore unless units are specified one could completely miss that a Response refers to a baby that is 16 months old by assuming that the number is in years. The units in the DCO are propagated through so that captured data will use the Measurement Datum classes for respondent data. This ensures that each number, string, or complex type has the appropriate unit as part of the Response.

To ensure the DCO Survey Ontology remains generic it does not include any classes outside those mentioned above meaning there are no Person definitions or other definitions that may not be used in a survey existing in any domain.

Relations

Many of the existing relations in the DCO were taken advantage of for the DCO Survey Ontology’s purpose through using has part for composition purposes, creating has answer, has response, and has question to link our major components. As with all the DCO relations we have inverse relations. In addition we defined data properties: has name, has purpose, and has question text to label our objects. Where the Survey Ontology used generic response

64 Table 4.6: DCO Survey Relations added to DCO along with the DCO parent under which it is defined.

Relation Parent Description has response has part Links response objects to Questions or classifiers has question has part Links question objects to Survey Parts, can also be used for classifiers looking for particular questions has answer has part Links answer formats to questions, i.e. one question may have multiple formats that one can use when presenting the question to a user has name has value Used to provide human readable names to instances in the ontology, an example would be a person taking a survey or a title for the Survey. has purpose has value Used for the Survey class and possibly a Question instance allowing the survey facilitator or person to understand why the question is being asked has question text has value Used to store the actual question string for a question in- stance. data properties we will use our has value to enable generic responses of any type since we do no impose string, numerical or multi-selection response value types. Therefore, using the DCO we have a minimal amount of relations that are necessary to be introduced. A summary of all additions and their usages can be seen in Table 4.6. A visual representation of the relations can be seen in Figure A.19.

4.4 Evaluation using Traditional Techniques

Techniques for validation vary depending on the design and use case of the ontology. Meth- ods range from Data Driven Approaches [21] which utilize domain data as a corpus to evaluate ontology content based on the number of keywords and terms that are matched in

65 the source ontology [21]. The data driven approach was designed to improve upon another evaluation technique that involves the use of experts [20, 33] to evaluate the structure, content, and completeness of an ontology [21]. The disadvantage of using experts includes issues of bias as well as making the correct choice in experts relevant for a particular on- tology [21, 20, 33]. The above methods are not suitable for evaluating the DCO as they rely on a specific domain to extract data from experts [21, 20, 33] to evaluate the content within a domain which the DCO does not define in its generic nature. Due to the DCO’s generic design it does not contain domain specific data therefore there is no suitable corpus or expert(s) that could evaluate the ontology completely.

Other methods focus on more structural and inferencing capabilities of ontologies [41, 44] that are more suitable for the DCO since such methods are not domain specific. Due to variety of such evaluation methods the DCO is evaluated using the FOCA method [14] which combines ideas of several existing methods to derive evaluative measures for ontology design. The FOCA method introduces several measures as well as a framework for evaluating the measures making it possible for an ontology developer of any experience level to evaluate their design. Secondly, it allows ontologies to be compared regardless of their level or purpose making it suitable to evaluate the DCO against even domain level ontologies as well as DCO derived ontologies with domain level ontologies.

4.4.1 The FOCA Methodology

FOCA has several parts to the methodology that includes determining the type of ontology, a questionnaire to evaluate components, a framework to follow based on ontology type, and finally a statistical model that calculates the quality of ontologies. Most of these components are fairly common to the Applied Ontology community but previous methods lack the questionnaire and framework which is why focus will be placed on these components [14]. The FOCA method breaks down its questionnaire into relevant roles that separate measures based on Ontological Commitments, Intelligent Reasoning, Efficient Computation, Human Expression, and Substitution [14]. Each of these goals evaluates a particular part of

66 the ontology and has questions associated with them which will be defined. Additionally, it groups questions based on the following metrics: Completeness, Adaptability, Conciseness, Consistency, Computational Efficiency, and Clarity [14]. These goals and metrics are found through many Applied Ontology communities [14] which is why we feel this method is a strong indicator of an ontologies quality.

To ensure readers who are unfamiliar with the method have an understanding of what it evaluates each question will be discussed along with how it will be evaluated. Additionally, there are a few cases where questions were altered from the original evaluation, this will be explained on a question by question basis. FOCA divides questions into major goals which is what will be used to divide the questions into groups.

67 FOCA Goals and Questions Defined

Goal 1 centres around the ontology design ensuring that one has competency questions defined and answered, and that the ontology has some form of reuse. See Table 4.7 for full descriptions.

Table 4.7: FOCA Goal 1 [14] defined along with a description used to evaluate each question.

Question Question Description/Evaluation Criteria ID Q1 Were the competency If the ontology does not have competency questions questions defined? defined, assign 0. If they do exist there are three subquestions: Does the document define the objec- tive of the ontology? Second: Does the documen- tation define stakeholders? Third: Does the docu- ment define use cases? Each sub question receives a grade of one of: 0,25,50,75,100. The overall grade is the mean of the 3 sub questions. Q2 Were the competency This grade is 0 if competency questions were not questions answered? defined. Otherwise determine if the ontology has satisfied the competencies. Grades: 0,25,50,75,100. Q3 Did the ontology reuse If the ontology reuses other ontology(s) assign 100, other ontologies? 0 otherwise.

Goal 2 centres around the ontologies design and structure. Specifically it ensures that based on how generic a design is intended to be (domain or upper level ontology) the ontologies terms meet the generic level expected. For example, it ensures that if an ontology claims to be upper level that it includes an inheritance structure before it defines domain specific terms. Goal 2 also checks the domain of the ontology, ensuring that it does not define terms outside of the domain it claims to represent. See Table 4.8 for full descriptions.

68 Table 4.8: FOCA Goal 2 [14] defined along with a description used to evaluate each question.

Question Question Description/Evaluation Criteria ID Q4 Did the ontology im- Answer this if it is type 2 (domain ontology). En- pose a minimal onto- sure that the ontology does not define high level logical commitment? abstractions and content that is not specific to a domain. i.e. a facebook ontology does not need to define a computer system. Grades are 0, 25, 50, 75, 100. Q5 Did the ontology im- Answer only if the ontology is type 1 (high level pose a maximal onto- ontology). Ensure the ontology defines high level logical commitment? abstractions such that domain level elements have more general parents. Grades are 0, 25, 50, 75, 100. Q6 Are the ontology prop- Determine if the ontology contains elements that erties coherent with the are not coherent with te domain. For example, a domain? car ontology should not contain lion. Grades 0, 25, 50, 75, 100.

Goal 3 centres around looking through the ontology and determining if it has contradic- tions or invalid reuse of terms. Contradictions are issues where the properties on relations (functional, transitive, reflexive etc.) are not applicable to the term in the ontology. Re- dundant axioms are cases where reuse should not occur as a term with the same name has two different meanings. See Table 4.9 for full definitions.

69 Table 4.9: FOCA Goal 3 [14] defined along with a description used to evaluate each question.

Question Question Description/Evaluation Criteria ID Q7 Are there contradictory Check if the classes and relations contradict the do- axioms? main for example if the hasSocialSecurityNumber is not functional this would be a problem because a person can only have one. Based on the number of contradictions give grades between 0, 25, 50, 75 or 100 if there are none. Q8 Are there redundant Determine there are classes or relations that model axioms? the same thing with the same meaning (ie.e using mouse for computer hardware and the animal). If there are many redundancies grade 0, if there are some assign one of 25, 50, 75 and 100 if there are no redundancies.

Goal 4 centres around reasoning and reasoner performance and this is where our first modification is made. Question 10 (Q10) [14] is based on the speed of ontological reasoning where it uses the verification of stopping being a grade of 0, any delay being a grade of 25, 50, or 75, and running quickly being 100, see Table 4.10. This does not allow for easy comparison as quick is relative. Furthermore, it does not consider the number of ontology components (classes, relations, properties etc) or the expected number of individuals which is important where equivalence relations are used. For speed our grade will be boolean, 0 for reasoner failure, or 100 for successful reasoning and we will note the time since its magnitude is dependent on the specific needs of an individual.

70 Table 4.10: FOCA Goal 4 [14] defined along with a description used to evaluate each question.

Question Question Description/Evaluation Criteria ID Q9 Does the reasoner bring Check if the reasoner returns errors. If there are modelling errors? many errors or the reasoner stops assign 0, if there are some errors assign 25, 50, 75 and 100 if there are no errors. Q10 Does the reasoner per- Determine if the reasoner runs quickly. If the rea- form quickly? soner stops assign 0, if there is any delay assign one of 25, 50, 75 or 100 if it runs quickly.

Goal 5 (see Table 4.11) centres around documentation that is internal to the ontology as well as ensuring the modelled ontology is the same as what is described in the design documentation. For Question 12 (Q12) [14] we make a modification of the criteria. The original criteria is based on definitions and descriptions in the ontology with scores being based on language used and deductions made for using a language other than English. For the evaluations in this work it will be changed so that as long as all terms are defined in English full score is assigned and no points will be deducted for ontologies that include definitions in other languages even if they do not cover all terms. The explanation in [14] gives the reader the impression that one should deduct for using other languages, this will not be case in our evaluation.

71 Table 4.11: FOCA Goal 5 [14] defined along with a description used to evaluate each question.

Question Question Description/Evaluation Criteria ID Q11 Is the documentation Determine if there are definitions in the ontology. consistent with the If there are none assign 0. Check that each class modelling? and relation has a definition and it is to the same detail as the document. Secondly, determine if the documentation explains each term and justifies it. For each sub question assign 25, 50, 75, or 100. Calculate the mean of the two sub questions. Q12 Were the concepts well Determine if the classes or relations are written in written an understandable and correct form (according to English or another language). If the ontology is dif- ficult to understand or full of poorly written terms assign 0. If there are a mix of languages, assign one of 25, 50, 75. If the ontology is well written and one language as used assign 100. Q13 Are there annotations In this question check existing annotations being in the ontology bring- the definitions of the modelled concepts. If there ing the concepts defini- are no annotations assign 0. If there are some an- tions? notations assign 25, 50, or 75. If all classes have annotations have annotations, assign 100.

Removal of Step 3: Quality Verification

We noted that Step 3 and the use of equations to calculate an overall grade would not be conducted in our evaluation. Step 3 will be presented and the removal justified.

72 Figure 4.4: The total quality estimator of FOCA

µbi = exp{−0.44+0.03(covs×sb)i+0.02(covC ×Co)i+0.01(covR×re)i+0.02(covC ×cp)i−0.66LExpi−25(0.1×nl)i} 1+exp{−0.44+0.03(covs×sb)i+0.02(covC ×Co)i+0.01(covR×re)i+0.02(covC ×cp)i−0.66LExpi−25(0.1×nl)i}

The Quality Verification step is based on beta regress models of another author [14] which allows ontology evaluators to integrate the scores in step 2 into an overall grade. See Figure 4.4 for the full equation.

• CovS is the mean of grades from Goal 1

• CovC is the mean of grades from Goal 2

• CovR is the mean of grades from Goal 3

• Covcp is the mean of grades from Goal 4

• LExp is the variable for evaluator experience with 1 being very experienced and 0 otherwise.

• Nl is 1 if a goal was not possible to evaluate (if any question of a goal could not be evaluated).

• sb = 1,C0 = 1,Re = 1,Cp = 1 because all goals are considered, 0 if that goal is not considered.

The equation itself is problematic for a few reasons. Firstly, the coefficient values were calculated using 6 evaluators in their study with a variety of experience level [14] which is felt to be lacking. We cannot be sure these coefficients accurately scale the scores based on their importance. Secondly, the ontology or ontologies evaluated were not noted so we cannot be sure if or how scaling of goal scores are affected by ontology type. The third issue is the use of score since Question 3 (Q3) is a boolean value whereas the other question scores in Goal 1 are not but the scores are then averaged and boolean values cannot be averaged. Additionally, some questions are averages of sub questions but are amalgamated

73 Figure 4.5: The total quality estimator of FOCA

exp{β1+β2(covs×sb)i+β3(covC ×Co)i+β4(covR×re)i+β5(covC ×cp)i−β6Expi−β7×nl)i} µi = b 1+exp{β1+β2(covs×sb)i+β3(covC ×Co)i+β4(covR×re)i+β5(covC ×cp)i−β6Expi−β7×nl)i} into one value, which places different weighting on certain questions. This makes certain aspects worth more than others which may not be case depending on what is important to the evaluator.

One will also notice that Goal 5 does not have a variable in the equation and as a result is never included in calculations which leads to the issue of how the averaged score affects the outcome. Furthermore, while the score ranges from 0 to 1 there are no guidelines on what makes for a good score or if ontology type or aspects have an effect and if there are optimal subranges for differing ontologies.

Finally, bringing an ontologies evaluation down to a single score value may not accurately depict an ontologies potential depending on its use-case.

FOCA Usage

The approach we will use with this method is to use the ontology types, questions, and questionnaire framework to rate each component. We will then present the results for each question, section, as well as written justifications for why each score was chosen based on the framework criteria. The justifications are important because the questions are not boolean and have scores from one of 0, 25, 50, 75, or 100 for many of the questions. We will also evaluate all sub-questions regardless if the overall rating will be zero based on a parent question’s pre-condition, this is to provide the reader with greater understanding of what each ontology covers. In other words our evaluation will consist of the first 2 steps of the FOCA evaluation with justifications and leaving out step 3 —the quality verification or statistical measures. The full steps and what we are conducting are illustrated in Figure 4.6.

74 Figure 4.6: The FOCA Methodology Steps [14] for evaluating an ontology.

Ontology Type Verification

Type 1 Type 2

Questions Verification

Question(Q,N)

Criterion (M)

Evaluated Role (N) Question(Q+1,N)

Criterion (M) Question(Qmax,N)

Quality Verification

Not Evaulated Total Quality Partial Quality

While we feel step 3 does not provide merit as we discussed the prior two steps do provide useful insight into how an ontology is constructed and documented so we feel there is strong merit for earlier parts of the framework. That merit however, is lost when the evaluation is brought down to a single number.

75 Chapter 5

Results

In our Evaluation Methodology chapter we introduced 3 DCO based ontologies in addition to the Survey Ontology that will serve as a base comparison point. Our results focus on a comparison of the Survey Ontology variants to the original design and an evaluation of the DCO itself. Not seen in the results will be the EPA Fuel Economy Ontology since it was designed for illustrative purposes. Specifically, it was designed to demonstrate using classifiers and how retroactive data collection is performed. Additionally, unlike the Survey Ontology variants it would provide no point of comparison against other ontologies as it is free standing and not based off any existing design.

This chapter is divided into two major sections. The first section consists of DCO results which are broken down into the Competency Question Answers using the components and design of the DCO. Secondly, the FOCA evaluation is then used to evaluate the DCO as a mid level ontology. Absent is the results for our Research Hypotheses as they are only conducted on derived ontologies. The second section is the greater focus of this chapter as it compares the Survey Ontology variants to show the increasing advantage of utilizing the DCO as base and then as a base with its philosophy. We then move on to evaluating our Research Hypotheses and discuss the impact of using the DCO on our ontology selection criteria.

76 5.1 Data Collection Ontology Evaluation

The results for the Data Collection Ontology consists of evaluating the Competency Ques- tions we outlined for the DCO prior to designing it. We then apply the FOCA methodology to determine the DCOs success as a mid level ontology.

5.1.1 Competency Question Evaluation

In Section 3.3 we defined competency questions for the ontology for problems it should solve as a mid level ontology. In this section we will evaluate these questions in terms of the components that are defined in the DCO ontology. These questions, we note, are part of the evaluation for only the base ontology and will not applied to derived designs. All answers and justifications can be found in Tables 5.1, and 5.2.

77 Table 5.1: Competency Question Justifications to assess the if the ontology implementation meets the requirements of each question.

ID Question Justification 1 Can one construct a pro- Yes, through the use of either state driven processes cess based on differing control or using independent processes, DCO supports pro- flow? cesses that run concurrently or follow a blocking pro- cess. Using control relations one can link instances to particular process states. 2 Has support for the construc- Yes, using datums one can mimic existing classes and tion of complex structures objects that one may have defined in another location. composed of several types al- The has expected type relation allows the linking of lowing existing structures to these complex structures to external types. be mimicked? 3 Has the ability to apply uni- Utilizing BFO’s zero dimensional and one dimensional versal time across the ontol- time spaces along with the has time relation one can ogy? use any time measure or space required. 4 Has the ability to assign ex- Classifiers provide a place for one to define equivalency pected types to individuals relations based on groups to categorize thus enabling allowing automatic classifica- non-destructive categorization in the ontology. tion using the reasoner? 5 Has the ability to assign qual- DCO defines several qualities under the quality class in ities to data types? BFO allowing one to relate qualities to their particular types

78 Table 5.2: Competency Question Justifications Continued

ID Question Justification 6 Has the ability to assign units Datums and data units enforce the capture of units of measure to data captured? along with values collected. 7 Has the ability to re-classify Using classifiers one can easily run the reasoner to re- existing data? classify and organize their data based on current on- tological understanding. 8 What is the amount of cap- Through the use of subjects one can link their entities tured aggregates? based on an aggregate and perform analytics on larger aggregates rather than single measures. 9 Has the ability to query The has expected type relation allows querying based based on data type and by on type and structure external or internal to the on- data structure? tology. Furthermore, though the use of datums one can group structures into broad categories. 10 Has the ability to query Using the time relations one can query based on time based on time? using any particular method of capturing time.

5.1.2 FOCA Evaluation

In this section we will conduct an evaluation for the DCO using the FOCA methodology. We will go section by section providing scores and justifications for each score provided. In the evaluation it was determined that DCO would be considered a Type 1 ontology [14] since it defines its elements at a generic level through trying to establish the domain of data collection. As a result Q5 was answered instead of Q4 for goal 2.

The DCO generally faired very well in the evaluation with an average score of 100 for Goal 1, an averaged score of 87.5 for Goal 2, an average score of 100 for Goal 3, an average score of 100 for Goal 4, and an average score of 95.8 for Goal 5. Overall the ontology does well but lacks mostly due to its design where we cannot say it generally covers the domain due to its size and age. It does lose some points in documentation based on the method

79 Table 5.3: Question Scores for the DCO Ontology

Question Score Q1 100 Q2 100 Q3 100 Q4 N/A Q5 75 Q6 100 Q7 100 Q8 100 Q9 100 Q10 100 Q11 87.5 Q12 100 Q13 100 but that decision was made to keep the internal comments relatively short when compared with the descriptions in this document. The scores for each question can be seen in Table 5.3.

The DCO scores well mostly due to the fact that it is structured around reuse and provides documentation both internally and externally via this document. It falls short due to the fact that it serves as a mid level ontology and therefore we cannot say that it covers its domain perfectly. We acknowledge that being a mid level ontology, DCO may be missing terms or it may define certain terms that will later be found to be unnecessary. Therefore terms may be added to the DCO or removed from the DCO over time. The other area where it loses points is because the internal ontology documentation is not identical to this document. Specifically, we did not include the same level of detail for terms such as classifiers and meta data classes for the sake of brevity. Scores and justifications are summarized in Tables: 5.4, 5.5, 5.6, 5.7, and 5.8.

80 Table 5.4: DCO FOCA Goal 1 Justifications

Question Score Justification Were the competency ques- 100 Yes, the DCO defines competency questions in the tions defined? ontology along with this document. There are 10 competency questions in total. In addition the DCO defines its purpose as an annotation in the ontology so users can decide if it fits their needs. Based on this the DCO scores 100 in all subques- tions, and therefore 100 total. Were the competency ques- 100 Yes, the DCO has answers to its competency ques- tions answered? tions in the main description along with in this doc- ument Did the ontology reuse other 100 Yes, the DCO reuses BFO as an upper level ontol- ontologies? ogy.

81 Table 5.5: DCO FOCA Goal 2 Justifications

Question Score Justification Did the ontology impose a - DCO would be considered a Type 1 ontology there- minimal ontological commit- fore this is not evaluated. ment? Did the ontology impose a 75 Maximum commitment is met with high level term maximum ontological com- definitions creating hierarchies to which survey spe- mitment? cific are children of. Due to DCOs infancy and po- tential changes to BFO we will not assign full score since it is possible additional terms will needed to be added to completely satisfy the data collection domain. Are the ontology properties 100 The ontology is coherent with its generic purpose coherent with the domain? for data collection and does not define domain spe- cific terms as per its purpose.

Table 5.6: DCO FOCA Goal 3 Justifications

Question Score Justification Are there contradictory ax- 100 There are no contradictory axioms as checked by ioms? HermiT, Pellet, and Fact++. Are there redundant axioms? 100 No, there are no redundant axioms. DCO has been reviewed several times to ensure there are no re- dundancies in its design.

82 Table 5.7: DCO FOCA Goal 4 Justifications

Question Score Justification Did the reasoner bring mod- 100 No errors were detected by HermiT, Pellet, or elling errors? Fact++. Did the reasoner perform 100 Yes, the reasoner performed in an average of 267 quickly? ms and never stopped which meets our redefined criteria.

Table 5.8: DCO FOCA Goal 5 Justifications

Question Score Justification Is the documentation consis- 87.5 Modelling is consistent with the documentation tent with modelling? presented in this paper. All relations are covered in the tables and all classes are described in the design section. Score: 100. The definitions in the ontology are however shorter than that of this doc- ument. Score: 75 Were the concepts well writ- 100 There are no written errors and documentation is ten? sufficient to develop solutions using the ontology. Are there annotations in the 100 Yes, all classes and relations are documented with ontology that show defini- annotations in the ontology. tions of the concepts?

83 5.2 Comparing the Survey Ontologies

Our comparison for the Survey Ontology variants will be from the structural perspective since documentation is reused from the original Survey Ontology which is in working stages. Documentation scores can easily be changed by the Survey Ontology developers through releasing an additional document with competency questions and updating the ontology file. Furthermore, it is the structural differences and reuse differences that form the advantage of DCO. These differences are reflected in the scoring with improvement seen from the Survey Ontology to the Integrated Survey Ontology and further improvement seen from the Integrated Survey Ontology to the DCO Survey Ontology. The results can be seen in Tables 5.9, 5.10, 5.11, 5.12, and 5.13.

The major advantages are seen in the reuse question (Q3) and again in Q5 where the DCO’s reuse avoids redefining common terms and relations while providing a structure for domain specific terms to have parents. Furthermore, there are no terms in the DCO based survey ontologies that have owl:Thing as a parent. Classes always have some parent either through the BFO or through the DCO’s own terms. This creates familiar hierarchies allowing developers less familiar with the domain to understand where terms exist in the larger world view.

The FOCA scores for documentation or competency questions are not an area where the DCO provides benefit but it did avoid issues regarding duplicate object and data properties since existing terms were reused instead of being defined helping to avoid potential error.

With success demonstrated using the traditional FOCA method we consider the struc- tural numbers of each ontology and contrast them with our criteria for choosing an ontology which was stated in Section 2.2.1. Doing so will allow us to determine what impact it has on choosing an ontology. Below is the criteria we defined in Section 2.2.1.

The first criteria we want to define is based on the number of terms and relations in the ontology, where we prefer to have fewer of each for two main reasons. Firstly, upper level ontologies are meant to be derived into a domain level ontology and thus will have more

84 terms and relations added over time and large ontologies introduce performance penalties potentially resulting in an ontology that is intractable for a reasoner [34]. Secondly, in terms of understandability, the fewer terms a person must know to use an ontology the easier it is to get started. Also, it will reduce reliance on documentation and expert knowledge making it easier to design and organize derived ontologies. Furthermore large ontologies may deter usage of the ontology altogether.

The second criteria we care about is usage and popularity. Popularity of an upper level ontology is important when considering its purpose for unifying ontologies [32]. We want to look at what people are using to see what is working and how many domains are being captured by the upper level ontology. If only one domain is using a particular ontology it is possible that it has not met the needs of others. Additionally greater popularity increases the likelihood that ontology developers will have experience with the ontology.

Finally, we move on to a more formal definition for upper level ontologies which is used for the purpose of ensuring that the base is kept generic, again to satisfy our definition. Thus we say an upper level ontology must be free from any domain specific terms or relations. We are not interested in ontologies that take the role of defining thousands of terms to satisfy a large number of domains since it is unlikely such an ontology could satisfy each domain realistically.

Starting with the first criteria we compared ontology sizes. The sizes are summarized in Table 5.14. As one would expect the class count is larger in both versions of the DCO based Survey Ontology variants, however, what is noted is that in both derived forms the number of relations were dramatically reduced. Therefore when looking at both classes and relations the size difference is insignificant. This means any implementation is not too large by our standards and this is because when the world of ontologies is considered as a whole, ontologies with thousands of classes are not uncommon making a difference in size of 70 relatively small and unlikely to deter people.

When looking at relations one can see the dramatic effect reuse makes. For object relations the Derived Survey Ontology added only 25 of the 41 relations defined in the

85 Survey Ontology. The other relations had equivalents that already existed in the DCO. Similarly, the Derived Survey Ontology adds only 21 of the 26 data properties again using equivalents declared in the DCO. Therefore in terms of growth the DCO maintains a small number of relations through reuse of major data collection requirements such as time, units, and data structures.

Our second criteria involves looking at domain specific content. We note that the Survey Ontology includes domain specific content as of this writing but we acknowledge that it is not tied into the structure of the ontology and could be removed without major refactoring. Therefore, we will not dive deeply into this criteria as it is not something the DCO will affect.

Lastly, is usage and popularity. Both ontologies are in their infancy, however, the DCO’s base provides familiarity since any OBO user or developer is already familiar with our design to an extent. An OBO user would already know of the the basic design, classes, and how the ontology is structured since it based on the BFO. To use the DCO they would only need to learn about Subjects, Classifiers, and Datums. We argue this provides significant benefit in terms of usage and popularity since the core of the ontology is well established. The Survey Ontology does not have this advantage since it uses no base ontology and defines terms exclusively in its own hierarchy. This would mean developers would need to completely learn the structure before use.

5.2.1 FOCA Evaluation Table Notation Defined

Here are some brief definitions for table components so one can more easily navigate the results. Major differences between versions are bolded to show changes across the variants. All Justifications are presented in the same order as the score columns meaning the first justification for a question refers to the Survey Ontology while the last justification refers to the DCO Survey Ontology. Any dashes present means refer to the row above for explanation. See below for more terminology.

86 • - Refer to the line above, for justifications it means the same justification applies.

• SO - Survey Ontology Score.

• ISO - Integrated Survey Ontology Score.

• DSO - DCO Survey Ontology Score.

• Q#-# - Any dash following a Question ID refers to the sub question number, i.e. Q1-2 refers to the second subquestion of question 1.

• N/A - Question was not scored.

87 Table 5.9: Goal 1 Questions and Justifications for Survey Ontologies

QID SO ISO DSO Justification Q1 0 0 0 No competency questions are defined in the Survey Ontol- ogy. Q1 Subquestions Q1-1 100 100 100 The document does contain the ontology objective which is to “represent the logic of the Survey, including contingent questions and repeated sections” [14] . Q1-2 100 100 100 Yes, the scenarios are cases when you would normally use an XML or other Survey representation. Yes, the stakeholders are defined throughout the document making it obvious who would be interested in such an on- Q1-3 100 100 100 tology. - - There were no competency questions in the ontology or in the document. Q2 0 0 0 - - No the ontology does not reuse other ontologies. The ontol-

Q3 0 100 100 ogy does however, reuse the schema:Person class. Yes, the DCO and the BFO are reused. -

Results for Q1 and Q2 presented no major differences due to the fact that all docu- mentation was based on the Survey Ontology so any issues regarding competency questions were carried through. The major difference is seen with Q3 where the DCO based variants see improvement in score due to their reuse of a higher level ontology, the DCO which in turn reuses the BFO.

88 Table 5.10: Goal 2 Questions and Justifications for Survey Ontologies

Question SO ISO DSO Justification Not Applicable, all ontologies are Type 1 Q4 N/A N/A N/A - - The Survey Ontology provides very little in the way of abstraction by going from owl:Thing to SurveyThing. It does not define the time or space a survey is represented in. Addition- ally, the ontology incorporates terms such as schema:Person as owl:Thing as well as Popula- tion and Interval. The Integrated Survey Ontology does define hi- Q5 25 75 75 erarchies for each of its objects with Surveys represented as Processes and SurveyParts being Process Parts and all objects organized accord- ing to their place in time. No objects are derived directly from owl:Thing. Similar to the Integrated Survey Ontology, the DCO Survey Ontology also defines a hierarchy for survey terms. The ontology does include elements that are not coherent with the domain such as Disorder and Medication which are not elements of all sur- Q6 0 0 100 veys. - The DCO Survey Ontology only includes rele- vant terms for the Survey domain.

89 Goal 2 presents some notable changes beginning with Q5 where improvements are seen due to the fact that DCO includes hierarchies for terms which FOCA states is important for non-application ontologies or Type 1 ontologies [14]. Since the goal of each of these ontologies is to establish a high level domain that is not application specific they should define terms in some of a hierarchy to place them within the larger world. This is where major advantages start to be seen with DCO since it allows people new to the survey domain to place particular terms and those familiar with BFO will have an even easier time with the basic structure.

Table 5.11: Goal 3 Questions and Justifications for Survey Ontologies

Question SO ISO DSO Justification Q7 100 100 100 There are no contradictory axioms. Q8 100 100 100 There are no redundant terms in the ontology.

There are no differences for axioms or redundancies.

Table 5.12: Goal 4 Questions and Justifications for Survey Ontologies

Question SO ISO DSO Justification There is one error with a redeclaration of target- Property as both an object property and a data Q9 75 100 100 property. There are no errors in the ISO. - Q10 100 100 100 Yes the reasoner completed and did so at an av- erage of just under 500ms based on our testing.

Goal 4 presented only one improvement over the Survey Ontology since it defines a rela- tion that is used as both an object relation and a data relation. However due to equivalences in relation purpose these relations were not migrated to the DCO variants which provided benefit in scoring as the error was no longer present.

90 Table 5.13: Goal 5 Questions and Justifications for Survey Ontologies

Question SO ISO DSO Justification Q11 50 50 50 The documentation explains each term very well along with examples how it will work which re- sults in a Score of 100. However, the definitions in the document are not the same as those in the ontology so we award a Score of 50. Q11 Subquestions Q11-1 0 0 0 No, the definitions in the ontology are not the same as some detail is missing and not all have annotations in the modelling. Q11-2 100 100 100 Yes, the terms and the design of the ontology are well documented and explained. Q12 100 100 100 The annotations are well written and easy to understand with no errors. Not all elements were annotated. Many of the object and data relations were not anno- tated. Classes that were not annotated in- clude: Feature, Interval, SurveyThing, Mental Disorder, Physical Disorder, Medication, Con- fidence QT, ExcuseQT, FrequencyQT, satisfac-

Q13 50 50 100 tionQT, YearQT, YesNoQT, TextQEQ, Experi- ence Statement, Intrinsic Statement, and Non Repeated Survey Part. All Elements were transferred over along with annotations, therefore the missing annotations were included. All elements were annotated with descriptions and purposes.

91 Goal 5 is based on components where again we see little difference with the only being no domain specific content in the DCO Survey Ontology.

5.2.2 Evaluating Ontology Hypotheses

Looking at the hypotheses defined in Section 4.1, we noted these hypotheses were for derived ontologies and not the DCO itself. In this section we will examine these hypotheses using the Survey Ontology DCO implementations.

The first hypothesis states that we expect overlap in terms meaning the DCO should contain terms and relations that can be used by the Survey Ontology variants. Overlap is found through integration of the Survey as a Process which uses the control flow for repeating, and branching directly. Similarly, Questions, Answers, and Person are considered Subjects as they are what is studied in Surveys. Relations for time were used directly in place of those defined in the Survey Ontology itself demonstrating direct overlap. Therefore we say can that our hypothesis is true there is domain overlap with the DCO components.

The second hypothesis states that we expect terms to be at a lower level. We note that none of the terms placed into the DCO derived Survey Ontology were placed at a level in the hierarchy that was above any existing DCO term meaning there were no terms that were of greater generality than those defined in the DCO. Therefore our hypothesis is true.

Lastly, we are concerned with coverage meaning we do not want terms defined outside of the DCO’s hierarchy and in the construction of the Survey Ontology derivatives it was the case that all terms fit within the DCO hierarchy meaning nothing was subclassed as owl:Thing. Therefore this hypothesis was true.

5.2.3 Survey Ontology Evaluation Conclusions

The Survey Ontology implementations meet our hypotheses therefore with the greater scores of the FOCA evaluation and no major detriment to our criteria we can determine that DCO provided some notable improvements with reuse especially though relations which we were

92 able to reduce as well as through provided hierarchies and organizational benefits to the classes of the Survey Ontology. Within DCO we used Subjects, and Processes to describe Survey Ontology terms so they could be understood and motivated at a higher level and as a result reused definitions and relations imposed on those classes to reduce the amount that needed to be created.

Table 5.14: Survey Ontology Sizes Compared

Measure Survey Ontol- Integrated DCO Survey DCO Base ogy Survey Ontol- Ontology ogy Class Count 37 102 107 73 Object Relation 41 50 34 25 Count Data Relation 26 33 16 12 Count

5.3 Conclusion

There are a few areas of emphasis from this chapter. Firstly,the competency questions for the DCO ontology are answered in the implementation which we believe will satisfy domain level ontologies that collect data in some form. Secondly, the FOCA scores for the DCO in its base form are 100, the highest value in most sections which means structurally and in terms of documentation it evaluates well.

More importantly is through the comparison of the Survey Ontology variants we have shown steady improvement in FOCA scores through each DCO version with the DCO Survey Ontology performing the best by using the DCO philosophy of data collection that focuses on high level definitions and reuse of terms through its subjects. Additionally through the use of classifiers it allows for collected data to be sorted and classified. The DCO advantage was further demonstrated when looking at the ontology selection criteria

93 we defined where we considered size, popularity, and term generality. The DCO based variants did not significantly increase size while providing the advantage of providing greater familiarity to OBO ontology developers.

As with the FOCA evaluation the ontology selection criteria also demonstrated increas- ing value with the DCO Survey Ontology due to its smaller size and higher level term definitions.

The results demonstrate strong merit to the design of the DCO in terms of its ability to accommodate existing designs but also to its philosophy of high level definitions and classifiers to organize terms based on their specificity.

94 Chapter 6

Conclusions

The research question of the thesis is stated as: How does one model the domain of data collection using an ontology while maintaining a level of domain agnosticism such that the ontology can be reused regardless of domain or stated another way Is it possible for one to construct a ontology that models the data collection domain such that it can be reused in other domains. Therefore the research in this paper has been to answer this question. This has been done through the analysis of existing research into reusable ontology design. Upper level ontology and reuse research was consulted to establish a base point for our design as well as to establish patterns that allow for an ontology to be extended. In this chapter we reflect on the proposed design of the Data Collection Ontology (DCO), a mid level data collection ontology.

The design of the DCO is premised on the notion that by being a mid level ontology we base our work on an existing upper level design that enables familiarity while allowing for a domain level ontology to extend and add terms that reflect the language used in that domain. The DCO is based on the Basic Formal Ontology (BFO) ontology and that opens up opportunity for OBO ontology developers to introduce a data collection component into their ontology. This provides usability benefits for many developers who are already familiar with the BFO design and structure. Furthermore, the BFO is well documented and many

95 existing ontologies allow for newcomers to validate their design ideas.

The DCO also follows the BFO and other high level designs to establish a philosophy that focuses on high level descriptions of terms, in the case of the DCO, that is data collection. This focus on the high level also creates a structure that can accommodate existing designs while providing its own philosophy for how terms should be organized under its hierarchy. This philosophy provides three main uses cases for the DCO which allow its use at different stages of data collection that are outlined below.

1. Live Collection describes an ontology that will collect data over time, done in real time with a system feeding the DCO based ontology data as it receives it. The Survey Ontology variants are an example of live collection.

2. Retroactive Collection describes an ontology that collects and organizes data that has already been collected and exists in a system but there is a desire to reuse and analyze this data. In this case the data is loaded into the DCO and organized using classifier classes to establish equivalence relations that allow for categorization of data items. The Fuel Economy Ontology is an example of retroactive data collection.

3. System Integration allows for classes and instances to be transferred and modelled in external systems. One of the major goals for the DCO was to enable system integration which is achieved through relations to link types.

The evaluation of the ontology through comparison of survey ontologies revealed that while its accommodation provides benefits through the creation of hierarchies and reuse of existing terms, greater benefit was seen though using the design philosophy. These benefits were reflected using the FOCA methodology for evaluation as well as through our own criteria of ontology selection. This merit comes from the hierarchies of higher level terms the DCO affords to domain level ontologies as well as the reuse of existing terms that allows for a reduction of term definitions for the domain ontology.

96 6.1 Contributions

There are a few major contributions to discuss that came out of the development of the DCO in addition to its ability to model data collection. These contributions are discussed below.

• The first major contribution is the Classifier class which is the result of meeting the motivation for reasoning. Classifiers are a powerful mechanism that enables use of a reasoner to place instances within an ontology. While Classifiers apply to the data collection domain they can also apply to other ontologies and domains as a way of creating a dynamic ontology design that enables an ontology’s world view to be affected by its instances and have the ontology’s world view affect instances. Such a design paves the way to dynamic ontologies that can operate and be affected by external systems.

• The second major contribution is the DCO Survey Ontology which can serve as the starting point for survey creators to integrate their designs into a high level ontology design that enables sharing survey results and survey components.

• Finally reuse is a major contribution of this research. DCO provides the ability to share and reuse components through existing BFO based ontologies as well as in the future with other DCO based ontologies. This means users will establish a design that allows other ontology developers to share domain related components by using the same structures.

6.2 Future Work and Limitations

The design of the DCO is not without limitations. Firstly, it must be acknowledged that the DCO is by no means complete and is likely to change and evolve over time with potentially added and removed terms and relations to better accommodate data collection ontologies

97 that extend the DCO. Like an upper level ontology, a mid level ontology cannot be said to be fully complete and should be expected to change through time. Furthermore greater evaluation of the ontologies design outside of the author’s work should be conducted.

With the limitations in mind the proposal of this thesis also presents several areas for future work. The first and most obvious is to get the DCO in the hands of ontology developers. The true validation of its mid level terms and the world view captured will be best validated by reviewing several ontologies developed using the DCO. This will allow us to note terms missing as well as terms that are unused so that future versions can have better coverage while maintaining a small footprint.

The second major area of future work would be a complimentary or integrated Data Analysis Ontology (DAO) that includes terms for analyzing collected data. Classifiers may be used as the basis for this ontology. Such a design should focus on greater integration with external systems and may include statistical tools and methodologies. There is potential for such a design to be integrated directly with the DCO through extension or to be a separate but complimentary design that works in tandem with the DCO.

The last area of future work would be the extension or redevelopment of the DCO Survey Ontology taking the premise defined here and some existing components to try to build a true and complete version of the DCO Survey ontology. Such a design would allow users to both extend the ontology for their use or use the ontology as is by creating instances for their Questions, Answers, and Survey Parts. The enhanced DCO Survey Ontology should allow for the quick development of surveys and be a standard ontology to share Questions, Answers, Responses among studies and for use in analysis.

98 Bibliography

[1] Basic formal ontology. http://www.obofoundry.org/ontology/bfo.html. (Accessed on 03/26/2017).

[2] Basic formal ontology - summary — ncbo bioportal. https://bioportal. bioontology.org/ontologies/BFO. (Accessed on 02/13/2017).

[3] Datetime - schema.org. http://schema.org/DateTime. (Accessed on 07/10/2017).

[4] Fuel economy web services. http://www.fueleconomy.gov/feg/ws/index.shtml# vehicle. (Accessed on 07/11/2017).

[5] General formal ontology - summary — ncbo bioportal. https://bioportal. bioontology.org/ontologies/GFO. (Accessed on 02/12/2017).

[6] General formal ontology (gfo). http://www.onto-med.de/ontologies/gfo/index. jsp. (Accessed on 02/13/2017).

[7] Github - bfo-ontology/bfo. https://github.com/BFO-ontology/BFO. (Accessed on 02/12/2017).

[8] jibalamy / owlready2 bitbucket. https://bitbucket.org/jibalamy/owlready2. (Accessed on 06/24/2017).

[9] Laboratory for applied ontology - dolce. http://www.loa.istc.cnr.it/old/DOLCE. html. (Accessed on 02/13/2017).

99 [10] Ontology portal - projects. http://www.adampease.org/OP/Projects.html. (Ac- cessed on 07/10/2017).

[11] sumo/merge.kif at master ontologyportal/sumo github. https://github.com/ ontologyportal/sumo/blob/master/Merge.kif. (Accessed on 02/12/2017).

[12] Basic formal ontology (bfo) — users. http://ifomis.uni-saarland.de/bfo/users, 12 2016. (Accessed on 02/12/2017).

[13] Download fuel economy data. http://www.fueleconomy.gov/feg/download.shtml, May 2017. (Accessed on 05/26/2017).

[14] Judson Bandeira, Ig Ibert Bittencourt, Patricia Espinheira, and Seiji Isotani. Foca: A methodology for ontology evaluation. arXiv preprint arXiv:1612.03353, 2016.

[15] Julita Bermejo. A simplified guide to create an ontology. Madrid University, 2007.

[16] Claudio Bettini, Oliver Brdiczka, Karen Henricksen, Jadwiga Indulska, Daniela Nicklas, Anand Ranganathan, and Daniele Riboni. A survey of context modelling and reasoning techniques. Pervasive and Mobile Computing, 6(2):161 – 180, 2010. Context Modelling, Reasoning and Management.

[17] Thomas Bittner and Barry Smith. Normalizing medical ontologies using basic formal ontology. In Proceedings of GMDS 2004. 2004.

[18] Thomas Bittner and Barry Smith. Normalizing medical ontologies using basic formal ontology. 2004.

[19] Stefano Borgo. How formal ontology can help civil engineers. In Ontologies for Urban Development, pages 37–45. Springer, 2007.

[20] Janez Brank, Marko Grobelnik, and Dunja Mladeni´c.A survey of ontology evaluation techniques. 2005.

[21] Christopher Brewster, Harith Alani, Srinandan Dasmahapatra, and Yorick Wilks. Data driven ontology evaluation. 2004.

100 [22] Werner Ceusters and Barry Smith. Aboutness: Towards foundations for the informa- tion artifact ontology. pages 1–5, 2015.

[23] BaTRJ Chandrasekaran and Todd R Johnson. Generic tasks and task structures: History, critique and new directions. Second generation expert systems, pages 232–272, 1993.

[24] World Wide Web Consortium. Owl web ontology language overview. https://www. w3.org/TR/owl-features/, 2 2004. (Accessed on 09/01/2017).

[25] World Wide Web Consortium. Uri - w3c wiki. https://www.w3.org/wiki/URI, 2 2005. (Accessed on 09/01/2017).

[26] Oscar Corcho, Mariano Fern´andez-L´opez, and Asunci´onG´omez-P´erez.Methodologies, tools and languages for building ontologies. where is their meeting point? Data & , 46(1):41–64, 2003.

[27] Wolfgang Degen, Barbara Heller, Heinrich Herre, and Barry Smith. Gol: A general ontological language. In Formal Ontology and Information Systems. Citeseer, 2001.

[28] Paulo C Barbosa Fernandes, Renata SS Guizzardi, and Giancarlo Guizzardi. Using goal modeling to capture competency questions in ontology-based systems. Journal of Information and Data Management, 2(3):527, 2011.

[29] Mariano Fern´andez-L´opez. Overview of methodologies for building ontologies. 1999.

[30] Katsumi M Fox M.S. An ontology for surveys. Proceedings of the Association for Survey Computing, 7, 2016.

[31] Thomas R. Gruber. Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum.-Comput. Stud., 43(5-6):907–928, December 1995.

[32] Heinrich Herre. General Formal Ontology (GFO): A Foundational Ontology for Con- ceptual Modelling, pages 297–345. Springer Netherlands, Dordrecht, 2010.

101 [33] Hlomani Hlomani and Deborah Stacey. Approaches, methods, metrics, measures, and subjectivity in ontology evaluation: A survey. Semantic Web Journal, pages 1–5, 2014.

[34] Ian Horrocks. Description logics in ontology applications. In International Conference on with Analytic Tableaux and Related Methods, pages 2–13. Springer, 2005.

[35] Viviana Mascardi, Valentina Cord`ı,and Paolo Rosso. A comparison of upper ontologies. In WOA, volume 2007, pages 55–64, 2007.

[36] Viviana Mascardi, Angela Locoro, and Paolo Rosso. Automatic ontology matching via upper ontologies: A systematic evaluation. IEEE Transactions on knowledge and data engineering, 22(5):609, 2010.

[37] Jonathan M. Mortensen, Matthew Horridge, Mark A. Musen, and Natalya F. Noy. Modest use of ontology design patterns in a repository of biomedical ontologies. In Proceedings of the 3rd International Conference on Ontology Patterns - Volume 929, WOP’12, pages 37–48, Aachen, Germany, Germany, 2012. CEUR-WS.org.

[38] Fox M.S. —enterprise integration laboratory eil. http://www.eil.utoronto.ca/, 2016. (Accessed on 04/23/2017).

[39] Ian Niles and Adam Pease. Origins of the ieee standard upper ontology. In Working notes of the IJCAI-2001 workshop on the IEEE standard upper ontology, pages 37–42. Citeseer, 2001.

[40] Ian Niles and Adam Pease. Towards a standard upper ontology. In Proceedings of the international conference on Formal Ontology in Information Systems-Volume 2001, pages 2–9. ACM, 2001.

[41] Leo Obrst, Werner Ceusters, Inderjeet Mani, Steve Ray, and Barry Smith. The evalu- ation of ontologies. Semantic web, pages 139–158, 2007.

102 [42] Christopher Ochs, Yehoshua Perl, James Geller, Sivaram Arabandi, Tania Tudorache, and Mark A. Musen. An empirical analysis of ontology reuse in bioportal. Journal of Biomedical Informatics, 71:165–177, July 2017.

[43] Catherine Roussey, Francois Pinet, Myoung Ah Kang, and Oscar Corcho. An intro- duction to ontologies and ontology engineering. In Ontologies in Urban Development Projects, pages 9–38. Springer, 2011.

[44] Dietmar Seipel, Joachim Baumeister, and Klaus Pr¨ator. Declarative evaluation of ontologies with rules. WLP 2014, page 47, 2014.

[45] Elena Simperl. Reusing ontologies on the semantic web: A feasibility study. Data Knowledge Engineering, 68(10):905 – 925, 2009.

[46] Stanford University. protg. https://protege.stanford.edu/, 2016. (Accessed on 09/01/2017).

[47] Denny Vrandeˇci´c. Ontology Evaluation, pages 293–313. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.

[48] OBO Technical WG. The obo foundry. http://www.obofoundry.org/, 2017. (Ac- cessed on 03/26/2017).

103 104 Appendix A

Diagrams

A.1 Fuel Economy Ontology Structure

Figure A.1: Fuel Economy Ontology Base

owl:Thing bfo:entity dco:classifier bfo:continuant bfo:occurrent

105 Figure A.2: Fuel Economy Ontology Continuant Tree Structure from BFO.

bfo:continuant bfo:generically dependent continuant dco:meta data dco:measurement datum dco:complex measurement datum dco:scalar measurement datum dco:length measurement datum dco:time measurement datum dco:measurement unit label dco:length unit dco:time unit dco:independent continuant dco:subject feo:Engine feo:Manufacturer feo:Model feo:Vehicle bfo:immaterial entity bfo:continuant fiat boundary bfo:one-dimensional continuant fiat boundary bfo:two-dimensional continuant fiat boundary bfo:zero-dimensional continuant fiat boundary bfo:site bfo:spatial region bfo:one-dimensional spatial region bfo:two-dimensional spatial region bfo:three-dimensional spatial region bfo:zero-dimensional spatial region bfo:material entity bfo:fiat object part bfo:object bfo:object aggregate bfo:specifically dependent continuant bfo:quality dco:relational quality dco:datatype property dco:boundedness dco:bounded dco:numerically bounded dco:unbounded dco:cardinality dco:countable dco:finite dco:uncountable dco:equality dco:equal dco:inequal dco:exactness dco:exact dco:approximate dco:numericalness dco:non numerical dco:numerical bfo:realizable entity bfo:disposition bfo:function bfo:role dco:data role

106 Figure A.3: The Fuel Economy Ontology’s class Structure starting with the Occurrent Class.

bfo:occurrent bfo:process bfo:process profile bfo:history dco:independent process dco:dependent process dco:exclusive state driven process dco:basic state dco:probabalistic state bfo:process boundary dco:process part bfo:spatiotemporal region bfo:temporal region bfo:one-dimensional temporal region bfo:zero-dimensional temporal region

107 Figure A.4: Fuel Economy Ontology Data Relations

owltopDataProperty dco:has expected property dco:has expected type dco:has control dco:can repeat dco:has sequence dco:has value dco:has coordinate value dco:has measurement value feo:achieves fuel economy feo:has displacement dco:has time value feo:has name feo:has engine code dco:has maximum dco:has minimum dco:has percentage

108 Figure A.5: Recreated Survey Ontology Object Relations

owl:topObjectProperty dco:has data entity dco:has measure dco:has measurement datum dco:has measurement unit dco:has time stamp dco:has object control dco:branches to dco:has part dco:contains process dco:has subject type dco:has quality dco:location of dco:realizes

109 A.2 Survey Ontology Structure

Figure A.6: The original Survey Ontology Class Structure

owl:Thing survey:Feature survey:Interval survey:Population schema:Person survey:SurveyThing survey:Disorder survey:Mental_Disorder survey:Physical_Disorder survey:Agent_Property survey:Question_Answer survey:Question survey:Close_Ended_Question survey:ConfidenceQT survey:Contingent_Question survey:ExcuseQT survey:FrequencyQT survey:SatisfactionQT survey:YearQT survey:YesNoQT survey:Non_Contingent_Question survey:Comment_Question survey:Open_Ended_Question survey:Number_QEQ survey:Text_QEQ survey:Statement survey:Experience Statement survey:Intrinsic Statement survey:Perform Statement survey:Possess Statement survey:Survey survey:Survey_Response survey:Survey_Part survey:Contingent_Survey_Part survey:Non_Repeated_Survey_Part survey:Repeated_Survey_Part

110 Figure A.7: The original Survey Ontology Data Relations owl:topDataProperty survey:SurveyDataProperty survey:canRepeat survey:endTime survey:forProperty survey:hasBaseURI survey:hasIdentifier survey:hasMax survey:hasMin survey:hasName survey:hasProperty survey:hasPurpose survey:hasQuestionText survey:hasRepeatPartCount survey:hasResponseString survey:hasSequence survey:hasSpecification survey:hasSubject survey:hasValue survey:hasVerb survey:maxRepeat survey:referenceNumber survey:responseData survey:startTime survey:targetObject survey:targetProperty survey:targetValue

111 Figure A.8: The original Survey Ontology Data Relations

owl:topObjectProperty survey:defined_by survey:for_time_interval survey:located_in survey:SurveyObjectProperty survey:contingentOn survey:contingentPart survey:contingentResponse survey:dataCollector survey:exitResponse survey:focusOfSurvey survey:forDiagnosis survey:hasAgent survey:hasAgentProperty survey:hasDative survey:hasDemographic survey:hasDiagnosis survey:hasFactitive survey:hasInstrument survey:hasLocation survey:hasObject survey:hasProvider survey:hasQuestion survey:hasQuestionComponent survey:hasQuestionType survey:hasRespondent survey:hasResponse survey:hasStatement survey:hasSymptom survey:hasTopic survey:hasTreatment survey:nextQuestion survey:repeatsFrom survey:subjectIDQuestion survey:subjectProperty survey:surveyAnswer survey:surveyFocus survey:targetClass survey:terminationResponse

112 A.3 Integrated Survey Ontology Class Structure

Figure A.9: Integrated Survey Ontology Base

owl:Thing bfo:entity dco:classifier bfo:continuant bfo:occurrent

113 Figure A.10: The Integrated Survey Ontology Class structure starting from the Classifier Class.

dco:classifier survey:Statement survey:Experience Statement survey:Intrinsic Statement survey:Perform Statement survey:Possess Statement

114 Figure A.11: Integrated Survey Ontology Continuant

bfo:continuant bfo:generically dependent continuant dco:meta data dco:measurement datum dco:complex measurement datum dco:scalar measurement datum dco:length measurement datum dco:time measurement datum dco:measurement unit label dco:length unit dco:time unit bfo:independent continuant dco:subject survey:Feature survey:Interval survey:Population survey:Disorder survey:Mental_Disorder survey:Physical_Disorder survey:Question_Answer survey:Question survey:Close_Ended_Question survey:ConfidenceQT survey:Contingent_Question survey:ExcuseQT survey:FrequencyQT survey:SatisfactionQT survey:YearQT survey:YesNoQT survey:Non_Contingent_Question survey:Comment_Question survey:Open_Ended_Question survey:Number_QEQ survey:Text_QEQ bfo:immaterial entity bfo:continuant fiat boundary bfo:one-dimensional continuant fiat boundary bfo:two-dimensional continuant fiat boundary bfo:zero-dimensional continuant fiat boundary bfo:site bfo:spatial region bfo:one-dimensional spatial region bfo:two-dimensional spatial region bfo:three-dimensional spatial region bfo:zero-dimensional spatial region bfo:material entity bfo:fiat object part bfo:object bfo:object aggregate bfo:specifically dependent continuant bfo:quality dco:relational quality dco:datatype property dco:boundedness dco:bounded dco:numerically bounded dco:unbounded dco:cardinality dco:countable dco:finite dco:uncountable dco:equality dco:equal dco:inequal dco:exactness dco:exact dco:approximate dco:numericalness dco:non numerical dco:numerical dco:realizable entity dco:disposition dco:function bfo:role dco:data role

115 Figure A.12: Integrated Survey Ontology Occurrent

bfo:occurrent bfo:process bfo:process profile bfo:history dco:independent process dco:dependent process dco:exclusive state driven process dco:basic state survey:Survey dco:probabalistic state bfo:process boundary dco:process part survey:Survey_Part survey:Contingent_Survey_Part survey:Non_Repeated_Survey_Part survey:Repeated_Survey_Part bfo:spatiotemporal region bfo:temporal region bfo:one-dimensional temporal region bfo:zero-dimensional temporal region

116 Figure A.13: Integrated Survey Ontology Data Relations

owltopDataProperty dco:has expected property dco:has expected type dco:has control dco:can repeat dco:has sequence dco:has value dco:has coordinate value dco:has measurement value dco:has time value dco:has maximum dco:has minimum dco:has percentage survey:SurveyDataProperty survey:canRepeat survey:endTime survey:forProperty survey:hasBaseURI survey:hasIdentifier survey:hasMax survey:hasMin survey:hasName survey:hasProperty survey:hasPurpose survey:hasQuestionText survey:hasRepeatPartCount survey:hasResponseString survey:hasSequence survey:hasSpecification survey:hasSubject survey:hasValue survey:hasVerb survey:maxRepeat survey:referenceNumber survey:responseData survey:startTime survey:targetObject survey:targetProperty survey:targetValue

117 Figure A.14: Integrated Survey Ontology Object Relations

owl:topObjectProperty dco:has data entity survey:hasFocus survey:hasSubjectProperty survey:hasTopic dco:has measure dco:has measurement datum dco:has measurement unit dco:has time stamp dco:has object control dco:branches to dco:has part dco:contains process dco:has subject type survey:SurveyParts survey:contingentOn survey:contingentPart survey:hasAgent survey:hasDatitive survey:hasDemographic survey:hasExitResponse survey:hasFacitive survey:hasInstrument survey:hasProvider survey:hasQuestion survey:hasQuestionComponent survey:hasQuestionType survey:hasRespondent survey:hasResponse survey:hasStatement survey:hasSurvey survey:hasSymptom survey:hasTerminationResponse survey:hasTreatment survey:hasVerb dco:has quality dco:location of dco:realizes

118 A.4 DCO Survey Ontology Class Structure

Figure A.15: DCO Survey Ontology Base

owl:Thing bfo:entity dco:classifier bfo:continuant bfo:occurrent

119 Figure A.16: DCO Survey Ontology Classifier Class Structure

dco:classifier survey:Response survey:CloseEndedResponse survey:ConfidenceResponse survey:ExcuseResponse survey:FrequencyResponse survey:SatisfactionResponse survey:YearReponse survey:YesNoReponse survey:NonContingentResponse survey:CommentReponse survey:OpenEndedResponse survey:NumberResponse survey:TextResponse

120 Figure A.17: DCO Survey Ontology Continuant Class Structure

bfo:continuant bfo:generically dependent continuant dco:meta data dco:measurement datum dco:complex measurement datum dco:scalar measurement datum dco:length measurement datum dco:time measurement datum dco:measurement unit label dco:length unit dco:time unit bfo:independent continuant dco:subject survey:Question survey:QuestionAnswer survey:CloseEndedAnswers survey:ConfidenceAnswers survey:ExcuseAnswers survey:FrequencyAnswers survey:SatisfactionAnswers survey:YearAnswers survey:YesNoAnswers survey:NonContingentAnswer survey:OpenEndedAnswer survey:NumberAnswer survey:TextAnswer schema:Person bfo:immaterial entity bfo:continuant fiat boundary bfo:one-dimensional continuant fiat boundary bfo:two-dimensional continuant fiat boundary bfo:zero-dimensional continuant fiat boundary bfo:site bfo:spatial region bfo:one-dimensional spatial region bfo:two-dimensional spatial region bfo:three-dimensional spatial region bfo:zero-dimensional spatial region bfo:material entity bfo:fiat object part bfo:object bfo:object aggregate bfo:specifically dependent continuant bfo:quality dco:relational quality dco:datatype property dco:boundedness dco:bounded dco:numerically bounded dco:unbounded dco:cardinality dco:countable dco:finite dco:uncountable dco:equality dco:equal dco:inequal dco:exactness dco:exact dco:approximate dco:numericalness dco:non numerical dco:numerical dco:realizable entity dco:disposition dco:function bfo:role dco:data role

121 Figure A.18: DCO Survey Ontology Occurrent Class Structure

bfo:occurrent bfo:process bfo:process profile bfo:history dco:independent process dco:dependent process dco:exclusive state driven process doc:basic state survey:survey dco:probabalistic state bfo:process boundary dco:process part bfo:spatiotemporal region bfo:temporal region bfo:one-dimensional temporal region bfo:zero-dimensional temporal region

122 Figure A.19: Recreated Survey Ontology Data Relations

owltopDataProperty dco:has expected property dco:has expected type dco:has control dco:can repeat dco:has sequence dco:has value dco:has coordinate value dco:has measurement value dco:has time value dco:has maximum dco:has minimum dco:has percentage survey:survey properties survey:has name survey:has purpose survey:has question

123 Figure A.20: Recreated Survey Ontology Object Relations

owl:topObjectProperty dco:has data entity survey:has survey data entity dco:has measure dco:has measurement datum dco:has measurement unit dco:has time stamp dco:has object control dco:branches to dco:has part dco:contains process dco:has subject type survey:has survey part survey:has answer survey:has question survey:has response survey:has quality survey:location of survey:realizes

124