University of South Australia Phenotype Capture in Genetic Variant Databases Peng Chen 2011

Phenotype Capture in Genetic Variant Databases

Peng Chen

[email protected]

Master of Science

(Computer and Information Science)

School of Computer and Information Science

University of South Australia

Supervisor: Dr Jan Stanek

Research Fields: Health Informatics, Health Information System

November 2011

1 University of South Australia Phenotype Capture in Genetic Variant Databases Peng Chen 2011

Table of Contents

Abstract

Background: Biomedical research information systems in laboratories mainly store genotype data of genetic testing and store very little phenotype data/clinical data. In contrast, EHR (Electronic Health Record) systems in clinics and hospitals have been storing plenty of clinical data for patients. It is a challenge in health informatics to integrate phenotype data of heterogeneous EHR systems and link them to genotype data of biomedical research for genotype-phenotype correlation discovery. The complexity of health data and lack of standards to store and integrate health data have become barriers to achieve ubiquitous health information computing and data linkage. In our research, we explored characteristics of phenotype data stored in biomedical research settings. We also studied the suitability of using openEHR archetypes to capture phenotype data and eventually to integrate all health data from heterogeneous health information systems.

Methods: A criteria form was designed to review phenotype data in genetic variant databases in 8 dimensions (storage, terminology, coding standard, granularity, curation, multiple phenotypes, case level and database). We also reviewed openEHR archetypes to understand their features and capabilities. A proposed phenotype capture model was used for guiding our phenotype capture experiment by using openEHR archetypes.

Results: We reviewed 1224 genetic variant databases and 283 existing openEHR archetypes. We found 98% of phenotype data were low-granularity data. They were either abbreviations or separate terms. Less than 5% of phenotype data were coded in formal terminologies. All phenotype data were in free text. Multilingual translations mechanism and term binding mechanism of openEHR standard were able to resolve semantic interoperability between heterogeneous health information systems. The openEHR archetypes were successfully applied to capture the phenotype data. A conceptual patient- centric EHR data warehouse schema was proposed for the idea of data integration and data linkage.

Conclusion: The openEHR is potentially a suitable standard to capture phenotype data in genetic variant databases and even suitable to perform data integration and data linkage for

3 University of South Australia Phenotype Capture in Genetic Variant Databases Peng Chen 2011 health data if we have a full set of mature openEHR archetypes ready. Knowledge governance is the key to create a full set of mature openEHR archetypes which can map health data of heterogeneous EHR systems and achieve ubiquitous health computing. We need international cooperation on managing and enhancing archetypes and we also need international agreement on choosing and enhancing terminologies and coding systems for resolving semantic conflicts.

Key words: Health informatics, health information system, openEHR, archetypes, semantic interoperability, data integration, data linkage

Acknowledgement Dr Jan Stanek is a knowledgeable researcher in health informatics and a program director from University of South Australia. He supervised and mentored me during the period of my study on this health informatics research topic. The phenotype capture review and the openEHR archetype review were guided by him. He explained the topic and its context to me and inspired me to find methods to answer the research question. He brought me into this research filed in the area of health informatics. My appreciation always goes to Dr Stanek for giving me his precious time and valuable guidance.

List of figures

List of Tables

Chapter 1 Introduction

1.1 Motivation

Phenotype data have been stored in heterogeneous genetic variant databases (mutation databases) for individual research purposes (Mitropoulou et al. 2010). Different researchers can have their differing terms to record phenotype data. Likewise, different clinicians can have their own medical terms to record the same phenotype data (clinical data) of a patient. Genomics study has been moving from genetic testing research to clinical practice and it has been a challenge to apply phenotype data (e.g. diagnoses) from research settings to clinical settings (Plon et al. 2008; Louie et al. 2007; Schulze & McMahon 2004). Clinicians want to make use of phenotype data which are collected from biomedical research, to support clinical diagnoses (Schulze & McMahon 2004). Therefore, we need consensus on recording phenotype data residing within different health settings in terms of terminologies and standards (Plon et al. 2008; Schulze & McMahon 2004). However, a universal standard to record and store phenotype data has not been established yet due to heterogeneous nature of phenotype data and lack of a standardised nomenclature (terminology) (Mitropoulou et al. 2010; Schofield et al. 2010; Louie et al. 2007; Kuntzer et al. 2010; Soussi et al. 2006).

1.2 The problem

Human genome variation society (HGVS) has made progress on standardising the nomenclature of genetic variants, but there are still limited standardised phenotype nomenclatures (Horaitis & Cotton 2004). High throughputs of genetic molecular data (genotype data) from laboratories have been accumulated in databases over years. These genotype data is becoming useful for clinical practice in correlation to phenotype data (Patrinos & Brookes 2005). The absence of a universal standardisation for phenotypes makes it difficult to integrate phenotype data from diverse database systems for genotype- phenotype association study (Louie et al 2007; Soussi et al. 2006; Martin-Sanchez et al. 2004).

In health informatics, electronic health record (EHR) technologies have been currently used in clinical settings (Garde et al. 2007a; Garde et al. 2007b; Liu et al. 2010). Current EHR systems are developed in diverse standards and technologies to store complex clinical data (Eichelberg et al. 2005). The openEHR standard is one of the EHR standards, which uses a concept of archetypes and a two-level software engineering model to capture clinical contents (Beale & Heard 2008). It is towards the goal of standardising EHR systems and semantic interoperability.

There are a few reviews on how genotype data are stored in mutation databases (Claustres et al. 2002; Mitropoulou et al. 2010; George et al. 2008; Kuntzer et al 2010) while we could not find research work on studying how phenotype data are stored in mutation databases. We want to explore the characteristics of phenotype data in terms of storage, coding, terminologies, granularity, etc. Given the outcomes of our review on phenotype data, we look into openEHR standard for its suitability as being a standard to capture phenotype of data genetic variant databases since openEHR archetypes have been applied to store clinical contents in large health information systems (Chen et al. 2009, Spath & Grimson 2011).

1.3 Research question

Research keywords Health informatics, Health information system, openEHR, archetypes, data integration, genotype-phenotype association, data linkage

Research question Can an existing standard for electronic health records (openEHR) be used to capture and store clinical data/phenotype data?

Research steps To answer the research question, first, we review the status quo of phenotype data, and then we explore openEHR technology standard as being a vehicle for capturing phenotype data.

Phenotype review: how are phenotype data stored in mutation databases? Explore the characteristics of phenotype data, in terms of storage, terminology, coding, granularity, etc.

 Study works done before: Claustres et al. (2002) and Mitropoulou et al. (2010), and study what criteria they used and what the focuses in their works are.

 Collect mutation databases: we pull our mutation database sources from HGVS website (http://www.hgvs.org/dblist/glsdb.html), which has a current list of LSDBs (as of May 2011), and from another literature Kuntzer et al. (2010).

 Design a criteria form.

 Use the criteria form to review the mutation databases, and fill review results into an excel spread sheet. (If present, ‘Y’; if not present, ‘N’.)

 Analyse the results and draw conclusions on the characteristics of the phenotype data.

Phenotype capture: we test whether openEHR standard can be suitable to store phenotype data. We use openEHR archetypes to capture the phenotype data.

 Propose a workflow towards data integration and business intelligence.

 Propose a model for phenotype capture experiment.

 Study openEHR archetypes and create a mapping of clinical concepts from phenotype data.

 Choose archetypes to capture phenotype data.

 Justify the result of my experiment and draw a conclusion.

Hypotheses Hypothesis one: Most of the phenotype data stored in genetic variant databases are not coded, not stored in fine details, and not stored in a consistent manner.

Hypothesis two: As an EHR standard, openEHR is suitable to record and store phenotype data.

1.4 Potential contribution

 Our research is to evaluate the suitability for openEHR becoming a standard to capture phenotype data. To some extent, it is also an approach towards data integration and business intelligence in health informatics

 Our research can fill a gap in current literatures about phenotype reviews in genetic variant databases. Our phenotype capture experiment is one small step to develop semantic interoperable health information systems.

Chapter 2 Literature review

2.1 Mutation database review

One important task in our research is to explore the status quo of how phenotype data are stored in mutation databases. First, we explore how other researchers have conducted their reviews on mutation databases. One early comprehensive mutation database review was done by Claustres et al. (2002). They reviewed the contents of 94 Locus-Specific Databases (LDSBs) via using 80 content criteria (see Appendix A). They examined the contents from 9 aspects, which included general presentation of LSDBs, data collection and submission, information on disease, gene, and protein, links to associated data and external sites, mutation database structure, mutation table content, allele nomenclature, search capabilities, and years of updates. They presented their results in bar charts. They only reviewed one LSDB for each website instead of all LSDBs in each website as they believed all LSDBs in a website were documented in the same way.

Claustres et al. (2002) had noticed nomenclature standards for recording data, but they only examined genotype nomenclature (e.g. allele nomenclature) and there were only two criteria for nomenclature review. Another review done by Mitropoulou et al. (2010) was based on Claustres et al. (2002)’s work. Mitropoulou et al. (2010) used similar criteria form (see Appendix B) as the one used by Claustres et al. (2002) and repeated most of Claustres et al. (2002)’s criteria for making a comparison with the previous work. They improved the previous work by reviewing more LSDBs and by looking into the granularity of phenotype data. They found out 8% LSDBs had detailed phenotypic description.

Claustres et al. (2002) and Mitropoulou et al. (2010) examined whether the databases had followed Human Genome Variation Society (HGVS) nomenclature standard for recording genotype data but they had not examined the nomenclature standard for phenotype data. Claustres et al. (2002) had one criterion for phenotype data which is to check whether databases store phenotypes or not while Mitropoulou et al. (2010) had two criteria to measure whether databases record summary or detailed phenotype data. Both works have no focus on reviewing phenotype data.

Some reviews done by other researchers are general and summarised reviews on genotypes and phenotypes from different aspects, like mutation types, numbers of mutations and currency of the contents, etc. (Kuntzer et al. 2010, Groth & Weiss 2006, George et al. 2008). Some reviews are related to multi-species databases while some are not related to human mutation databases (Groth & Weiss 2006). George et al. (2008) reviewed general mutation databases (central databases) and made comparisons between them in terms of their content, currency and completeness. They only focused on mutations but not on phenotype data. Soussi et al. (2006) reviewed mutation databases for one particular mutation p53 and they had no focus on phenotype data either. These researchers have not demonstrated a criteria form as a tool for their reviews and they reviewed selective mutation databases. Therefore, their reviews are not as systematic as those using a criteria form to review a wide range of current mutation databases (Claustres et al. 2002, Mitropoulou et al. 2010). However, these researchers have mentioned it is important to have a consistent and universal nomenclature for standardisation, data integration and classification.

2.2 The importance of universal phenotype nomenclature

Health informatics is now moving from genomics study to genotype-phenotype association study. Genotype-phenotype association study can help clinical prediction of diseases based on the correlation between genotypes and phenotypes. To achieve genotype-phenotype association study, we need to integrate phenotype data before linking them to their corresponding genotype data (genetic information). The phenotype data are very complex (Thorisson, Muilu & Brookes 2009), heterogeneous and diverse (Kahraman et al 2005). Louie et al. 2007 pointed out that integration of diverse large datasets was for creating new knowledge. An international nomenclature is required to integrate phenotype data from different data sources. Schofield et al. (2010) and Kola et al. (2010) argued that different classification systems were using local terms to describe diseases.

Plon et al. (2008) reviewed several clinical classification systems which were developed to predict cancer susceptibility and then they proposed their own classification system. All these classification systems including the one they proposed had used different

13 University of South Australia Phenotype Capture in Genetic Variant Databases Peng Chen 2011 terminologies to interpret the meanings of genetic testing results, which are phenotypes. These different classification systems for the same purpose confused clinicians to make diagnostic decisions. It is also difficult for clinicians to align their own clinical terms with the ones used in these classification systems to clearly and consistently interpret the laboratory results to patients.

Although Plon et al. (2008) improved a classification system for minimising risks of misinterpreting genetic testing results they could not exchange a classification result with other clinicians who used different sets of terms to describe the same diagnosis. We need a standardised nomenclature to internationally share phenotype data (Groth & Weiss 2006) across mutation databases, laboratories, clinics and hospitals.

Once a standardised nomenclature for phenotype data is ready, we can use uniform terms to build a phenotype ontology and then build a phenotype data model which anchors on the ontology (Patrinos & Brookes 2005; Kola et al. 2010), and then use the model to integrate phenotype data from different data sources. We continue to map genetic variants to corresponding phenotypes for discovering genotype-phenotype correlations. We believe these ideas eventually will contribute to create a better decision support system for clinical diagnoses of diseases and cancers, and also will help provide better health care services.

Schofield et al. (2010) stated that existing term frameworks encompassed the Unified Medical Language System (UMLS), Medical Subject Headings (MeSH), International Classification of Disease (ICDx e.g. ICD-9/10) and Systematised Nomenclature of Medicine Clinical Terms (SNOMED-CT). They argued that phenotypes data were generally described in natural language and in free text and they were either recorded in local terms or in the existing term frameworks. Besides ICDx and SNOMED-CT (Systematized Nomenclature of Medicine -- Clinical Terms), Beale & Heard (2006) have mentioned another two terminologies, which are logical observation identifiers names and codes (LOINC) and international classification of primary care (ICPC).

2.3 The openEHR standard

The openEHR is an EHR standard and its architecture provides two-level modelling which separate data and their semantic meanings. The first level is a reference model. The reference model defines data types, data structures and healthcare-specific components, which are to be consumed by archetypes. The second level is an archetype model. Archetypes are a set of classes which can model and represent medical concepts like blood pressure. An archetype semantically defines constraints on the reference model and data have to conform to the defined constraints in an archetype (Beale & Heard 2008; Eichelberg et al. 2005). An archetype can be specialised and created based on another archetype (Eichelberg et al. 2005). A template is constructed with a group of archetypes and it represents a medical report. The language to express archetypes is Archetype Definition Language (ADL) (Beale & Heard 2008). The openEHR reference model essentially contains constraints and rules while the archetype model is to use these constraints and rules to define particular clinical concepts.

The openEHR archetypes are independent of data and its data model in an EHR system (Eichelberg et al. 2005). We can customise an archetype for a particular clinical concept without changing the underlying relational tables (See Figure 1). The openEHR framework already provides a library of archetypes to express generic clinical concepts so it saves efforts to model these clinical concepts from scratch (Garde et al 2007b). If an existing archetype is not able to capture a clinical concept, what we need to do is to modify the original archetype or to create a new archetype to suit that specific concept.

Figure 1: An Example Association between Archetypes and the Information System Data Structure (Eichelberg et al. 2005)

Garde et al. (2007b) has successfully transformed a German paediatric oncology clinical dataset with 260 clinical items into 46 openEHR archetypes. It appears to us that openEHR archetypes can reduce the size of a dataset and eliminate data redundancy. It also appears to us that openEHR is feasible for capturing clinical data, and potentially for phenotype data of mutation databases. Their hypothesis was ‘expressing clinical data sets as openEHR clinical content models (archetypes) is feasible and useful’ and they presented an approach to transform a clinical dataset to archetypes as a result. This gives us ideas of how to use archetypes to capture phenotype data. During the transformation process, they identified issues of clinical data and addressed the issues by using openEHR archetypes. They also pointed out two limitations of openEHR to achieve ubiquitous computing, which are insufficient supports for hierarchical archetypes and challenges for binding user-nominated terminology. But they believed these limitations could be solved by the improvement of openEHR.

We have a similar hypothesis as the one proposed by Garde et al. (2007b) with regard to the feasibility of using openEHR to capture clinical data. Garde et al. (2007b) proved their hypothesis based on well prepared and selected clinical datasets which were based on the German Basic Dataset for Paediatric Oncology. Our data are from various genetic variant databases.

2.4 Ontology

Robinson & Mundlos (2010) argued that the development of human phenotype ontology was an effort to provide standardised vocabulary to describe phenotypes unambiguously. The use of ontology enables computational analysis on semantic similarity between phenotype abnormalities and on genotype-phenotype association. We can develop data mining algorithms based on the ontology and apply them to create new knowledge from the vast amount of phenotype data.

Schofield et al. (2010) argued that a big challenge in genome study was coding phenotype data. The openEHR standard has a coding system to define clinical concepts (Beale & Heard 2008). Threfore, openEHR potentially can be used to solve the challenge. We can use openEHR to code the entities within the ontology and then map other coding systems to openEHR codes. In such a way, it is possible to reconcile semantic conflicts between different proprietary systems.

Figure 2 and Figure 3 are examples of human phenotype ontology:

Figure 2: ‘Diagnostic-investigation’ and ‘Specialty-care-visits’ mapped into the Reference Ontology (Eccher et al. 2006)

Figure 3: The human phenotype ontology (Robinson & Mundlos 2010)

You can see ontology is a tree-like structure. In terms of programming, we can use a tree data structure to store ontology. We use ontology to capture phenotype data as ontology provides visible paths to search the relations between different phenotypes in a computational way.

2.5 The openEHR archetype conversion of EHR data model

There have been experiments done by other researchers that relate to converting regional or proprietary system models into openEHR archetypes. Spath and Grimson (2011) converted a database schema of a bio-bank information management system (BIMS) into openEHR archetypes. Chen et al. (2009) also converted a regional EHR system in Sweden

18 University of South Australia Phenotype Capture in Genetic Variant Databases Peng Chen 2011 into openEHR archetypes. We will look at more details of these two experiments in terms of conversion methods, results, and chosen archetypes.

In general, most of the data stored in biomedical information systems are gene-related biomedical data from lab testing results. These systems store very little clinical data or phenotype description. In contrast, most of the data stored in EHR systems are patient health records which are rich clinical observation data, such as medical history, lifestyle data or family history (Chen et al. 2009). These EHR systems store very little biomedical data. Genotype data and phenotype data are stored separately in different data models. In order to establish correlations between molecular genetic data and clinical data we need a way to combine these two kinds of data. Eventually medical intelligence can be discovered from the correlations (Spath & Grimson 2011).

Spath and Grimson (2011)

Spath and Grimson (2011) mapped columns of tables in a Biobank Information Management System (BIMS) to corresponding archetype entry items based on the meaning and the context of each column. They first analysed what kinds of data were stored in a database schema and what data types were used to store the data. Then, they grouped the columns into distinctive concepts and each archetype presented one distinctive concept. During grouping the concepts, some repeated columns in some tables were identified into one concept. As a result, the number of columns to be captured in archetypes became smaller. The next step was to use archetypes to capture and represent these concepts. They first searched existing suitable archetypes for capturing these concepts without any change. If original archetypes were not suitable, they modified the archetypes or created a new one.

Figure 4,5,6,7 (Spath & Grimson 2011) as below illustrates a conceptual database schema of the BIMS. Figure 1 is the overview of the conceptual schema. Figure 5,6,7 present the details of three separate parts of the schema.

Figure 4: the overview of BIMS conceptual schema (Spath & Grimson 2011)

Figure 5: Part 1 of BIMS conceptual schema (Spath & Grimson 2011)

Figure 6: Part 2 of BIMS conceptual schema (Spath & Grimson 2011)

Figure 7: Part 3 of BIMS conceptual model (Spath & Grimson 2011)

Figure 8 shows the concepts which have been identified and grouped from columns in the conceptual model. This diagram can be viewed as the ontology of the system which can be represented by archetypes. Figure 9 shows how the columns in the Biopsy table have been mapped to concepts.

Figure 8: BIMS fields regrouped into non-overlapping concepts (Spath & Grimson 2011)

Figure 9: Detail of how the fields in ‘Biopsy’ table were mapped to the concepts (Spath & Grimson 2011)

Spath and Grimson (2011) proposed 43 archetypes to cover all columns and concepts in a biobank database and 63% of chosen archetypes were used without changes. They created five new archetypes and modified some archetypes through openEHR slot mechanism. Their experiment has proven that it is feasible to use openEHR archetypes to capture data in a BIMS. They also exposed some issues from their experiment of using openEHR archetypes to convert a biobank database. First, the openEHR archetype approach is still immature. There is lack of modelling guidelines or best practice for applying the reference model. Second, the absence of domain knowledge governance will result in overlapping archetypes for the same concept in that it will shift the semantic interoperability problem onto archetype problem. Third, data produced in clinical context and in research context have different granularity levels. Meanwhile, data models in BIMS and openEHR archetypes are not always at the same granularity level. These differences will cause loss of information at the archetype level since archetypes are used to capture clinical data. Fourth, some deduced

24 University of South Australia Phenotype Capture in Genetic Variant Databases Peng Chen 2011 information from BIMS cannot be mapped to an archetype explicitly. Fifth, the terms used in the BIMS and the ones used in archetypes have no semantic mappings in between and there is a strong need to bind external terminologies to solve semantic problems for terms’ meanings. Sixth, most data in BIMS are stored in free-text, in unstructured form while archetypes provide structures to store data. Therefore, it is difficult to populate or distribute these unstructured free-text data into appropriate entry items in archetypes.

In summary of the mapping issues in Spath and Grimson (2011), we note that there are two major challenges of converting BIMS to openEHR archetypes. First, openEHR archetypes conform to recording requirements for clinical practice and it is not totally compatible to recording requirements for a BIMS. Second, current EHR and BIMS databases do not follow any recording standards and they are built in a proprietary fashion. It is difficult for applying immature archetypes to integrate heterogeneous information systems without a common recording standard and solutions for semantic conflicts in data representation level and data-value level are needed. Spath and Grimson (2011) argued that their research was the first attempt to apply an archetype approach to present the information in BIMS databases in biomedical research context. In terms of combining clinical information with biomedical research information to find out data linkages between them, archetypes are still not mature enough to fulfil this task yet as they cover only clinical data so far but not medical research information. It requires improvement on the maturity and quality of openEHR archetypes in regards with capturing both clinical and biomedical information and solving semantic mapping issues.

Chen et al. (2009)

The difference between Chen et al. (2009)’s research and Spath & Grimson (2011)’ research is that Chen et al. applied openEHR archetypes to successfully convert a regional EHR system (COSMIC) instead of a BIMS. As EHR systems store clinical records and openEHR archetypes have been designed to capture clinical records, the conversion should be easier compared to the conversion between archetypes and a BIMS as in Spath & Grimson (2011). The result of Chen et al. (2009)’s research showed that openEHR archetypes nearly had preserved all structural and semantic definitions of the original EHR model. Another

25 University of South Australia Phenotype Capture in Genetic Variant Databases Peng Chen 2011 difference between these two researches is that Chen et al. (2009) have achieved bi- directional conversion between the regional EHR templates and openEHR archetypes. Chen et al. converted 15 selected archetypes into EHR templates automatically and converted 86 EHR templates into archetypes automatically too. We can learn the method and the result from Chen et al. (2009)’s experiment in regards with converting regional EHR templates to openEHR archetypes.

Chen et al. (2009) first analysed a regional EHR template model and compared it to openEHR reference model and archetype model. Then they created a semantic mapping between the EHR templates and openEHR with a set of conversion validation rules. Structural constraints, data value constraints and terminology binding between the regional EHR template model and the archetype model were the three major considerations during a semantic mapping process. Structural constraints are for matching structures between EHR templates and openEHR, which organise data entries according to clinical requirements. Data value constraints are for data values to conform to specific data types in the reference model as to achieve data quality. Terminology binding is for achieving data sharing and reuse between systems.

In terms of mapping issues encountered in Chen et al.’s research, most of the existing EHR templates had a larger clinical scope than archetypes because the scope of archetypes was meant to be smaller for encouraging reuse. In terms of mappings of structural constraints, they looked through openEHR reference model and archetype model and mapped the templates to openEHR RM EVALUATION class. They identified 15 archetypes (Table 1) to be converted into COMSMIC EHR template formats.

Table 1: Selected Archetypes for Conversion

In terms of mappings of data value constraints, Chen et al. (2009) found openEHR reference model and archetype model were more extensive and more detailed than the tool types (data types) and data validation rules in the EHR template model. Table 2 displays the mappings of data types between openEHR data types and the COSMIC EHR data types. This mapping can give us ideas of how to map openEHR data types to the data types of other information systems.

Table 2: Mapping of the Data Types

The constraint definition in archetypes can define a term at the point of data entry. The C_CODED_PHRASE constraint in openEHR can specify an external terminology. With combining several C_CODED_PHRASE constraints, openEHR can specify multiple external terminologies. The term binding mechanism in openEHR is sufficient enough to express the semantics in COMSMIC EHR templates by binding archetypes’ entry items and external terminologies.

Chapter 3 Methodology

3.1 Phenotype review methodology

Our review work only focuses on analysing human phenotype data. It is inspired by two review works done by Claustres et al. (2002) and by Mitropoulou et al. (2010). We use a similar method which is a criteria form, to perform our review. We have collected a list of mutation databases from the HGVS website (http://www.hgvs.org/dblist/glsdb.html) and the literature Kuntzer et al. (2010). Kuntzer et al. (2010) classified the mutation databases into two categories, cancer variation databases and disease variation databases. Some of the links are duplicates of those in HGVS website. We eliminate these duplicates and merge them into one list.

We have 22 content criteria to examine phenotype data of mutation databases. We want to review phenotypes from 8 dimensions. These dimensions include storage, terminology, coding, granularity, curation, multiple phenotypes, case level and database. Each dimension has its own separate criteria to measure and evaluate phenotype data.

We record the absence and presence of 17 content criteria. If a criterion is present, it will be marked as ‘y’ which means ‘yes, it is present’. If a criterion is absent, it will be marked as ‘n’ which means ‘no, it is not present’. After finishing the review, we convert ‘y’ to ‘1’ and ‘n’ to ‘0’ for statistics analysis. The remaining five content criteria are ‘recognised terminology’, ‘recognised coding standard’, ‘granularity level’, ‘database family’ and ‘platform’. The ‘recognised terminology’ and ‘recognised coding standard’ criteria record the terminology and coding standard respectively. The ‘granularity level’ criterion records the phenotype granularity.

These selected criteria are based on our interest in standardising phenotype data. It can objectively evaluate the status quo of how phenotype data are stored in mutation databases. These criteria are presented in Table 3.

Table 3: Criteria form for phenotype review in mutation databases

1. Storage 4. Granularity

Collect phenotypes Overall granularity level

Internal storage Partial fine-grained phenotypes

Proprietary external storage

Foreign external storage 5. Curation

Curated

2. Terminology

Formal terminology 6. Multiple phenotypes

Proprietary terms (mapped to Single phenotype

a recognised terminology) Multiple phenotype

External terminology used directly

Recognised terminology 7. Case level

Variant-level phenotypes

3. Coding standard Case-level phenotypes

Formal coding standard

Proprietary codes (mapped to 8. Database

a recognised coding standard) Database family

External coding standard used directly Flatform

Recognised coding standard

1. Storage Collect phenotypes: we first evaluate whether a mutation database collects phenotypes. If it does not collect phenotypes, the subsequent criteria in the form will not be evaluated.

Internal storage: it is to evaluate whether phenotype data are stored right within a mutation database system.

Proprietary external storage: it is to evaluate whether phenotype data are stored in a separate proprietary external system, but not within a mutation database system (‘separate’ means phenotype data do not reside in the same system as genotype data do).

Foreign external storage: it is to evaluate whether phenotype data are stored in a separate foreign external storage.

2. Terminology Formal terminology: it is to evaluate whether a recognised formal terminology is used to define phenotype data. It can be to use proprietary terms but which are mapped to a recognised terminology, or it can be to use external terminology directly. If phenotype data are not termed by any formal terminology, three subsequent criteria will not be evaluated.

3. Coding standard Formal coding standard: it is to evaluate whether a recognised coding standard is used to code phenotype data. Codes have no intelligence and it is only a numeric identifier for a piece of data. It can be to use proprietary codes but which are mapped to recognised coding standards, or it can be to use an external coding standard directly. If phenotype data are not coded by any formal coding standard, there subsequent criteria will not be evaluated.

4. Granularity

Overall granularity level: it is to evaluate the overall extent of phenotype data’s details. We record a granularity level as either ‘low’ or ‘high’. If most of phenotype data are described by disease names or their abbreviation like ‘MRX’ (Mental Retardation), or just a simple description of a disease like ‘Early-onset severe retinal dystrophy’, then we will consider the overall granularity level is low. If phenotype data describe symptoms of a disease and they also include some quantitatively or qualitatively descriptive data

31 University of South Australia Phenotype Capture in Genetic Variant Databases Peng Chen 2011 and some other patient-related information, we will consider the overall granularity level is high. In our research, we only can qualitatively justify the granularity level.

Here is an example of fine-grained phenotype data (from http://bioinf.uta.fi/C1QAbase/index.php):

“Symptoms Bacterial infections

Symptoms Pneumonia

Symptoms Otitis

Symptoms Kidney abnormalities

Symptoms Severe renal malfunction

Symptoms Other clinical features: At the age of 5.5 years patient

Symptoms presented with erythematous and crusted, desquamative skin

Symptoms lesions on his face and extremities. Patient had also oral

Symptoms aphtous lesions. Patient developed glomerulonephritis

Symptoms three years later and after 9 months he died of renal

Symptoms failure.”

Partial fine-grained phenotypes: some mutation databases have an overall low-level granularity of phenotype data but they also could partially store some high-level granularity phenotypes. In this case, we use this criterion to mark those mutation databases which have partial fine-grained phenotypes and set the value to ‘y’. For those mutation databases already have an overall high-level granularity of phenotype data, we will set this criterion to ‘n’.

5. Curation Curated: we consider phenotype data are curated if there is a curator present in a mutation database. If there is a vacancy for a curator, we consider it as not being curated.

6. Multiple phenotypes Single phenotype: it is to evaluate whether a mutation database only collects one single kind of phenotype. For example, all genetic variants relate to one disease, like mental retardation.

Multiple phenotypes: it is to evaluate whether a mutation database collects more than one single phenotype.

7. Case level We can collect phenotype data on patient case basis, that is to say, a mutation database collects phenotype data for one particular variant for each individual patient. In this case, we say these phenotype data are case-level phenotypes.

Or we can generalise phenotype data from all individual cases and provide a summarised phenotype description for one particular variant. In this case, we say these phenotype data are variant-level phenotypes.

8. Database Database family: we want to find out whether there is a specialised database software to collect phenotype data.

Platform: we want to know where phenotypes are stored, in a web page table or in only free text, or in a relational database like MySQL.

Except for overall granularity level, as long as a criterion is present in a data entry, we will mark the criterion as ‘y’. For overall granularity level, we will evaluate the granularity level based on the proportion of a granularity level in a mutation database. For example, if 70% of data entries in a mutation database recorded low-granularity phenotype data and 30% of the data entries recorded high granularity phenotype data, then we will say the granularity level of the mutation database is low, vice versa.

3.2 openEHR phenotype capture methodology

We propose a workflow (Figure 10) to show the essential steps towards data integration and business intelligence. Then we propose a model (Figure 11) to show how to capture phenotype data by using openEHR. We want to use the workflow and the model to explain the methodology and the ideas of our phenotype capture experiment. The ideas of both the data integration workflow and the phenotype capture model are inspired by Kola et al. (2010) and Beale & Heard (2008).

Data integration workflow:

The readiness of semantically interchangeable data is essential for integrating data from various mutation databases. As various terminologies are used in the favours of different research purposes, the first step for data integration is terminology unification (Kola et al. 2010). We believe nomenclature standardisation is an important task during the process of data integration for mutation databases.

Ontology is at the centre of the proposed data integration workflow and the ontology connects standardised nomenclature with subsequent tasks (Kola et al. 2010), i.e. data modelling and data mining. We use standardised nomenclature to uniquely describe each entity in the ontology. In our research, ontology presents phenotype entities and the relationships between them. We use ontology as a specification to create proper data models and data mining algorithms. Data models are used to collect and integrate data while data mining techniques are applied to discover knowledge.

Our proposed workflow presents how to reconcile, store and integrate heterogeneous data from genetic variant databases. The final stage of the workflow is to achieve business intelligence. We can apply data mining techniques to probe and discover intelligence from integrated data. Such endeavour in health informatics helps find out genotype-phenotype association.

This workflow is an idea to achieve data interoperability. Structural heterogeneity and semantic heterogeneity of data are the obstacles to achieve data interoperability in information level (Wache et al. 2001). Standardised nomenclature can address semantic

34 University of South Australia Phenotype Capture in Genetic Variant Databases Peng Chen 2011 heterogeneity of data while a universal data model to store data can address structural heterogeneity of data.

Data warehouse approach to integrate data requires alleviation of rapid change on data schema (Louie et al. 2007). Therefore, standardised nomenclatures and data schemas prepare a data warehouse approach to data integration. A data warehouse can represent a comprehensive, central genotype-phenotype data centre and it provides data availability for genotype and phenotype correlation analysis (Robinson & Mundlos 2010).

The workflow only provides a guideline for a realisation of business intelligence in health informatics. In this research, it is not our objective to verify the whole workflow but only a part of it by examining the suitablity of using openEHR as a standard to capture phenotype data. openEHR phenotype capture model:

Figure 10: Data integration workflow

Figure 11: openEHR phenotype capture model

Archetypes present clinical concepts without any modification of underlying data model and its data (Eichelberg et al. 2005) when both archetypes and the underlying data models are developed based on the same ontology as described in Figure 10. Since openEHR has been applied to develop healthcare systems (Chen et al. 2009, Garde 2007b) and openEHR archetypes can capture clinical contents (Beale & Heard 2008), we believe openEHR archetypes can also be applied to capture phenotype data of genetic variant databases.

We analyse our phenotype review result and understand the characteristics of phenotype data. Then, we map phenotype data to a phenotype ontology. We want to see whether openEHR archetypes can express the ontology and capture the phenotype concepts within the ontology. Meanwhile, we establish a phenotype data model based on the ontology.

The openEHR two-level model separates invariable data structure (reference model) and variable clinical domain knowledge (archetypes and templates) (Roman, Calvillo & Roa 2009). Archetypes are used to structure specific domain content by consuming a reference model. In single level modelling, domain knowledge is inherent in a data model structure. Therefore, the two-level model approach is more adaptive to changes. Terminology itself cannot interpret clinical domain concepts but rather facts about the real world only and it needs archetypes and templates these semantic structures to model meaningful clinical concepts (Beale & Heard 2008). Two level modelling changes traditional software development process. We have all these reference models to store data and how to resemble the data and use them to present clinical concepts it all depends on users/clinicians. It is a clinician’s job to give semantic meanings to the data via building archetypes with terminology. (Beale & Heard 2008).

3.3 Expected outcomes

The expected outcome of phenotype review is:

 most phenotype data are not coded,

 most phenotype data are described in no formal terminologies,

 most phenotype data are stored in coarse detail.

The expected outcome of phenotype capture is:

 openEHR is a viable and promising tool to model and capture phenotype data.

Chapter 4 Phenotype review

4.1 Review result

We collected 1184 database links from the HGVS website (http://www.hgvs.org/dblist/glsdb.html), and 48 cancer variation databases and 166 disease variation databases from the literature Kuntzer et al. (2010). By excluding overlapping database links, we finalised 1224 links for our review. 978 databases collect phenotype data and the phenotype data all are stored in internal storages. We interpret the phenotype review result against our criteria based on these 978 genetic variant databases.

We summarised review results and examined 22 criteria. We calculated the rate of each criterion for comparison in order to find out dominant features for each dimension. Statistics of our review results are displayed in a table format. Based on these statistics, we learned characteristics of phenotype data in 8 dimensions respectively, which include storage, terminology, coding standard, granularity, curation, multiple phenotypes, case level and database.

Storage

Our statistics indicate that 80% of these genetic variant databases collected phenotype data (Table 4). With our expertise, we assessed all accessible databases and we could not discover any external storage for phenotype data. As a result, all of these databases store their phenotype data internally.

Table 4: ‘Storage’ review result Criteria Number Pct(%) Collect phenotypes 978 80 Internal storage 978 100 Proprietary external storage 0 0 Foreign external storage 0 0

* only the percentage of the criterion ‘collect phenotypes’ is based on the total of genetic database (1224), the percentage calculation of the rest of criteria in the criteria form is only based on the total of databases which collect phenotypes (978), except for the ‘partial fine-grained phenotypes’ criterion.

Terminology

Only 40 (4.1%) of 978 genetic variant databases has formal terminologies to describe phenotype data and use external terminologies directly (Table 5 and Table 6). We only could find two kinds of formal terminologies, which are OMIM and SNOMET CT. However, not all of terms are described with formal terminologies as in some databases only a partial of terms are recorded in terminologies.

Table 5: ‘Terminology’ review result - A Total of databases using terminology : 40 Terminology Number OMIM 29 OMIM (part of the terms) 4 SNOMET CT 1 SNOMET CT (part of the terms) 6

Table 6: ‘Terminology’ review result - B Criteria Number Pct(%) Formal terminology 40 4.1 Proprietary terms (mapped to a 0 0 recognised terminology) External terminology used directly 40 0 Recognised terminology OMIM, SNOMET CT

Coding standard

A code represents a concept id of a corresponding term. Although we found that 40 databases had formal terminologies to describe phenotype data, only 30 of them had formal coding and they are coded in OMIM coding (Table 7). 30 (3.1%) of 978 genetic variant databases has formal coding standards and they use external coding standards directly. The only recognised coding standard is OMIM coding standard while those databases using SNOMET CT terminology (10 databases) have no coding for SNOMET CT terms.

Table 7: ‘Coding standard’ result Criteria Number Pct (%) Formal coding standard 30 3.1 Proprietary codes (mapped to a 0 0 Recognised coding standard) External coding standard used directly 30 3.1 Recognised coding standard OMIM

Granularity

Most phenotype data have low granularity. Our result shows 959 (98%) of 978 databases collect low-granularity phenotype data overall (Table 8). However, we could find some partial fine-grained phenotype data within these 959 databases. 53 (5.5%) of these 959 databases are partially storing fine-grained phenotype data. Most phenotype data are recorded in either abbreviations or some single terms, such as ‘MRX’, ‘HCM’, or ‘Pneumocystis carinii pneumonia; Diarrhea’. We demonstrated the phenotype data in ‘phenotype samples’ section. We hardly could find fine-grained and well-structured phenotype data in these genetic variant databases and phenotype data which are recorded by using abbreviations are hard to be understood.

Table 8: ‘Granularity’ review result Criteria Number Pct (%) Overall granularity level 959 98 Low granularity 19 2

High granularity Partial fine-grained phenotypes (only 53 5.5 for those databases storing overall low- granularity-level phenotypes)

*The percentage of ‘partial fine-grained phenotypes’ is based on 959 databases which collect low- granularity phenotype data.

Curation

We found 604 (62%) of 978 databases were curated and we could find curators’ names and contract information from these database links (Table 9).

Table 9: ‘Curation’ review result Criteria Number Pct (%) Curated 604 62 Multiple phenotypes

Our result shows no big statistical difference between single phenotype collection and multiple phenotypes collection in these genetic variant databases (Table 10). 534 (54.6%) databases collect single phenotype data while 444 (45.5%) databases collect multiple phenotype data.

Table 10: ‘Multiple phenotype’ review result Criteria Number Pct (%) Single phenotype 534 54.6 Multiple phenotypes 444 45.5

Case level

Our result indicates that 221 (22.6%) databases store phenotype data at variant level and 757 (77.4%) databases store phenotype data at individual case level (Table 11). This result implies that it is potentially possible to link individual health records to biomedical records.

Table 11: ‘Case level’ review result Criteria Number Pct (%) Variant-level phenotypes 221 22.6 Case-level phenotypes 757 77.4

Database

We reviewed whether a database used a particular kind of software (database family) which can provide a generic data model to store data and we also reviewed what kind of platform was used to store data. Our result shows that LOVD (Fokkema, Dunnen & Taschner 2005) and UMD (Beroud et al. 2005) are mainly used to collect phenotype data (Table 12). LOVD is dominantly applied to 614 databases compared with UMD which is only applied to 13 databases (Table 13). LOVD uses a MySQL database to store data and most genetic variant databases (617) are MySQL databases. In contrast, UMD uses 4D SQL DB to store data. In regards with other platforms, 342 databases store their data directly in a HTML web page

41 University of South Australia Phenotype Capture in Genetic Variant Databases Peng Chen 2011 format (209 in tables, 132 in free text and 1 in a bar chart), and 4 databases store data in a ‘pdf’ file and 2 databases store data in an excel spread sheet.

Table 12: ‘Database’ review result - A Database family Number Platform LOVD 614 MySQL UMD 13 4D SQL DB 63% of databases are LOVD

Table 13: ‘Database’ review result – B Platform Number MySQL DB 617 Web page table form 209 Web page free text 132 4D SQL DB 13 PDF table form 4 Excel table form 2 Web page bar chart 1

Phenotype samples

Example 1:

Samples: ‘MRX’, ‘ARRP’, ‘AMD’, ‘arCRD’, ‘CIPA or HSN IV (H406Y + G613V are polymorphisms)’, ‘Type I, type II, non syndromic recessive’

Comment: these samples are abbreviations of phenotype data. They are coarse data and not descriptive. These abbreviations are confusing and ambiguous.

Example 2:

Samples: ‘hyperornithinemia-hyperammonemia-homocitrullinuria syndrome (OMIM238970)’,

‘Limb Girdle muscular dystrophy’,

‘Failure to thrive; Pneumocystis carinii pneumonia; Diarrhea; Marked lymphopenia’,

‘mild; arched eyebrows, synophrys, long eyelashes, myopia, thin lips, small hands, small feet, proximally set thumbs, clinodactyly 5th finger, restriction elbow movements, hirsutism’

Comment: these samples are separate terms. Most of these terms are only used to name certain clinical diagnoses without any further details. The granularity of these data is still low.

Example 3:

Sample 1:

“

Diagnosis Wiskott Aldrich syndrome Symptoms Platelets Symptoms At date of diagnosis: Count: 28,000/µL Treatment Bone marrow transplatation: Yes Treatment Donor: mismatched family donor ”

Sample 2:

“

Diagnosis Barth syndrome/isolated left ventricular noncompaction Symptoms Movement developmental delay, growth retardation, typical Symptoms myopathic face, decreased myodynamia and deep tendon Symptoms reflex, neutropenia and elevated urinary 3-methylglutaconic Symptoms acid were identified leading to the diagnosis of BTHS with Symptoms LVNC, the echocardiogram demonstrated myocardial Symptoms insufficiency and a noncompacted myocardium.

”

Sample 3:

“

Symptoms Other bacterial infections: Symptoms Pseudomonas aeruginosa; Symptoms Escherichia coli; Symptoms Stenotrophomonas maltophilia; Symptoms Other; Enterobacter cloacae Symptoms Other symptoms: perirectal abscess and failure of the Symptoms umbilical stump to involute, recurrent perirectal Symptoms abscesses, an infected urachal cyst, a failure to heal

Symptoms surgical wounds, and the absence of pus in infected areas, Symptoms leucocytosis, neutrophilia, hypochromic anemia Treatment Bone marrow transplantation: Yes Treatment Donor: matched sibling Treatment Outcome: alive and well Comment D57N mutant behaves in a dominant-negative fashion at the Comment cellular level

”

Comment: these samples have a structure (diagnosis, symptoms, treatment and comment) to organise phenotype data and they are considered as fine-grained data.

4.2 Discussion

In our phenotype review, we did not consider overlaps between genes in different genetic variant databases as previous studies did (Claustres et al. 2002; Mitropoulou et al. 2010). Our review is phenotype-oriented regardless of genes overlaps. We are concerned about how phenotype data were stored and presented in heterogeneous genetic variant databases. We also did not eliminate those links which were not relational databases, such as a ‘pdf’ file, because we intended to provide a bigger picture of how phenotype data were stored. The reason for choosing mutation databases for analysing phenotype data instead of clinical information systems is that mutation databases can possibly store both genotype data and phenotype data and we can easily discover associations between them although these associations are not strong.

Compare with the research done by Claustres et al. (2002) almost a decade ago, around 40% databases collected phenotype data at that time while around 80% databases collected phenotype data by the time we conducted our research. In Mitropoulou et al. (2010)’s research, they contended that the data content in 98.1% of LSDBs was submitted via contacting curators. In contrast, our review result shows that there less databases have been curated.

Our results show that more than half of genetic variants are related to one single phenotype. This also has proven that phenotype overlapping between different genetic variants exists in LSDBs (Coppieters et al. 2010). This will introduce clarity issues when performing data linkage between genotype and phenotype because one single phenotype could be linked to more than one genetic variant.

63% of these genetic variant databases use LOVD software (Fokkema, Dunnen & Taschner 2005) and this implies that we can convert 63% of these databases into one standardised data model by only mapping the LOVD software structure to a standardised data model. More importantly, regardless of platforms all of these phenotype data are unstructured and in free text except that one platform is a graphical bar chart. Even though these data are stored in relational databases such as MySQL, only one column which is named ‘phenotype’ has been designed to store free-text content.

Claustres et al. (2002) and Mitropoulou et al. (2010) both used similar criteria to review genetic variant databases. Their review criteria are broad but not focusing on phenotype data. In comparison, our review criteria are designed specifically for reviewing phenotype data in depth as to understand characteristics of phenotype data stored within existing genetic variant databases. We also have selected a broader set of genetic variant databases for our review. Our research adds phenotype data review to existing literatures of reviewing genetic variant databases.

In the analysis of Mitropoulou et al. (2010)’s review, they argued that one deficiency of LSDBs was lack of phenotype description for each genetic variant. This is similar to our result that phenotype data have low granularity. They also argued that phenotype data in LSDBs could be linked to other clinical associations to provide extended clinical services for patients beside for researchers if the deficiency was addressed. Due to security and privacy reason, all public LSDBs have guised patients’ identities of all records. Security, privacy and confidentiality have been concerns and obstacles for EHR data linkage study in terms of collecting and using data from different data sources (Alvarez et al. 2011, Jutte, Roos & Brownell 2011).

We have successfully mapped the status quo of phenotype data which are recorded in existing genetic variant databases. We have noticed some significant features of these phenotype data. 80% of these genetic variant databases collect phenotype data and less than 5% have formal terminologies and coding, and 98% of overall phenotype data are low- granularity data. In summary, most of these phenotype data are not coded with formal terminologies and they are coarse data, i.e. either abbreviations or some separate terms, and they are unstructured free text. In this sense, it will be difficult to group these phenotype data into individual clinical concepts.

Chapter 5 openEHR archetypes review

5.1 Review criteria

We explore the status quo of openEHR archetypes in terms of semantic interoperability before searching for suitable archetypes to capture phenotype data. We want to know whether openEHR is able to achieve multilingual translations between different languages and to interchange information between heterogeneous information systems. The openEHR archetype has its own internal coding system and a mechanism to bind external terms. It also has a mechanism to translate internal terms between different languages.

Table 14: Archetype review criteria Criteria Meaning Number of terms The number of terms used to describe an archetype Number of term bindings The number of terms binding to other external terms Coding system What kind of coding system (or terminology) used for the term binding (like SNOMET-CT) Has term binding A flag to indicate whether an archetype has term binding Has multilingual translations A flag to indicate whether an archetype has implemented multilingual translations Languages What languages an archetype has implemented for multilingual translations Compile failure The number of archetypes which cannot be opened and cannot be compiled to the reference model

We have reviewed 283 archetypes which were exported from openEHR clinical knowledge manager (http://www.openehr.org/knowledge/). An archetype review criteria (Table 14) were designed to examine the suitability of using openEHR to achieve internal and external semantic operability by observing its ability for internal multilingual translations and external term binding.

5.2 The result

Table 15: Archetype review result Criteria Result Number of terms 7361 Number of term bindings 94 Coding system SNOMED-CT, LOINC Has term binding 7 (0.24% archetypes) Has multilingual translations 83 (29.3% archetypes) Languages English, German, Arabic, Portuguese, Japanese, Russian, Dutch, Chinese, Spanish, Farsi Compile failure 14

Our result (Table 15) indicates that only 7 (0.24%) of 283 archetypes have term bindings and 83 (29.3%) archetypes have their terms translated into more than one language (including English). The sum of all terms in 283 archetypes is 7361 but only 94 terms are linked to external terminology systems, such as SNOMED-CT and LOINC. We found that 5 archetypes use SNOMED-CT system while 2 archetypes use LOINC system. We did not find a case that both SNOMET-CT and LOINC were used in one archetype. There 10 languages are used to define terms in archetypes, including English, German, Arabic, Portuguese, Japanese, Russian, Dutch, Chinese, Spanish and Farsi. We also found 14 archetypes failed being complied by openEHR reference model.

We need to point out two things in regards with term binding and multilingual translations. First, none of these 7 archetypes which have term bindings has fully bound all terms to a coding system, but only a small part of these terms are bound to a coding system and the rest of these terms remains using openEHR internal coding system. Second, not all terms in these 83 archetypes which have implemented multilingual translations have been translated into other languages, and some of these terms remain untranslated. German, Arabic and Portuguese are commonly used languages in translating archetypes. Based on our observation, although openEHR archetypes provide the mechanisms for terms binding and multilingual translations which can potentially fulfil universal semantic consistency and interoperability between health information systems, these mechanisms have not been fully implemented in all archetypes and a lot of work and cooperation are needed to fill this gap.

5.3 A blood pressure archetype example

A blood pressure archetype is chosen to demonstrate how openEHR implements its multilingual translations and term blinding mechanisms. We look at these two mechanisms in the forms of ADL and XML. By compared with another archetype which captures heart rate data, we explain why openEHR internal ‘at****’-like coding system cannot achieve semantic interoperability without binding a universal coding system.

Multilingual translations mechanism

The openEHR first defines different languages used for translations under the ‘language’ heading and then it records translated terms under the ‘ontology’ heading. The original language to define terms in openEHR archetypes is English. We illustrate ADL and XML examples for multilingual translations implemented in the blood pressure archetype as follows.

ADL display under the ‘language’ heading: archetype (adl_version=1.4) openEHR-EHR-OBSERVATION.blood_pressure.v1 concept [at0000] -- Blood Pressure language original_language = <[ISO_639-1::en]> translations = < ["zh-cn"] = < language = <[ISO_639-1::zh-cn]> author = < ["organisation"] = <"Ocean Informatics"> ["name"] = <"Chunlan Ma"> > > ["de"] = < language = <[ISO_639-1::de]> author = < ["organisation"] = <"Ocean Informatics, University of Heidelberg"> ["name"] = <"Sebastian Garde, Jasmin Buck"> > > ["fa"] = < language = <[ISO_639-1::fa]> author = < ["organisation"] = <"Ocean Informatics">

["name"] = <"Shahla Foozonkhah"> ["email"] = <"[email protected]"> …

…

In this example, the ADL defines the original language of the archetype is English and it can be translated into Chinese, German and Farsi. We can note that other languages are neglected in this example.

ADL display under the ‘ontology’ heading: ontology terminologies_available = <"SNOMED-CT", ...> term_definitions = < …

["zh-cn"] = < items = < ["at0000"] = < text = <"*Blood Pressure(en)"> description = <"*The local measurement of arterial blood pressure which is a surrogate for arterial pressure in the systemic circulation. Most commonly, use of the term 'blood pressure' refers to measurement of brachial artery pressure in the upper arm.(en)"> > ["at0001"] = < text = <"*history(en)"> description = <"*history Structural node(en)"> > ["at0003"] = < text = <"血压"> description = <"*@ internal @(en)"> > ["at0004"] = < text = <"收缩压"> description = <"一个血液循环周期中，系统性动脉血压高峰值。收缩期血压">

…

["de"] = < items = < ["at0000"] = < text = <"Blutdruck"> description = <"Die lokale Messung des arteriellen Blutdrucks als Surrogat für den arteriellen Druck in der systemischen Zirkulation. Häufig wird der Ausdruck 'Blutdruck' zur Bezeichung der Messung des brachialen Ateriendrucks im Oberarm verwendet."> > ["at0001"] = < text = <"Historie"> description = <"Historie"> > ["at0003"] = < text = <"Blutdruck"> description = <"*@ internal @(en)"> > ["at0004"] = <

text = <"Systolisch"> description = <"Der höchste arterielle Blutdruck eines Zyklus - gemessen in der systolischen oder Kontraktionsphase des Herzens.">

…

["en"] = < items = < ["at0000"] = < text = <"Blood Pressure"> description = <"The local measurement of arterial blood pressure which is a surrogate for arterial. pressure in the systemic circulation. Most commonly, use of the term 'blood pressure' refers to measurement of brachial artery pressure in the upper arm."> > ["at0001"] = < text = <"history"> description = <"history Structural node"> > ["at0003"] = < text = <"blood pressure"> description = <"@ internal @"> > ["at0004"] = < text = <"Systolic"> description = <"Peak systemic arterial blood pressure - measured in systolic or contraction phase of the heart cycle.">

From these examples above we can know that a Chinese term ‘血压’ and a German term ‘Systolisch’ both share the same concept ID ‘at0004’ which link to a English term ‘Systolic’ which also has the same concept ID. The openEHR uses an internal concept ID to link terms that are recorded in different languages but having the same semantic meaning.

XML display under the ‘language’ heading:

ISO_639-1 de Sebastian Garde, Jasmin Buck Ocean Informatics, University of Heidelberg

ISO_639-1 zh-cn Chunlan Ma Ocean Informatics

ISO_639-1 fa Shahla Foozonkhah [email protected] Ocean Informatics

XML display under the ‘ontology’ heading:

Blood Pressure The local measurement of arterial blood pressure which is a surrogate for arterial. pressure in the systemic circulation. Most commonly, use of the term 'blood pressure' refers to measurement of brachial artery pressure in the upper arm. history history Structural node Tree @ internal @ Systolic Peak systemic arterial blood pressure - measured in systolic or contraction phase of the heart cycle. Diastolic Minimum systemic arterial blood pressure - measured in the diastolic or relaxation phase of the heart cycle. …

Blutdruck Die lokale Messung des arteriellen Blutdrucks als Surrogat für den arteriellen Druck in der systemischen Zirkulation. Häufig wird der Ausdruck 'Blutdruck' zur Bezeichung der Messung des brachialen Ateriendrucks im Oberarm verwendet. Historie Historie Blutdruck *@ internal @(en) Systolisch Der höchste arterielle Blutdruck eines Zyklus - gemessen in der systolischen oder Kontraktionsphase des Herzens. Diastolisch Der minimale systemische arterielle Blutdruck eines Zyklus - gemessen in der diastolischen oder Entspannungsphase des Herzens.

…

The XML structure stores the same data as the ADL structure does, but XML is the most common standard or protocols to store, transfer and share information among all sorts of applications and systems. The ADL is designed to represent clinical data in a structured way but it needs to be implemented in a message transmission standard which can be XML.

Based on our archetype review, openEHR internal ‘at****’-like coding system only can hold a unique term within an archetype but other archetypes can also use extract same codes to represent different clinical concepts. We used heart rate archetype ‘openEHR-EHR- OBSERVATION.heart_rate.v1.adl’ for comparison with blood pressure archetype ‘openEHR- EHR-OBSERVATION.blood_pressure.v1.adl’ to explain that openEHR internal coding system cannot uniquely identify a term across all archetypes.

The comparison of coding between heart rate and blood pressure archetypes:

We can see blood pressure and heart rate archetypes both share the same set of codes except that heart rate archetype does not have ‘at0002’, but both the same set of code represents completely two different sets of terms. For example, ‘at0004’ means ‘Rate’ in the heart rate archetype while ‘at0004’ means ‘Systolic’ in the blood pressure archetype; ‘at0005’ means ‘Rhythm pattern’ in the heart rate archetype while ‘at0005’ means ‘Diastolic’ in the blood pressure archetype.

The openEHR internal coding system will introduce semantic misinterpretation of codes during information exchange between two systems. For instance, when a system receives a code ‘at0004’, it would be a question of how the system knows the code ‘at0004’ uniquely stands for ‘Systolic’. The code could also mean ‘Rate’ or something else. On the other hand, archetypes can be modified to accommodate special information requirements. Even though two systems employ the same archetype, one may define ‘at0004’ as ‘Systolic’ while another one may define ‘at0004’ as ‘Diastolic’. Furthermore, when using openEHR to migrate to a new health information system, the legacy system may use other term(s) to describe ‘Systolic’, say, ‘Contraction phase of the heart cycle’ instead of using ‘Systolic’ which is originally in openEHR blood pressure archetype. In order to apply openEHR as a standard to capture clinical data as well as to unify all health information systems, it is essential to have a standardised coding system to map terms which have the same semantic meaning to a unique concept ID. The openEHR standard can achieve multilingual translations by using its internal coding system while there is also a need to have an external standardised coding system for binding internal openEHR terms semantically with unique codes.

Term binding mechanism

We continue studying the blood pressure archetype example in terms of openEHR term binding mechanism. We demonstrate both ADL and XML displays for this mechanism. There is a term binding section in ADL to define the kinds of terminologies used for term bindings and to list the mappings between openEHR internal codes and a corresponding external terminology.

ADL display for term binding:

term_bindings = < ["SNOMED-CT"] = < items = < ["at0000"] = <[SNOMED-CT(2003)::163020007]> ["at0004"] = <[SNOMED-CT(2003)::163030003]> ["at0005"] = <[SNOMED-CT(2003)::163031004]> ["at0013"] = <[SNOMED-CT(2003)::246153002]> > >

The SNOMED-CT terminology is used for term bindings in the blood pressure archetype, but there only four internal terms are mapped to SNOMET-CT terminology. 52 remaining terms have no term bindings so these terms are unable to participate in semantic interoperation between two different health systems. The term binding mechanism can link an internal code to a unique code of the specified terminology in a particular version. For instance, ‘at0004’ links to code ‘163030003’ in SNOMED-CT in 2003 version.

XML display for term binding:

SNOMED-CT(2003) 246153002

SNOMED-CT(2003) 163030003 SNOMED-CT(2003) 163031004 SNOMED-CT(2003) 163020007

The term binding mechanism allows openEHR archetypes to reconcile semantic issues between systems. It acts like a middleware to unify all sorts of health information systems by establishing a standardised semantic interoperable model. It can potentially be a standard to integrate all health systems with a single ontology for exploring health intelligence.

5.4 Discussion

Our openEHR archetypes review implies that openEHR archetypes have multilingual translations mechanism and term binding mechanism for achieving semantic interoperability. By implementing the multilingual translations mechanism the semantics of a health information system can be understood by different nations in different languages around the world. It is achieved by translating openEHR internal codes within an archetype into multiple languages. By implementing the term binding mechanism the semantics of a health information system can be understood by another information system in terms of information exchange and integration. It is achieved by binding openEHR internal coding system to a set of external standardised coding systems. Therefore, regardless of what terms have been used in an information system and in what languages they are, as long as these terms between two information systems are bound to the same set of external coding systems, terms in one system can be understood by another. In general terms, the multilingual translations mechanism provides an interface for human to understand a health information system while the term binding mechanism provides an interface for two information systems (machines) to understand each other.

However, based on our archetypes review result, none of these archetypes has fully implemented both multilingual translations mechanism and term binding mechanism although these two mechanisms already exist in these openEHR archetypes. Only 7 of 283 archetypes have implemented both mechanisms and another 76 archetypes only have multilingual translations but no term bindings. None of these 76 archetypes has their internal terms bound to external terminologies and these terms only have been translated into limited numbers of languages (commonly German, Arabic). Some of these translations are incomplete.

Additionally, openEHR standard can manage different versions of archetypes for the same clinical concept (Beale & Heard 2008). In that case, we can modify an archetype to adjust the change of the content for a concept. Given multilingual translations mechanism, term binding mechanism and version control mechanism all these features in openEHR standard, openEHR has the potential of being a standard for solving semantic interoperability problem

(Lee et al. 2009) in EHR systems, but it is still far from being ready because of the incompleteness of multilingual translations and term binding within existing archetypes. Existing archetype-based EHR systems have been implemented via directly using existing archetypes or modifying them or creating new ones, but there is little control and governance on how to use and manage these archetypes. Even though there is an online clinical knowledge manager for openEHR archetypes, there is still little guideline for managing archetypes. It is unsure whether archetypes have been used correctly when developing an EHR system. Without universal knowledge governance on openEHR archetypes, the anarchy of using archetypes will become a big problem before we can perfect these archetypes for the use of unifying all EHR systems. There is a strong need in international cooperation and knowledge governance to develop and improve openEHR archetypes for building semantic interoperable and integrated EHR systems and also for reconciling semantic conflicts between heterogeneous EHR systems.

Chapter 6 openEHR phenotype capture experiment

6.1 Phenotype capture experiment result

We proposed ‘openEHR phenotype capture model’ in Chapter 3 as a methodology for capturing phenotype data. Given 98% of phenotype data in genetic variant databases have very low granularity, it is not possible to create a phenotype data model from a bunch of abbreviations and separate terms. We chose a relatively structured and simple phenotype sample to demonstrate our experiment. We identified openEHR archetypes to represent clinical concepts captured from the sample.

It is very important to note that capturing phenotype data of genetic variant databases will contribute to discover data linkage between clinical data (phenotype data) and individual genetic materials. It helps discover personalised medicine (Martin-Sanchez et al. 2004) and biomedical knowledge discovery.

Most phenotype data are recorded in a single column of a table in unstructured free text. There is very little we can do with phenotype data which are only some abbreviations or separate simple terms in terms of using openEHR archetypes to capture these data since archetypes express detailed and rich clinical content.

Our experiment was inspired by the research done by Spath and Grimson (2011). We did not try to convert a proprietary BIMS database into archetypes as Spath and Grimson did. The similarity between our research and Spath &Grimson’s is that our phenotype data are also from biomedical research setting and they are also recorded in free text. Spath and Grimson (2011) discussed a lot on how to group clinical concepts from unstructured free- text clinical data. We followed their ideas to identify clinical concepts from our phenotype data sample. Our proposed method to capture the phenotype data (Figure 11. openEHR phenotype capture model) was examined in this experiment.

We chose a simple but meaningful piece of structured phenotype sample from Chapter 4 for our experiment demonstration. This sample is deemed as fined-grained phenotype data in our review compared with other coarse-grained phenotype data (e.g. abbreviations, simple terms).

Diagnosis Wiskott Aldrich syndrome Symptoms Platelets Symptoms At date of diagnosis: Count: 28,000/µL Treatment Bone marrow transplatation: Yes Treatment Donor: mismatched family donor

Referring to our proposed phenotype capture model, the first step is to map a clinical data set to a phenotype ontology. This step is similar to the process of analysing the semantic meanings of information and grouping the information into distinctive clinical concepts as Spath & Grimson (2011). Due to characteristics of phenotype data in genetic variant databases (incomplete, sparse, and coarse-grained, etc.), we were not able to create a full phenotype ontology. In our sample, phenotype data has been recorded into three categories: diagnosis, symptoms and treatment. The diagnosis is Wiskott Aldrich syndrome; the symptom is that the count of platelets is 28,000/µL; the treatment is bone marrow transplantation with comment “mismatched family donor”. Imaging a patient is the centre of an EHR system and the patient is surrounded by these three concepts, we created a mapping of these concepts as follows (Figure 12).

Figure 12: The mapping of concepts

The next step in our phenotype capture model is to express this phenotype ontology by using openEHR archetypes. We searched for suitable archetypes to capture this piece of phenotype data from archetype repository in openEHR clinical knowledge manager. We found the archetypes in ENTRY class which includes OBSERVATATION, EVALUATION, INSTRUCTION, ACTION and ADMIN_ENTRY five sub-classes was suitable for expressing phenotype data. Then, we analysed which of these sub-classes could semantically cover these three concepts in the mapping. In terms of semantic similarity, it was not difficult to

60 University of South Australia Phenotype Capture in Genetic Variant Databases Peng Chen 2011 figure out that diagnosis is close to evaluation, and symptom is close to observation and treatment is close to action.

We then examined each entry item in archetypes in each sub-classes (EVALUATION, OBSERVATATION and ACTION) to find out which archetypes had suitable items to record the clinical content in our example. As a result, three openEHR archetypes (Table 16) are considered having suitable entry items to record the clinical content in the example.

Table 16: openEHR archetypes for capturing phenotype data NO Archetypes Entry items . 1 openEHR-EHR-EVALUATION.problem-diagnosis.v1.adl Diagnosis 2 openEHR-EHR-OBSERVATION.lab_test- Platelet count full_blood_count.v1.adl 3 openEHR-EHR-ACTION.procedure.v1.adl Procedure, Comments

We display all entry items available in these three archetypes and specify constraints in each archetype for capturing specific phenotype data as well as defining terminology binding with external terminologies.

openEHR-EHR-EVALUATION.problem-diagnosis.v1.adl

Table 17: Entry items in openEHR-EHR-EVALUATION.problem-diagnosis archetype NO. Entry Item Data Type Description 1 Diagnosis Text The index diagnosis 2 Status Text The status of the diagnosis 3 Date of initial onset Date Time The date that the problem began causing symptoms or signs 4 Age at initial onset Duration The age of the at the onset of the problem 5 Severity Ordinal The severity of the index problem 6 Clinical description Text Description of the clinical aspects of the problem 7 Date clinically recognised Date Time Date the problem was recognised by clinicians 8 Age when clinically recognised Duration The age when this problem was clinically recognised Location (cluster) Cluster The age when this problem was clinically recognised 9 Body site Text The body site affected 10 Location description Text A free text description of the location - may be in addition to a coded body site Aetiology (cluster) Cluster Agents or Factors known to have been of aetiological significance 11 Agent Text Microbial or other agent known to have caused this problem

12 Complication of URI A problem or link to a problem or injury described elsewhere in the EHR 13 Description Text Description of aetiological process Occurrences or exacerbations Cluster Grouping of information about individual (cluster) occurrences or exacerbations 14 Frequency of recurrence Quantity The frequency of individual occurrences of the problem 15 Date of last occurrence Date Time The date of the last occurrence or exacerbation Occurrence/exacerbation (sub- Cluster Information about one occurrence or cluster) exacerbation 16 Clinical description Text A description of the exacerbation or occurrence 17 Outcome Text Outcome of the occurrence or exacerbation 18 Date of onset of Date Time Date of onset of occurrence or exacerbation occurrence 19 Date of resolution of Date Time Date of the resolution of the occurrence occurrence 20 Number of occurrences Count Number of times this problem has occurred or been apparent Related problems (cluster) Cluster Complications that are attributed to this problem Related problem (cluster) Cluster A group of characteristics of the problem complicating the index condition in this archetype 21 Related problem URI Details of the problem as text or coded text or URI 22 Clinical description Text Description of the complicating problem 23 Date of resolution Date Time The date that the problem resolved or went into remission 24 Age at resolution Duration The age of the at the resolution of the problem Diagnostic criteria (cluster) Cluster The criteria on which the diagnosis is based 25 Criterion Text A basis for the diagnosis

From the list of descriptions of these entry items (Table 17), it is easy to figure out that ‘Diagnosis’ can be used to record ‘Diagnosis Wiskott Aldrich syndrome’ this piece of data. We defined the constraint of the ‘Diagnosis’ entry item in the archetype and specified it as ‘Wiskott Aldrich syndrome’. Any terms that ‘is_a’ diagnosis can be typed into the constraint and ‘Wiskott Aldrich syndrome’ is a diagnosis. Then, we searched a corresponding SNOMET- CT term and a code for ‘Wiskott Aldrich syndrome’ on this website ‘http://www.semfinder- snomed.ch/snomed-ct-en’. The SNOMED term search website returned the term ‘Wiskott- Aldrich syndrome’ and the code ‘36070007’. The SNOMED-CT term and the code were recorded in the archetype as the external term binding for the captured information ‘Diagnosis Wiskott Aldrich syndrome’. After we imported SNOMED-CT terminology to the archetype and set the constraint of the entry item ‘Diagnosis’ to ‘Wiskott Aldrich syndrome’ (Figure 13), we then set ‘36070007’ as the code of ‘Diagnosis’ (Figure 14).

Figure 13: constraint and terminology setting for diagnosis

Figure 14: terminology binding for diagnosis

Here is a piece of ADL code that reflects the setting as Figure 13:

63 University of South Australia Phenotype Capture in Genetic Variant Databases Peng Chen 2011 definition EVALUATION[at0000.1] matches { -- Diagnosis data matches { ITEM_TREE[at0001] matches { -- structure items cardinality matches {1..*; ordered} matches { ELEMENT[at0002.1] matches { -- Diagnosis value matches { DV_CODED_TEXT matches { defining_code matches {[ac0.1]} -- Wiskott Aldrich syndrome } } }

DV_CODED_TEXT is a data type in openEHR’s reference model and any content in this data type will be coded with external terminologies (e.g. SNOMED-CT) (Beale et al. 2008). This data type represents a semantic constraint for an entry item in an archetype as referring to our ‘openEHR phenotype capture model’. The next we look at how term binding has been implemented in ADL.

Here is a piece of ADL code that reflects the setting as Figure 14:

term_bindings = < ["SNOMED-CT"] = < items = < ["at0002.1"] = <[SNOMED-CT::36070007]> > > > constraint_bindings = < ["SNOMED-CT"] = < items = < ["ac0.1"] = > > >

Here is another piece of ADL code which realises internal multilingual translations between English and German for the term ‘Wiskott Aldrich syndrome’:

constraint_definitions = < ["de"] = < items = < ["ac0.1"] = < text = <"Wiskott-Aldrich-Syndrom"> description = <"Wiskott-Aldrich-Syndrom"> > ["ac0000"] = < text = <"Beliebiger Ausdruck, der eine Körperstelle beschreibt"> description = <"Eine anatomische Struktur mit Vermerken"> > > > …. …. ["en"] = <

items = < ["ac0.1"] = < text = <"Wiskott Aldrich syndrome"> description = <"Wiskott-Aldrich syndrome"> > ["ac0000"] = < text = <"Any term that describes a body site"> description = <"An anatomical structure with qualifiers"> > > > >

We can interpret these three pieces of ADL code as above like this: in this archetype, the ‘Diagnosis’ is specified as ‘Wiskott Aldrich syndrome’ and the term is binding SNOMED-CT terminology, and the code for the specific ‘Diagnosis’ is ‘36070007’, and the term ‘Wiskott Aldrich syndrome’ has been translated into German ‘Wiskott-Aldrich-Syndrom’.

openEHR-EHR-OBSERVATION.lab_test-full_blood_count.v1.adl

Table 18: Entry items in openEHR-EHR-OBSERVATION.lab_test-full_blood_count archetype NO. Entry Item Data Type Description 1 Test name Text Specific identifier for this lab test. e.g. Full blood count , blood glucose, urine microbiology. May equate to the result name for a single value result. Commonly a coded term e.g from LOINC or SNOMED-CT. 2 Diagnostic service Text The type of high-level diagnostic service e.g. biochemistry, haematology. 3 Test status Text The status of the lab test as a whole. 4 Specimen detail [Cluster] Slot Details of the specimen being reported where all individual results are derived from the same specimen 5 Haemoglobin Quantity The mass concentration of haemoglobin 6 Red cell count (RCC) Quantity The number of red blood cells per litre 7 Packed cell volume (PCV) Proportion (Haematocrit) The proportion of the volume of blood taken up by red blood cells 8 Mean cell haemaglobin Quantity The average haemaglobin concentration in the concentration (MCHC) red blood cells 9 Mean cell volume (MCV) Quantity The average volume of the red blood cells (PCV/RCC) 10 Mean cell haemaglobin (MCH) Quantity The average haemaglobin content of red blood cells 11 Red cell distribution width (RDW) Proportion The variation in red cell volume 12 Erythrocyte sedimentation rate Quantity The velocity of sedimentation of red cells in the (ESR) first hour 13 Mean platelet volume (MPV) Quantity The average platelet volume 14 Platelet distribution width Proportion The variation in platelet volume 15 Platelet count Quantity The number of platelets per litre

16 Plateletcrit Proportion The proportion of the volume of blood taken up by platelets 17 White cell count Quantity The number of white cells per litre White cell differential (Cluster) Cluster The differential count of white cells or leukocytes 18 Neutrophils Quantity The number of neutrophils per litre 19 Lymphocytes Quantity The number of lymphocytes per litre 20 Basophils Quantity The number of basophils per litre 21 Monocytes Quantity The number of monocytes per litre 22 Eosinophils Quantity The number of eosinophils per litre 23 Microscopic features Text The features of the blood film on microscopy 24 Result Any The result of the test 25 Per-result annotation [Cluster] Slot Slot to allow an annotation to be added to a particular test result at run-time. 26 Overall interpretation Text An overall interpretative comment on this test. 27 Multimedia representation Multi-Media Representations of the whole test in mutlimedia e.g image, audio, video

From the list of entry items (Table 18), ‘Platelet count’ was chosen as a suitable entry item to record this piece of information in our example:

“

Symptoms Platelets Symptoms At date of diagnosis: Count: 28,000/µL ”

The symptom of ‘Wiskott Aldrich syndrome’ is that the platelets count is low and the result is 28000 µL. In this archetype, the data type of the item ‘Platelet count’ is quantity which records numeric data. We set the property of the quantity to ‘Volume’ and set the unit to measure the ‘Volume’ to ‘µL’. We set the count of the minimal value of the volume to ‘28000’ and set the count of the maximal value of the volume to ‘28000’. In this means the count is actually equal to 28000 (Figure 15).

Figure 15: capturing platelet count for symptom

Figure 16: terminology binding for platelet count

The terminology binding for ‘Platelet count’ is directly associated with the entry item while the terminology binding for ‘Diagnosis’ is associated with the constraint. We found an exact same term in SNOMED-CT for ‘Platelet count’.

Here is a piece of ADL code that reflects the setting in Figure 15: definition OBSERVATION[at0000.1] matches { -- Full blood count data matches { HISTORY[at0001] matches { -- Event Series events cardinality matches {1..*; unordered} matches { EVENT[at0002] occurrences matches {0..1} matches { -- Any event data matches { ITEM_TREE[at0003] matches { -- Tree items cardinality matches {0..*; unordered} matches {

… … ELEMENT[at0078.12] occurrences matches {0..1} matches { -- Platelet count value matches { C_DV_QUANTITY < property = <[openehr::129]> list = < ["1"] = < units = <"µL"> magnitude = <|28000.0|> > > > } }

C_DV_QUANTITY is a data type in openEHR reference model which can be used to store the platelet count (Beale et al. 2008). The term ‘Volume’ is associated with an openEHR internal code ‘129’.

Here is a piece of ADL code that reflects the term binding as Figure 16:

term_bindings = < ["SNOMED-CT"] = < items = < ["at0078.12"] = <[SNOMED-CT::61928009]> > > >

The original archetype has not implemented multilingual translations. However, we can enable this by adding ADL code of other non-English languages into the archetype.

68 University of South Australia Phenotype Capture in Genetic Variant Databases Peng Chen 2011 openEHR-EHR-ACTION.procedure.v1.adl

Table 19: Entry items in openEHR-EHR-ACTION.procedure archetype NO. Entry Item Data Type Description 1 Procedure Text The name of the procedure 2 Reason/s for procedure Text The reason or indication for the procedure 3 Method/Technique Text Identification of specific method or technique used for procedure 4 Description Text Narrative description about the procedure carried out 5 Procedure Details [Cluster] Slot Detailed structure describing the procedure carried out, including preparation and details about the method and equipment/devices used 6 Anatomical site details [Cluster] Slot Details about the anatomical site of procedure Additional tasks Cluster Record information about unplanned or unexpected activities that needed to be done during the procedure. Record the name of the task and a description within this archetype, but detail should be recorded in specific linked INSTRUCTION or ACTION archetypes. 7 Task Text Name of additional task performed during the procedure. 8 Task description Text Description of additional task performed during the procedure. 9 Record of additional task URI Link to a detailed record of the additional task. 10 Outcome Text Outcome of procedure performed. 11 Procedure unsuccessful Boolean Was the procedure ultimately unsuccessful? True if unsuccessful. 12 Failed attempts Any The number of failed attempts to perform the procedure. 13 Unplanned event Text An unplanned event prior to or related to the procedure, which may affect its execution e.g patient self-removed cannula. 14 Complication Text Details about any complication arising from the procedure. 15 Emergency Boolean Was this procedure performed as an emergency? True if Yes. 16 Comments Text Comments about the procedure. 17 Multimedia Multi-Media Multimedia representation of the procedure, including images.

There were two pieces of information for treatment left to be captured in the example:

“

Treatment Bone marrow transplantation: Yes Treatment Donor: mismatched family donor

”

We first recorded the name of the treatment which is a procedure. Then, we recorded additional information which was related to the procedure. For the first piece of information, we chose the entry item ‘Procedure’ from the list above (Table 19) to record it. Similar to recording ‘Diagnosis’, we defined the constraint for ‘Procedure’ as ‘Bone marrow transplantation’ in the archetype (Figure 17) and bound the constraint to the term ‘Transplantation of bone marrow’ in SNOMED-CT (Figure 18).

For the second piece of information, existing entry items in the list are not able to store any meaningful semantic terms or phrases which are broken down from this second piece of information. Therefore, we had to record this piece of information as whole in one entry item in a result of lower granularity. We found two possible entry items to capture this piece of information: ‘Unplanned event’ and ‘Comments’. Given lack of guidelines for archetypes and lack of best practice of using them (Spath & Grimson 2011), we deemed ‘Comments’ this entry item was more appropriate to capture the information.

We were also facing another issue that no external terminology binding was found for these terms ‘mismatched family donor’. We searched these terms in several terminology search engines, including SNOMED-CT (http://www.semfinder-snomed.ch/snomed-ct-en), LOINC (http://loinc.org/), ICD10 (http://www.icd10data.com/), Mesh (http://www.nlm.nih.gov/mesh/MBrowser.html), none of these terminology search engines returned a result. We also rearranged the order of these terms or substituted terms which had similar meanings for searching similar terms, but no result returned. Although single term like ‘family’ and ‘donor’ were found in SNOMED-CT terminology, no entry items in this archetype were able to record these single terms separately while still preserving the semantic meaning as a whole.

Finally, we created an openEHR internal code and terms to record the second piece of information which is ‘Donor: mismatched family donor’ (Figure 19). As shown in Figure 19, we created the internal term ‘mismatched family donor’ and its description ‘the family donor is not matched’, then the archetype editor automatically generated an internal code ‘at0059’. However, we were unable to bind the internal code and terms to external terminology due to the absence of returned result from several terminology search engines.

Here is a piece of ADL code that reflects the internal coding as Figure 19:

ELEMENT[at0005] occurrences matches {0..2} matches { -- Comments value matches { DV_CODED_TEXT matches { defining_code matches {[local::at0059]} -- mismatched family donor } } }

When we created our own internal code, DV_CODED_TEXT was selected as the data type for a coded text. The code ‘at0059’ was denoted as a local code.

Based on the experiment on using this archetype to record the treatment information in the example, we formed two suggestions. One suggestion is in regards with openEHR archetypes. These archetypes need enhancement on the capability of structuring and recording more atomic data from free text for increasing the possibility of recording more atomic data and the possibility of finding more correlations between genetic variant data and clinical data. Another suggestion is in regards with terminology bindings. Each terminology system like SNOMED-CT, LOINC, and ID10 need to expand the scope of the terminology in order to cover more terms for term bindings. One terminology may not be able to comprehensively cover all terms or atomic semantic meanings in any free text, but all these terminology systems together can be potentially binding all these terms as long as they are well categorised to avoid overlapping in binding the same terms.

Figure 17: capturing the first piece of information for the treatment

Figure 18: terminology binding for the procedure

Figure 19: openEHR internal coding for comments

6.2 An attempt to modelling a conceptual patient-centric EHR data warehouse schema

The last step in our proposed ‘openEHR phenotype capture model’ is to use the ontology to map a phenotype data model which can be expressed by openEHR reference model. As we continued exploring our study on phenotype data, clinical content, openEHR archetype model and reference model, and EHR systems, our ‘openEHR phenotype capture model’ is no longer only for capturing phenotype data of genetic variant databases. It has become a generic ontology-archetype-based data modelling method (Figure 20). The advantage of this method is that the ontology can well reflect and capture real world objects and their relationships while the archetype can well express business logic and separate the business logic from underlying data structures. With this method, we can create a better EHR system which is very close to the reality and having flexibility and scalability.

Figure 20: A generic ontology-archetype-based data modelling method (evolved from ‘openEHR phenotype capture model’)

We propose a conceptual patient-centric EHR data warehouse schema (Figure 21) as the result of the generic ontology-archetype-based data modelling method. As a concern of EHR integration and health business intelligence as well as genotype-phenotype correlation study (Martin-Sanchez et al. 2004), it is necessary to propose such a conceptual patient- centric EHR data warehouse schema which can potentially guide us achieve these goals.

The schema was inspired by Beale and Heard’s ‘Clinical Investigator Record (CIR) Ontology’ (Beale & Heard 2007), openEHR archetypes, data warehousing methods and our phenotype capture experiments. Based on Beale and Heard’s ‘Clinical Investigator Record (CIR) Ontology’, we divided the schema into five clinical knowledge domains (observation, evaluation, instruction, action and administration) supplemented by another knowledge domain, i.e. bio-medical search which contributes to discover genotype-phenotype correlations. The entities in the schema and the relations between them are based on the archetypes in the entry class in openEHR. The entities are derived from archetypes or a

74 University of South Australia Phenotype Capture in Genetic Variant Databases Peng Chen 2011 direct mapping from archetypes as archetypes present clinical concepts. We can study entry items of archetypes (Table 17, 18, 19) and get the idea of what kinds of data are needed for each knowledge domain. Then we can decompose the archetypes and restructure them into entities in a data warehouse.

Referring to the schema, fact tables (in orange) represent core concepts in each knowledge domain, for example, ‘Lab_Test_Fact’ is a core observation concept. Each fact table records data which are produced from a health care activity. We have two kinds of dimension tables: one kind (in grey) is tightly linked to a fact table; another kind (in green) is shared by all fact tables, named as conformed dimension. For instance, ‘Patient_Dim’ and ‘Date_Time_Dim’ these two dimension tables are linked to all fact tables. This schema is a final data model to integrate various heterogeneous EHR systems which have been added archetype support or have been converted to archetypes.

We created a central table (in maroon) ‘EHR_Linkage’ to connect all these fact tables in each knowledge domain and some essential dimension tables. This table is very important in this schema because it represents our idea of forming the linkage between a patient’s clinical records from these five clinical knowledge domains and the patient’s gene-related data from bio-medical research domain. In a concern of being a patient-centric schema, each record in the ‘EHR_Linkage’ represents a patient’s electronic health record which has happened at a particular time in several particular health care activities which can be any of activities within six knowledge domains. Meanwhile, this linkage table also links to administrative dimensions such as ‘Hospital_Dim’, ‘General_Practitioner_Dim’, ‘Specialist_Dim’ and ‘Health_Insurer_Dim’ to provide information regarding who has involved in an EHR and where the EHR has occurred. Therefore, the ‘EHR_Linkage’ is a big centre table recording and tracking all EHRs for all patients.

In terms of linking all activities happened in these six knowledge domains, we use two keys ‘Patient_ID’ and ‘Date_Time’ to form connections between all activities within an EHR. Each fact table and the ‘EHR_Linkage’ table have its own key to uniquely identify each record and also have ‘Patient_ID’ and ‘Date_Time’ two keys to form linkages between entities. Given a ‘Patient_ID’ and a ‘Date_Time’, we can create a snapshot of the schema. The snapshot represents one patient’s EHR information at a particular time in a means of joining the

‘EHR_Linkage’ table to other fact tables and administrative dimensions. With a ‘Patient_ID’ and a ‘Date_Time’, we first look for an EHR in the ‘EHR_Linkage’ table, then use that record to link tables in six knowledge domains to discover who, when and where have involved in an EHR and how it links to the patient’s genetic materials.

Figure 21: A conceptual patient-centric EHR data warehouse schema

77 6.3 Discussion

Our experiment results prove that openEHR is a potentially suitable standard to store and represent phenotype data, but it is not yet mature enough to achieve semantic interoperability. We need international cooperation and agreement for clinical domain knowledge governance on openEHR archetypes.

As our proposed openEHR phenotype capture model, we intended to create a phenotype ontology from phenotype data of genetic variant databases. However, based on the phenotype data review result, we could not form the phenotype ontology from 98% coarse- grained phenotype data. The experiment result in Spath & Grimson (2011) showed that it was promising to represent data in BIMS by using archetypes and it was feasible to reuse the vast amount of clinical data from biomedical research despite of different data recording requirements between clinical practice oriented archetypes and biomedical research oriented BIMS. We should be aware of that a radical switch to archetype based systems can disrupt current running EHR installation. A gradually incremental process for adding archetype support to existing EHR systems should be advocated (Chen et al. 2009).

We have successfully applied openEHR archetype ontology to capture the clinical concepts in the selected phenotype sample data and have mapped these concepts data into three knowledge domains of openEHR archetypes. We have completely recorded all information of the sample by using three archetypes. Instead of creating a phenotype data model for integrating heterogeneous genetic variant databases, we proposed a conceptual patient- centric EHR data warehouse schema for data integration. The schema covers five clinical knowledge domains (observation, evaluation, instruction, action and administration) and bio-medical research. Our research result has showed that the archetypes can represent rich clinical content and they are potentially suitable to map phenotype data into entry items in archetypes to store phenotype data in a structured way.

Relational databases are flat so it is hard for them to fully and correctly capture multilevel clinical content. An openEHR archetype can provide a multilevel structure to translate sophisticated health data and its metadata as well. Instead of putting efforts on building all sorts of clinical data models for different EHR systems, we can channel these efforts to create a complete curriculum of openEHR archetypes as an EHR modelling standard to capture all genotype and phenotype data.

We suggest that we improve existing archetypes to cover as much clinical content as possible and create new archetypes to capture more clinical concepts. As a result of such endeavour, we can have a full set of well-defined archetypes as a mapping to map phenotype data and genotype data of all heterogeneous health information systems into one single unified data repository. Furthermore, within an archetype mapping, we can easily find the correlations between data and apply data mining techniques for knowledge discovery based on well-structured and path-based archetypes.

We found inconsistency in capturing phenotype data by using openEHR archetypes in our experiment. Sometimes we have to record a piece of data as a constraint of an entry item in an archetype since that piece of data is considered as a value of the constraint while sometimes we can directly record a piece of data as a value of an entry item. Therefore, two pieces of data having the same granularity may be recorded in different granularity level within archetypes. We suggest that a guideline for handling granularity difference between archetype entry items and source data help to clarify the inconsistency of granularity recording. We also should notice that the clarity of source data’s granularity is not always clear.

Compared with the research by Spath & Grimson (2011) and Chen et al (2009), the difference of our research is that we did not convert genetic variant databases into openEHR archetypes but mapped phenotype data of these databases to archetypes. At last, based on openEHR archetype ontology we have mapped archetypes into a conceptual patient-centric EHR data warehouse schema. The schema is still far from complete but it shows our idea that we can potentially map archetypes into different entities as to form a data model to perform data integration. The schema also shows the idea of a central linkage entity to form connection between phenotype data and genotype data.

For an ease of explaining the schema we have assumed the entities in the schema are created in a relational database management system (RDBMS). However, RDBMS will not be sufficient for representing all health data due to complexity of health data. For instance, ‘Imaging_Fact’ may rather possibly contain objects like pictures and videos; ‘Bio_Medical_Reseach_Fact’ may require tree-like data structure to record some genotype data. An object-oriented database management system (OODBMS) or a combination of RDMS and OODBMS can be a better solution for this schema. This schema is implementation-neutral and it is only a representation of an ontology-archetype-based idea of EHR integration and genotype-phenotype correlations. For the sake of explaining the idea of EHR linkage, we mainly listed linkage keys in each entity and neglected all other attributes. Entities and attributes in within are incomplete and are not all correct, and the schema is still far from the actual implementation. Further work needs to be done to complete the schema and elaborate all other details for our future research and we will also consider a possible implementation of the schema to prove the ideas within our proposed schema.

Phenotype data or clinical data have been recorded in different granularity level, in various terms for the same meaning, in various languages and in various data models. The openEHR can become a universal standard to structure these data, to unify data which have the same semantic meaning through term binding, and to describe data in different human languages. The openEHR standard can help establish health information systems that can store and represent clinical data which can be understood and interoperable between machines as well as between human-beings. Chapter 7 Conclusion

Phenotype data in genetic variant databases have been stored in free text and in unstructured way. There have been no formal coding systems and terminologies applied to most of phenotype data. Most phenotype data are stored in the form of either abbreviations or separate terms. They are low-granularity data with coarse details for clinical concepts and they have not been able to fully describe clinical concepts. With formal coding systems and terminologies, we can standardise terms to describe phenotype data and to reconcile semantic conflicts between phenotype data of different heterogeneous health information systems.

The openEHR standard is potentially suitable for capturing phenotype data. It can even be potentially suitable for integrating heterogeneous health information systems if we have a full set of archetypes to capture all health concepts and completely implement multilingual translations mechanism and term binding mechanism for openEHR archetypes. The openEHR archetypes can provide a well-defined structure to capture rich content of a clinical concept. Its multilingual translations mechanism and term binding mechanism are two strong evidences for openEHR archetypes to achieve semantic interoperability between heterogeneous information systems. The archetype structure can ensure syntactic consistency for storing phenotype data of heterogeneous systems.

Although openEHR archetypes are still not mature enough to fully capture all clinical data, its two-level modelling method (archetype model and reference model) and archetype ontology are well-designed and they are stable technologies. In health informatics research, various information system data models are developed to serve particular research purposes but at the same time they introduce challenges on linking and combining all data which have been stored in these heterogeneous data models. We believe that openEHR standard already has features to become a suitable standard for modelling clinical data. Instead of pursuing new technologies and standards from scratch to store phenotype data/clinical data, we should endeavour completing a full set of mature and stable openEHR archetypes via good knowledge governance. Since there have been already a full set of technologies in openEHR standard, all we need to do is to complete the content of each technology in an archetype (e.g. multilingual translations and term binding) and adding new archetypes for capturing other clinical concepts and genetic information as well. In terms of knowledge governance, we need international cooperation on managing and enhancing archetypes, and we also need international agreement on choosing and enhancing terminologies and coding systems for resolving semantic conflicts. When we have a full set of archetypes ready, we can create mappings between archetypes and underlying data (genotype data and phenotype data) in different health information systems. Well-defined archetypes are our tool to achieve data integration, data linkage and phenotype-genotype correlation.

Based on archetype-ontology method, we can build our health information systems or any information system in a cognitive way. The archetype-ontology method is a top-down approach towards data capturing and data integration. The openEHR archetypes provide a formalism to standardise the structure of underlying health data as to reconcile syntactic conflicts between heterogeneous data models of health information systems. On the other hand, terminology bindings help uniquely identify the same concept which is recorded in different terms in different health information systems and in such a way we link all these terms to the same concept to reconcile the semantic conflicts. The openEHR archetypes with terminology binding together can form an ontology of heath data. Then we extract entities from archetype mappings and form relations between these entities so as to build a data warehouse model for data integration.

To some extension, information systems essentially are for communication and human use terms to communicate information. We build an information system starting from learning and formalising these terms of the information. We group these terms into concepts and map these concepts into archetypes. With archetype-ontology approach, as long as we can develop a full set of archetypes for a specific knowledge domain (e.g. health, finance or education), with terminology bindings we can break down data structure conflicts and semantic conflicts between heterogeneous information systems and create robust, scalable, semantic interoperable and integrated information systems towards the goal of achieving ubiquitous information computing.

Reference

Alvarez, LG, Aylin, P, Tian, J, King, C, Catchpole, M, Hassall, S, Whittaker-Axon, K & Holmes, A 2011, 'Data linkage between existing healthcare databases to support hospital epidemiology', Journal of Hospital Infection, vol. 79, no. 3, pp. 231-235.

Beroud, C, Hamroun, D, Collod-Beroud, G, Boileau, C, Soussi, T & Claustres, M 2005, 'UMD (Universal Mutation Database): 2005 update', Human Mutation, vol. 26, no. 3, pp. 184-191.

Beale, T & Heard, S (eds) 2008, openEHR Architecture: Architecture Overview, Ocean Informatics Australia.

Beale, T, Heard, S, Kalra, D & Lloyd, D (eds) 2008, The openEHR Reference Model: Data Types Information Model, Ocean Informatics Australia.

Beale, T & Heard, S 2007, 'An Ontology-based Model of Clinical Information', in Proceedings of the 12th World Congress on Health (Medical) Informatics: Building Sustainable Health Systems, MEDINFO, Brisbane, pp. 760-764.

Chen, R, Klein, GO, Sundvall, E, Karlsson, D & Ahlfeldt H 2009, 'Archetype-based conversion of EHR content models: pilot experience with a regional EHR system', BMC Medical Informatics and Decision Making, vol. 9, p. 33.

Coppieters, F, Lefever, S, Leroy BP & Baere ED 2010, 'CEP290, a Gene with Many Faces: Mutation Overview and Presentation of CEP290base', Human Mutation, vol. 31, pp. 1097- 1108.

Claustres, M, Horaitis, O, Vanevski, M & Cotton, RGH 2002, 'Time for a Unified System of Mutation Description and Reporting: A Review of Locus-Specific Mutation Databases', Genome Research, vol. 12, no. 5, May 1, 2002, pp. 680-688. Eichelberg, M, Aden, T, Riesmeier, J, Dogac, A & Laleci, GB 2005, 'A survey and analysis of Electronic Healthcare Record standards', Acm Computing Surveys, vol. 37, no. 4, Dec, pp. 277-315.

Eccher, C, Purin, B, Pisanelli, DM, Battaglia, M, Apolloni, I & Forti, S 2006 'Ontologies supporting continuity of care: The case of heart failure', Computers in Biology and Medicine, vol. 36, no. 7-8, pp. 789-801.

Fokkema, IFAC, Dunnen, JT & Taschner, PEM 2005, 'LOVD: Easy creation of a locus-specific sequence variation database using an “LSDB-in-a-box” approach', Human Mutation, vol. 26, no. 2, pp. 63-68.

Garde, S, Knaup, P, Hovenga, EJS & Heard, S 2007a, 'Towards semantic interoperability for electronic health records - Domain knowledge governance for openEHR archetypes', Methods of Information in Medicine, vol. 46, no. 3, pp. 332-343.

Garde, S, Hovenga, E, Buck, J & Knaup, P 2007b, 'Expressing clinical data sets with openEHR archetypes: A solid basis for ubiquitous computing', International Journal of Medical Informatics, vol. 76, Dec, pp. S334-S341.

George, RA, Smith, TD, Callaghan, S, Hardman, L, Pierides, C, Horaitis, O, Wouters, MA & Cotton, RGH 2008, 'General mutation databases: analysis and review', Journal of Medical Genetics, vol. 45, no. 2, February 1, 2008, pp. 65-70.

Groth, P & Weiss, B 2006, 'Phenotype Data: A Neglected Resource in Biomedical Research?', Current Bioinformatics, vol. 1, pp. 347-358.

Horaitis, O & Cotton, RGH 2004, 'The challenge of documenting mutation across the genome: The human genome variation society approach', Human Mutation, vol. 23, no. 5, pp. 447-452. Jutte, DP, Roos, LL & Brownell, MD 2011, 'Administrative Record Linkage as a Tool for Public Health Record', The Annual Review of Public Health, vol. 32, pp. 91-108.

Kuntzer, J, Eggle, D, Klostermann, S & Burtscher, H 2010, 'Human variation databases', Database, vol. 2010, July 17, 2010.

Kola, J, Harris, J, Lawrie, S, Rector, A, Goble, C & Martone, M 2010, 'Towards an ontology for psychosis', Cognitive Systems Research, vol. 11, no. 1, pp. 42-52.

Kahraman, A, Avramov, A, Nashev, LG, Popov, D, Ternes, R, Pohlenz, H-D & Weiss, B 2005, 'PhenomicDB: a multi-species genotype/phenotype database for comparative phenomics', Bioinformatics, vol. 21, no. 3, February 1, 2005, pp. 418-420.

Lee, CY, Ibrahim, H, Othman, M & Yaakob, R 2009, 'Reconciling semantic conflicts in electronic patient data exchange', Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services, Kuala Lumpur, Malaysia, pp. 390-394.

Louie, B, Mork, P, Martin-Sanchez, F, Halevy, A & Tarczy-Hornoch, P 2007, 'Data integration and genomic medicine', Journal of Biomedical Informatics, vol. 40, no. 1, pp. 5-16.

Liu, D, Wang, X, Pan, F, Yang, P, Xu, Y, Tang, X, Hu, J & Rao, K 2010, 'Harmonization of health data at national level: A pilot study in China', International Journal of Medical Informatics, vol. 79, no. 6, pp. 450-458.

Mitropoulou, C, Webb, AJ, Mitropoulos, K, Brookes, AJ & Patrinos, GP 2010, 'Locus-specific database domain and data content analysis: evolution and content maturation toward clinical use', Human Mutation, vol. 31, no. 10, pp. 1109-1116.

Martin-Sanchez, F, Iakovidis, I, Norager, S, Maojo, V, de Groen, P, Van der Lei, J, Jones, T, Abraham-Fuchs, K, Apweiler, R, Babic, A, Baud, R, Breton, V, Cinquin, P, Doupi, P, Dugas, M, Eils, R, Engelbrecht, R, Ghazal, P, Jehenson, P, Kulikowski, C, Lampe, K, De Moor, G, Orphanoudakis, S, Rossing, N, Sarachan, B, Sousa, A, Spekowius, G, Thireos, G, Zahlmann, G, Zvarova, J, Hermosilla, I & Vicente, FJ 2004, 'Synergy between medical informatics and bioinformatics: facilitating genomic medicine for future health care', Journal of Biomedical Informatics, vol. 37, no. 1, pp. 30-42.

Plon, SE, Eccles, DM, Easton, D, Foulkes, WD, Genuardi, M, Greenblatt, MS, Hogervorst, FBL, Hoogerbrugge, N, Spurdle, AB & Tavtigian, SV 2008, 'Sequence Variant Classification and Reporting: Recommendations for Improving the Interpretation of Cancer Susceptibility Genetic Test Results', Human Mutation, vol. 29, no. 11, pp. 1282-1291.

Patrinos, GP & Brookes, AJ 2005, 'DNA, diseases and databases: disastrously deficient', Trends in Genetics, vol. 21, no. 6, Jun, pp. 333-338.

Robinson, PN & Mundlos, S 2010, 'The Human Phenotype Ontology', Clinical Genetics, vol. 77, no. 6, pp. 525-534.

Roman, I, Calvillo, J & Roa, LM 2009, 'Personalizing Care: Integration of Hospital and Homecare', in Yogesan, Ket al (eds), Handbook of Digital Homecare, Springer Berlin Heidelberg, pp. 33-52.

Spath MB & Grimson, J 2011, 'Applying the archetype approach to the database of a biobank information management system', International Journal of Medical Informatics, vol. 80, no. 3, pp.205-226.

Schofield, PN, Gkoutos, GV, Gruenberger, M, Sundberg, JP & Hancock, JM 2010, 'Phenotype ontologies for mouse and man: bridging the semantic gap', Disease Models & Mechanisms, vol. 3, no. 5-6, May/June 2010, pp. 281-289.

Soussi, T, Ishioka, C, Claustres, M & Beroud, C 2006, 'Locus-specific mutation databases: pitfalls and good practice based on the p53 experience', Nature Reviews Cancer, vol. 6, no. 1, pp. 83-90. Schulze, TG & McMahon, FJ 2004, 'Defining the Phenotype in Human Genetic Studies: Forward Genetics and Reverse Phenotyping', Human Heredity, vol. 58, no. 3-4, pp. 131-138.

Thorisson, GA, Muilu, J & Brookes, AJ 2009, 'Genotype–phenotype databases: challenges and solutions for the post-genomic era', Nature Reviews Genetics, vol. 10, no. 1, pp. 9-18.

Wache, H, Vogele, T, Visser, U, Stuckenschmidt, H, Schuster, G, Neumann, H, & Hubner, S. 2001, 'Ontology-based integration of information—a survey of existing approaches', In IJCAI-01 workshop: ontologies and information sharing, pp. 108–117. Appendix

Appendix A: Summary of raw data for LSDBs (Claustres et al. 2002) Appendix B: Database contents and data content criteria upon which the entire LSDB domain analysis has been performed