The Integrated Data Repository

The Integrated Data Repository: A Non-Traditional Data Warehousing Approach for the Translational Investigator

Marco J. Casale3, MS, Russ J. Cucina1, MD, MS, Mark G. Weiner2, MD, David A. Krusch, MD3, Prakash Lakshminarayanan1, MBA, Aaron Abend, Bill Grant, Rob Wynden1, BSCS

1University of California, San Francisco, CA; 2University of Pennsylvania, Philadelphia, PA; 3University of Rochester, Rochester, NY

Abstract An integrated data repository (IDR) containing Most repositories are designed using standard data aggregations of clinical, biomedical, economic, warehouse architecture, with a predefined data model administrative, and public health data are key incorporated into the database schema. The components of an overall translational research traditional approach to data warehouse construction infrastructure. But most available data repositories is to heavily reorganize and frequently to modify are designed using standard data warehouse source data in an attempt to represent that architecture using a predefined data model, which information within a single database schema. This does not facilitate many types of health research. In information technology perspective on data response to these shortcomings we have designed a warehouse design is not well suited for the schema and associated components which will construction of data warehouses to support facilitate the creation of an IDR by directly translational biomedical science. The purpose of this addressing the urgent need for terminology and paper is to discuss the components which would ontology mapping in biomedical and translational facilitate the creation of an IDR by directly sciences and give biomedical researchers the addressing the need for terminology and ontology required tools to streamline and optimize their mapping in biomedical and translational sciences and research. The proposed system will dramatically the novel approach to data warehousing design. lower the barrier to IDR development at biomedical research institutions to support biomedical and Background translational research, and will furthermore promote The IDR database design is a departure from the inter-institute data sharing and research traditional data warehouse design proposed by collaboration. Inmon. The data warehouse architecture model shown in Figure 1 depicts the process of transforming Introduction operational data into information for the purpose of An integrated data repository (IDR) containing generating knowledge. The diagram displays data aggregations of clinical, biomedical, economic, flowing from left to right in accordance with the administrative, and public health data are key corporate information factory (CIF) approach (Inmon components of an overall translational research et al). According to Inmon, the data enters the CIF as infrastructure. Such a repository can provide a rich raw data collected by operational applications. The platform for a wide variety of biomedical research data is transformed through extract, transform and efforts. Examples might include correlative studies load processes and stored in either the data seeking to link clinical observations with molecular warehouse or the ODS, operational data store. data, data mining to discover unexpected relationships, and support for clinical trial development through hypothesis testing, cohort scanning and recruitment. Significant challenges to the successful construction of a repository exist, and they include, among others, the ability to gain regular access to source clinical systems and the preservation of semantics across systems during the aggregation process. “A star schema is a simple database design (particularly suited to ad-hoc queries) in which dimensional data (describing how data are commonly aggregated) are separated from fact or event data (describing individual business transactions).” (Hoffer, Prescott & McFadden)

However, an IDR has to incorporate heterogeneous data. A common data modeling design approach in biomedical informatics is the entity-attribute-value (EAV) model. An EAV design, conceptually involves a table with three columns—a column for entity/object identification (ID), one for attribute/parameter (or an attribute ID pointing to an Figure 1. Inmon’s Corporate Information Factory attribute descriptions table), and one for the value for Data Warehousing Model the attribute. The table has one row for each A-V pair. The IDR is built on the combination of both the “Often up to 80 percent of the work in building a data star schema data model and the EAV approach. An warehouse is devoted to the extraction, example of this approach can be seen in Partners transformation, and load (ETL) process: locating the Healthcare i2b2 NIH supported data warehouse data; writing programs to extract, filter, and cleanse design. the data; transforming it into a common encoding scheme; and loading it into the data warehouse.” (Hobbs, Hillson & Lawande,) The data model is typically designed based on the structure of the source data. Figure 2 depicts a source system clinical data model.

Figure 3. Partners Healthcare i2b2 star schema EAV data model

EAV design is potentially attractive for databases with complex and constantly changing schemas that reflect rapidly evolving knowledge in scientific domains. Here, when a conventional design is used, the tables (and the code that manipulates them, plus the user interface) need continuous redesign with each schema revision.

Federated vs. Centralized Approach Figure 2. University of Rochester simplified clinical data repository OLTP data model The debate regarding a federated data warehouse design versus an ‘all in one’ centralized data Furthermore, after the data has been prepared, it is warehouse design has been a key point between the loaded into the de-normalized schema of the data two main data warehousing factions between Bill warehouse or data marts and resides there in a fine Inmon and Ralph Kimbal. The following example grain level of detail. The logical design of a data outlines the differences for the two approaches and warehouse is usually composed of the star schema. justifies why the centralized approach is the favorable querying the data with new criteria will be faster, and choice in biomedical informatics. user controlled.

Many institutions have electronic clinical data that is Discussion decentralized across different departments. That There are several challenges posed by IDR projects infrastructure can be used to create an integrated data geared toward biomedical research which do not set with information that spans the source systems. apply to the construction of most commercial However, the decentralization creates the need for warehouse implementations: 1) integrity of source redundant steps, lengthy iterative processes to refine data - a clear requirement in the construction of an the information, and requires that more people have IDR is that source data may never be altered, nor may access to protected health information in order to their interpretation be altered. Records may be satisfy the research information needs. To illustrate updated, but strict version control is required to these issues, the following describes the workflow enable reconstruction of the data that was available at needed to define a cohort of people known to have a given point in time. Regulatory requirements and cardiovascular disease and laboratory evidence of researchers demand clear visibility to the source data significant renal disease defined by an elevated serum in its native format to verify it has not been altered; creatinine. 2) high variability in source schema designs - IDRs import data from a very large set of unique software In the decentralized system, where should the environments, from multiple institutions, each with investigator start? He can begin by going to the its own unique schema; 3) limited resources for the billing system that stores diagnoses and get a list of data governance of standardization - widespread PHI (Personal Health Information) of people with a agreement on the interpretation, mapping and history of a heart attack. Then he can transport that standardization of source data that has been encoded list of identifiers to the people who work in the using many different ontologies over a long period of laboratory and request the serum creatinine levels on time may be infeasible. In some cases the owners of that set of patients, and then limit the list to those the data may not even be available to work on data who have an elevation. The lab will have to validate standardization projects, particularly in the case of the patient matches generated by the billing system historical data; 4) limited availability of software by comparing PHI, a step redundant with the billing engineering staff with specialized skill sets - system. Furthermore, many of the subjects interpretation of source data during the data import associated with heart attack may not have the process requires a large and highly skilled technical elevated creatinine, so, in retrospect, the PHI of these staff with domain expertise, talent often not available people should not have been available to the people or only at considerable expense; and 5) valid yet running the query in the lab. Perhaps the cohort that contradictory representations of data - there are valid, was generated was not as large as expected, and the yet contradictory interpretations of source data investigator decides to expand the cohort to those depending on the domain of discourse of the patients with a diagnosis of peripheral vascular researcher. Examples related to the inconsistency of disease and stroke. He then has to iterate back to the the researchers’ domain of discourse include: two billing system to draw additional people with these organizations may interpret the same privacy code additional diagnoses, and then bring the new list of differently, or researchers within the same specialty patient identifiers to the lab to explore their creatinine may not use the same ontology, or clinical and levels. research databases often encode race and ethnicity in differing ways. We have developed an alternative The centralized warehouse as proposed will conduct approach that incorporates the use of expert systems the matching of patient identifiers behind the scenes. technologies to provide researchers with data models The information system will conduct the matching of based on their own preferences, including the ability patients across the different components of the to select a preferred coding/terminology standard if database, so that identifiers do not have to be so desired. We believe that such an approach will be manually transported and manipulated by the distinct more consistent with typical research methodologies, database managers at each location. The result is that and that it will allow investigators to handle the raw a centralized warehouse is radically more secure than data of the repository with the degrees of freedom to a decentralized warehouse due to the reduced which they are accustomed. exposure of PHI. Further, if the query produces results that are not satisfactory, the cycle of re- An ontology mapping component provides a streamline data acquisition and identification process by delivering data to researchers in a just-in- completed after the source data has already been time fashion, instead of requiring that all data be imported into the IDR. transmitted to the IDR via a single common format and without the requirement that all data be stored To support the translation of data, we are developing within a single centralized database schema, an approach called Inference Based Ontology providing a knowledge management system and Mapping – in which the source data must be ontology mapping tools that enable less technical translated into the ontology that the biomedical users to translate the complex array of data fields researcher requires for a particular domain of needed to fulfill data requests, and facilitating inter- expertise. The IDR will use a rules-based system to institutional data sharing by translating data perform this mapping of source data format to the definitions among one or more site-specific researcher’s ontology of choice. terminologies or ontologies, and shareable aggregated data sets. Ontology Mapper Data Model- Extending i2b2

We propose an ontology mapping software service To support the alternate data governance model that runs inside of an IDR. This service will provide shown in Figure 4 above a further evolution of the the capability to map data encoded with different database model provided by i2b2 was required. In ontologies into a format appropriate for a single area order to support this refinement of the data of specialty, without preempting further mapping of management process, the ability to map data that same data for other purposes. This approach (instance mapping) after ETL must be added to the would represent a fundamental shift in both the data warehousing model. The proposed Ontology representation of data within the IDR and a shift in Mapper functionality that enables the creation of how resources are allocated for servicing maps for the transformation of data from source translational biomedical informatics environments. encoding to target encoding required several changes Instead of relying on an inflexible, pre-specified data on the original i2b2 schema. governance and data model, the proposed architecture shifts resources to handling user requests for data In order to maintain data integrity, a design approach access via dynamically constructed views of data was made to diminish the alteration of the core i2b2 (Fig.4). Therefore, data interpretation happens as a star schema design. Minimal changes were made to result of an investigator’s specific request and only as the observation_fact table to lesson any impact on the required. primary key. These changes support the relationship between the core fact and the instance map that was used in support of that fact. In general, the primary key is necessary for providing a unique instance of an observation. In addition, all fields in a record should have significance.

For instance, if an aggregation of records is generated across a particular value, the primary key must continue to uniquely identify the instance of the aggregated record. Aggregation requirements posed a unique challenge related to the partial existence of facts. A separate fact table was designed to accommodate aggregate facts. The development of aggregate fact tables is a common practice in data warehousing, typically accomplished through a materialized view against the detailed fact table. Aggregation also provides a valuable data set to clinical researchers who are only interested in de- Figure 4. Complex data governance (top) can be identified data sets or for patient cohort exchanged for rules encoding (bottom) identification.

2) Translation - the translation of data from its source Furthermore, the reduction of alterations to the star ontology into the ontology required by the researcher schema will help to reduce changes incurred by will not be completed during the extract, transform subsequent releases of the i2b2 software. As each and load (ETL) phase. The ontology mapping will be database upgrade is applied and in addition to the map_aggr_fact table previously mentioned, the following minor modifications will also need to be applied: Ontology Mapper Database Design: Generating Data Marts 1) Encoding_Dimension: This is a new table created for the purpose of storing the various encodings used This final set of modifications made to i2b2 was in the Mapper system. 2) Map_Dimension: This required in order to provide compatibility with the table has been created to store info on the instance existing i2b2 application server code. Additionally mapper xml files uploaded onto the system. 3) these modifications also offer a fundamental Map_Data_Fact: This is a new table designed to store the records created as a result of Ontology Mapper improvement in i2b2 security, storage requirements execution. All the transformed records in target and performance. encodings will be housed in this table. 4) Changes to the i2b2 Observation_Fact table: a) The addition of the CONCEPT_PATH column which is referenced from table CONCEPT_DIMENSION for storing the concept path pertaining to the concept code used in this table. This is required since the original i2b2 design supports only the CONCEPT_CD (this can be the same for multiple concept paths) and resolving the concept path from the same is difficult. b) Addition of the ENCODING_CD column which is referenced from table ENCODING_DIMENSION to denote the encoding type this record is encoded with. c) Addition of the OBSERVATION_FACT_ID column which was added to store a running sequence Figure 6. i2b2 Workbench and related tables no. for uniquely identifying each record in the table.

In the current i2b2 workbench software the concept Figure 5 depicts the Ontology Mapper Data Model. dimension is used to populate the list of concept OBSERVATION_FACT ENCOUNTER_NUM (FK) PATIENT_NUM (FK) CONCEPT_CD paths displayed on the left of the screen. The user PROVIDER_ID START_DATE MODIFIER_CD generates a query via a drag-and-drop of these CONCEPT_DIMENSION VALTYPE_CD CONCEPT_PATH TVAL_CHAR NVAL_NUM concept paths into the query boxes at the top right. CONCEPT_CD VALUEFLAG_CD NAME_CHAR QUANTITY_NUM CONCEPT_BLOB UNITS_CD UPDATE_DATE The query results are populated from the observation END_DATE DOWNLOAD_DATE LOCATION_CD PATIENT_DIMENSION IMPORT_DATE CONFIDENCE_NUM PATIENT_NUM SOURCESYSTEM_CD OBSERVATION_BLOB fact table using the box on the lower right. UPLOAD_ID UPDATE_DATE VITAL_STATUS_CD TABLE_NAME DOWNLOAD_DATE BIRTH_DATE ENCODING_CD (FK) IMPORT_DATE DEATH_DATE SEX_CD SOURCESYSTEM_CD UPLOAD_ID AGE_IN_YEARS_NUM CONCEPT_PATH (FK) LANGUAGE_CD The SQL query which is generated by this UI is OBSERVATION_FACT_ID RACE_CD MARITAL_STATUS_CD RELIGION_CD ZIP_CD stored in the request table. The actual data returned STATECITYZIP_PATH PATIENT_BLOB MAP_AGGR_FACT UPDATE_DATE CONCEPT_PATH (FK) is stored within the response table as an XML blob. DOWNLOAD_DATE MAP_ID (FK) IMPORT_DATE VALTYPE_CD SOURCESYSTEM_CD UPLOAD_ID Using an i2b2 feature called the Export Data Feature TVAL_CHAR NVAL_NUM MAP_DIMENSION QUANTITY_NUM PROVIDER_DIMENSION UNITS_CD PROVIDER_PATH the user is then granted access to the data contained MAP_ID CONFIDENCE_NUM PROVIDER_ID MAP_NAME MAP_DATE NAME_CHAR CONCEPT_PATH (FK) TARGET_ENCODING (FK) PROVIDER_BLOB within that response data BLOB. MAP_PATH UPDATE_DATE DOWNLOAD_DATE MAP_DESC MAP_DATA_FACT IMPORT_DATE IMPORT_DATE ENCOUNTER_NUM (FK) SOURCESYSTEM_CD UPDATE_DATE PATENT_NUM (FK) UPLOAD_ID UPLOAD_USER CONCEPT_CD UPDATE_USER PROVIDER_ID This architecture has several disadvantages: LAST_RUN_DATE VISIT_DIMENSION START_DATE ENCOUNTER_NUM MODIFIER_CD PATIENT_NUM CONCEPT_PATH (FK) MAP_ID (FK) INOUT_CD LOCATION_CD VALTYPE_CD LOCATION_PATH 1) Data stored within the observation fact table TVAL_CHAR START_DATE NVAL_NUM END_DATE VALUEFLAG_CD VISIT_BLOB QUANTITY_NUM is replicated into the response data BLOB. UPDATE_DATE UNITS_CD DOWNLOAD_DATE ENCODING_DIMENSION END_DATE IMPORT_DATE LOCATION_CD ENCODING_CD SOURCESYSTEM_CD For users which require a snapshot of data CONFIDENCE_NUM UPLOAD_ID ENCODING_NAME OBSERVATION_BLOB ENCODING_VERSION UPDATE_DATE ENCODING_DESC DOWNLOAD_DATE this format may be acceptable. However CREATE_USER IMPORT_DATE UPDATE_USER SOURCESYSTEM_CD CREATE_DATE UPLOAD_ID database BLOB fields have size limitations UPDATE_DATE OBSERVATION_FACT_ID MAP_DATE and we expect researcher queries to require Figure 5. Ontology Mapper Star Schema – Extension access to very large sets of IDR data. of i2b2 Star Schema Replication of such large datasets has distinct disadvantages regarding the efficient usage of storage space.

Also not all users want to have access to information as snapshots. The Ontology Mapper will continuously instance map incoming data which is presented to the system during regular ETL processing. Some users will request access to this newly instance mapped data as soon as it becomes available. An algorithm which is based on

2) This architecture requires to maintenance of Figure 10. Extensions to i2b2 to support the generation of Study Specific Views a duplicate security paradigm outside of the host database environment. The BLOBs Please note that in this model instead of providing which are created and exported to the access to end users via the XML BLOB of response researcher must be permissioned for usage data (referred to as the i2b2 CRC or Clinical solely by the researcher’s study staff under Research Chart), the user is instead granted access IRB (Institutional Review Board) protocol. automatically to a database view. That database view Since this mechanism is not using native can then be either manipulated directly or it can be database security for data delivery that materialized into a data mart when necessary. security mechanism must be replicated within the application server layer. Also the original workbench software would be modified so that the data shown in the lower right Security models which are dependent on portion of the screen is derived from an Observation application server security are inherently Fact View which is comprised of data from both the less secure than models which leverage Observation Fact Table and the Mapped Data Fact native database security. In models which Table or from both the Observation Fact Table and leverage database security, if the application the Mapped_Aggr_Fact table. server layer is hacked then the would-be hacker would still not obtain access to the The Observation Fact View is created such: data in question. CREATE To address these concerns and to provide an easier method of integration of the Ontology Mapper into SELECT OBS.* FROM the i2b2 framework we have therefore made the OBS_FACT, MAPPED_DATA_FACT following additional modifications to the i2b2 schema. WHERE OBS.FACT.OBSERVATION_FACT_ID = MAPPED_DATA_FACT.OBSERVATION_FACT_I D

This is a common Create Table As Select statement (CTAS) which is used to generate derived tables from base tables.

Conclusion Our proposed design is intended to greatly facilitate References biomedical research by minimizing the initial investment that is typically required to resolve 1. Noy NF, Crubézy M, Fergerson RW, Knublauch semantic incongruities that arise when merging data H, Samson WT, Vendetti J, Musen M. Protégé- from disparate sources. Through the use of a rules- 2000: an open-source ontology-development and based system, the translation of data into the domain knowledge acquisition environment. Proc. of a specific researcher can be accomplished more AMIA Symp. 2003; 953. quickly and efficiently than with a traditional data warehouse design. The proposed system will 2. Brinkley JF, Suciu D, Detwiler LT Gennari JH, dramatically lower the barrier to IDR development at Rosse C. A framework for using reference biomedical research institutions to support ontologies as a foundation for the semantic web. biomedical and translational research, and promote Proc. AMIA Symp. 2006; 96-100. inter-institute data sharing and research collaboration. 3. Gennari JH, Musen MA, Fergerson RW, Grosso WE, Crubézy M, Eriksson H, Noy NF, Tu SW. The evolution of Protégé: an environment for knowledge-based systems development. International Journal of Human Computer Studies 2003; 58(1):89-123.

4. Advani A, Tu S, O’Connor M, Coleman R, Goldstein MK, Musen M. Integrating a modern knowledge-based system architecture with a Legacy VA database: The ATHENA and EON projects at Stanford. Proc. AMIA Symp. 1999; 653-7.

5. Hobbs, Lilian; Hillson, Susan & Lawande, Shilpa Oracle9iR2 Data Warehousing. Burlington: Digital Press. 2003, 6

6. Inmon, W. H., Imhoff, C., Sousa, R. Introducing the Corporate Information Factory 2nd Edition New York: John Wiley & Sons, Inc. 2001

7. Hoffer, Jeffrey A.; Prescott, Mary B., McFadden, Fred R.. Modern Database Management. Upper Saddle River: Prentice Hall. 2002, 421

8. Nadkarni, Prakash M. MD et al, Organization of Heterogeneous Scientific Data Using the EAV/CR Representation. Journal of American Medical Informatics 1999