DE-IDENTIFIED MULTIDIMENSIONAL MEDICAL RECORDS FOR DISEASE POPULATION DEMOGRAPHICS AND IMAGE PROCESSING TOOLS DEVELOPMENT

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Barbaros Selnur Erdal, D.D.S., M.S.

*****

Electrical and Computer Engineering

The Ohio State University

2011

Dissertation Committee:

Prof. Bradley D. Clymer, Adviser Prof. Elliott D. Crouser Prof. Umit V. Catalyurek

Prof. Kun Huang c Copyright by

Barbaros Selnur Erdal

2011 ABSTRACT

Recently, The National Institute of Health (NIH) has outlined its scientific priorities in a strategic plan, “NIH Roadmap for Medical Research”. In direct alignment with these pri- orities, many academic and research oriented medical institutions across The United States conduct numerous clinical and translational research studies on an ongoing basis. From a personalized health care and translational research perspective, quite often efforts of such nature will span across multiple departments or even institutions. We consider these activ- ities as a knowledge and flow which is taking place around multidimensional, heterogeneous clinical and research data that is collected from disparate sources.

The primary objective of the research and development described in this thesis is to pro- vide an integrative platform where multidimensional data from multiple disparate sources can be easily accessed, visualized, and analyzed. We believe that ability to execute such truly integrative queries, visualizations and analyses across multiple data types is critical to the ability to execute highly effective clinical and translational research. Therefore, to address the preceding gap in knowledge, we introduce a model computational frame- work that is intended to support the integrative query, visualization and analysis of struc- tured data, narrative text, and image data sets in support of translational research activi- ties. The introduced framework also aims to address the challenges posed by regulatory compliance, patient privacy/confidentiality concerns, and the need to facilitate multicenter research paradigms.

ii dedicated to Sevinc and Deniz

iii ACKNOWLEDGMENTS

I would like to thank several people who believed in me and supported me throughout my doctoral studies and particularly during my dissertation. First and foremost, I would like to express my sincere thanks to my mentors: my advisor Dr. Bradley Clymer and Dr.

Elliott Crouser. They never quit believing in me and supported me, even when nobody thought the things I wanted to accomplish were possible. Because of their guidance, I am a better researcher and a professional. I also would like to thank to my other committee members Dr. Umit Catalyurek for his continuous support and Dr. Kun Huang for his constructivist feedback. Also, earnest thanks and appreciation to Dr. Philip Payne, Dr.

Nathan Hall and Dr. Michael Knopp for their support and collaboration in many research projects.

Special thanks to Dr. Hakan Ferhatosmanoglu, Dr. Han-Wei Shen, and Dr. Umit

Catalyurek for their support and encouragement in early stages of my career. If it weren’t for their support, this degree wouldn’t have been possible. Great appreciation is also dedi- cated to my supervisors from James Cancer Hospital, Dr. Miguel Villalona and Dr. Gregory

Otterson for bringing me into OSU Medical Center.

Also thanks to fellow colleagues Dr. Mehmet Kale, Dr. Lee Cooper, Dr. Olcay Sertel and Brian Myers for being great friends and supporters; my supervisor, mentor and great friend Dr. Felix Liu, for being there in every stage of this journey and all my fellow water polo teammates, for giving me a mental escape and the occasional battle scars.

iv Last but not least, heartfelt thanks go to my mom, Selma; my sister Funda; my aunt

Rengin; my uncle Engin; my stepdad Mustafa; my in-laws Tumer and Memnune; my brother-in-law Kurtulus; and my best friend Akif for helping me get to this point and pro- viding me a great deal of emotional support. Finally, heartfelt love and appreciation is offered to my wife Sevinc and my son Deniz, whose loving support and encouragement carried me through the tough times and reminded me the priorities in life.

Thank you all!

v VITA

August 01, 1972 ...... Born - Istanbul, Turkey

1991-1997 ...... D.D.S, Dentistry, Ege University

1997-1998 ...... Dentist (Private Practice), Karsiyaka, Izmir, Turkey 2000-2001 ...... Research Assistant, James Cancer Hospi- tal / Division of Hematology and Oncol- ogy, The Ohio State University Medical Center 2001-2005 ...... M.S, , The Ohio State Uni- versity 2005-2009 ...... Senior Systems Consultant, The Ohio State University Medical Center 2006-2009 ...... M.S, Electrical & Computer Engineering, The Ohio State University 2009-present ...... Ph.D.c, Electrical & Computer Engineer- ing, The Ohio State University 2009-present ...... Biomedical Informatics Consultant, The Ohio State University Medical Center

PUBLICATIONS

Research Publications

Peer Reviewed Journal Publications

Nadella P, Shapiro C, Otterson GA, Hauger M, Erdal S, Kraut E, Clinton S, Shah M, Stanek M, Monk P and Villalona-Calero MA, Pharmacobiologically Based Scheduling

vi of Capecitabine and Docetaxel Results in Antitumor Activity in Resistant Human Malig- nancies, Journal of Clinical Oncology, vol. 20(11), pp. 2616–2623, 2002.

Altiparmak F, Ferhatosmanoglu H, Erdal S and Trost DC, Information Mining over Het- erogeneous and High Dimensional Time Series Data in Clinical Trials Databases, IEEE Transactions on in Biomedicine, vol. 10(2), pp. 254–263, 2006.

Erdal S, Catalyurek UV, Payne P, Saltz J, Kamal J and Gurcan MN, A Knowledge- Anchored Integrative Image Search and Retrieval System, J Digit Imaging, vol. 22(2), pp. 166–182, 2009.

Stawicki SP, Schuster D, Liu J, Kamal J, Erdal S, Gerlach AT, Whitmill ML, Lindsey DE, Thomas YM, Murphy C, Steinberg SM and Cook CH, Introducing the glucogram: Description of a novel technique to quantify clinical significance of acute hyperglycemic events, OPUS 12 Scientist, vol. 3(1) pp. 1–5, 2009.

Stawicki SPA, Schuster D, Liu J, Kamal J, Erdal S, Gerlach AT, Whitmill ML, Lindsey DE, Murphy C, Steinberg SM and Cook CH, The glucogram: A new quantitative tool for glycemic analysis in the surgical intensive care unit, Int. J. Crit. Ill. & Inj. Sci, vol. 1(1) pp. 5–12, 2011.

Erdal BS, Liu J, Ding J, Chen J, Marsh CB, Kamal J and Clymer BD, A Database De- identification Framework to Enable Direct Queries on Medical Data for Secondary Use, Methods Inform Med, 2011.

Peer Reviewed Conference/Symposia Publications and Demonstrations

Erdal S, Ozturk O, Armbruster A, Ferhatosmanoglu F and Ray WC, A Time Series Anal- ysis of Microarray Data, Fourth IEEE Symposium on Bioinformatics and Bioengineering (BIBE’04), p.366, 2004.

Erdal S, Catalyurek UV, Kamal J, Saltz J and Gurcan MN, Information Warehouse Appli- cation of caGrid: A Prototype Implementation, In caBIG, Cancer Biomedical Informatics Grid, 2007 Annual Meeting, Washington, D.C., 2007.

Erdal S, Catalyurek UV, Saltz J, Kamal J and Gurcan MN, Flexible Patient Information Search and Retrieval Framework: Pilot Implementation, In Proceedings - SPIE, The Inter- national Society for Optical Engineering, vol. 6516 p. 6516OI

vii Erdal S, Catalyurek UV, Saltz J, Kamal J and Gurcan MN, Integrating a PACS System to caGrid: A De-identification and Integration Framework, In SIIM, Society of in Medicine, Providence, RI, 2007.

Altiparmak F, Ozturk O, Erdal S, Ferhatosmanoglu H and Trost DC, Combining Mining Results from Multiple Sources in Clinical Trials and Microarray Applications, The MMIS workshop at the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Jose, CA, 2007.

Altiparmak F, Erdal S, Ozturk O and Ferhatosmanoglu H, A Multi-Metric Similarity Based Analysis of Microarray Data, The IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 317–324, Fremont, CA, 2007.

Liu J, Erdal S, Silvey SA, Ding J, Marsh CB and Kamal J, Toward a Fully De-identified Biomedical Information Warehouse, In AMIA, American Medical Informatics Association Annual Proceedings, pp. 370–374, 2009.

Erdal S, Clymer BD; Liu J, Kamal J, Knopp MV and Hall N, Integrative Searches on HIS, PACS and RIS with Utilization of an Information Warehouse, In SIIM, Society of Imaging Informatics in Medicine, Washington, D.C., 2011.

Peer Reviewed Conference/Symposia Abstracts

Altiparmak F, Ozturk O, Erdal S, Ferhatosmanoglu H and Trost DC, Information Mining over Heterogeneous Microarray and Clinical Data, LSS Computational Systems Bioinfor- matics Conference, 2006.

Erdal S and Kamal J, An Indexing Scheme for Medical Free Text Searches: A Prototype, AMIA, American Medical Informatics Association Annual Proceedings, p. 918, 2006.

Ding J, Erdal S, Dhaval R and Kamal J, Augmenting Oracle Text with the UMLS for Enhanced Searching of Free-Text Medical Reports, AMIA, American Medical Informatics Association Annual Proceedings, p. 940, 2007.

Erdal S, Ding J, Osborn C, Mekhjian H and Kamal J, ICD9 Code Assistant: A Prototype, AMIA, American Medical Informatics Association Annual Proceedings, p. 950, 2007.

Ding J, Erdal S, Borlawsky T, Liu J, Golden-Kreutz D, Kamal J and Payne PRO, The Design of a Pre-Encounter Clinical Trial Screening Tool: ASAP, AMIA, American Medical Informatics Association Annual Proceedings, p. 631, 2008.

viii Rogers P, Erdal S, Santangelo J, Liu J, Schuster S and Kamal J, Use of Synthesized Data to Support Complex Ad-hoc Queries in an Enterprise Information Warehouse: A Diabetes Use Case, AMIA, American Medical Informatics Association Annual Proceedings, p. 1029, 2008.

Santangelo J, Erdal S, Wellington L, Mekhjian H and Kamal J, Identification of potential surgical site infections leveraging an enterprise clinical information warehouse, AMIA, American Medical Informatics Association Annual Proceedings, p. 941, 2008.

Erdal S, Rogers P, Santangelo J, Buskirk J, Ostrander M, Liu J and Kamal J, Data delivery workflow in an academic information warehouse, AMIA, American Medical Informatics Association Annual Proceedings, p. 942, 2008.

Kamal J, Silvey SA, Buskirk J, Dhaval R, Erdal S, Ding J, Ostrander M, Borlawsky T, Smaltz DH and Payne PRO, Innovative applications of an enterprise-wide information warehouse, AMIA, American Medical Informatics Association Annual Proceedings, p. 1134, 2008.

Erdal S, Liu J, Silvey SA, Kamal J and Hall N, Delivering Images and Searching PACS Metadata: The Data Warehouse Approach, AMIA, American Medical Informatics Associ- ation Annual Proceedings, p. 837, 2009.

Waugh C, Erdal S, Deiter RJ, Buskirk J, Dyta R, Liu J, Marsh C, Wood K, Mastronarde J and Kamal J, Pulmonary Portal: Study Management through Web Portals, AMIA, American Medical Informatics Association Annual Proceedings, p. 1074, 2009.

Erdal S, Clymer BD, Liu J, Kamal J, Knopp MV and Hall N, X-Ray and CT Dose Calcu- lations: Operations and Safety, AMIA, American Medical Informatics Association Annual Proceedings, p. 1032, 2010.

Erdal S, Clymer BD, Liu J, Kamal J, Knopp MV and Hall N, Integrative searches on PACS metadata with utilization of an Information Warehouse, J Nucl Med, 51 (Supplement 2), p. 1336, 2010.

Liu J, Chen J, Ding J, Erdal S, Kellough D, Huebner K, Shapiro C, Ramirez M-T and Kamal J, Integration of Tissue Microarray Data in an Information Warehouse to Enable Breast Cancer Research, AMIA, American Medical Informatics Association Annual Proceedings, p. 1149, 2010.

ix Kahmann S, Erdal S, Liu J, Kamal J and Clymer B, Generalizable Session Dependent De-identification Methods, AMIA, American Medical Informatics Association Annual Pro- ceedings, p. 1751, 2011.

Liu J, Erdal S and Kamal J., A Step against Re-identification: Keep Low Volume Query Result inside Your Data Source, AMIA, American Medical Informatics Association Annual Proceedings, p. 1866, 2011.

Erdal S, Liu J, Craig K, Kamal J and Clymer BD, Proxy PACS Servers for Image Delivery through an Information Warehouse, AMIA, American Medical Informatics Association Annual Proceedings, p. 1752, 2011.

Erdal S, Clymer BD and Crouser E, Increasing Prevalence of Sarcoidosis in a Midwest USA Metropolitan Health System, Am J Respir Crit Care Med, vol. 183, p. A5636, 2011.

Erdal S, Crouser E and Clymer BD, Quantitative Computerized Two-Point Correlation Analysis of Lung CT Scans Correlates with Pulmonary Function in Pulmonary Sarcoidosis, Am J Respir Crit Care Med, vol. 183, p. A5644, 2011.

FIELDS OF STUDY

Major Field: Electrical & Computer Engineering

x TABLE OF CONTENTS

Page

Abstract ...... ii

Dedication ...... iii

Acknowledgments ...... iv

Vita ...... vi

List of Tables ...... xiv

List of Figures ...... xv

Chapters:

1. Introduction ...... 1

1.1 Introduction ...... 1 1.2 Statement of Problem ...... 3 1.3 Organization of the Thesis ...... 5

2. A Knowledge-Anchored Integrative Image Search and Retrieval System . . . . 7

2.1 Summary ...... 8 2.2 Introduction ...... 8 2.3 Background and Significance ...... 10 2.3.1 Information Needs in the Clinical and Translational Research Domains ...... 11 2.3.2 Grid-computing Electronic Data Interchange Platforms . . . . . 13 2.3.3 Knowledge-anchored Information Retrieval ...... 14 2.3.4 Image Retrieval Tools ...... 17 2.3.5 OSUMC Information Warehouse ...... 18

xi 2.4 Methods ...... 19 2.4.1 Motivating Use Case ...... 21 2.4.2 Three-Tiered Software Framework ...... 25 2.5 Results ...... 34 2.6 Discussion ...... 36 2.7 Conclusion ...... 38

3. A Database De-identification Framework to Enable Direct Queries on Medical Data for Secondary Use ...... 41

3.1 Summary ...... 41 3.2 Introduction ...... 42 3.3 Background ...... 45 3.3.1 The OSUMC Honest Broker Protocol ...... 48 3.4 Design Objectives ...... 49 3.5 System Description ...... 49 3.5.1 Operations on the source system ...... 51 3.5.2 Operations on the destination system ...... 53 3.5.3 System Validation ...... 57 3.5.4 Test Environment and Setup ...... 59 3.6 Test Queries and Results ...... 60 3.6.1 Generalizability ...... 62 3.6.2 Reliability ...... 62 3.7 Discussion ...... 64 3.8 Conclusion ...... 69

4. Unexpectedly High Prevalence of Sarcoidosis in a Representative U.S. Metropoli- tan Population ...... 73

4.1 Summary ...... 73 4.2 Introduction ...... 74 4.3 Methods ...... 75 4.4 Results ...... 76 4.4.1 Demographics of the regional population ...... 76 4.4.2 Comparison of sarcoidosis prevalence to that of other rare lung diseases ...... 77 4.4.3 Changes in prevalence of Sarcoidosis over time in our health care system ...... 78 4.4.4 Estimate of sarcoidosis prevalence in Columbus, Ohio ...... 78 4.5 Discussion ...... 79

xii 5. Computer analysis of chest CT for sarcoidosis...... 84

5.1 Summary ...... 84 5.2 Introduction ...... 85 5.3 Methods ...... 87 5.3.1 Sarcoidosis patient population ...... 87 5.3.2 CT Image Analysis ...... 88 5.3.3 Statistical methods ...... 89 5.4 Results ...... 89 5.4.1 Two-point correlation analysis of CT images reduces background signal from normal lung structures ...... 89 5.4.2 LTS strongly correlates with forced vital capacity (FVC) and to- tal lung capacity (TLC) ...... 90 5.4.3 Correlations between LTS and pulmonary function remain sig- nificant after reducing image intensity precision ...... 90 5.5 Discussion ...... 91

6. Conclusions ...... 95

Bibliography ...... 99

xiii LIST OF TABLES

Table Page

2.1 Combining ICD9 Codes for “Malignant Neoplasm of Trachea, Bronchus, and Lung” with Radiology Reports that Contain Concept Corresponding to “Lung Nodules” along with Pathology Reports that Contain Concept Corresponding “Carcinoma.” ...... 39

2.2 Combining ICD9 Codes for “Coagulation Defects” with Radiology Re- ports that Contain Concept Corresponding to “Pulmonary Embolism.” . . . 40

3.1 Tables de-identified during testing ...... 59

3.2 Queries with different forms and parameters were used during performance evaluations. Each query is repeated in a non-sequential order with different parameters...... 71

3.3 Statistics on pseudorandom number generation; the minimum pass value for a given test is 29 for sequences with size of 32-bits; The minimum pass rate for the random excursion (variant) test is approximately = 2 for a sample size = 3. The Number of sequences used was 86400. We had 100% pass rate in all tests...... 72

4.1 Demographics of Columbus, Ohio patient population compared to U.S in 2010...... 77

4.2 Sarcoidosis Prevalence Compared to Other Rare Lung Diseases...... 77

xiv LIST OF FIGURES

Figure Page

1.1 Some of the data sources and analysis tools that support clinical and trans- lational research at OSUMC...... 2

2.1 Illustration of translational research information flow model...... 11

2.2 Conceptual Model for the our software framework and model system im- plementation...... 20

2.3 The multitier implementation of our framework provides a means to imple- ment Grid based or web based electronic data interchange platform within a service-oriented architecture. End-user access to privileged data is man- aged in one or more ways: 1) a fully privileged user within the institutional firewall (left hand side) can access the data in any way preferred; 2) an external user (located outside the firewall, right hand side) is subject to additional restriction in order to access data; and 3) the multitier service oriented approach allows for the deployment of custom services easily. . . . 22

2.4 By first querying for available meta-data in one or more relational databases, users are able to identify patients of interest; later corresponding images can be retrieved from a PACS...... 23

2.5 The interface provides interactive assistance to users in order to map key- words used during query formulation to appropriate diagnosis codes (ICD9- CM). Once a query is executed, users may browse the result sets using a hierarchical “drill down” model...... 24

xv 2.6 The Ontology Tools Package (OTP) allows users to interact with a local UMLS knowledge collection instance, diagnostic databases, and text report databases. Users can query OTP with a keyword or textual phrase, which is then mapped to one ore more concept codes derived from text mining approaches informed by the UMLS and using MMTx. Once users finalize a set of targeted search concepts, text reports that contain those concepts are queried and returned. Users can also retrieve ICD9-CM codes that correspond to their query keywords; and use those codes to query tables containing structured data...... 29

3.1 Data Request Process. Top: normal process with IRB review. Bottom: HBP process...... 43

3.2 Schematic overview of clinical data from identifiable to de-identified repos- itories...... 44

3.3 De-identification process across databases; on the left, operations on the source database system; on the right operations on the destination database system...... 50

3.4 Overview of identifier de-identification process...... 55

3.5 Average query execution times. De-identified (red) vs. Limited (blue). These queries need to execute under one minute in our source identifiable environment. Detailed description of queries along with their syntax are provided in Appendix...... 61

4.1 Sarcoidosis Age distribution 1995-2010. The average age of the sarcoidosis patients was 48 yrs, with the vast majority falling in the 3rd (light blue), 4th (dark blue), and 5th (red) decades...... 79

4.2 Changes in gender distribution over the years of 1995 and 2010. Females are represented in blue and males in red...... 80

4.3 Changes in race distribution over the years of 1995 and 2010; overall race distribution. African American (blue) and white (red) race represented the majority of sarcoidosis patients, reflecting regional demographics...... 81

xvi 4.4 Sarcoidosis prevalence vs. lung cancer prevalence over time. Note that these trend lines were derived from patients residing in zip codes within Franklin County (Columbus), OH, to reduce bias relating to changing re- ferral patterns. The results here are based upon actual US Census Data from 1990, 2000 and 2010 as well as the census estimates for intervening years...... 82

5.1 Schematic representation of our two-point correlation function based approach (LTS). (a) LTS, accepts Chest CT studies in DICOM format as input; (b) Using Hounsfield units, lungs are segmented, and then this segmented lung volume is raster scanned; (c) during this scan each pixel is compared to its neighbors in various distances within a threshold; (d) this process is repeated for the entire lung segment; (e) all mismatches for each pixel which are integrated throughout the volume are summed at individual pixel level, which will later be added to the volume of mismatches...... 87

5.2 Two-point correlation analysis of CT images highlights diseased lung. The image on the left is produced by filtering out the tissues other than the lungs. The image on the right is a ”result” image which shows the calcu- lated features for every pixel from the source image. Areas marked by ”1” (green arrows) highlight example ROIs with enhanced discrimination be- tween normal and abnormal lung tissue. Areas marked by ”2” (red arrows) highlight example ROIs with enhanced discrimination between ”normal” blood vessels and adjacent ”diseased” lung...... 90

5.3 Correlation between CT image score and lung function. The score cal- culated from the CT images reflected the amount of irregularity within the lungs (That is, the percentage of irregular (textured) lung within the overall lung volume for the given CT image.) Higher CT image scores correlated well with lower Percent-predicted FVC (r = −0.92) and Percent-predicted TLC (r = −0.88). LTS did not correlate well with Percent-predicted DLCO (r = −0.15)...... 91

xvii CHAPTER 1

INTRODUCTION

1.1 Introduction

Recently, The National Institute of Health (NIH) has outlined its scientific priorities in a strategic plan, “NIH Roadmap for Medical Research” [1]. From a personalized health care and translational research perspective, many of these priorities are in direct alignment with The Ohio State University Medical Center’s (OSUMC) own mission areas: patient care, research and education. Here at OSUMC, numerous clinical and translational re- search studies are conducted on an ongoing basis, and these efforts span across multiple departments and institutions. We consider the clinical and translational research activi- ties as a knowledge and information flow which is taking place around multidimensional, heterogeneous clinical and research data that are collected from disparate sources.

Figure 1.1 illustrates a conceptual model of clinical and translational research informa- tion flow, available data sources, and some of the available analysis tools for researchers at

OSUMC (who are supported by an academic information warehouse). For better serving the clinical and translational research community, our broader aims include: 1) providing integrative tools that comply with regulatory policies; 2) enabling networking as well as collaborative studies among researchers and institutions; and 3) providing new computer

1 Figure 1.1: Some of the data sources and analysis tools that support clinical and transla- tional research at OSUMC.

assisted diagnosis (CAD) platforms that enable correlative analysis of multidimensional, heterogeneous data in a clinical and translational research setting.

Within academic medical centers such as OSUMC, large amounts of multidimensional, heterogeneous data are collected electronically on an ongoing basis. These data include clinical parameters derived during the patient care process, as well as financial and oper- ational information. Clinical data can take many forms, including but not limited to (1)

2 structured, codified data elements; (2) semi-structured or unstructured narrative text; and

(3) multimodal images [2]. While such data are readily available to clinical providers and administrators, accessing the same data for research or business intelligence purposes is of- ten a challenge, usually because of regulatory compliance requirements and concerns over patient privacy and confidentiality. Furthermore, data stored in operational systems are not necessarily structured in a manner that support integrative longitudinal or class-based query and analysis–a requirement that often exists in the research context [3]. Therefore, even when regulatory and patient privacy and confidentiality concerns are adequately ad- dressed, it often remains difficult to query and access such data for research and business intelligence. One of the ways institutions have tried to address this problem is by extracting clinical, operational, and financial data from source systems and subsequently storing this information in a centralized data warehouse that uses a data model optimized for longitu- dinal and/or class-based queries [4, 5]. However, imaging data sets are not usually phys- ically stored in such warehouses because of concerns over data storage capacity and are instead commonly stored and managed using a separate Picture Archiving and Communi- cation System (PACS). For purposes of clinical [6], translational, and educational research, the integration and retrieval of image data along with structured data and narrative text is highly desirable [7, 8].

1.2 Statement of Problem

The primary objective of the research and development described in this thesis is to pro- vide an integrative platform where multidimensional data from multiple disparate sources can be easily accessed, visualized, and analyzed. We believe that ability to execute such truly integrative queries, and to enable visualizations and analyses across multiple data

3 types is critical to the ability to execute highly effective clinical and translational research.

Therefore, to address the preceding gap in knowledge, we introduce a model computational framework that is intended to support the integrative query, visualization and analysis of structured data, narrative text, and image data sets in support of translational research ac- tivities. The introduced framework also aims to address the challenges posed by regulatory compliance, patient privacy/confidentiality concerns, and the need to facilitate multicenter research paradigms. The model we introduce is motivated by two types of translational research-oriented end users, specifically:

(1) clinical researchers who need to track their patients through an unified and simpli-

fied interface which presents all types of data (narrative, structured and image) originating from disparate source systems,

(2) researchers who are seeking correlations between phenotypic and genotypic param- eters of a given patient dataset from a personalized health care perspective.

In either scenario, such integration requires the ability to locate patients’ data within multiple source systems (e.g. for patients who are enrolled on a given clinical trial for sar- coidosis, bring all pulmonary function test (PFT) results, and chest computed tomography

(CT) images where primary care physician’s dictated notes mention “shortness of breath” within the text). This can be achieved by utilizing some combination of structured data and narrative text, and then subsequently querying a PACS system to identify and obtain image sets for such patients that are potentially generated by one or more modalities (e.g., computed tomography (CT), magnetic resonance imaging (MRI), etc.) and that correspond to the desired anatomic location(s).

4 Based on our preceding strategic goals and broader aims, our specific aims in this thesis are designed to address the following specific obstacles as well as the gaps in knowledge where innovation is needed:

1) Develop integrative interfaces that enable correlative analysis between phenotypic data (e.g., radiographic images, clinical disease manifestations) and laboratory data (e.g., pathology), hence, improving the clinical and translational research discovery throughput.

2) Provide HIPAA and local IRB compliance at the data sources, hence the researchers can self-serve themselves while performing cost effective and timely research.

3) Evaluate usability of aims 1 and 2, provide applications around motivating use cases.

4) Demonstrate integrative image search capabilities to support development of image analysis techniques around motivating use cases.

1.3 Organization of the Thesis

For the remainder of this thesis, we frame our discussions around the motivating use case of pulmonary diseases which are treated at OSUMC Davis Heart and Lung Research

Institute (DHLRI), where pulmonary medicine researchers work closely with pulmonary patients. From a personalized health care perspective, research studies are conducted and data are gathered in both genotypic and phenotypic levels. Institutional data gathered on pulmonary patients include microarray, tissue, clinical notes such as radiology and pathol- ogy reports, PFTs, as well as image data. At the phenotypic level the combination of CT imaging studies and PFTs are the gold standard for diagnosing and staging diseases such as sarcoidosis; and the integration with collected genotypic level data (i.e. gene expression data extracted from pathology slides) is an inevitable necessity for a precise classification, and better understanding of the disease diagnostic and prognostic characteristics in many

5 cases. Based on our preceding specific aims and our motivating use case, we have defined the following specific aims:

Aim 1: To provide an integrative open source, framework where data from basic and clinical research data sources could be accessed and visualized from an unified interface.

We address this aim in Chapter 2.

Aim 2: To provide HIPAA and local IRB compliance at the data sources, hence the re- searchers can serve themselves while performing cost effective and timely research through secondary use of clinical data. We address this aim in Chapter 3.

Aim 3: To demonstrate applications of aim 1 and 2 around the motivating case of pulmonary diseases. We address this aim in Chapter 4.

Aim 4: To further demonstrate integrative image search capabilities to support devel- opment of image analysis techniques for the motivating case of pulmonary diseases. We address this aim in Chapter 5.

6 CHAPTER 2

A KNOWLEDGE-ANCHORED INTEGRATIVE IMAGE SEARCH AND RETRIEVAL SYSTEM

In this chapter, we introduce a framework that is first of its kind. This data integra- tion methodology has significant impact on research applications and beyond. Being able to query and retrieve clinical image and related metadata from PACS in bulk, allows for responding to very crucial queries related to: research, operations, quality, safety as well as business intelligence. In addition, it provides timely and cost effective access to clini- cal data for secondary use. Use of this framework enables otherwise unanswered queries for many different types of users such as: researchers, business administrators, system ad- ministrators, management engineers, medical students and clinicians. The main use cases which were picked as examples in this chapter are rather simplified to ease readers’ expe- rience. In real world scenarios much more complex cases can be addressed while utilizing this methodology. For example, information related to patients’ scheduling (operational, business intelligence), medications, lab values, allergies (clinical), payor (for clinical trial, clinical research) can be easily requested together as part of a query that corresponds to subjects on cancer clinical trials. Or scheduling information along with radiation strength, duration, as well as time stamps for each image series can be easily requested for a study in- volving quality assessment on PET/CT scans and patient safety. Whether it is for research,

7 education, business intelligence, operations or safety, gathering datasets and the statistical information derived from them are very crucial for institutions.

2.1 Summary

Clinical data that may be used in a secondary capacity to support research activities are regularly stored in three significantly different formats: 1) structured, codified data el- ements; 2) semi-structured or unstructured narrative text; and 3) multimodal images. In this chapter we describe the design and testing of a computational system that is intended to support the ontology-anchored query and integration of such data types from multiple source systems. Additional features of the described system include: 1) the use of Grid services-based electronic data interchange models in order to enable the use of our system in multisite settings; and 2) the use of a software framework intended to address both poten- tial security and patient confidentiality concerns that arise when transmitting or otherwise manipulating potentially privileged personal health information. We frame our discussion within the specific experimental context of the concept-oriented query and integration of correlated structured data, narrative text, and images for cancer research.

2.2 Introduction

Within academic medical centers, large volumes of multidimensional, heterogeneous data are collected on an ongoing basis. These data include patient clinical, financial and operational information. Clinical data can take many forms, counting but not limited to:

1) structured; 2) text; and 3) images [2]. While such data are readily available to clinical providers and administrators, retrieving the same data for secondary uses is often a chal- lenge, generally due to concerns over patient privacy and confidentiality. Moreover, data in

8 operational systems are not necessarily organized in an optimal way that supports integra- tive longitudinal or class based query and analysis. Hence, even when patient privacy and confidentiality concerns are sufficiently addressed, it often remains challenging to query and access such data for secondary use. One of the techniques institutions have tried to address this problem is by extracting clinical, operational, and financial data from source systems and storing these in an enterprise data warehouse that uses a data model optimized for longitudinal and/or class based queries [4, 5]. Imaging data sets are not usually stored in such warehouses due to concerns over storage capacity, and are commonly stored and managed using a PACS. Yet, for purposes of clinical [6], translational, and educational re- search [7, 9, 10], the integration and retrieval of image data along with structured and text data is highly desirable [7, 8].

There are several examples in the current literature of integrative and ontology-anchored image search or query tools [3,11–13], however, to the best of our knowledge, none of these tools have been shown to also support the simultaneous query and subsequent integration of image data sets with structured data and narrative text. We believe that enabling the execution of such truly integrative, ontology-anchored queries across multiple data types is critical to the ability to execute highly effective clinical and translational research. There- fore, in order to address the preceding gap in knowledge, we have formulated a model computational system that supports the integrative query of structured data, narrative text, and image data sets in support of research activities. This system has also been designed to address the challenges posed by regulatory compliance, patient privacy/confidentiality concerns, and the need to facilitate multicenter research paradigms. The model we de- scribe is motivated by two types of research-oriented end-users, specifically: 1) clinical researchers who need to perform queries such as “find all patients with brain CT and MRI

9 images and who have a prior medical history of congestive heart failure, high blood pres- sure, and diabetes”; and 2) imaging informatics researchers who need to obtain image sets defined by both a specific anatomic location and phenotypic parameters in order to evaluate techniques such as the use of computer-aided diagnosis algorithms [14]. In either scenario, such a query requires the ability to locate patients with the specified phenotypic parame- ters, utilizing some combination of structured data and narrative text, and then subsequently querying a PACS to identify and obtain image sets for such patients that are potentially gen- erated by one or more modalities (e.g., CT, MRI, etc.) and that correspond to the desired anatomic location(s).

Given the preceding motivation and use cases, in the following sections of this chap- ter we: 1) provide pertinent and contributing background material concerning information needs in: a) the clinical and translational research domains, b) Grid-computing electronic data interchange platforms, c) ontology-anchored information retrieval techniques, d) im- age retrieval tools, e) applicable regulatory compliance and patient privacy/confidentiality concerns that must be addressed when using clinical data for research purposes, and f) the specific experimental context for our model formulation - The Ohio State University Med- ical Center (OSUMC) Information Warehouse (IW); 2) describe the methods and system design approaches used for our model formulation process; 3) report upon initial feasi- bility evaluation results relating to the described model; and 4) describe the implications, limitations, and next steps for our work.

2.3 Background and Significance

In this section, we summarize several areas that contribute to or otherwise inform the model formulation work reported in later sections of the chapter.

10 Figure 2.1: Illustration of translational research information flow model.

2.3.1 Information Needs in the Clinical and Translational Research Domains

The relationship between the information needs of clinical and translational researchers and currently available information technology (IT), can be understood by conceptualiz- ing the translational research process (which subsumes clinical research) as a sequential information-flow model as illustrated in Figure 2.1 (adapted from [15]).

At each stage in this model, a combination of dual-purpose and research-specific IT sys- tems may be utilized. Examples of such systems that can support translational research in- clude: 1) literature search tools such as PubMed and OVID [16]; 2) protocol authoring [17] and data mining tools [18]; 3) simulation and visualization tools [19]; 4) research-specific web portals [20]; 5) electronic data collection or capture tools [21]; 6) participant screening tools [22]; 7) electronic health records (EHRs) [23]; 8) computerized physician order en- try (CPOE) systems [24]; 9) decision support systems [25]; and 10) picture archiving and systems (PACS) [26, 27].

Numerous reports have described increased translational capacity, data quality, and de- creased clinical trial protocol deviations resulting from the use of such IT systems [19,21].

In addition, the use of IT [28] in studies involving multiple, geographically distributed

11 research sites has been shown to have demonstrable benefits in terms of increased effi- ciency and decreased resource requirements [24, 29]. However, despite the promise that the integration of research-specific and dual-purpose IT systems which the translational re- search enterprise holds, institutional informatics infrastructures providing such integration are generally absent, an outcome largely attributed to socio-technical factors [30]. Given such problematic adoption of IT in the translational research domain, it is critical for new systems and technology frameworks to provide significant value to investigators and re- search staff in order to overcome potential barriers to acceptance.

A particular concern in the context of translational research and the utilization of in- formation technology to support such activities is the need to maintain proper security and confidentiality of privileged or protected health information (PHI). In many cases, provid- ing for such security, confidentiality, and ethical conduct of research requires the provision of partially or completely de-identified data sets (including imaging data) to researchers.

In general, two overriding frameworks exist that dictate such needs, Institutional Review

Boards (IRBs) and the Health Insurance Portability and Accountability Act (HIPAA), as summarized briefly below:

1, Institutional Review Boards (IRBs) are federally mandated oversight bodies who have the responsibility to monitor, approve, and ensure the regulatory compliance of human- subjects research (of note, IACUCs, Institutional Animal Care and Use Committees, also exist to oversee animal research).

2, The Health Insurance Portability and Accountability Act (HIPAA) includes compul- sory security standards pertaining to the protection and use of a well defined collection of privileged data elements known as protected health information (PHI) [31, 32]. HIPAA

12 guidelines mandate the removal of over 18 specified identifiers from PHI prior to its re- search use. Significant challenges exist when attempting to transact such PHI to support research operations, especially when employing technologies such as the Grid-based elec- tronic data interchange models in the literature [33–37]. These challenges include ensuring that appropriate access controls are maintained throughout a distributed architecture, the certification that consumers of such data have a valid and documented purpose for access- ing PHI, and maintaining the confidentiality of such data while in-transit or being stored throughout a potentially heterogeneous computing environment. An example of a HIPAA compliant research data repository that contains image data and phenotypic data is the

RIDER (Reference Image Database to Evaluate Response) archive of CT scans for lung cancer patients [10].

2.3.2 Grid-computing Electronic Data Interchange Platforms

We define a computational Grid, per the convention used by Foster and Kesslman, as

“a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities concerned, above all, with large-scale pooling of resources, whether compute cycles, data, sensors, or people.” [38]

Grid computing has become the new means for creating distributed infrastructures and vir- tual organizations for multi-institutional research and enterprise applications. The use of

Grid computing has evolved from being a platform targeted at large-scale computing ap- plications to a new architecture paradigm for sharing information, data, and software as well as computational and storage resources. A vast array of middleware systems and tool- kits has been developed to support the implementation of Grid computing infrastructures.

13 These include middleware and tools for service deployment and remote service invoca- tion, security, resource monitoring and scheduling, high-speed data transfer, metadata and replica management, component based application composition, and workflow manage- ment. Central to the ability to develop such Grid computing middleware and toolkits is the existence of widely accepted standards. The Grid services framework builds on and extends Web Services for scientific applications. It defines mechanisms for such additional features as stateful services, service notification, and management of service/resource life- time. A primary example of a Grid computing initiative focusing upon the biomedical domain is National Cancer Institute’s (NCI) Cancer Biomedical Informatics Grid (caBIG, https://cabig.nci.nih.gov/) program, which uses a Grid computing infrastructure, named caGrid, to provide a common, extensible, collaborative platform for data interchange and analysis between cancer researchers and institutions.

2.3.3 Knowledge-anchored Information Retrieval

Ontologies and terminologies (also described as vocabularies) are a type of concep- tual knowledge collection, comprised of definitions of both atomic units of knowledge

(e.g., facts) and the network of hierarchical and/or semantic relationships between those atoms [39]. Ontologies can be either formal, or semi-formal, which corresponds to their level of ontological commitment, which can be described as the degree to which the output of computational agents’ reasoning based upon the ontology is consistent with the ontolo- gies definition and structure. Controlled terminologies or vocabularies that do not satisfy the preceding definition of what constitutes an ontology, but that do contain definitions of concepts and hierarchical relationships between such definitions, are another example of a conceptual knowledge collection. Instances of ontologies in the biomedical domain include

14 SNOMED-CT and the NCI Thesaurus, while commonly used controlled terminologies in- clude ICD9-CM and CPT. A frequently employed resource when reasoning over or upon the content of such ontologies and terminologies and their potential interrelationships is the Unified Medical Language System (UMLS), a meta-thesaurus of over 100 biomedical ontologies and vocabularies incorporating in excess of 1 million concepts and 4 million synonyms, which is maintained by the National Institutes of Health (NIH) - National Li- brary of Medicine (NLM) [40–42]. The primary benefit of using the UMLS is the ability to identify a concept of interest, and subsequently discover synonymous definitions for that concept in multiple source ontologies or terminologies. In addition, it is also possi- ble to reason upon all possible hierarchical and semantic relationships between concepts in the UMLS, based upon the relational structures subsumed from the included source on- tologies and terminologies. Such conceptual knowledge collections are highly useful when performing information retrieval tasks, as they allow for object-oriented, semantic, or class- based definitions or expansions of queries. For example, if one were to pose a query of the following type (represented in pseudo-code): “SELECT ALL WHERE PROCEDURE =

’MRI’ and ANATOMIC LOCATION = ’lower extremity’”, it would be difficult to identify and return such patients, since it is highly unlikely that such imaging procedures would be codified and stored as having an anatomic location of ”lower extremity.” Instead, it is likely that such procedures would be codified as having an anatomic location such as “foot”, “an- kle”, “lower leg”, or “knee.” However, if the concept “lower extremity” is contained in an ontology or terminology, and is related (hierarchically and/or semantically) to concepts including the preceding specific anatomic locations (e.g., “foot”, etc.), then by reasoning upon the contents of that knowledge collection, the earlier query could be expanded to type

15 “SELECT ALL WHERE PROCEDURE = ’MRI’ and ANATOMIC LOCATION = (’foot’,

’ankle’, ’lower leg’, ’knee’).”

In the context of our model system formulation, there is a particular focus on the use of controlled terminologies, specifically ICD9-CM (Clinical Modification to the Interna- tional Classification of Diseases, Ninth Revision). Within the United States, ICD9-CM is the most commonly used medical coding system for procedures and diagnoses. Such codes are manually assigned to patient records by medical and billing experts based on numerous information sources, including: 1) clinician provided progress and/or diagnostic reports (usually consisting of free text) lab results; 2) quantitative find- ings such a laboratory data; and 3) prior medical history (in both coded and non-coded forms) [43, 44]. A patient encounter (which could encompass either an outpatient visit, or a multiday inpatient admission) can be characterized by multiple diagnosis and procedure codes, usually summarized by what is known as a primary code which represents the moti- vation for that encounters (e.g., the diagnosis or procedure which incurred the encounter).

ICD9-CM codes are hierarchically organized. For example, the codes between 160 and 165 correspond to malignant neoplasms of respiratory and intrathoracic organs, with associated subdiagnoses of such neoplasms being indicated using decimalized variants of the codes, such as code 162.2 that specifically defines “primary disease in the upper lung lobes.”

In addition to the use of ICD9-CM, our model system formulation also employs an advanced type of knowledge-anchored information retrieval known as text mining. At the most basic level, text mining involves the use of a computational agent, usually informed by one or more conceptual knowledge collections, in order to parse (i.e., decompose text into constituent components at one or more levels of granularity ranging from paragraphs to

16 words) and tag (i.e., apply a codified concept identifier to a parsed component that is repre- sentative of its lexical and/or semantic meaning) narrative free-text. Such parsed and tagged text can then be queried as structured data. An example of the current state-of-the-art in biomedical text mining tools includes the open-source MetaMap Transfer (MMTx) appli- cation provided by the NLM, which uses the contents of the UMLS as a knowledge source to inform the parsing and tagging of biomedical narrative text (codifying such text in terms of UMLS concepts) [45–48]. In addition to open-source applications, numerous commer- cial database vendors such as IBM, Microsoft, and Oracle provide free-text indexing and mining capabilities within the scope of their database management systems [49–51]. How- ever, these capabilities are usually limited to keyword-based searches. One exception to the previous statement is the free text search functions provided by Oracle, which do have limited thesaurus-based capabilities. Yet, in the case of the thesaurus-based text mining functionality provided by Oracle, there is a noticeable lack of available biomedical thesauri that can be utilized for such purposes. Additional examples of commercial text mining tools include: 1) IBM’s UIMA [52], which employs an ontology-anchored approach to concept tagging; and 2) Vivisimo [53], which supports concept-oriented search of tagged narrative text where such tagging can be informed by the use of commonly available ontologies or terminologies.

2.3.4 Image Retrieval Tools

In the case of medical images, the most commonly used storage repositories are Picture

Archival and Systems (PACS). Most PACS support the Digital Imaging and Communications in Medicine (DICOM) standard [54] (currently version 3). In PACS

17 that are compliant with the DICOM standard, medical images are stored and retrieved us- ing patient metadata (e.g., descriptors such as medical record number, MRN). The primary focus of image retrieval functionality within modern PACS is to support conventional clin- ical operations and not necessarily research requirements. An example use case in terms of clinical operations would be a radiologist querying the system for a patient’s latest MRI or CT scan using the patient’s MRN, and then reviewing the imagery for the related visit.

Recently, there have been efforts to improve the image retrieval process in order to sup- port image related research efforts [36, 37, 55–57], as well as to enable better integration of imaging data with Electronic Health Records (EHRs) [58–60]. However, the preceding efforts are still relatively immature, and wide-scale adoption of such tools and processes is limited.

Research uses of most clinical data, including imaging data, present additional chal- lenges beyond those associated with clinical uses, and in particular these challenges are due to the frequent need to de-identify such data sets. In order to address de-identification imaging data sets for research purposes, the commonly employed practice is to generate a de-identified duplicate of the desired DICOM as found in the source, production PACS, and then store the duplicate in a research-specific PACS instance [10, 57, 61]. While this approach is feasible for the retrieval and use of well-defined cases for a given experimen- tal context, it does not readily support the integration with and subsequent retrieval and de-identification of related imaging and phenotypic data

2.3.5 OSUMC Information Warehouse

The Information Warehouse (IW) at The Ohio State University Medical Center (OS-

UMC) is a comprehensive repository integrating data from over 80 clinical, operational,

18 and research systems throughout the institution. The IW serves a broad variety of cus- tomers in all mission areas at OSUMC, including: 1) clinical operations; 2) administration;

3) education; and 4) research. Content in the IW includes structured medical and financial data, clinical free-text reports, tissue and genomic data, and limited numbers of medical images. Data stored in the IW can be queried from and presented to end-users in identi-

fiable, partially de-identified, or completely de-identified forms depending on applicable

IRB and institutional policies and requirements. As described earlier, the IW does include medical free-text from sources such as radiology reports and pathology reports, as well as structured and codified data elements such as age, sex, and diagnosis. However, imaging

(e.g. PET, CT, MRI) data are stored in a separate picture archival systems (PACS), with only limited physically duplicated image data within the IW. This physical separation of most image data makes the integrative query of multiple data types throughout OSUMC, including image data, extremely challenging. It should be noted that this problem is not unique to this IW but quite common in other IWs.

2.4 Methods

As stated at the outset of this chapter, we report upon the development and implemen- tation of a model system intended to enable the integrative, knowledge-anchored query of multiple information types, including structured, narrative text, and image data, in support of research requirements. Our model system is based upon a framework [62–64] that is intended to provide a convenient interactive environment for such integrative data retrieval tasks (Figure 2.2).

The resulting system that was implemented based upon this framework allows end- users to retrieve images based on characteristics defined in correlative text and structured

19 Figure 2.2: Conceptual Model for the our software framework and model system imple- mentation.

data elements stored within the OSUMC IW. Such retrieval and presentation of image datasets using phenotypic context derived from narrative text and structured data involves the handling of all data types related to the image, including the image itself, as well as associated heterogeneous, multidimensional textual and structured data. In order to en- able such a data handling process, our framework leverages knowledge-anchored infor- mation retrieval techniques, specifically combining the UMLS Meta Thesaurus (UMLS

MT) [65, 66] knowledge collection with text mining platforms incumbent to the Oracle database management system employed by the IW. The use of the UMLS MT allows the expansion of simple keyword-based queries into concept-oriented queries that can then be applied to the contents of clinical narrative text and structured data elements such as diagnosis codes. Due to the research orientation of our framework, we have also incorpo- rated mechanisms to partially or completely de-identify the PHI contained in any returned

20 dataset. This provides compliance with the applicable regulatory requirements, including

HIPAA. Finally, in order to allow for the elegant evolution of our framework in light of con- stantly emerging technology platforms and standards, as well as the need to enable research activities that span institutional and geographic boundaries, our framework incorporates a multitiered service oriented architecture (SOA, Figure 2.3).

A primary benefit of this multitiered SOA is the ability to utilize emergent, research oriented electronic data interchange platforms, such as the previously introduced caGrid middleware [67].

2.4.1 Motivating Use Case

For the remainder of this chapter, we will frame our discussion using the following motivating use case: An investigator is interested in lung cancer patients. In addition to images, the investigator would like to access findings in both radiology and pathology reports, as well as diagnosis codes, for each patient from whom images are obtained. The specific findings of interest in the radiology reports should mention lung nodules, while the pathology reports should mention adenocarcinoma.

The interaction between our system (Figure 2.4) and the researcher (i.e., end-user) incorporate the following components:

Query Construction

The construction of a query intended to identify and retrieve patients that meet the given criteria can be decomposed into two parts: 1) a structured data search query; and 2) a free-text search query.

Structured data search query: In this case, the user is looking for ICD9-CM codes that correspond to different types of lung cancer. At this stage if the user already knows the

21 Figure 2.3: The multitier implementation of our framework provides a means to implement Grid based or web based electronic data interchange platform within a service-oriented architecture. End-user access to privileged data is managed in one or more ways: 1) a fully privileged user within the institutional firewall (left hand side) can access the data in any way preferred; 2) an external user (located outside the firewall, right hand side) is subject to additional restriction in order to access data; and 3) the multitier service oriented approach allows for the deployment of custom services easily.

22 Figure 2.4: By first querying for available meta-data in one or more relational databases, users are able to identify patients of interest; later corresponding images can be retrieved from a PACS.

codes for lung cancer (e.g., 162, 162.2, etc.) he or she can provide those codes. For those users not familiar with the ICD9-CM coding system, a code lookup facility is provided, informed by the UMLS knowledge collection. This lookup system supports keyword-based searches. Once ICD9-CM codes are entered or selected, construction of the structured query is complete.

Free-text search query: The free text query in this instance requires the assessment of two different report types: radiology and pathology, using the initial keywords “lung nodules” and “carcinoma” respectively. The system end-user is provided the ability to expand these keywords using the contents of the UMLS knowledge collection. Once the end-user has identified the report types, entered the search keywords, and expanded the search scope per the contents of the UMLS, the construction of the free text search query is complete.

During the construction of both query types (free-text and structured), the presentation tier solicits end-user data entry as appropriate and passes that data to a Grid service in an application tier. This Grid-service then interacts with a local UMLS instance that is

23 available via a data sources tier. Asynchronous calls that take place in the presentation tier

(e.g., the user interface) allow such transactions to occur in parallel, so that the user can construct both queries simultaneously.

Query Execution

Once the user is satisfied with the queries as constructed previously, he or she can exe- cute the query and identify matching patients or cases. At this point in the system workflow, the presentation tier passes the query to the application tier that in turn interacts with the data sources tier. All queries are executed in parallel; queries on diagnosis codes for lung cancer, free-text queries for lung nodules (radiology reports), and queries for adenocarci- noma (pathology reports) all return results simultaneously. Once all available results have been returned, the presentation tier joins them and presents them to the user.

Figure 2.5: The interface provides interactive assistance to users in order to map keywords used during query formulation to appropriate diagnosis codes (ICD9-CM). Once a query is executed, users may browse the result sets using a hierarchical “drill down” model.

24 Data Browsing and Retrieval

Upon completion of the preceding phase of the systems workflow, end-users may browse the returned patients or cases. Once a patient or case is selected, the presentation tier calls the Grid services in the application tier in order to retrieve the patient or case specific data. Instead of presenting abstract results for such a query, the radiology report contain- ing the concept of “lung nodule”, the pathology report containing the concept of “adeno- carcinoma”, any structured codified data pertaining to the “diagnosed with lung cancer” characteristic, and finally the corresponding images are presented collectively for further evaluation and review (Figure 2.5).

2.4.2 Three-Tiered Software Framework

There are several complementary goals within this framework. The realization of all these goals requires the use of components that satisfy both infrastructure constraints (e.g., available platforms and prevailing data interchange standards) and institutional or regu- latory requirements as they pertain to the use of PHI for research purposes. The tiered architecture is one of the common techniques used today for the separation of presentation, application and data [68,69] for web based applications. We utilize this approach within our framework to achieve flexibility and efficiency in terms of development, deployment, and management. The framework used in our model system (Figure 2.3) consists specifically of the following three tiers:

1) Presentation Tier: provides end-user interaction capabilities based upon the features of the framework via web interfaces,

2) Application Tier: utilizes standards-compliant Grid services to interact with and apply logic to multiple, heterogeneous data sources,

25 3) Data Sources Tier: support access to data sources such as relational database man- agement systems and PACS.

In the following subsection, detailed descriptions of the design approaches and func- tionality that pertains to each such tier are provided.

Presentation Tier

As stated previously, the primary goal of this framework and the resulting system is to provide simple yet flexible ways of identifying and retrieving images from a PACS system that are characterized by one or more heterogeneous sources of phenotypic data. Sim- plicity is delivered through a unified user interface that deals with all available data types, while flexibility is demonstrated by the expandability of queries. Our user interface gives the end-user the flexibility to execute such queries by first allowing searches on existing sources of phenotypic data such as free-text reports and diagnosis codes (or any other pa- tient related data). Once the patients are identified according to all available meta-data, the

PACS system is then queried and associated images are retrieved (Figure- 2.5). Since the number of retrieved images depends on the number of identified patients, proper utilization of existing free-text (radiology and pathology reports) and structured data (diagnosis codes) is crucial in the search criteria. This framework brings together knowledge-anchored free- text capabilities provided by a combination of the UMLS knowledge collection and the text indexing and keyword search capabilities of Oracle. The utilization of the UMLS and Ora- cle’s text indexing allow our framework to provide interactive query expansion capabilities on free-text documents along with assistance on diagnosis code selection through an in- teractive web based user interface. Selected images are retrieved using PixelMed, an open source JAVA based toolkit [70]. The web interface component of the presentation tier en- ables users to construct and execute queries based upon both free-text (e.g., radiology and

26 pathology) and structured data elements (e.g., diagnosis codes) in an interactive fashion.

Once the patients are identified based upon specified query parameters, their images along with radiology reports, pathology reports, and diagnosis codes (ICD9-CM) are presented to the users. During the preceding query construction and result set presentation process, end-users interact with and are assisted by the presentation layer in the following ways:

UMLS-based text query expansion: Within the IW, free-text reports are stored in re- lational databases (Oracle, version 10gR2), where they are indexed for fast text searches.

However, these indexes only allow keyword based searches. As introduced earlier, we have adopted a combination of the UMLS Meta Thesaurus (MT) and MetaMap (MMTx) in order to convert users’ keyword queries into conceptual queries. UMLS MT allows us to present alternatives to or expansions of given query criteria. For example, the user might enter the search term “melanocarcinoma” when searching the contents of pathology reports, but would receive few results due to the infrequent use of that specific term in such text (in the test dataset to be described later in the evaluation section of this chapter, such a search returns no text reports). However, if this keyword were expanded to include the synonym

“melanoma” per the contents of the UMLS MT, the user will likely retrieve many more reports (again in our test dataset, such a search returns 2706 text reports with the speci-

fied term). The MMTx API allows us to parse clinically relevant terms and phrases that can then be expanded and used for query operations when the initial user entered search criteria consists of larger text constructs (e.g., a sentence or paragraph). For example, the concept of “chronic obstructive lung disease” can be extracted from the sentence “Patient has chronic obstructive lung disease.” and can subsequently be mapped to as corresponding

UMLS concept.

27 UMLS-based diagnosis code selection: When users are unsure which ICD9-CM code to choose when querying structured data elements that are encoded using that terminology, context-specific assistance that enables those users to select appropriate code(s) is provided.

Specifically, during our UMLS MT installation, controlled vocabularies and dictionaries related to ICD9-CM codes are included. This allows us to provide hierarchical searches on

ICD9-CM codes initiated by user supplied text. For example, during the construction of a query, a user may not know the proper codes for a targeted type of lung cancer. In this case the user may use the assistance mechanism to navigate the ICD9-CM hierarchy, traversing through the concepts “neoplasm” and “lung cancer” in order to visualize and select the appropriate specific lung cancer codes.

Image handling and display: After constructing and executing queries, users may browse the result set using a hierarchical “drill-down” model. For example the interface allows end users to drill-down through multiple layers of granularity in a CT study, begin- ning with a series and ending at a single image. In addition, the presentation layer allows users to compare images within a series. Corresponding radiology or pathology reports are also displayed when retrieving and presenting image data (Figure 2.5).

The systems presentation layer is implemented using Java Server Pages (JSP) running on Apache Tomcat. In addition, asynchronous calls to the application tier are used in order to provide a more dynamic human-computer interaction experience.

Application Tier

To support relational database connectivity sufficient to access both textual and struc- tured data elements, and to support the use of the UMLS knowledge collection as described earlier, an Ontology Tools Package (OTP) was created within the application tier. In addi- tion, to enable the query, retrieval, and manipulation of images from our PACS system we

28 utilized the functionality provided by the PixelMed open source JAVA toolkit. These two packages form the core of our application tier, and are designed and implemented to man- age interaction between the presentation and data source tiers. Furthermore these packages are wrapped as caGrid services as described in the following sections.

Figure 2.6: The Ontology Tools Package (OTP) allows users to interact with a local UMLS knowledge collection instance, diagnostic databases, and text report databases. Users can query OTP with a keyword or textual phrase, which is then mapped to one ore more concept codes derived from text mining approaches informed by the UMLS and using MMTx. Once users finalize a set of targeted search concepts, text reports that contain those concepts are queried and returned. Users can also retrieve ICD9-CM codes that correspond to their query keywords; and use those codes to query tables containing structured data.

Ontology Tools Package (OTP): The OTP is a collection of functions from several exist- ing API’s, including both the MMTx API from National Library of Medicine and Oracle’s

JDBC drivers. Additionally, OTP allows for queries on diagnostic data tables to be exe- cuted and linked with UMLS MT expanded conceptual search queries on radiology and pathology reports. For performance considerations, local copies of controlled vocabularies

29 and dictionaries from the UMLS MT are maintained and utilized by the OTP (Figure 2.6).

When a user interacts with the relational databases targeted by our system, OTP’s functions

are utilized during the query construction, execution and retrieval of the resulting datasets,

as described below:

Query construction with OTP: During the construction of text queries, OTP takes free- text entered by end users and expands that free-text and returns synonyms and other se- mantically relevant concepts through a process of reasoning upon the previously described local UMLS tables. However, the final selection of such search terms in order to define a conceptual text query is based upon end user evaluation of such suggested expansions.

Similarly, for the construction of diagnosis code-based queries, user-supplied free-text en- tries are expanded with the help of UMLS. Instead of keywords, ICD9-CM codes and their descriptions are returned to the user. Again, end users decide which of the returned diag- nosis codes are to be included with their query.

Query execution with OTP: During the query execution process, OTP executes the gen-

erated query against text and structured data tables, links the datasets, and returns identifiers

for further retrieval of other correlated data such as radiology images contained in a PACS.

This data linkage process is largely made possible through the use of some combination

of the three common identifiers used to store and identify PHI at OSUMC (and which are

generally analogous to those used in most common clinical information systems): Medical

Record Number (MRN, used to identify an individual patient), Encounter Number (ENC,

used to identify a patient-specific encounter from which an item of data is derived), and

Accession Number (ACC#, used to identify the specific component or activity association

with an encounter from which an item of data is derived). When a user-constructed query

is executed, OTP returns only identifiers or de-identified pointers to the data, depending

30 on the access privileges assigned to a user for a specific project. It is important to note that at this stage, no data other than such identifiers are returned by OTP (e.g., no textual, structured, or image data are returned).

Data retrieval with OTP: Once the end user selects a patient or group of patients for which to obtain further data, OTP manages the retrieval of all nonimaging data (with such image data retrieval being managed using PixelMed). All the data tables that need to be ac- cessed with this framework are indexed using the three previously defined identifiers MRN,

ENC, ASC. Therefore, on-demand access to any of the text or structured data elements does not present significant performance implications.

Use of PixelMed: PixelMed provides open source JAVA libraries for reading, writ- ing, manipulating, and communicating DICOM objects. In addition, PixelMed provides a simple PACS and WADO (Web Access to DICOM Objects) server implementation. In our framework and model system, PixelMed is used for handling all operations related to

DICOM objects, as follows:

DICOM query construction and execution: Once a user identifies which patient cases he or she wants to review, it is then necessary to query for and retrieve corresponding images from the targeted PACS. Our PACS system (AGFA, Impax 5.2) can be queried by Accession Number (ACC#). Here, the accession numbers retrieved from the OTP are formed into a DICOM query and those queries are subsequently executed through the use of standard DICOM messaging. Each ACC# is mapped to a DICOM Study Object within the PACS, supporting the retrieval of: 1) the entire imaging study; 2) any series contained in the study; or 3) any single image from any series.

DICOM object manipulations: When images are retrieved for display via the previously described web interface they must be converted to a web compatible format such as JPEG

31 or Bitmap (BMP). In addition, images may need to be de-identified based upon a user’s

and/or project’s privileges. PixelMed provides functionality such as the conversion and

de-identification of DICOM objects (based on DICOM 3.0 Standards).

Grid enablement: Grid-enabling our model system allows us to support research col-

laborations involving institutionally and geographically disparate participants, a scenario

that has motivated many recent infrastructural research and development programs, such

as caBIG. In this project, we specifically have Grid-enabled our system using the caBIG

developed caGrid middleware. In order to implement this type of functionality, a wrapper

application was created that supports Grid-compliant electronic data interchange with both

OTP and PixelMed (thus making them available as caGrid analytical services). Since Grid

services operate in a manner similar to web services or applications, implementing our

Grid-based system within the institutional firewall did not require any specialized network

configurations other than the placement of the presentation and application tiers within a

demilitarized zone (DMZ) of our network, which for security purposes incorporates an

access control list to restrict connectivity to known hosts.

Data Sources Tier

Several key requirements influenced the design of the data sources tier, specifically,

the need to 1) ensure HIPAA-compliant de-identification of the data when necessary; 2)

generate proxy data sources for efficient access; and 3) manage access privileges at multiple

levels of granularity from projects to individuals. These requirements are described in detail

in the following sections.

De-identification: The need to de-identify data can be broadly separated into require- ments that pertain to structured data, textual data, and image data:

32 Structured Data: Since structured data reside on relational databases, de-identification

is straightforward compared to other data types. This type of de-identification was achieved

by removing and replacing patient-unique identifiers (using surrogate identifiers which we

refer to as de-identifiers).

Text Data: Within the OSUMC IW, all text reports are preprocessed in order to generate

distinct, de-identified versions of the original source text, which were used in this project.

Image Data: PixelMed allows DICOM objects to be separated into metadata and image

components, and provides total programmatic control over the metadata section through

its JAVA based API. Hence, by re-writing and manipulating the metadata of the DICOM

objects, images are de-identified or anonymized according to the HIPAA guidelines.

Proxy data sources: In order to prevent computational performance issues pertaining

to the use of operation data repositories, we implemented proxy relational databases and

proxy PACS as part of our data sources tier. Research images and data are moved to these

proxy resources in a time-sensitive, batch operation. In order to support regulatory compli-

ance, research datasets are also de-identified during this batch operation. The two specific

components of our proxy data source are:

Proxy Relational Database: Within our framework this is a physically and logically dis- tinct Oracle instance. Oracle enables tables and views based on remote data sources, how- ever in our case this proxy database only holds de-identified summary tables and material- ized views that are based on our operational databases. User accounts created for accessing this proxy database only have view-only privileges for that data.

Proxy PACS: The proxy PACS is an instance of PixelMed’s PACS server. Images that

have been approved for research use are moved to this PACS in de-identified form. The

use of this proxy PACS allows for more frequent and large-scale queries to be executed as

33 would be possible if searching the operational PACS used for clinical care purposes, owing to performance degradation concerns in such an alternative scenario.

2.5 Results

In order to evaluate the feasibility and performance of our model system implementa- tion, we performed an evaluation of the system using the previously introduced motivating use case applied to 30 months of patient data stored in the OSUMC IW which comprised

1,373,194 radiology reports and 304,212 pathology reports that corresponded to 4,753,985 patient encounters. Our example searches focused on two main ICD9 code categories: lung cancer (162: Malignant neoplasm of trachea, bronchus, and lung) and coagulation defects (286: Coagulation defects). While our example searche involves both radiology and pathology reports for the first category (lung cancer), searches for the second category

(coagulation defects) were applied to radiology reports only.

Lung cancer searches: Under this category, 7 different ICD9-CM codes were used for query construction: Trachea (162.0), Main Bronchus (162.2), Upper Lobe (162.3), Middle

Lobe (162.4), Lower Lobe (162.5), Other Parts of Bronchus or Lung (162.8) and Bronchus and Lung, Unspecified (162.9). For our search on radiology reports, the concept “lung nod- ule” was expanded to include “thoracic”, “chest” for “lung” and “lesion” for “nodule”. For our search on pathology reports the keyword “carcinoma” was expanded by the keyword

“neoplasm”. Table 2.1 depicts the effects of the thesaurus based conceptual expansions for each lung cancer ICD9 code. In some cases, the use of knowledge-anchored query expan- sion captures more relevant data than would be possible in queries without such expansion.

Coagulation defects searches: Under this category, 9 different ICD9-CM codes were used for query construction: “Congenital factor VIII disorder” (286.0), “Congenital factor IX

34 disorder” (286.1), “Congenital factor XI deficiency” (286.2), “Congenital deficiency of other clotting factors” (286.3), “von Willebrand’s disease” (286.4), “Hemorrhagic disor- der due to intrinsic circulating anticoagulants” (286.5), “Defibrination syndrome” (286.6),

“Acquired coagulation factor deficiency” (286.7) and “Other and unspecified coagulation defects” (286.9). For our search on radiology reports expanded the keyword “pulmonary embolism” as follows: “lung”, “chest”, also synonym for Pulmonary Embolism “PE.” Ta- ble 2.2 depicts the effects of the thesaurus based conceptual expansions for each Coagula- tion defects ICD9 code.

In our “lung cancer” cases, query expansion returned 43% more reports on average than without such expansion when applied to radiology reports (Table 2.1). However, in the context of pathology reports, this increased in returned reports was only 4%. In “co- agulation defects” cases, similar query expansion yielded 8% additional reports compared to a query without such expansion (Table 2.2). Considering our earlier example on the keyword expansion of “melanocarcinoma” with “melanoma” where we demonstrate much greater (0 to 2706) expansion, these results demonstrate the impact on overall performance of the initial keywords used to pose a query.

The execution times associated with our model system were evaluated in two stages: 1) the first stage focused on the average time required to identify and retrieve images, which was found to range between 2-10 seconds; 2) the second stage focused on the average time required to retrieve structured and free text data corresponding to a given image or image series, which was found to range between 4-10 seconds. These measurements were derived by 30 repeated measurements of the elapsed time required to retrieve and view the first 8 images of a CT or MRI series for a given patient, and then to retrieve on average 3 related

35 text reports and structured data elements for that study, via the previously introduced web interface.

2.6 Discussion

The retrieval of image data in support of research requirements is usually more mean- ingful when patient-derived phenotypic context data accompanies such images. Such phe- notypic data can be derived from multiple sources, including free-text data such as ra- diology or pathology reports and structured data such as diagnosis codes. As we have demonstrated, by applying integrative knowledge-anchored strategies, conceptual searches spanning all of the preceding data types are possible and in some cases can generate larger amounts of data meeting the criteria used to define a motivating use case and its associ- ated data query requirements than is otherwise possible. We have also demonstrated that in those instances where the de-identification of imaging and corresponding phenotypic data is needed in order to satisfy regulatory and patient confidentiality requirements, such de-identification can be performed in such a manner that overall context and the ability to recreate the linkage between data elements is maintained. While such an approach by necessity introduced additional technical challenges to the proper de-identification of data, the programmatic utilization of the open-source PixelMed PACS API within our frame- work allows us to replace identifiers within metadata for images with de-identifiers which are consistent with other de-identified phenotypic data, thus demonstrating one possible solution to such challenges.

36 Interoperability is a major requirement for sharing data and collaborative work in a research environment, especially when that environment spans institutional and geographi- cally distributed investigators and research participants. By adapting to the caGrid middle- ware in our framework and model system, we are able to facilitate such interoperability and on both the syntactic and semantic levels – thus addressing the prior requirement. In partic- ular, the multitier architecture we have adopted simplifies deployment of new technology such as caGrid within our framework. This multitiered framework has additional benefits, including more efficient control of and access to multiple underlying data sources as well as the ability to mitigate potential performance concerns in operational systems through the utilization of appropriate proxy data sources.

There are several limitations with our framework and model system which should be noted, including: 1) our relatively simple approach to text mining does not exploit more ad- vanced semantic interpretation of clinical narrative text, nor does it allow for the detection and reasoning upon negation within that text, which could effect the recall and precision of information retrieval tasks (however, an analysis of the recall and precision of the text mining process was infeasible during this study due its limited scope); 2) the query ex- pansion techniques employed, by virtue of the previously noted simple approaches to text mining, do not lend themselves to fully assisting end users in identifying optimal descrip- tors or codes to be used during the query formulation process, and thus the efficacy of our queries are in part reliant on the domain expertise and heuristic knowledge of our end-users;

3) the text data de-identification scheme employed relies on pre-existing de-identification processes external to our framework and model system; and 4) our evaluation of the de- scribed software framework and model system is limited to a single instance and basic scope, owing to the preliminary status of our work as a model formulation effort.

37 2.7 Conclusion

We have described a model and associated software framework with a promising and unique combination of components capable of providing translational research users with an integrative query and information retrieval tool that spans multiple, critical biomedical information sources including structured data, narrative text, and images. Furthermore, the inclusion of de-identification mechanisms as standards-compliant electronic data inter- change modalities within our system has significant potential to address inherent challenges to the conduct of multicenter or crossdisciplinary translational research in the modern regu- latory environment. Our future plans for this project include the continued evaluation of the framework, with specific emphasis on the types of novel hypotheses that can be addressed using such a knowledge-anchored, integrative query platform, as well as its applicability to other usage scenarios. We fully anticipate that our system, with its focus on satisfying a critical translational research information need, will continue to develop into an opera- tional platform for use by researchers at OSUMC that will also be extensible to the broader informatics and research communities.

38 Lung Expanded Carcinoma Expanded: 1 Lung Expanded Carcinoma Expanded: 28 Nodule Expanded No Query Expansion: 1 Nodule Expanded No Query Expansion: 28

Nodule Expanded Carcinoma Expanded: 1 Nodule Expanded Carcinoma Expanded: 26 No Query Expansion: 1 No Query Expansion: 26 Carcinoma Expanded: 0 Carcinoma Expanded: 25 Lung Expanded Lung Expanded No Query Expansion: 0 No Query Expansion: 24

Carcinoma Expanded: 0 Carcinoma Expanded: 22 No Query Expansion No Query Expansion No Query Expansion: 0 No Query Expansion: 22 Radiology Reports for Pathology Reports for Radiology Reports for Pathology Reports for ICD9 162 ICD9 162 ICD9 162.2 ICD9 162.2

Lung Expanded Carcinoma Expanded: 83 Lung Expanded Carcinoma Expanded: 6 Nodule Expanded No Query Expansion: 80 Nodule Expanded No Query Expansion: 6

Nodule Expanded Carcinoma Expanded: 70 Nodule Expanded Carcinoma Expanded: 5 No Query Expansion: 67 No Query Expansion: 5 Carcinoma Expanded: 68 Carcinoma Expanded: 5 Lung Expanded Lung Expanded No Query Expansion: 66 No Query Expansion: 5

Carcinoma Expanded: 56 Carcinoma Expanded: 5 No Query Expansion No Query Expansion No Query Expansion: 54 No Query Expansion: 5 Radiology Reports for Pathology Reports for Radiology Reports for Pathology Reports for ICD9 162.3 ICD9 162.3 ICD9 162.4 ICD9 162.4

Lung Expanded Carcinoma Expanded: 39 Lung Expanded Carcinoma Expanded: 10 Nodule Expanded No Query Expansion: 37 Nodule Expanded No Query Expansion: 10

Nodule Expanded Carcinoma Expanded: 35 Nodule Expanded Carcinoma Expanded: 9 No Query Expansion: 33 No Query Expansion: 9 Carcinoma Expanded: 31 Carcinoma Expanded: 7 Lung Expanded Lung Expanded No Query Expansion: 29 No Query Expansion: 7

Carcinoma Expanded: 26 Carcinoma Expanded: 6 No Query Expansion No Query Expansion No Query Expansion: 24 No Query Expansion: 6 Radiology Reports for Pathology Reports for Radiology Reports for Pathology Reports for ICD9 162.5 ICD9 162.5 ICD9 162.8 ICD9 162.8

Lung Expanded Carcinoma Expanded: 29 Nodule Expanded No Query Expansion: 26

Nodule Expanded Carcinoma Expanded: 22 No Query Expansion: 20 Carcinoma Expanded: 25 Lung Expanded No Query Expansion: 22

Carcinoma Expanded: 18 No Query Expansion No Query Expansion: 16 Radiology Reports for Pathology Reports for ICD9 162.9 ICD9 162.9

Table 2.1: Combining ICD9 Codes for “Malignant Neoplasm of Trachea, Bronchus, and Lung” with Radiology Reports that Contain Concept Corresponding to “Lung Nodules” along with Pathology Reports that Contain Concept Corresponding “Carcinoma.”

39 Lung Expanded PE Expanded: 9 Lung Expanded PE Expanded: 2 Chest Expanded No Query Expansion: 9 Chest Expanded No Query Expansion: 2

Chest Expanded PE Expanded: 9 Chest Expanded PE Expanded: 2 No Query Expansion: 9 No Query Expansion: 2 PE Expanded: 9 PE Expanded: 2 Lung Expanded Lung Expanded No Query Expansion: 9 No Query Expansion: 2

PE Expanded: 9 PE Expanded: 2 No Query Expansion No Query Expansion No Query Expansion: 9 No Query Expansion: 2 Radiology Reports for ICD9 286 Radiology Reports for ICD9 286.1

Lung Expanded PE Expanded: 164 Lung Expanded PE Expanded: 7 Chest Expanded No Query Expansion: 163 Chest Expanded No Query Expansion: 7

Chest Expanded PE Expanded: 164 Chest Expanded PE Expanded: 7 No Query Expansion: 163 No Query Expansion: 7 PE Expanded: 164 PE Expanded: 7 Lung Expanded Lung Expanded No Query Expansion: 163 No Query Expansion: 7

PE Expanded: 164 PE Expanded: 7 No Query Expansion No Query Expansion No Query Expansion: 163 No Query Expansion: 7 Radiology Reports for ICD9 286.3 Radiology Reports for ICD9 286.4

Lung Expanded PE Expanded: 2 Lung Expanded PE Expanded: 39 Chest Expanded No Query Expansion: 2 Chest Expanded No Query Expansion: 39

Chest Expanded PE Expanded: 2 Chest Expanded PE Expanded: 38 No Query Expansion: 2 No Query Expansion: 38 PE Expanded: 2 PE Expanded: 39 Lung Expanded Lung Expanded No Query Expansion: 2 No Query Expansion: 39

PE Expanded: 2 PE Expanded: 38 No Query Expansion No Query Expansion No Query Expansion: 2 No Query Expansion: 38 Radiology Reports for ICD9 286.5 Radiology Reports for ICD9 286.6

Lung Expanded PE Expanded: 43 Lung Expanded PE Expanded: 371 Chest Expanded No Query Expansion: 42 Chest Expanded No Query Expansion: 317

Chest Expanded PE Expanded: 43 Chest Expanded PE Expanded: 367 No Query Expansion: 42 No Query Expansion: 313 PE Expanded: 34 PE Expanded: 353 Lung Expanded Lung Expanded No Query Expansion: 33 No Query Expansion: 291

PE Expanded: 34 PE Expanded: 339 No Query Expansion No Query Expansion No Query Expansion: 33 No Query Expansion: 277 Radiology Reports for ICD9 286.7 Radiology Reports for ICD9 286.9

Table 2.2: Combining ICD9 Codes for “Coagulation Defects” with Radiology Reports that Contain Concept Corresponding to “Pulmonary Embolism.”

40 CHAPTER 3

A DATABASE DE-IDENTIFICATION FRAMEWORK TO ENABLE DIRECT QUERIES ON MEDICAL DATA FOR SECONDARY USE

3.1 Summary

In this chapter, we introduce a novel data de-identification methodology which directly addresses all HIPAA and local IRB concerns and gives direct access to much needed clini- cal data for secondary use. The data are always guaranteed to be consistent, reliability de- identified. This is the first reported de-identification scheme that ensures a zero-knowledge protocol at the database level, even at the session level. From a clinical and translational researcher’s perspective, this means timely access to data. A potential researcher, who is preparing a grant submission which is due in a week, may not be able to afford waiting for a data analyst to test a last minute hypothesis which relies on retrospective analysis. In this case, timely access to data may differentiate between the researcher who gets the data and the grant, and the one who does not.

To give a more concrete real world example that also elaborates on the integrative search capabilities that are introduced in Chapter 2: let’s assume we have a hypothetical investi- gator who is applying for a multisite grant to investigate interstitial pulmonary diseases, and he or she needs to find the number of accessible tissue samples for sarcoidosis patients,

41 where RNA (for gene expression studies) can be extracted. Using a de-identified data warehouse, without the need for an IRB approval, the researcher can get the most accurate numbers on accessible tissue samples in a timely fashion. On the contrary, if the researcher had directly asked the same question to tissue banking services, even after spending time on an IRB approval, the researcher might potentially be misled. For one, the tissue banking services would not know the final given clinical diagnosis (sarcoidosis), on pathological diagnosis of non-caseating granulomas (that lead to sarcoidosis), because within the clin- ical workflow the clinical diagnosis would be assigned after the pathological report. In this scenario, using a de-identified data warehouse that links all the data elements that are needed, the researcher may be able to find the closest number of samples. For OSUMC, if we look at the samples collected within the past 10 years, this number is around 400

(researcher should apply for the grant, there are enough samples to work with). Without the integrative, de-identified solution, researcher would not be able to identify more than

15 samples (subset of non-caseating granuloma samples that had the prior diagnosis of sar- coidosis at OSUMC ), and therefore the researcher’s chances on applying for the multisite grant and getting a good score would be very slim. In summary, being able execute timely, consistent, secure and reliable (eliminate the need for an IRB) queries on clinical data for secondary use is very crucial for institutions.

3.2 Introduction

Enabling clinical and translational research is dependent on the availability of the elec- tronic medical record (EMR) systems as the data sources [71, 72]. The use of EMR for research is governed by federal privacy regulations such as HIPAA [73] and the Common

Rule [74]. In order to use identifiable data in research, an Institutional Review Board (IRB)

42 approval must be obtained. Even though the IRB procedures can be extensive enough to impede opportune research and discoveries, these rules are in place to ensure complete safeguarding of Protected Health Information (PHI).

Generally, when PHI for subjects within a dataset is removed, EMR data can be treated as nonhuman data. To address the concern of timeliness in research activities, the IRB has granted the Information Warehouse [75] (IW) the “Honest Broker” status (the Honest

Broker Protocol [76], or HBP) that allows IW analysts to provide de-identified clinical data for research (Figure- 3.1).

Protocol Revision

Protocol Protocol Formulation Review

Researcher/ Data Data Research Request Access Idea

Data Use Agreement

Figure 3.1: Data Request Process. Top: normal process with IRB review. Bottom: HBP process.

43 Researchers only need to sign a data use agreement before gaining access to de-identified data. While the HBP greatly accelerated research access to data, IW analysts are over- whelmed with new data requests that need to be manually de-identified and delivered. This manual process also carries the risk of accidental PHI exposure. These limitations and/or problems can be remedied by the creation of a de-identified IW (DIW) [77] in conjunction with query tools that allow researchers to build complex queries without specific knowledge of any query languages.

Limited Dataset DIW Internal IW Databases 1 De-identification 2 Limited (Identifiable) Dataset De-identified Dataset IW DIW

Figure 3.2: Schematic overview of clinical data from identifiable to de-identified reposito- ries.

In this chapter we report on our de-identification processes (Figure 3.2) that meet

HIPAA regulations as well as more restricted rules set forth by OSU IRB. Following Figure

3.2, these processes can be explained as follows. Beginning with identifiable data inside the IW repository, all data go through a set of de-identification algorithms (step 1) and the resulting data are stored in a separate database as a limited dataset with dates unchanged in the DIW (step 2). These data can become available to researchers when a limited dataset is requested (step 3). Alternatively, all de-identified surrogate identifiers (DSI) and dates are

44 further de-identified using a 2-stage scheme both in the database and in each user session, leading to a completely de-identified dataset (step 4). These processes are generalizable: they can be used to export data into destination databases (or warehouses) while satisfying destination security requirements through different parameter sets; they can also be used to publish data from existing systems without exposing source identifiers. In our efforts, we are describing processes that de-identify data in order to enable use of data in such research related endeavors. These processes operate at database system level, and the idea is to make databases themselves protect the data from both internal and external threats and still enable incremental data updates.

3.3 Background

Preserving patient privacy and confidentiality while sharing [78] and enabling the sec- ondary use [79,80] of medical data brings forward the problem of de-identification [79,80].

This has been the focus of much research in recent years [81–87]. As it is well recognized by Narayanan and Shmatikov [88] unnecessary de-identification may destroy the utility of datasets. However, as noted by Cavoukian and El-Emam [89], this does not mean we should stop de-identifying, but in addition we should incorporate careful risk assess- ments [86, 89, 90].

Many efforts have been made to de-identify data in conformance with HIPAA and IRB regulations that cover both structured [81, 85, 91, 92] and unstructured [93–95] clinical data. Methods have been described using a hash function to build a de-identified bio- repository [91]. Other methods have been reported on either de-identification schemes [87] or trying to guarantee anonymity [96, 97]. De-identification of medical free text has been studied and published with great progress in the past years [98]. Here, we would like to note

45 that we consider de-identification of MRI and CT images a structured data de-identification problem which can be solved by proper handling of DICOM headers (see Chapter 2), and we consider removal of information from pixels of images (e.g. removal of patient names from ultrasound images) an optical character recognition (OCR) problem which is outside the scope of this thesis.

De-identification methods for structured data can be divided into two categories: (1) heuristic methods, and (2) statistical methods. When heuristic methods are employed, usu- ally all PHI containing field values are removed or replaced in a procedural manner. Well known examples in the literature include works of Roden et al [91] and Pulley et al [99], where both cases utilize a one-way hash algorithm (SHA-512) in order to de-identify their data; in addition patients in their health system are presented with opt-out forms so they can be excluded from the repositories if they choose. There are also cryptographic ap- proaches in the literature which demonstrate returning aggregate results without giving any patient information [100, 101] and performing secure join queries on encrypted datasets originating from multiple institutions [102]. Statistical methods as reported in studies of

Sweeney [97, 103, 104] form the basis of many other works in this category [96]. In k- anonymity [97] Sweeney describes an algorithm that protects the anonymity of an individ- ual in a dataset by ensuring existence of at least k records with the same characteristics (e.g. at least four other records have the same values) belonging to other individuals in the same dataset. Also in this category many differential privacy based methods [105] are defined in order govern the inclusion or removal of database fields or items to datasets. Generally, the probability of re-identification versus the effects of added noise on statistical meaning derived from the datasets is calculated and acted upon.

46 De-identification of unstructured medical text is an active research area, and much ad- vancement has been made. The research community has even organized challenges such as the i2b2 challenge where de-identification and concept extraction from the medical text was evaluated using Precision, Recall, and Balanced F-measure (2 x Precision x Recall

/ (Precision + Recall)) against community prepared and annotated gold standard medical text corpus [98]. Promising methods include systems combining dictionaries with pattern matching [106] as well as statistical systems [107]. In i2b2 challenge, the best perform- ing system was the work of Wellner et al. where they combined two toolkits for named entity recognition (Carafe and LingPipe) [107], and they were able to achieve precision, recall and f-measure all greater than 96% . Later, using the same i2b2 corpus Uzuner et al. were able to produce higher precision (99%), recall (97%) and F-measures (98%); they used an open source Support Vector Machines (SWM) library called LIBSVM during their implementation [108].

In a recent review by El Emam et al. [86], the risk of re-identification is shown to be quite low on a properly de-identified dataset. Empirical tests conducted by statistical team experts assembled at U.S department Health and Human Services’ Office of the National

Coordinator for Health Information Technology has shown that risk of re-identification is as low as 0.013 percent when HIPAA safe harbor methods are used. Their tests were conducted under realistic conditions using a 15000 patient set [109]. It is also worth noting that potential leaks or attacks that can be caused by rogue employees (inside jobs) [89].

In cryptography, the protocol for one party to interactively answer another’s question without revealing the underlying hidden secret is called a zero-knowledge protocol [110].

If we were to give a medical dataset related example, one way to achieve zero-knowledge protocol could be adding a random number to patient record identifiers while answering a

47 question. Here, we would be producing a new set of unique random codes for each patient for the given question and this approach could potentially be used as a building block for developing HIPAA and Common Rule compliant query methods [111].

3.3.1 The OSUMC Honest Broker Protocol

The IW’s “Honest Broker” status as a provider of de-identified clinical data for research purposes was approved as an annually reviewed procedural protocol by The Ohio State

University IRB in April 2006.

IW data analysts prepare the de-identified data by removing PHI and deliver the dataset as a bundle of coherently linked files to the researcher. In a de-identified dataset, dates are recorded as time intervals from a patient’s first visit. Alternatively, actual dates, zip codes, and ages over 89 can be included in a limited dataset [88, 112]. To safeguard against ac- cidental identification, resulting de-identified or limited datasets with less than 25 records are not delivered; this number is determined by the OSU IRB protocol (can be adjusted for other institutions). Under certain circumstances, limited dataset is preferred over de- identified datasets because of the inclusion of temporal elements in studies. Also, queries based on previous results are not allowed to prevent users from performing longitudinal studies using either limited or de-identified dataset. DSI are changed between each query.

This approach, while allowing retrospective analysis, inhibits forward longitudinal study of particular subjects. Furthermore, should an inadvertent identification occur due to in- vestigator’s prior knowledge of a particularly unique data pattern, the Data Use Agreement stipulates that the investigator must immediately seek IRB oversight.

48 3.4 Design Objectives

Our first objective is providing a framework that is simple enough that it can be easily modified in support of potential policy changes from our IRB, HIPAA or structural changes in source database(s). Moreover, there are other honest broker protocols in the literature, such as the University of Michigan [113] and the University of Pittsburg [114] honest broker systems. The design of our framework should be generic enough that it can, as a whole or in part, be adoptable/adaptable in these systems based on specific institutional needs.

To support both research and education, it is desirable to have a completely de-identified database (DIW) that is similar in data structure to the identifiable version (IDB). This DIW must be HIPAA compliant in terms of the masking or removal of PHI, and should ideally be capable of defending against re-identification attacks both internal and external. Struc- tural similarity between IDB and DIW is beneficial to ensure that previously developed applications designed for use with the IDB remain applicable for the DIW with minimal modification.

3.5 System Description

In general, our de-identification framework is constructed as a methodology for ex- tracting, transferring and loading (ETL) data from multiple sources into a comprehensive de-identified database (or data warehouse). Quite often sources of data within large medical institutions reside across multiple systems; thus, even when they are collected in a single repository (e.g. data warehouse) their origins are usually federated. This means potentially

49 multiple record identifiers could have been used for a given patient. Hence, while remov- ing PHI from records special attention is needed in order to maintain the record consistency across systems.

De-identification Processes

Database with De-Identified Identifiers Database

Limited Account Source Encryption Account Destination Encryption Account f(x) Master MRN Mapper rMRN MRN Mapper dMRN MRN rENC g(x) MRN (hMRN, rMRN) (uMRN1..uMRNn) dENC Encounter (ENC) rACC# dACC# ACC# Real Dates f(x) Encouter Encouter g(x) dMRN Master Mapper rMRN Mapper dACC# Encounter MRN (hENC, rENC) rACC# (uENC1,..uENCn) Real Dates ACC# g(x) dENC dACC# Real Dates

De-identified Account ACC# f(x) ACC# Master Mapper Mapper dMRN ACC# (uACC1.. uACCn) (hACC, rACC) g(x) dENC dACC# Fake Dates dMRN Stored Procedures g(x) Stored Procedures rENC dACC# Encounter (ENC) and Login Triggers ACC# For Encryption rACC# Fake Dates For Encryption and and Data Transfer View Generation g(x) dENC dACC# Operations on source Operations on destination Fake Dates

Figure 3.3: De-identification process across databases; on the left, operations on the source database system; on the right operations on the destination database system.

Following our system architecture depicted in Figure 3.3 we describe how we create a de-identified version of our source database system. In this model, data flow is one way only and the objective is to sever all possible linkages from destination to the source sys- tem in terms of patient identifiers, while the source can still safely update the destination.

During our description we follow the direction of data travel through our de-identification pipeline. We first explain the operations taking place in the source database system, fol- lowed by operations on the destination database system.

50 3.5.1 Operations on the source system

As shown in the left side of Figure 3.3 these operations take place in an ”Source En- cryption Account” (SEA) which is an account that has read privileges (SQL SELECT) on all database schemas (excluding the SYSTEM account) chosen to be de-identified.

SEA can be further safeguarded by using data encryption functions within the database, if available. None of the database accounts (except the SYSTEM account) can see con- tents of the SEA. Using its stored procedures, the SEA can execute functions which can hash given strings, generate random numbers of variable sizes, or write data to master tables. The hashing and random number generator functions mentioned here are imple- mented as JAVA Stored Procedures using open source functions of java.security package

(java.security.MessageDigest and java.security.SecureRandom). These are cryptographi- cally strong pseudo-random numbers which are minimally compliant with the statistical random number generator tests [115] specified in FIPS 140-2, Security Requirements for

Cryptographic Modules, section 4.9.1 [116].

First, looking at the tables to be de-identified, SEA creates master tables for each unique identifier type such as Medical Record Number (MRN), Encounter Number (ENC), and

Accession Number (ACC#). These master tables hold all unique identifiers for all patients whose data are going to be de-identified. Then, based on these Master tables, Mapper tables are generated by using SHA-2 hashing based algorithm implementations available (SHA-

256, SHA-512) as part of stored procedures in the SEA. SHA-2 based algorithms instead of SHA-1 were chosen because of potential mathematical weaknesses of SHA-1 indicated by experts [117]. During our tests, where we have repetitively performed entire Mapper table builds, we have used SHA-256 instead of SHA-512 for faster table generation. Using a secure random number implementation from its stored procedures, the SEA then assigns

51 a unique random number to each hashed value (f(x), Figure 3.3). This process results in generation and storage of a hashed and random value pair (e.g. hMRN, rMRN) for each unique identifier (e.g. MRN). The use of a hashing algorithm on all ID’s in the source sys- tem is a necessary step to ensure that there is no accidental mapping exposure to database users who have access privileges to the source system. We choose not to use the hashed string as reported by Roden et al [91] for two reasons: (1) hashed string is considered a directly derived ID, thus its use is questionable by HIPAA, and (2) numeric IDs perform better than long strings in database queries [77].

Using the Mapper tables and its stored procedures, SEA takes each record from tables to be de-identified, and first re-runs its hash function (e.g. finds hMRN) on identifiers (e.g.

MRN) to be replaced in order to find the random value to be used (e.g. rMRN), then re- places the identifier with the precalculated random value. Then, this record, which now has a DSI that has the form of a real identifier, is written to the corresponding staging ta- ble on the destination database’s encryption account. At the SEA only the Mapper tables which hold the hashed value and the random number pair are kept. While the hashed num- bers scramble the data, the replacement random number to be used in destination system ensures compatibility with the source system. The Master tables are only used as tem- porary in memory data structures and they are discarded after each data transfer. This is done against accidental exposure of linkage between the random ID and real ID pair, and prevents internal users from manually linking data between IDs used in source and de-identified data.

52 3.5.2 Operations on the destination system

As shown on the right side of Figure 3.3, operations here take place in a “Destination

Encryption Account” (DEA). Evaluating the tables populated by the SEA, the DEA cre- ates its own Mapper Tables by using its own stored procedures analogous to SEA’s stored procedures. During the creation of these new Mapper tables each unique identifier gets additional multiple unique random numbers. Hence, each unique random number which was originally created by the SEA (e.g. rMRN, etc.) gets multiple columns of unique random numbers (CURN) at the DEA. For example for an MRN Mapper table we would have the rMRN as the key and if we had n random MRN columns on this table we would have columns such as uMRN1, uMRN2uMRNn. In addition, DEA creates a random date offset (rDOS) value for each random MRN (uDOS1, uDOS2uDOSn) and adds them to the Mapper Table as columns. As explained below in creation of Limited Datasets and

De-identified Datasets, Mapper tables other than the MRN Mapper (Encounter Mapper,

Accession Mapper, etc.) do not have random date offset columns. This process allows production of independent random series to be used against potential attackers whom may try brute force attacks (by creating multiple sessions) to expose static random numbers used by the SEA. The number of CURNs are pre-determined prior to setup (we are cur- rently working on a dynamic version as well), and during our tests we used minimum 2 of them. CURNs are shuffled and refreshed after 40000 login sessions or within 24 hours

(SEA initiates an update), whichever comes first. The additional layer of CURNs adds less than 10% overhead on query execution times (when a query directly involves an identifier related field). In Oracle based systems we eliminate this overhead by utilizing bitmap join indexes [118].

53 The destination system uses two more database accounts to expose data to its users: 1) a Limited Account (LA) enabling access to Limited Datasets; 2) a De-identified Account

(DA) enabling access to De-identified Datasets. Both of these accounts do not hold any physical tables; all database objects in these accounts are views based on tables and views on the DEA. DEA controls and dictates how frequently the views in these two accounts are refreshed.

Creation of Limited Datasets: Using its stored procedures, DEA creates views by join- ing the tables populated by the SEA and the Mapper tables. In these views each unique identifier (e.g. rMRN) is replaced by another (e.g. uMRNn). These views keep the original date values. Using these views DEA creates views for limited dataset in the LA. However view and column names for these views are given based on naming conventions of the source system, which ensures compatibility and simplicity in adaptation for tools and users who are familiar with the source system. DEA gets an additional random number from the session variables for each user login. Session dependent random variables are preferably picked from higher cardinality numbers. For example, if we were to map a 9 digit MRN to a 10 digit MRN, the random session addition would be a value between 1,000,000,000 and 9,000,000,001. This addition is introduced in order to prevent users from comparing results across different studies. It also ensures that a user gets a different identifier for a given patient every time the user logs in to the system (also at system level the database limits how long a session can last).

Creation of De-identified Datasets: Figure 3.4 provides a walkthrough for the de- identification process. In this example, an MRN, 900000001, is hashed into a long string

(H MRN in H R MAPPING Table) using SHA-256. This hashed string is assigned a ran- dom number, 6708389166. Patient data for 900000001 is replaced by 6708389166 as a

54 ENC_DX H_R_MRN ADM_DT DSCH_DT MRN DX_CD RANK H_MRN R_MRN 01/04/1996 01/05/1996 900000001 173 1 A3A9F6A4B494554C621867DD09EA7F263271C4FB 6708389166 01/04/1996 01/05/1996 900000001 496 2 1A7D7643548BD21038BF32EC68D9FF4A02E14A2D 5273519526 01/04/1996 01/05/1996 900000001 492.8 3 01/04/1996 01/05/1996 900000001 316 4 02/10/1996 02/18/1996 900000002 716.91 1 hashing 02/10/1996 02/18/1996 900000002 722 2

One-way Data Flow LIM_ENC_DX ADM_DT DSCH_DT R_MRN DX_CD RANK 01/04/1996 01/05/1996 6708389166 173 1 ENC_DX 01/04/1996 01/05/1996 6708389166 496 2 01/04/1996 01/05/1996 6708389166 492.8 3 ADM_DT DSCH_DT D_MRN DX_CD RANK 01/04/1996 01/05/1996 6708389166 316 4 01/10/1996 01/11/1996 4760383073 173 1 02/10/1996 02/18/1996 5273519526 716.91 1 01/10/1996 01/11/1996 4760383073 496 2 02/10/1996 02/18/1996 5273519526 722 2 01/10/1996 01/11/1996 4760383073 492.8 3 01/10/1996 01/11/1996 4760383073 316 4 USER_MRN 02/07/1996 02/15/1996 5213316599 716.91 1 R_MRN U_MRN 02/07/1996 02/15/1996 5213316599 722 2 Session Specific 6708389166 7860397473 De-identification Data exposed to users Function 5273519526 4727029653

Figure 3.4: Overview of identifier de-identification process.

surrogate identifier and saved in the de-identified IW database (LIM ENC DX, the blue box). By using the linkage in USER MRN, we further de-identify the unique surrogate identifier to a value that is changed periodically (U MRN). This changing U MRN is fur- ther de-identified by a hidden algorithm using a user login session-dependent random key to produce the user viewable identifier in ENC DX of the de-identified database (the green box). Once a user has obtained a dataset and has logged off, the de-identified MRN,

4760383037, cannot be linked back to any intermediate or original identifiers (U MRN,

R MRN, and MRN) - even data analysts do not have the capability of linking the result set back to any identifiable information. Also shown in this example is the date shifting

55 result. Subject dependent date sift values of +4 and -5 days are applied to MR=900000001 and 900000002, respectively. A session-dependent shift of +2 days is further applied to all date results in a query in this example session. This produces different date shifts among different subjects in the same query, as well as different date shift between same queries run in different user session.

These datasets are created in similar fashion to the Limited Datasets, with access through the DA (based on user privileges a user can only access either LA or DA). The difference is that the date columns are treated as identifiers as well; hence, records are padded by unique random date offset values (for DA) for each MRN, as well as random dynamic session variables. As a result, the time intervals between the dates for any specific patient are kept but the original dates are no longer available. Beginning from the source system, the whole process could be described as a one way one to one mapping ((source) MRN + hashing

+ (source+destination) random number + (destination) Random number + random session variable).

As one can expect, adding additional views and functions over existing table structures brings performance overhead. However, this overhead can be minimized or eliminated us- ing methods such as placing additional indexes to source tables or pinning frequently used look-up tables to the shared memory, resulting in dramatic improvement on query execu- tion times and overall performance. Being a read only system (end-users are not allowed to alter or write new data); our de-identified instance was fine-tuned for fast query response times rather than a system allowing real-time data updates. Our current configuration is designed for updating once daily.

56 3.5.3 System Validation

We performed our system validation by following the trail of the data movement and

tested the necessary functionalities along the way. Some of these functionalities are in place

to fulfill HIPAA requirements; others are in place to fulfill our internal IRB requirements.

While our focus here is still the validation of Limited and De-identified Dataset Generation,

we cover validation of all main system functionalities.

ZIP code roll-ups: Our current HBP requires that we follow the HIPAA Privacy Rule

[3]. Hence, we form ZIP codes with the same 3 initial digits that contain more than 20,000 people. The scripts related to this functionality are executed in the SEA, before the data are passed on to the DEA. We have tested this feature by simply declaring different under- populated ZIP codes and we have verified the results by simply counting merged population changes and the disappearance of the under-populated ZIP codes in the result set.

Handling Patients over Age 89: This feature only applies to de-identified datasets.

Within our framework, when patients reach age 90, updates to their records on age re-

lated fields are suspended. SEA continues to process these records, and keeps them in a

separate table. Updates to DEA for these patients occur only after the patients are deceased.

Until that time their age are kept at the fixed value of 89. We have verified this feature by

querying the datasets to be passed from the SEA and found no patients over the age 89. We

also verified that, when suspended dataset (patients 90 and older) were merged back we

can retrieve the original dataset.

Limited and De-identified Dataset Generation: In theory, each source identifier should

be replaced by a new identifier every single time. One to one mapping is a rule that we

enforce on all tables and views handling transactions related to identifier de-identification.

For completeness, we still performed test queries across the system. For example, one of

57 the queries we executed counted the maximum number of patients and encounters seen among different diagnosis codes, grouped by months and ordered by maximum encoun- ters. We executed similar queries in the source system and compared the results from the limited dataset. Then, we executed similar queries without date constraints, and compared the results from source database to results from the limited and de-identified dataset and we verified that the total counts matched in all cases. Finally, using multiple accounts and ses- sions running the same query we have verified that during concurrent sessions by multiple users (10 during our tests), or consecutive sessions by the same user (4 during our tests), no identical identifiers are retrieved.

We are aware that having a date obfuscation function, which is function based and parameterized, enables the de-identified dataset to produce similar results to source dataset

(which has real dates) when small random shifts are used during the date field generation.

The statistical variations and associated re-identification related risk factors introduced by small random time shifts are currently outside the scope of this thesis, and it will be further evaluated and reported later.

Query rejection: As an optional and parameterized functionality, query rejection could be turned on to safeguard against at least two scenarios in order to minimize the possibility of subject re-identification: (1) the number of records returned is below a preset threshold; and (2) sufficient number of records returned are from a population (unique subjects) below a preset threshold. Our Honest Broker Status through OSU IRB mandates this parameter to be set at 25 or higher, but it could be set at any number through a user definable parameter.

In order to test this feature we simply executed queries which would return results below the preset parameter. It is verified that when queries return less than 25 distinct subjects and/or records, a message is presented to user instead of the result set and the normal queries

58 behave seamlessly to user. Query rejection functionality is implemented as an optional in-database module using Fine Grained Auditing [119].

3.5.4 Test Environment and Setup

For our tests we transferred 4 tables from our source clinical database to our de-identified instance, and then we generated a limited and de-identified dataset version of each table at destination (Table 3.1).

Tables Columns PHI Dates Records Description ENC CLIN 45 2 4 5,697,118 OSUMC clinical encounters ENC DX 15 2 2 20,175,845 One or more di- agnoses for each encounter ENC ICD9 14 2 2 1,899,487 One or more pro- PROC cedures for each patient PATIENT 16 1 3 1,005,657 Patient demo- graphics

Table 3.1: Tables de-identified during testing

Using these tables and mapper tables we created 2 views for each table; one view for limited dataset access, and another view for de-identified dataset access. Following is an example SQL code snippet for these types of views:

CREATE OR REPLACE VIEW PATIENT DEIDENTIFIED

(MRN,GENDER,RACE,DOB ...)

AS SELECT b.MRN R + DEIDVW.GET MRN() MRN,

59 a.GENDER,

a.RACE, ...,

FROM PATIENT.PATIENT LTD a, ENCRYPT.MRM MAPPER b

WHERE b.MRN RANDOM = a.MRN;

As we have mentioned earlier, users get their random numbers for each new session by calling stored procedures (DEIDVW) of the DEA. Since these views are function based, there are no additional maintenance requirements.

All tests were performed on a Sun Fire V445 server (4x 1.6GHz UltraSPARC IIIi CPUs,

32GB of memory) running Solaris 10 and Oracle Relational Database Management System

(RDBMS) 11gR1, with tablespace compression turned on. All necessary indexes were created on all patient identifier fields (rMRN, rENC, etc.), date fields, as well as code look up fields, with table statistics gathered using all available data prior to test query executions.

Frequently used code lookup tables (Mapper tables) and their indexes were pinned into memory for fast access. We allowed each connection to utilize only a single CPU.

3.6 Test Queries and Results

During our query performance evaluation we executed 8 representative queries (see

Table 3.2 in appendix for query details). We have executed each query 20 times with 4 different parameter sets, on 2 views per source table each representing a limited dataset and a de-identified dataset, and we measured query execution times (1,280 total executions) by

CPU time (1.4 – 44 seconds) as reported by Oracle’s SQL Trace Facility and TKPROF utility. Our average CPU time was 15 seconds. These are considered reasonable execution times for the given types of queries for our environment.

60 query 8

query 7

query 6

De-identified query 5

Limited query 4

query 3

query 2

query 1

0 5 10 15 20 25 30 35 40 45 50 Average Time Elapsed in Seconds

Figure 3.5: Average query execution times. De-identified (red) vs. Limited (blue). These queries need to execute under one minute in our source identifiable environment. Detailed description of queries along with their syntax are provided in Appendix.

Overall, during our tests, views for limited datasets performed better when compared to their de-identified counterparts (Figure 3.5). This is due to the fact that we perform fewer lookups during creation of limited datasets (since original dates are used). In fact, since the de-identified views have more complex functions generating them, slowdowns in query times are expected (e.g. Query 3, Query 5 shown in Table 3.2) whenever join conditions on date fields are present. However, the whole system being a read only system allows us to optimize for fast reads (rather than read and writes); hence, we can stay within reasonable query response times. Also, through our experiments we have seen that as queries get more complex (e.g. Query 6 is a more complex query when compared to Query

61 4), query execution times do not get as impacted as one would expect. As this may seem counterintuitive, our query evaluations revealed that additional query complexity enabled this performance upturn by increase in filtering.

3.6.1 Generalizability

As we have mentioned earlier, our framework is a de-identification framework which in general is a HIPAA and local IRB compliant ETL methodology. Our initial local instances were built using PL/SQL and JAVA stored procedures for data transfers. However, different enterprise level production environments may have their own requirements for software de- ployment. Therefore, in order to test generalizability we tested all our scripts through two other industry accepted ETL tools: IBM DataStage (version 8.5) [120] and Oracle Ware- house Builder (11g) [121]. Both tools allow execution of local scripts in databases; hence, we were able transfer data using a SEA to a DEA in both cases. In addition we have created an example i2b2 database instance to demonstrate DEA’s session based capabilities [122] and another prototype system that demonstrates generation of de-identified image datasets which can also be viewed on mobile devices (e.g. iPad, iPhone or Google Android based devices) [123].

3.6.2 Reliability

In addition to measuring how queries performed in our environment we further evalu- ated reliability of our de-identification framework.

Internal Consistency: As a continuation of our development and potential problem dis- covery efforts, we created artificial datasets with one non-PHI associated field injected

62 with a unique identifier (IUI) marking each record. We created artificial tables (10 ta- bles) holding these records (20 million records). Then we ran these tables through our de- identification framework for 100 consecutive runs and using a third independent database

(independent from source and destination databases) we compared resulting de-identified sets from 100 independent sessions. As we expected our IUIs did have 100% matches while fields for PHIs did have 100% mismatches or had no data depending on earlier de-

fined HIPAA requirements (and stricter OSU IRB requirements when necessary based on number of subjects returned).

Random number generation during sessions (session security): As it can be followed from Figures 3.3 and 3.4 in our METHODS section, final pseudo-identifier generation for a given session relies on generation of random numbers to be used during that session.

Therefore, in order to hide the underlying identifier or pointer our random number gener- ation process has to be rigorously tested, because if exposed and not changed, the DSIs can potentially be used as the “new” patient identifiers. For that purpose we have followed

National Institute of Standards and Technology (NIST) guidelines for statistically testing random and pseudorandom number generators (PRNG) [124]. During NIST statistical evaluations mainly the algorithmic predictability is measured through multiple tests [125].

A PRNG method passing multiple tests indicates its strength. While NIST’s testing pack- age includes 15 tests, some of these tests supersede others, for example Maurer’s “Universal

Statistical” Test [126] supersedes the Monobits Test [125, 127], and however, the former is intended for longer bit sequences. Therefore we did not employ the Maurer’s Test dur- ing our evaluation but instead we used the Monobits Test (our test numbers were 32-bits long). We looked at 86400 pseudo random numbers (PRN) generated (representing 1 PRN generation for each second for a 24 hour period) by using our session login script. These

63 numbers were then converted into 32-bit binary sequences, and given as input to the NIST

Package (NIST recommends minimum 1000 PRNs for reliable results, we used 86400).

Table 3.3 shows our test results.

In Table 3.3 we provided a condensed version of our results. The results report includes the proportion of passing sequences where each pass or fail is assigned based on given internal P-values for each test. During NIST statistical testing a sequence passes a statistical test whenever the P-value ≥ α and fails otherwise, for further details please refer to NIST documentation [125]. During our tests some of the statistical tests were applied multiple times according to default settings on NIST testing package. For example if we look at results from the Non-Overlapping Template Matching Test, which was repeated 149 times; we had 104 results with passing proportion of 32/32 (42 with 31/32, etc.) Our PRNG methodology did produce satisfactory random numbers based on every applicable NIST statistical test.

3.7 Discussion

It is beneficial to the readers to compare the necessity of the steps in our framework to existing systems. Our design objectives were to form a core framework which is flex- ible enough and can aid operational needs and enable future modifications. Within our framework the source data could be in any form or structure in a database and after de- identification that form and structure can be maintained (with the exception of free-text re- ports). This approach decouples query tools’ operational security needs by using database security. It takes the maintenance and responsibility of HIPAA compliance from query tools and pushes these responsibilities down to database level. This means most of the open

64 source or commercial tools which are used for looking at the data (e.g. for training and re- search purposes) can still remain operational with minimal or no changes when pointed to a de-identified instance. This is crucial, especially for open source tools, because with given time and effort their behavior can be altered or modified and potentially any security pro- vided at the tool level can be bypassed. The individual components of our framework are based on well-established and well-known methods such as hashing and random number generation which makes it easily adoptable/adaptable [122, 123]. The cryptographic level information security methods used in our framework are empirically measurable and they achieve 100 percent pass rate on well-established national statistical standards [124, 125].

The unique combination with which the individual methods are applied makes this frame- work enable HIPAA, the Common Rule and the local IRB compliance.

On the source system, the use of a hash function, and then another random number before the data are sent from SEA to DEA prevents hash keys being exposed to potential outside attackers, and the hash keys themselves provide a defense layer against potential internal attackers. If we were to rely on hash keys only, an outside attacker can potentially expose all patient identifiers by knowing only a subset of identifiers (this means attacker has both the input and the output for a given subset, therefore, potentially the method can be hacked). While there are examples in the literature using the one-way-hash method [21,29], in these examples every patient being included in the datasets has an opt-out signature in

file. But, we are trying to avoid the need for collection of opt-out forms (in our source EHR systems we have more than 2.6 million patients with admissions dating back to 1985); and this can be avoided by satisfying the HIPAA and the Common Rule. We would also like to note that under HIPAA, use of a one-way hash as a substitute for de-identification

65 is explicitly forbidden. In practice, this means access to data for research use cannot be granted without an IRB approval when using a one-way-hash only approach.

On the destination system, the use of session based random number generation methods enables implementation of a zero-knowledge protocol for the medical datasets recycled for secondary use. These methods are in place to prevent patient identifiers from being traced back to record identifiers. Here, we would like to note that a simple records-based system cannot be considered exempt from HIPAA. In practice, this means de-identifiers such as static random numbers that remain unchanged across queries cannot be used by themselves in order to receive exemption. The added session dependent identifier generation and query rejection methods minimize the risk of re-identification. Our current (pseudo) random number generator provides mapping by increasing the cardinality of the data, meaning for a given example of 9 digit random identifier when mapped to a 10 digit one, the chance of two users getting the same number on a concurrent session or the same user getting the same number between consecutive sessions is on the order of 1 in 1010 within our scheme. Considering this is the third place the random switching is taking place, even with a brute-force attack, the attacker can only get back to the previous source number given by the DEA, which is to be replaced within 24 hours. In this scheme, even if the physical hardware is stolen or lost, there would not be any identifier that can be mapped back to source data, since the numbers from SEA are nowhere in destination system.

Unlike k-anonymity and differential privacy based methods, our query restriction meth- ods do not alter query result sets beyond HIPAA requirements. For example, while all patients over 89 may be marked 89 regardless of their age; no female patient would be marked male or vice versa in order to prevent re-identification. There is growing concern

66 that tampering with medical data in order to prevent re-identification may diminish its val- ues and render the data unusable [89,96], and following the HIPAA and The Common Rule has already been shown to be effective [89, 109]. If and when a query produces a narrow patient population (increase in risk for re-identification); rather than returning an altered dataset, our query restriction methods simply produces a message telling the investigator to seek further IRB guidance in order to access the dataset asked by the query. The num- ber of patients available in our QA system was around 1 million patients, and this number will be exceeding 2.6 million in the production version. The minimum number of distinct records suggested by our IRB for our environment was 25 (10−6 of the current popula- tion), and we predict as the patient population grows there will be less number of queries which will require specific IRB approval. However, we would like to point out that, this is a number each institution should evaluate based on their local environment (level of risk each institution willing to take may differ). Nevertheless, if an institution chooses to use to employ further restrictive methods our framework does not prohibit such modifications.

We acknowledge there are limitations in our de-identification framework. Despite remov- ing all HIPAA required PHI and taking additional preventative measures, there is still a chance for re-identification. As it has been indicated by many researchers in the litera- ture [82, 90, 128], there is always a possibility for re-identification by a primary care giver or relative, etc. who can identify a patient by recognizing his or her unique combination of diagnostic codes, test results, lab values, etc. Currently, we address this issue by utilizing our data use agreements which are part of our HBP.

While our framework fully supports inclusion of scrubbed text documents to be in- cluded in the de-identification process, we did not include de-identified text documents in our institutional implementation because of two reasons: 1, Analysis on IW customer data

67 requests indicated that more than 95% of queries were code or value based. Hence, while addressing our immediate needs we did not focus on text de-identification. 2, We believe that 100% removal of the identifiers from free-text reports cannot be guaranteed using au- tomated methods at this point. Even though great progress has been made in medical free text de-identification over the recent years, the inclusion of text documents still carries a higher risk than we are willing to take at this time. Our current repositories include more than 8 million text reports. The best system we have seen in the literature has 99% preci- sion [108], and even if we had adopted such an implementation with the assumption that it can de-identify any type of text document, we would still carry the risk of exposing more than 80,000 patient identifiers. Currently, for our environment we are evaluating the in- clusion of text documents as a query-only-system (no document retrieval allowed) where searches are allowed only through controlled vocabularies [129–131]. Over recent years great progress has been made in information extraction from textual documents [132–136] and we also believe a combination of these new techniques could potentially speedup the lengthy chart reviews. However, if benefits outweigh the risks for one’s purpose, through our modular framework one can enable integration of de-identified text reports. There are mature tools in the literature that can be employed for such purposes [98,137]. We recom- mend use of these method before the data are moved (in SEA) into a de-identified instance

(DEA).

Currently, our framework is being tested for quality and assurance by developers, busi- ness analysts and a select number of expert users (who can write and execute their own SQL queries without any assistance). Our analysts report that, when used along with queries where de-identification is a requirement, depending on the complexity of the dataset our

68 framework enables time savings of 30 minutes to an hour compared to routine editing per- formed by analysts. In addition potential errors which would result in additional laborious work are eliminated. Even though a general end-user savings or usability report is not available at this point, based on the responses from our expert users, we predict users will appreciate the time savings; since it is quite possible for an end-user to wait 10 days to 3 weeks in order to receive manually prepared datasets for their queries from an analyst.

3.8 Conclusion

The creation of a De-identified Information Warehouse (DIW) is a continuing effort at the OSUMC IW to better support research activities using clinical data. The ultimate goal is to enable a direct connection between a researcher and data. IW’s HBP has shortened the time it takes for researchers to access the data they need. Compared to 10 data requests in

2003, IW had reached 256 and 311 research data requests in 2009 and 2010, respectively.

With the HBP, OSUMC researchers can gain access to limited and de-identified data much faster than before. However, IW analysts can only process so many data requests at any given time. The new bottleneck facing researchers is the lengthening data request queue. As a solution, DIW can easily be coupled with commercial database query tools such as Oracle Answers [138]; in-house developed query tools such as uQuery [139]; or open source development efforts such as i2b2 [122, 140, 141], caGrid [142] etc. This can improve the efficiency of data requests by letting researchers perform their own queries on their time, with greater interactivity and increased flexibility.

Our framework successfully de-identifies and removes all HIPAA mandated PHI from structured data elements, and provides conformance with the local IRB, while maintain- ing data integrity. This framework guarantees that even for each session new identifiers

69 are generated. This makes our framework a suitable tool for aiding core de-identification operations, which need to take place at medical data warehouses in order to support non- human-subject research using clinical data. Data sharing between multiple institutions, on the other hand, has other challenges associated with data standardization, network manage- ment, and many more beyond the scope of this work. Once these challenges are properly addressed, this framework can be used as part of the solutions for multi-institutional data sharing.

70 Query SQL Description Query 1 SELECT count(1) FROM ENC DX LTD WHERE dx cd like ’250%’; Number of diagnoses with diabetes Query 2 SELECT count(distinct a.mrn) FROM ENC DX LTD Number of patients who had pri- a, ENC CLIN LTD b WHERE a.mrn=b.mrn AND mary diagnosis of ”Ulcer of Other a.encounter no=b.encounter no AND b.prim dsch dx cd =’707.15’ Part of Foot” and a non-primary di- AND a.dx cd like ’250%’ AND a.dx rank <> 1; agnosis of diabetes. Query 3 SELECT count(MRN) FROM ENC DX LTD b WHERE b.dx cd like Number of diagnoses with diabetes ’250%’ AND b.DSCH DT - b.adm dt¿30; where patient stayed more than 30 days. Query 4 SELECT count(1) FROM ENC CLIN LTD a WHERE Number of patients who are diag- a.prim dsch dx cd=’428.0’ AND a.mrn IN (SELECT mrn nosed with CHF and who recorded FROM enc dx ltd b WHERE b.dx cd=’786.50’ AND b.DSCH DT - a diagnosis of ”Unspecified Chest b.adm dt¿30 ); Pain” where length of stay was greater than 30 days. Query 5 SELECT count(distinct b.MRN)FROM ENC ICD9 PROC LTD Patients who had catheterization b, ENC ICD9 PROC LTD a WHERE a.MRN = b.MRN AND and transthoracic echo procedures a.ICD9 DT between to date(’1/1/2008’,’MM/DD/YYYY’)and within given date ranges to date(’12/31/2009’, ’MM/DD/YYYY’) AND a.ICD9 CD in (’37.21’, ’37.22’, ’37.23’) AND b.ICD9 DT between to date(’1/1/2008’,’MM/DD/YYYY’)and to date(’12/31/2009’, ’MM/DD/YYYY’) AND b.ICD9 CD = ’88.72’; Query 6 SELECT count(1) FROM ENC CLIN LTD a WHERE Patients who had history of CHF, a.prim dsch dx cd like’428%’ AND a.mrn IN (SELECT b.MRN who also had catheterization or FROM ENC ICD9 PROC LTD b WHERE b.ICD9 DT between transthoracic echo procedures to date(’1/1/2008’,’MM/DD/YYYY’) and to date(’12/31/2009’, within a given date range ’MM/DD/YYYY’) AND b.ICD9 CD IN (’37.21’, ’37.22’, ’37.23’, ’88.72’)); Query 7 SELECT count(distinct a.mrn) FROM ENC DX LTD Number of patients who had pri- a, ENC CLIN LTD b WHERE a.mrn=b.mrn AND mary diagnosis of ”Ulcer of Other a.encounter no=b.encounter no AND b.prim dsch dx cd =’707.15’ Part of Foot” and a non-primary di- AND a.dx cd like ’250%’ AND a.dx rank¡¿1 AND b.DSCH DT - agnosis of diabetes, who also had b.adm dt¿10; length of stay more than 10 days. Query 8 SELECT count(distinct a.mrn)FROM ENC DX LTD Number of patients who had pri- a, ENC CLIN LTD b WHERE a.mrn=b.mrn AND mary diagnosis of ”Ulcer of Other a.encounter no=b.encounter no AND b.prim dsch dx cd Part of Foot” and a non-primary di- =’707.15’ AND a.dx cd like ’250%’ AND a.dx rank¡¿1 AND agnosis of diabetes, whose length of b.DSCH DT - b.adm dt¿10 AND a.mrn IN (SELECT c.MRNFROM stay was greater than 10 days and ENC ICD9 PROC ltd c WHERE c.ICD9 CD in (’37.21’, ’37.22’, who had any history of catheteriza- ’37.23’,’88.72’)); tion or a transthoracic echo proce- dure

Table 3.2: Queries with different forms and parameters were used during performance evaluations. Each query is repeated in a non-sequential order with different parameters.

71 Test Proportion (examples for this score) Frequency (Monobits) Test 31/32 Frequency Test within a Block 31/32 Cumulative Sums (Cusum) Test 31/32 (2) Runs Test 32/32 Test for the Longest Run of Ones in a Block 32/32 Binary Matrix Rank Test 31/32 Discrete Fourier Transform (Spectral) Test 31/32 32/32(104) 31/32(42) Non-Overlapping Template Matching Test 30/32(2) 29/32(1) Approximate Entropy Test 32/32 3/3(7) Random Excursions Test 2/3(1) Random Excursions Variant Test 3/3(18) Serial Test 32/32(2) Linear Complexity Test 30/32

Table 3.3: Statistics on pseudorandom number generation; the minimum pass value for a given test is 29 for sequences with size of 32-bits; The minimum pass rate for the ran- dom excursion (variant) test is approximately = 2 for a sample size = 3. The Number of sequences used was 86400. We had 100% pass rate in all tests.

72 CHAPTER 4

UNEXPECTEDLY HIGH PREVALENCE OF SARCOIDOSIS IN A REPRESENTATIVE U.S. METROPOLITAN POPULATION

4.1 Summary

As one demonstration of use of de-identified data, we report on changes in patient population for sarcoidosis at OSUMC. Prevalence of sarcoidosis in the United States is un- known, with estimates ranging widely from 1-40 per 100,000. We sought to determine the prevalence of sarcoidosis in our health system and to further establish if the prevalence was changing over time. We interrogated the electronic medical records of all patients treated in our health system from 1995-2010 (1.48 million patients) using the common ICD9 code for sarcoidosis (135) and lung cancer (162), and several other lung diseases characterized, like sarcoidosis, as “rare lung diseases.” The patient demographic information (race, gender, age) was further analyzed using association rule mining algorithms to identify signature data patterns. The prevalence of sarcoidosis in our health system increased steadily from

164/100,000 in 1995 to 307/100,000 in 2010, and this trend could not be ascribed simply to changes in patient demographics or patient referral patterns. We further estimate that the prevalence of sarcoidosis exceeds 48 per 100,000 in Franklin County, Ohio, the de- mographic profile of which is nearly identical to that of the U.S. Sarcoidosis prevalence

73 increased over time relative to lung cancer, a benchmark disease with stable disease preva-

lence, and exceeded that of other rare lung diseases. We postulate that the observed 2-fold

increase in sarcoidosis disease prevalence in our health system is primarily related to im-

proved detection and diagnostic approaches, and we conclude that the actual prevalence of

sarcoidosis in central Ohio greatly exceeds current U.S. estimates.

4.2 Introduction

Sarcoidosis is a chronic systemic disease that commonly affects the lungs, which afflicts

adults in the primes of their lives. Based upon the ACCESS trial, the largest epidemiologi-

cal study on sarcoidosis to date, the annual incidence of the disease varies from 5/100,000

in whites to 39/100,000 in African Americans, with an overall prevalence estimated to be

less than 40/100,000 in the USA [143]. As such, sarcoidosis is characterized as a ”rare

lung disease” (i.e., prevalence <200,000 cases) in the USA. However, firm data relating to the true prevalence of sarcoidosis is lacking and it is unclear if the prevalence of disease is changing over time.

Several features of sarcoidosis tend to obscure the diagnosis, leading to an under- appreciation of the potential impact of the disease on the health care system and society as a whole. Sarcoidosis frequently presents with non-specific complaints, ranging from fa- tigue and depression, “asthma symptoms” (wheezing, cough), to arthritis and muscle pain or weakness. As such, sarcoidosis can mimic other diseases, leading to misdiagnosis and inappropriate treatments. When these non-specific disease manifestations prevail, the un- derlying diagnosis of sarcoidosis may be overlooked. In support of this concept, a review of 9324 forensic autopsies cases in Cleveland, Ohio, indicated that the actual prevalence of sarcoidosis exceeds 300/100,000, an order of magnitude higher than suspected based upon

74 death certificate reporting [144]. If the prevalence of sarcoidosis approaches this forensic

figure, the disease would have to be reclassified as a common lung disease.

Several recent developments could contribute to an apparent and/or actual increase in the prevalence of sarcoidosis over the past 15 years. There is reason to believe that the detection of sarcoidosis has improved as clinical standards have shifted towards the more extensive use of high-resolution imaging techniques (e.g., CT scanning) and more effective sampling techniques (e.g., endobronchial ultrasound-guided biopsy). And to the extent that exposure to various environmental antigens promotes sarcoidosis [145], it follows that acute or chronic exposures to inhaled [146] and perhaps ingested antigens [147] could contribute to more cases of sarcoidosis. The objective of this study was to determine if the prevalence of sarcoidosis is changing over time in our health system and in our community.

4.3 Methods

The Ohio State University Medical Center’s institutional Information Warehouse (IW) was interrogated from 1995-2010 by means of Structured Query Language (SQL) queries.

The census data for Franklin County is obtained from the US Census Bureau’s online re- sources (http://quickfacts.census.gov/qfd/states/39/39049.html), contained the actual cen- sus values from the years of 2000 and 2010, and the estimates between these years and it was uploaded into the IW in the form of database tables. The patient demographic informa- tion (race, gender, age) was gathered for patient groups using specific ICD9 diagnosis codes for sarcoidosis (135), idiopathic pulmonary fibrosis (IPF; 516.3), hypersensitivity pneu- monitis (HSP, 495), alpha-1 antitrypsin deficiency (AAT, 273.4), and lung cancer (162); demographics on all patients within our system were collected as well. In this regard, The

Ohio State University Medical Center is a regional referral center for all of these rare lung

75 diseases. For the purposes of these analyses each patient was counted once for a given year. We then applied data mining and statistical analysis techniques in order to better understand and verify the significance of our results. For data mining we have utilized the well-known Apriori algorithm, which is designed for mining association rules in large databases [148]. Briefly, Apriori algorithm is designed to operate on large transactional databases to identify frequently co-occurring items or events. A simple example could be items frequently co-occurring in market baskets during purchases from a grocery store (i.e.

”60% of the time that bread is sold so are pretzels and that 70% of the time jelly is also sold” [149]). The results from such analysis could be used in order identify or understand customer profiles; here we use it for understanding the demographic profiles of given sets of sarcoidosis patients. We utilized Weka (version 3.7.2) an open source data mining [150] software package (http://www.cs.waikato.ac.nz/ml/weka/ ), for our data mining and related visualizations. For statistical analysis and verification we used logistic regression; our anal- yses were performed using SAS (version 9.2). Before the demographic data were analyzed, as a prerequisite for the Apriori algorithm and the logistic regression methods, we turned our numeric age (at the time of discharge) values into categorical sets by binning them into decades (30s, 40s, 50s, etc.)

4.4 Results

4.4.1 Demographics of the regional population

The demographic profile of Columbus, Ohio closely approximates that of the USA

(Table 4.1; and 2010 census results: http://quickfacts.census.gov/qfd/states/00000.html).

76 Race Gender Age African Other White Female Male Median American OSU Health 14% 12% 74% 52% 48% 49.1 System US Census 13% 15% 73% 51% 49% 36.8

Table 4.1: Demographics of Columbus, Ohio patient population compared to U.S in 2010.

4.4.2 Comparison of sarcoidosis prevalence to that of other rare lung diseases

Ohio State University Medical Center (OSUMC) is a regional referral center for all types of lung disease, including sub-specialty clinics for rare lung diseases (IPF, HSP,

AAT, sarcoidosis), as well as lung cancer. A total of 1.48 million patients were encoun- tered at OSUMC from 1995-2010, and were included in the analysis. The prevalence of sarcoidosis, and the number of patient encounters with the health system associated with the diagnosis of sarcoidosis was significantly higher than all other rare lung diseases (Table

4.2).

Disease Sarcoidosis HP IPF AAT Patients ICD9 135 495 516.3 273.4 All Patients 3758 528 2297 179 1.4 Million AGE(AVG) 49 53 61 52 46 58%(F) 51%(F) 51%(F) 45%(F) 54%(F) SEX 42%(M) 49%(M) 49%(M) 55%(M) 46%(M) 57%(W) 83%(W) 81%(W) 93%(W) 74%(W) RACE 40%(B) 13%(B) 15%(B) 5%(B) 14%(B) 3%(O) 4%(O) 4%(O) 2%(O) 12%(O)

Table 4.2: Sarcoidosis Prevalence Compared to Other Rare Lung Diseases.

77 4.4.3 Changes in prevalence of Sarcoidosis over time in our health care system

In keeping with previous reports [143,151], sarcoidosis was most common in the third-

fifth decades (Figure- 4.1) with a strong female gender bias Figure- 4.2). Despite regional census data indicating that the ratio of African Americans to whites is 1:5, the absolute number of African Americans presenting with sarcoidosis slightly exceeded that of whites

(Figure- 4.3). Over the 15 year period analyzed in this study there was no significant change in the age distribution, whereas there was significant decrease over this time in the distribution of females from 65 to 56% (p < 0.0001) and increase in African American

race from 43 to 51% (p < 0.0001) in the patient sample (Figures 4.1- 4.3). The overall

prevalence of sarcoidosis in our patient population ( 95% were local residents and this

did not change over time increased steadily from 164/100,000 in 2000 to 330/100,000

in 2010, whereas the prevalence of lung cancer and the population of Franklin county

was unchanged over time (Figure- 4.4). This increase in sarcoidosis prevalence over time

remained highly significant after adjusting for changes in regional referral patterns, based

upon zip code location of residence.

4.4.4 Estimate of sarcoidosis prevalence in Columbus, Ohio

Limiting the analysis of patient data to residency zip codes within Columbus, OH; and

very conservatively assuming that all sarcoidosis cases in this referral base were cared for

at our institution, the minimum estimated prevalence of sarcoidosis in Columbus, OH was

48/100,000 in 2010. The actual market share of our health care system for the greater

Columbus area ranged from 23% in 1995 to 26% in 2010. Thus, the true prevalence of

sarcoidosis in this region may exceed 200/100,000 in 2010.

78 800

700

600

80s 500 70s 60s 400 50s 40s 300 30s 20s 200

100

0 95' 96' 97' 98' 99' 00' 01' 02' 03' 04' 05' 06' 07' 08' 09' 10'

Figure 4.1: Sarcoidosis Age distribution 1995-2010. The average age of the sarcoidosis patients was 48 yrs, with the vast majority falling in the 3rd (light blue), 4th (dark blue), and 5th (red) decades.

4.5 Discussion

The actual health care burden associated with sarcoidosis in the U.S. is unknown, how- ever, this study indicates that the prevalence of sarcoidosis in our patient population is nearly an order of magnitude greater than previously estimated [143], and has increased nearly three-fold within the past 15 years. This dramatic increase in disease prevalence over time is not fully explained by changes in the distribution of the patient population with respect to race or gender. Moreover, the prevalence of sarcoidosis was shown to in- crease over time relative to lung cancer, an index disease for which the national prevalence trends have remained relatively stable over the past 15 years [152]. Using the most conser- vative analytical approach, the prevalence of sarcoidosis in Columbus, Ohio, a population

79 800

700

600

500

400 FEMALE MALE

300

200

100

0 95' 96' 97' 98' 99' 00' 01' 02' 03' 04' 05' 06' 07' 08' 09' 10'

Figure 4.2: Changes in gender distribution over the years of 1995 and 2010. Females are represented in blue and males in red.

closely resembling the demographic profile of the U.S., was at least 48/100,000 in 2010, which is over 3-times greater than estimates of sarcoidosis prevalence in the U.S. reported the Orphanet Reporter Series for Rare Diseases (www.orpha.net/ orphacom /cahiers/docs

/GB/Prevalence of rare diseases by alphabetical list.pdf), a resource used by the NIH’s

Office of Rare Lung Diseases Research to track disease prevalence.

Previous studies have indicated that sarcoidosis may be much more common than pre- viously predicted. Interestingly, a large autopsy series conducted in Cleveland, Ohio in- dicated that the true prevalence of sarcoidosis was nearly identical to that reported in our patient population ( 300/100,000) [144]. If these results were to reflect national trends, it would correspond to over 900,000 sarcoidosis cases in the U.S., placing it well outside of the range of rare lung diseases. Using an extremely conservative approach, including the

80 800

700

600

500

OTHER 400 WHITE BLACK 300

200

100

0 95' 96' 97' 98' 99' 00' 01' 02' 03' 04' 05' 06' 07' 08' 09' 10'

Figure 4.3: Changes in race distribution over the years of 1995 and 2010; overall race distribution. African American (blue) and white (red) race represented the majority of sarcoidosis patients, reflecting regional demographics.

assumption that all regional sarcoidosis patients are cared for at The Ohio State University

Medical Center, the prevalence of sarcoidosis was at least 48/100,000 in the greater Colum- bus, Ohio population in 2010. To the extent that our patient population is representative of trends in the U.S., the burden of this disease on our health care system and society as a whole (e.g., disability, missed work, reduced quality-of-life) may be much higher than an- ticipated. The high prevalence of sarcoidosis observed in our patient population is in keep- ing with European studies indicating that sarcoidosis is much more common than any other interstitial lung diseases [153]. Accepting that the annual incidence of sarcoidosis among

European and African Americans is estimated to be 3-10/100,000 and 35-80/100,000, re- spectively [154], and considering that most patients with sarcoidosis are diagnosed in early adulthood and have near-normal life expectancy [155, 156], it follows that the lifetime risk

81 400 Franklin County 350 R² = 0.9115 Lung CA 300 Sarcoid 250

200

150

100 R² = 1 50

0 Jan-95 Jan-96 Jan-97 Jan-98 Jan-99 Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10

Figure 4.4: Sarcoidosis prevalence vs. lung cancer prevalence over time. Note that these trend lines were derived from patients residing in zip codes within Franklin County (Columbus), OH, to reduce bias relating to changing referral patterns. The results here are based upon actual US Census Data from 1990, 2000 and 2010 as well as the census estimates for intervening years.

of developing sarcoidosis in the U.S. would exceed the annual incidence by at least an order of magnitude. Our findings support this simple mathematical model.

There are a number of limitations of this study that could explain the apparent rapid increase in disease prevalence in our health system within the past 15 years. Potential bias includes the availability of more sensitive diagnostic screening techniques (e.g., CT scan), improved survival, greater recognition of sarcoidosis among medical professionals, and changing referral patterns to our institution. Alternatively, environmental (e.g., antigen exposure) or host factors (e.g., stress, vitamin D, smoking) could be contributing to the recent increase in sarcoidosis prevalence. Despite all of these potential confounders, it is undeniable that the regional prevalence of sarcoidosis is at least 48/100,000, much higher than current estimates (see above). Further studies are needed to clarify the actual burden of disease and to identify the variables contributing to the apparent increase in disease prevalence.

82 In summary, this study demonstrates an unexpectedly high prevalence of sarcoidosis in a large Midwest U.S. community, the demographic profile of which closely matches that of the U.S. The apparent prevalence is rapidly increasing over the past 15 years. We spec- ulate that improved screening (e.g., CT scans) and diagnostic techniques, together with increased disease awareness, has improved our ability to detect disease. This would ex- plain why our results closely approximate those reported by Reid in a large autopsy series conducted in a nearby community (Cleveland, OH) [144]. However, it remains possible that the regional prevalence of sarcoidosis has increased due to changes in host or envi- ronmental factors. Uncertainties relating to disease prevalence, together with recent data showing that sarcoidosis-related mortality is increasing over the past several decades in the

U.S. [157], emphasize the need to determine the true prevalence of sarcoidosis.

83 CHAPTER 5

COMPUTER ANALYSIS OF CHEST CT FOR SARCOIDOSIS.

As one demonstration of integrative image search capabilities to support development of image analysis techniques, we report on a newly discovered image analysis technique developed for evaluating chest CT images, which correlates with pulmonary functions tests in pulmonary sarcoidosis patients. This novel two-point correlation CT image analysis tool provides an accurate, quantitative approach for measuring lung disease severity in patients with pulmonary sarcoidosis. No such image analysis tool is currently used in clinical practice. The computerized CT image analysis tool presented here may be useful for objectively assessing disease progression and response to treatments in patients with pulmonary sarcoidosis. Moreover, this tool could be applied to clinical research that is designed to test the effects of treatments on patients presenting with pulmonary sarcoidosis.

5.1 Summary

Chest CT scans are commonly used in the clinical setting to assess lung disease severity in patients presenting with pulmonary sarcoidosis. Despite their ability to reliably detect subtle changes in lung disease, the utility of chest CT for guiding therapy is limited by the fact that image interpretation by radiologists is qualitative and highly variable. Thus, we sought to create a computerized CT image analysis tool that would provide quantitative and

84 clinically relevant information. Based upon preliminary trials, we established that a two- point correlation analysis approach was able to reduce the background signal attendant to normal lung structures, such as blood vessels, airways and lymphatics. This approach was applied to multiple lung fields to generate an overall lung texture score (LTS), represent- ing the quantity of diseased lung parenchyma. Using de-identified lung CT and PFT data from The Ohio State University Medical Center’s Information Warehouse, we analyzed 35 consecutive CT scans for which simultaneous matching PFTs were available to determine if the LTS correlated with standard PFT results. We found very high correlations between

LTS and FVC and TLC (Pearson’s correlation coefficients of 0.92 and 0.88, respectively), whereas the correlation with DLCO was much weaker. The image analysis protocol was conducted quickly (< 1 minute per study) on a standard laptop computer connected to a publicly available NIH Image J Toolkit. Thus, the two-point correlation image analysis tool is highly practical and appears to reliably assess lung disease severity. We predict that this tool will be useful for clinical and research applications.

5.2 Introduction

Pulmonary function testing is currently the standard tool for objectively assessing lung disease severity in pulmonary sarcoidosis [158] and other interstitial lung diseases [159–

163]. High resolution chest CT imaging is highly reproducible and is very sensitive for detecting lung pathology [164], unlike pulmonary function tests, which are subject to error relating to the patient, including effort and ethnic variability influencing reference stan- dards, and technical variables, such as equipment, laboratory methodology, and different interpretive algorithms [165]. However, the current approach to interpretation of CT scan

85 results, which depends upon radiologists, is poorly standardized and qualitative, both for clinical and research applications in patients with interstitial lung disease [166]. Thus, there is a need for a more objective and quantifiable approach to chest CT image analysis.

Previous attempts to quantify the severity of interstitial lung disease, particularly fibrotic lung disease, based upon computer-aided CT scan image analysis, were of limited utility. In general, previous studies have reported modest correlations between radiographic features

(e.g., lung attenuation) and pulmonary function parameters [167–169], leading the authors of these studies to conclude that computer CT image analysis is “relatively insensitive to textural changes, such as ground glass abnormality, reticular abnormality” [167]. Thus, to account for textural characteristics of the lung relating specifically to disease, while mini- mizing the “noise” relating to normal features, including blood vessels, airways and other contiguous structures, we developed a two-point correlation function based approach to quantify the severity of diseased lung using conventional resolution CT images. To this end, we developed a proprietary plug-in program which can be operated on the publicly available NIH Image J Toolkit (http://rsbweb.nih.gov/ij/index.html). In our sarcoidosis pa- tient population, the computerized image analysis tool is shown to be highly efficient (less than 1 minute per CT scan), strongly correlated with PFT parameters, particularly FVC and TLC, and it can be implemented on an ordinary laptop computer. This novel two-point correlation function based image analysis tool is very practical for clinical and research applications relating to sarcoidosis and potentially other interstitial lung diseases.

86 (a) (b) (c)

X-axis

is ax Z-

s

i

x

a

-

Y

(d) (e)

Figure 5.1: Schematic representation of our two-point correlation function based ap- proach (LTS). (a) LTS, accepts Chest CT studies in DICOM format as input; (b) Us- ing Hounsfield units, lungs are segmented, and then this segmented lung volume is raster scanned; (c) during this scan each pixel is compared to its neighbors in various distances within a threshold; (d) this process is repeated for the entire lung segment; (e) all mis- matches for each pixel which are integrated throughout the volume are summed at individ- ual pixel level, which will later be added to the volume of mismatches.

5.3 Methods

5.3.1 Sarcoidosis patient population

The Ohio State University Medical Center’s clinical Information Warehouse serves as an honest broker for de-identified patient data, including image files and pulmonary func- tion test results. Using “sarcoidosis” and ICD9 code 135 as search terms, we analyzed

87 35 consecutive chest CT studies derived from 28 patients with an established diagnosis

of sarcoidosis which had simultaneous (within 2 months) pulmonary function test results

available.

5.3.2 CT Image Analysis

The lungs were segmented using Hounsfield units, and a two-point correlation func-

tion based (TPCF-based) approach (Figure- 5.1) was employed to quantify the proportion

of diseased lung, using a proprietary plug-in program which can be operated on the pub-

licly available NIH Image J Toolkit (http://rsbweb.nih.gov/ij/index.html). The lung texture

score (LTS) derived from these analyses represents the percentage of the lung parenchyma

exhibiting non-uniform texture, as described in Figure- 5.1. Briefly, our TPCF-based ap-

proach, LTS, is adapted from the field of material sciences [170]. In material sciences,

while calculating TPCF, samples from a given section of material (metal, alloy, etc.) are

compared to samples from another section in question, and the TPCF measures the cor-

relation between the two sample populations. During our LTS calculations (TPCF-based

approach) rather than measuring the correlation between the sections (pixels in our im-

ages) we measure the mismatches for a given CT study. Hence, following Figure- 5.1, LTS

calculations could be explained as follows:

Step 1: For a given chest CT (Figure- 5.1 a) using the Hounsfield Units (e.g. between

−1000 and −200) lungs are segmented (Figure- 5.1 b); during this segmentation operation

a ratio value between the segmented lung tissue volume and the overall CT volume (LR:

Lung Ratio) is calculated as well. We then convert the values in this volume from 16-bits

(65, 535 gray levels) to 8-bits (256 gray levels) and enhance their contrast using Histogram equalization.

88 Step 2: Within this given lung volume the pixel values are converted from 8-bits to

4-bits (16 gray levels). For each pixel (Figure- 5.1 c) in this volume (Figure- 5.1 d) com-

parisons are made between the surrounding neighbors. Here, if we were to compare pixels

2 steps away in every direction (X: sagittal, Y: transversal and Z: axial) we would be mak-

ing 125 comparisons (5x5x5) per pixel. Comparisons between pixels could be made in any

distance (e.g. 1 pixel distance would be 9 comparisons, 3 pixel distance would be 343 com-

parisons, etc.). Using this logic, mismatches for each pixel are thresholded (e.g. if there is

more than 75% mismatch with surrounding pixels count this pixel as a mismatch). Then,

the mismatches (Figure- 5.1 e) are integrated for the given lung volume and this integrated

value is turned into a ratio of mismatches for the entire volume (MR: Mismatch Ratio). The

final value of LTS is calculated by dividing the Mismatch Ratio to the Lung Ratio (LTS =

MR/LR).

5.3.3 Statistical methods

We used Pearson’s approach to measure the correlation between the current gold stan-

dard PFT associated with each CT scan and the LTS calculated.

5.4 Results

5.4.1 Two-point correlation analysis of CT images reduces background signal from normal lung structures

The image analysis tool effectively eliminates the background noise represented by normal contiguous lung structures (e.g., blood vessels, airways) such that the signals from alveolar and interstitial compartments are highlighted (Figure- 5.2).

89 1

2

Figure 5.2: Two-point correlation analysis of CT images highlights diseased lung. The image on the left is produced by filtering out the tissues other than the lungs. The image on the right is a ”result” image which shows the calculated features for every pixel from the source image. Areas marked by ”1” (green arrows) highlight example ROIs with enhanced discrimination between normal and abnormal lung tissue. Areas marked by ”2” (red ar- rows) highlight example ROIs with enhanced discrimination between ”normal” blood ves- sels and adjacent ”diseased” lung.

5.4.2 LTS strongly correlates with forced vital capacity (FVC) and to- tal lung capacity (TLC)

Higher LTS is shown to correlate strongly with percent-predicted FVC and TLC, whereas the correlation with DLCO was relatively weak Figure- 5.3).

5.4.3 Correlations between LTS and pulmonary function remain sig- nificant after reducing image intensity precision

16-bit images are converted to 4-bit images during our calculations; this implies that lower dose CT images (less radiation exposure, but higher signal to noise ratio) may be suitable for LTS analyses.

90 70 70 70

60 60 60

50 50 50

40 40 40

S S S

T T T

L L L 30 30 30

20 20 20

10 10 10

0 0 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120

Percent-predicted Percent-predicted Percent-predicted FVC TLC DLCO

Figure 5.3: Correlation between CT image score and lung function. The score calcu- lated from the CT images reflected the amount of irregularity within the lungs (That is, the percentage of irregular (textured) lung within the overall lung volume for the given CT image.) Higher CT image scores correlated well with lower Percent-predicted FVC (r = −0.92) and Percent-predicted TLC (r = −0.88). LTS did not correlate well with Percent-predicted DLCO (r = −0.15).

5.5 Discussion

Despite ready access to highly automated and reproducible high quality images avail- able from lung CT scans [171], CT image analysis is currently of limited utility for as- sessing the severity of pulmonary sarcoidosis due to the subjective and non-quantifiable nature of the radiologist’s interpretation of interstitial lung disease severity. As such, subtle changes in CT image characteristics are typically ignored in the context of clinical sar- coidosis research. The current study indicates that a novel two-point correlation analysis of CT images strongly correlates with pulmonary function parameters (FVC, TLC) repre- senting the current standard for estimating lung disease severity in patients with pulmonary sarcoidosis [172, 173].

91 Currently available clinical end-points for sarcoidosis research all have their limita- tions. For instance, patients often report a paradoxicall worsening of their quality-of-life despite treatments that effectively prevent lung disease progression. On the other hand, pa- tients can remain asymptomatic with ”normal” lung function despite obvious progression of pulmonary sarcoidosis based upon radiographic criteria. A more objective measure of the overall burden of pulmonary disease is needed to guide clinical decision-making and research. By eliminating the large background signal generated by normal lung structures

(blood vessels, lymphatics, airways) the computerized two-point correlation analysis de- scribed herein assesses changes in radiographic texture relating to interstitial and alveolar processes. Thus, this quantitative approach is expected to objectively and more sensitively detect changes in the overall burden of parenchymal lung disease.

Another potential advantage of the computerized two-point correlation analysis ap- proach is the potential to reduce the intensity of radiation exposure required to generate useful CT images. In this regard, high-resolution CT scans with high signal to noise ratio

(SNR) are generally regarded as superior to conventional CT images for detecting changes in lung disease severity in patients with interstitial lung diseases (ILDs), such as sarcoido- sis. However, there is a growing concern relating to the long-term risks attendant to the higher levels of radiation exposure when using serial CT scans for the assessment of dis- ease progression [174]. Computer-aided CT image analysis was recently shown to detect subtle changes in lung texture attendant to early asymptomatic ILD using low-dose CT studies [175]. While the retrospective nature of the current study does not allow us to determine the minimal radiation dose required to maintain a strong correlation with stan- dardized pulmonary function results, we were able to show strong correlations between the

92 LTS with FVC and TLC while using 4-bits. Being able to work with 16 gray levels sug- gests LTS is robust regarding SNR and should be able work with low dose scans. We plan to investigate this in a later study.

In addition to the study design limitations mentioned above, a number of other questions are raised by these compelling results. Our analysis was confined to a single institution us- ing similar CT scanners and protocols, there was no attempt to correlate image analysis with a particular sarcoidosis disease phenotype (e.g., nodular, ground glass, fibrotic) and it is unclear if this approach is also effective for detecting changes in the severity of other

ILDs. To further enhance the performance of this image analysis tool, future studies should start with the raw image data to optimize the protocol for the purpose of minimizing radi- ation exposure while further improving the detection of changes in lung texture relating to

ILD, and perhaps obstructive lung disease. With respect to the latter, our unpublished data suggests that lower LTS scores correspond to emphysema severity. Finally, we speculate that the relatively poor correlation between LTS and DLCO in sarcoidosis patients relates to the upper-lobe predominance of the disease, which tends to preserve DLCO compared to interstitial lung diseases affecting the lower lobes (e.g., IPF). As such, we expect improved correlation with DLCO could be achieved by weighting the image analysis based upon the severity of apical and basilar lung disease severity.

In summary, the novel two-point CT image analysis approach described herein is shown to strongly correlate with severity of pulmonary sarcoidosis based upon standard PFT cri- teria, and these correlations were obtained using 16 gray levels. This implies that the com- puter image analysis approach could reduce the risks of radiation exposure while providing a more objective assessment of disease progression for clinical and research applications.

Another benefit of this approach is the demonstrated efficiencies in terms of computer

93 power, expense and time. Specifically, the analysis can be conducted within one minute on a conventional laptop computer using NIH’s Open Source ImageJ Toolkit along with our

ImageJ JAVA software plug-in. Additional studies are needed to determine if the two-point correlation CT image analysis approach is effective for other interstitial and obstructive lung diseases.

94 CHAPTER 6

CONCLUSIONS

The primary objective of the research and development described in this thesis has been to provide an integrative platform where multidimensional data from multiple dis- parate sources can be easily accessed, visualized, and analyzed. We believe that ability to execute such truly integrative queries, visualizations and analyses across multiple data types is critical to executing highly effective clinical and translational research. There- fore, to address the existing gap in knowledge, we have introduced a model computational framework that supports the integrative query, visualization and analysis of structured data, narrative text, and image data sets in support of translational research activities. The in- troduced framework also addresses the challenges posed by regulatory compliance, patient privacy/confidentiality concerns, and the need to facilitate multi center research paradigms.

In Chapter 2, we have described a model and associated software framework with unique combination of components that are capable of providing translational research users with an integrative query and information retrieval tool that spans multiple, critical biomedical information sources including structured data, narrative text, and images. While there are several examples in the current literature of integrative and ontology-anchored im- age search or query tools, to the best of our knowledge, our framework is the first to also

95 support the simultaneous query and subsequent integration of image data sets with struc- tured data and narrative text. This data integration methodology has significant impact on research applications and beyond. Being able to query and retrieve clinical image and re- lated metadata from PACS in bulk, allows for responding to very crucial queries related to research, operations, quality, safety as well as business intelligence. In addition, it provides timely and cost effective access to clinical data for secondary use. Use of this framework enables otherwise unanswered queries for many different types of users such as researchers, business administrators, system administrators, management engineers, medical students and clinicians. Our future plans for this project include the continued evaluation of the framework, with specific emphasis on the types of novel hypotheses that can be addressed using such a knowledge-anchored, integrative query platform, as well as its applicability to other usage scenarios. We fully anticipate that our system, with its focus on satisfying a critical translational research information need, will continue to develop into an opera- tional platform for use by researchers at OSUMC that will also be extensible to the broader informatics and research communities.

Chapter 3 is a further advancement of further our framework from Chapter 2. The creation of a De-identified Information Warehouse (DIW) is a continuing effort at the OS-

UMC IW to better support research activities using clinical data. The ultimate goal is to enable a direct connection between a researcher and data. IW’s HBP has shortened the time it takes for researchers to access the data they need. With the HBP, OSUMC researchers can gain access to limited and de-identified data much faster than before. However, IW analysts can only process so many data requests at any given time. The new bottleneck facing researchers is the lengthening data request queue. As a solution, DIW can easily be coupled with commercial database query tools; in house developed query tools; or open

96 source development efforts such as i2b2, caGrid etc. This can improve the efficiency of data requests by letting researchers perform their own queries on their time, with greater interactivity and increased flexibility.

Our framework successfully de-identifies and removes all HIPAA mandated PHI from structured data elements, and provides conformance with the local IRB, while maintain- ing data integrity. There are no other de-identification frameworks in the literature which guarantee a zero knowledge protocol, where even for each session new identifiers are gener- ated. This makes our framework a suitable tool for aiding core de-identification operations, which need to take place at medical data warehouses in order to support non-human-subject research using clinical data. From a clinical and translational researcher’s perspective, this means timely access to data. A potential researcher, who is preparing a grant submission which is due in a week, may not be able to afford waiting for a data analyst to test a last minute hypothesis which relies on retrospective analysis. In this case, timely access to data may differentiate between the researcher who gets the data and the grant, and the one who does not.

As a use case scenario utilizing the infrastructures and information acquisition tech- niques developed in Chapter 2 and 3 for secondary use of medical data, in Chapter 4, we have demonstrated prevalence calculations on retrospective data. In summary, this study validates an unexpectedly high prevalence of sarcoidosis in a large Midwest U.S. commu- nity, the demographic profile of which closely matches that of the U.S. The apparent preva- lence is rapidly increasing over the past 15 years. This information would otherwise re- main anecdotal, but fortunately, previously developed novel infrastructures and techniques

(Chapter 2 and 3) enable timely quantization of suspicions from OSUMC researchers’ clin- ical encounters.

97 In Chapter 5, further utilizing the infrastructures and information acquisition techniques developed in Chapter 2 and 3, we have demonstrated the development of image processing tools using de-identified medical data. In summary, the novel two-point CT image analysis approach described herein is shown to strongly correlate with severity of pulmonary sar- coidosis based upon standard PFT criteria, and these correlations were obtained using 16 gray levels. This implies that the computer image analysis approach could reduce the risks of radiation exposure while providing a more objective assessment of disease progression for clinical and research applications. Another benefit of this approach is the demonstrated efficiencies in terms of computer power, expense and time. Specifically, the analysis can be conducted within one minute on a conventional laptop computer using NIH’s Open Source

ImageJ Toolkit along with our ImageJ JAVA software plug-in. As future work, we plan to conduct additional studies to determine if the two-point correlation CT image analysis approach is effective for other interstitial and obstructive lung diseases.

In summary, the work in this thesis consists of novel techniques that facilitate high throughput clinical and translational research. We further demonstrate these capabilities through introducing further novel discoveries and clinical findings using the frameworks themselves. The author believes that the infrastructural advances introduced around novel frameworks and information retrieval techniques presented in this thesis would be valuable to many researchers who strive for data. After all, timely access to data is a critical key component to timely an efficient clinical and translational research.

98 BIBLIOGRAPHY

[1] NIH. NIH road map for medical research: http://nihroadmap.nih.gov/, 2009.

[2] Cimino J.J. From data to knowledge through conceptoriented terminologies: expe- rience with the medical entities dictionary. J Am Med Inform Assoc., 7(3):288–297, 2000.

[3] Sujansky W. Heterogeneous database integration in biomedicine. J Biomed Inform., 34(4):285–298, 2001.

[4] Kamal J., Rogers P., Saltz J., and Mekhjian H. Information warehouse as a tool to analyze computerized physician order entry order set utilization: Opportunities for improvement. In AMIA Annu Symp Proc., pages 336–340, 2003.

[5] Prather J.C., Lobach D.F., Goodwin L.K., Hales J.W., Hage M.L., and Hammond W.E. Medical data mining: knowledge discovery in a clinical data warehouse. In Proc AMIA Annu Fall Symp, pages 101–105, 1997.

[6] Brown M.S., McNitt-Gray M.F., Pais R., Shah S.K., Qing P., Da Costa I., Aberle D.R., and Goldin J.G. Cad in clinical trials: current role and architectural require- ments. Comput Med Imaging Graph., 31(4-5):332–337, 2007.

[7] Sigal R. Pacs as an e-academic tool international congress series 2005, 1281, cars 2005. Computer Assisted Radiology and Surgery, pages 900–904, 2005.

[8] Boochever S.S. His/ris/pacs integration: getting to the gold standard. Radiol Man- age, 26:16–24, 2004.

[9] Kamauu A.W., DuVall S.L., Robison R.J., Liimatta A.P., Wiggins R.H., and Avrin D.E. Informatics in radiology (inforad): vendor-neutral case input into a server- based digital teaching file system. Radiographics, 26(6):1877–1885, 2007.

[10] NCIA: Reference Image Database to Evaluate Response (RIDER). http://ncia.nci.nih.gov/ncia/collections., 2007.

99 [11] Poli R. Eds. Guarino N., editor. Toward principles for the design of ontologies used for knowledge sharing., chapter Formal Ontology in Conceptual Analysis and Knowledge Representation Norwell. Kluwer, 1993.

[12] Bruce G.B. Joseph P. Ontology-guided knowledge discovery in databases. In Pro- ceedings of the international conference on knowledge capture., 2001.

[13] Smith B. and Kumar A. On controlled vocabularies in bioinformatics: a case study in the gene ontology. Biosilico: Drug Discovery Today, 2(1):246–252, 2004.

[14] Gurcan M.N., Sahiner B., Petrick N., Chan H.P., Kazerooni E.A., Cascade P.N., and Hadjiiski L. Lung nodule detection on thoracic computed tomography images: pre- liminary evaluation of a computer-aided diagnosis system. Med Phys., 29(11):2552– 2558, 2002.

[15] Payne P.R., Johnson S.B., Starren J.B., Tilson H.H., and Dowdy D. Breaking the translational barriers: the value of integrating biomedical informatics and transla- tional research. J Investig Med., 53(4):192–200, 2005.

[16] Ebbert J.O., Dupras D.M., and Erwin P.J. Searching the medical literature using pubmed: a tutorial. Mayo Clin Proc., 78(1):87–91, 2003.

[17] Olson G.M. et al., editors. Proceedings of SAICSIT, South Africa., chapter Col- laboratories to support distributed science: the example of international HIV/AIDS research. ACM Press, 2002.

[18] Butler D. Data, data, everywhere. Nature, 414(6866):840–841, 2001.

[19] Marks R.G., Conlon M., and Ruberg S.J. Paradigm shifts in clinical trials enabled by information technology. Stat Med., 20(1718):2683–2696, 2001.

[20] Payne P.R., Greaves A.W., and Kipps T.J. Crc clinical trials management system (ctms): an integrated information management solution for collaborative clinical research. In AMIA Annu Symp Proc, page 967, 2003.

[21] Kuchenbecker J., Dick H.B., Schmitz K., and Behrens-Baumann W. Use of inter- net technologies for data acquisition in large clinical trials. Telemed J E Health., 7(1):73–76, 2001.

[22] Marks L. and Power E. Using technology to address recruitment issues in the clinical trial process. Trends Biotechnol., 20(3):105–109, 2002.

[23] Bates D.W., Ebell M., Gotlieb E., Zapp J., and Mullins H.C. A proposal for elec- tronic medical records in u.s. primary care. J Am Med Inform Assoc., 10(1):1–10, 2003.

100 [24] Sung N.S. et al. Central challenges facing the national clinical research enterprise. JAMA, 289(10):1278–1287, 2003.

[25] Bates D.W. et al. Effect of computerized physician order entry and a team interven- tion on prevention of serious medication errors. JAMA, 280(15):1311–1316, 1998.

[26] Huang H. et al. Picture archiving and communication systems (PACS) in medicine. Springer, NY, 1991.

[27] Duerinckx A.J. and Pisa E.J. Filmless picture archiving and communication system (pacs) in diagnostic radiology. In Proc SPIE, pages 9–18, 1982.

[28] Gurcan M.N., Sharma A., Kurc T., Oster S., Langella S., Hastings S., Siddigui K.M., Siegel E.L., and Saltz J. Gridimage: a novel use of grid computing to support in- teractive human and computer-assisted detection decision support. J Digit Imaging, 20:160–171, 2007.

[29] Craver J.M. and Gold R.S. Research collaboratories: their potential for health be- havior researchers. Am J Health Behav., 26(6):504–509, 2002.

[30] Kukafka R., Johnson S.B., Linfante A., and Allegrante J.P. Grounding a new infor- mation technology implementation framework in behavioral science: a systematic analysis of the literature on it use. J Biomed Inform., 36(3):218–227, 2003.

[31] Johnson M.S., Gonzales M.N., and Bizila S. Responsible conduct of radiology re- search part v. the health insurance portability and accountability act and research. Radiology., 237(3):757–764, 2005.

[32] Liu B.J., Zhou Z., and Huang H.K. A hipaa-compliant architecture for securing clinical images. J Digit Imaging., 19(2):172–180, 2006.

[33] Amendolia S.R., Estrella F., Hassan W., Hauer T., Manset D., McClatchey R., Rogulin D., and Solomonides T. Mammogrid: a service oriented architecture based medical grid application. In 3rd International Conference on Grid and Cooperative Computing, 2004.

[34] Blanquer I., Hernandez V., Mas F., and Segrelles D. A middleware grid for stor- ing, retrieving and processing dicom medical images. In Workshop on Distributed Databases and Processing in Medical Image Computing (DIDAMIC), 2004.

[35] Espert I.B., Garcaa V.H., and Quilis J.D. An ogsa middleware for managing medical images using ontologies. J Clin Monit Comput., 19(4-5):295–305, 2005.

[36] Montagnat J., Duque H., Pierson J.M., Breton V., Brunie L., and Magnin I. E. Med- ical image content-based queries using the grid. In HealthGrid, 2003.

101 [37] Power D. Politou E., Slaymaker M., Harris S., and Simpson A. A relational approach to the capture of dicom files for grid-enabled medical imaging databases. In ACM symposium on applied computing., 2004.

[38] Foster I. and Kesselman C. The Grid 2: blueprint for a new computing infrastruc- ture. 2nd Ed. Morgan Kaufman, 2003.

[39] Payne P.R., Mendonca E. a., Johnson S.B., and Starren J.B. Conceptual knowledge acquisition in biomedicine: a methodological review. J Biomed Inform., 40:582– 602, 2007.

[40] NLM. Unified Medical Language System. http://www.nlm.nih.gov/research/ umls/meta2.html., 2007.

[41] Bodenreider O. Using umls semantics for classification purposes. In Proc AMIA Symp., pages 86–90, 2004.

[42] Campbell K.E., Oliver E. D., Spackman A.K., and Shortliffe E. H. Representing thoughts, words, and things in the umls. J Am Med Inform Assoc., 5(5):421–431, 1998.

[43] Thomas B.J., Ouellette H., Halpern E.F., and Rosenthal D.I. Automated computer- assisted categorization of radiology reports. Am J Roentgenol., 184(2):687–690, 2005.

[44] Tsui F-C., Wagner M.M., Dato V., and Chang C.H. Value of icd-9-coded chief complaints for detection of epidemics. J Am Med Inform Assoc., 9:41–47, 2002.

[45] Friedman C., Shagina L., Lussier Y., and Hripcsaak G. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc., 11(5):392–402, 2004.

[46] Srinivasan S., Rindflesch T.C., Hole W.T., Aronson A.R., and Mork J.G. Finding umls metathesaurus concepts in medline. In American Medical Informatics Associ- ation Annual Symposium., pages 727–731, 2002.

[47] Taira R.K., Soderland S.G., and Jakobovits R.M. Automatic structuring of radiology free-text reports. Radiographics, 21:237–245, 2001.

[48] Zou Q., Chu W.W., Morioka C., Leazer G.H., and Kangarloo H. Indexfinder: a method of extracting key concepts from clinical texts for indexing. In American Medical Informatics Association Annual Symposium, pages 763–767, 2003.

[49] Alonso O. et al. Oracle text white paper. available at http://www.oracle.com/ tech- nology/products /text/index.html, 2006.

102 [50] International Business Machines Corporation. Db2 text extender. avail- able at ftp://ftp.software.ibm.com/ software/data/db2/ extenders/text/ db2tewkspecsheet.pdf., 2002.

[51] Microsoft Corporation. Sql server 2000 full-text search deployment white paper. available at http:// www.support.microsoft.com /kb/323739, 2004.

[52] Ferrucci D. and Lally A. Building an example application with the unstructured information management architecture. IBM Syst J., 43(3):455–475, 2004.

[53] Baecker R., Small I., and Mander R. Bringing icons to life. In Proceedings of the SIGCHI conference on human factors in computing systems: reaching through technology, pages 1–6, 1991.

[54] NEMA: Digital imaging and communications in medicine. http://www.medical.nema.org, 2007.

[55] Armato III S.G., McLennan G., McNitt-Gray M.F., Meyer C.R., Yankelevitz D., Aberle D.R., Henschke C.I., Hoffman E.A., Kazerooni E.A. MacMahon H., Reeves A.P., Croft B.Y., and Clarke L.P. Lung image database consortium: developing a resource for the medical imaging research community. Radiology, 232:739–748, 2004.

[56] Sigal R. Pacs as an e-academic tool. In CARS 2005: computer assisted radiology and surgery., 2005.

[57] Toms A.P., Kasmai B., Williams S., and Wilson P. Building an anonymized cata- logued radiology museum in pacs: a feasibility study. Br J Radiol., 79:661–671, 2006.

[58] Cohen S., Gilboa F., and Uri S. Pacs and electronic health records. In SPIE, 2002.

[59] Lehmann T., Wein B., and Greenspan H. Integration of contentbased image retrieval to picture archiving and communication systems. In Medical Informatics Europe Conference, 2003.

[60] Traina A., Rosa N.A., and Traina C. Integrating images to patient electronic medical records through content-based retrieval techniques. In 16th IEEE Symposium on Computer-Based Medical Systems., 2003.

[61] Leoni L., Manca S., Giachetti A., and Zanetti G. A virtual data grid architecture for medical data using srb. In EuroPACS-MIR, 2004.

[62] Erdal S., Catalyurek U.V., Saltz J., and Kamal J. Gurcan M.N. Flexible patient information search and retrieval framework: pilot implementation. In Proceedings of the SPIE Medical Imaging., 2007.

103 [63] Erdal S., Catalyurek U.V., Saltz J., and Kamal J. Gurcan M.N. Information ware- house application of cagrid: a prototype implementation. In caBIG 2007 Annual Meeting., 2007.

[64] Erdal S., Catalyurek U.V., Payne P.R.O., Saltz J., and Kamal J. Gurcan M.N. In- tegrating a pacs system to grid: a deidentification and integration framework. In Annual Meeting of the Society for Imaging Informatics in Medicine (SIIM), 2007.

[65] Lindberg C. The unified medical language system (umls) of the national library of medicine. J Am Med Rec Assoc, 61(5):40–42, 1990.

[66] Lindberg D.A., Humphreys B.L., and McCray A.T. The unified medical language system. Methods Inf Med., 32(4):281–291, 1993.

[67] Cancer Biomedical Informatics Grid (caBIG). https://cabig.nci.nih.gov/ workspaces/ architecture/cagrid, 2006.

[68] Eckerson W.W. Three tier client/server architecture: achieving scalability, perfor- mance, and efficiency in client server applications. Open Inf Syst., 10:1–12, 1995.

[69] Gallaugher J. and Ramanathan S. Choosing a client/server architecture. a compari- son of two-tier and three-tier systems. Inf Syst Manage Mag, 13(2):7–13, 1996.

[70] Clunie D.A. Dicom structured reporting. In PixelMed, 2000.

[71] Powell J. and Buchan I. Electronic health records should support clinical research. J Med Internet Res., Jan-Mar;7(1):e4, 2005.

[72] Weiner M. and Embi P. Toward reuse of clinical data for research and quality im- provement: the end of the beginning? Ann Intern Med, 151:359–360, 2009.

[73] U.S. Dept. of Health and Human Services. Standards for privacy of individually identifiable health information, final rule., 2002.

[74] U.S. Dept. of Health and Human Services. Federal policy for the protection of human subjects (the common rule), 1991.

[75] Kamal J., Silvey S.A., Buskirk J., Dhaval R., Erdal S., Ding J., Ostrander M., Bor- lawsky T., Smaltz D.H, and Payne P.R. Innovative applications of an enterprise-wide information warehouse. In AMIA Annu Symp Proc., page 1134, 2008 Nov 6.

[76] Silvey S.A., Schulte J., Smaltz D.H., and Kamal J. Honest broker protocol stream- lines research access to data while safeguarding patient privacy. In AMIA Annu Symp Proc., page 1133, 2008 Nov 6.

104 [77] Liu J., Erdal S., Silvey S.A., Ding J., Marsh C.B., and Kamal J. Toward a fully de-identified biomedical information warehouse. In AMIA Annu Symp Proc., pages pp.370–374, 2009 Nov 14.

[78] Boussi Rahmouni H., Solomonides T., Casassa Mont M., Shiu S., and Rahmouni M. A model-driven privacy compliance decision support for medical data sharing in europe. Methods Inf Med., 50 (4):326–36. Epub 2011 Jul 26., 2011 Aug 15.

[79] Holzer K. and Gall W. Utilizing ihe-based electronic health record systems for sec- ondary use. Methods Inf Med., 50(4):319–25. Epub 2011 Mar 21., 2011 Aug 15.

[80] Safran C., Bloomrosen M., Hammond W.E., Labkoff S., Markel-Fox S., Tang P.C., Detmer D.E., and Expert Panel. Toward a national framework for the secondary use of health data: an american medical informatics association white paper. J Am Med Inform Assoc., 14(1):1–9. Epub 2006 Oct 31., 2007 Jan-Feb.

[81] Wylie J.E. and Mineau G.P. Biomedical databases: protecting privacy and promoting research. Trends Biotechnol., 21(3):113–116, 2003, March.

[82] Loukides G., Gkoulalas-Divanis A., and Malin B. Anonymization of electronic med- ical records for validating genome-wide association studies. Proc Natl Acad Sci., 107(17):78987903, 2010, April.

[83] Claerhout B. and DeMoor G.J. Privacy protection for clinical and genomic data. the use of privacy-enhancing techniques in medicine. Int J Med Inform., 74(2-4):257– 265, 2005 Mar.

[84] Cooper T. and Collman J., editors. Advances in Medical Informatics: Knowledge Management and Data Mining in Biomedicine, chapter Managing Information Se- curity and Privacy in Healthcare Data Mining, pages 95–137. Springer, 2005.

[85] de Moor G.J., Claerhout B., and de Meyer F. Privacy enhancing technologies: the key to secure communication and management of clinical and genomic data. Meth- ods Inform Med., 42:148–153, 2003.

[86] El Emam K.E., Jabbouri S, Sams S., Drouet Y., and Power M. Evaluating common de-identification heuristics for personal health information. J Med Internet Res., OctDec; 8(4):e28, 2006.

[87] Kohane I.S., Dong H., and Szolovits P. Health information identification and de- identification toolkit. In Proc AMIA Symp., pages 356–360, 1998.

[88] Arvind N. and Shmatikov V. Privacy and security: Myths and fallacies of ”personally identifiable information”. Communications of the ACM., 53(6):24–26, 2010.

105 [89] Cavoukian A. and El Emam K. Dispelling the myths surrounding de-identification: Anonymization remains a strong tool for protecting privacy. Technical report, Dis- cussion Papers, Information and Privacy Commissioner of Ontario., June 2011.

[90] El Emam K, Dankar F.K., Vaillancourt R., Roffey T., and Lysyk M. Evaluating the risk of re-identification of patients from hospital prescription records. The Canadian Journal of Hospital Pharmacy., 62(4):307–319, 2009.

[91] Roden D.M., Pulley J.M., Basford M.A., Bernard G.R., Clayton E.W., Balser J.R., and Masys D.R. Development of a large-scale de-identified dna biobank to enable personalized medicine. Clinical Pharmacology and Therapeutics., 24(3):362–369, 2008.

[92] Lyman J.A., Scully K., and Harrison Jr J.H. The development of health care data warehouses to support data mining. Clinics in Laboratory Medicine., 28(1):55–71, 2008.

[93] Berman J.J. Concept-match medical data scrubbing. how pathology text can be used in research. Archives of Pathology and Laboratory Medicine., 127(6):680–686, 2003.

[94] Gardner J. and Xiong L. Hide: An integrated system for health information de- identification. In 21st IEEE International Symposium on Computer-Based Medical Systems (CBMS), pages 254–259, June 2008.

[95] Gupta D., Saul M., and Gilbertson J. Evaluation of a de-identification (de-id) soft- ware engine to share pathology reports and clinical documents for research. Am J Clin Pathol., 121:176–186, 2004.

[96] El Emam K. and Dankar F. K. Protecting privacy using k-anonymity. J Am Med Inform Assoc., 15(5):627–637, 2008 Sept.-Oct.

[97] Sweeney L. k-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems., 10(5):557–570, 2002 Octo- ber.

[98] Meystre S. M., South B.R. Friedlin F.J. and, Shen S., and Samore M.H. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol., 10:70, 2010 Aug.

[99] Pulley J., Clayton E., Bernard G.R., Roden D.M., and Masys D.R. Principles of human subjects protections applied in an opt-out, de-identified biobank. Clin Transl Sci., 3(1):42–48, 2010 Feb.

106 [100] Kantarcioglu M., Jiang W., Liu Y., and Malin B. A cryptographic approach to securely share and query genomic sequences. IEEE Trans Inf Technol Biomed., 12(5):606–617, 2008 Sept.

[101] Hacigumus H., Iyer B., Li C., and Mehrotra S. Executing sql over encrypted data in the database-service-provider model. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data., 2002.

[102] Kantarcioglu M., Jiang W., and Malin B. A privacy-preserving framework for inte- grating person-specific databases, privacy in statistical databases. LNCS, 5262:298– 314, 2008.

[103] Sweeney L. Guaranteeing anonymity when sharing medical data, the datafly system. In Proc AMIA Annu Fall Symp., pages 51–55, 1997.

[104] Sweeney L. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems., 10(5):571–588, 2002 Oct.

[105] Dwork C. Differential privacy: a survey of results. In Proceedings of the 5th interna- tional conference on Theory and applications of models of computation, TAMC’08., pages 1–19, 2008.

[106] Neamatullah I., Douglass M.M., Lehman L.H., Reisner A., Villarroel M., Long W.J., Szolovits P., Moody G.B., Mark R.G., and Clifford G.D. Automated de- identification of free-text medical records. BMC Med Inform Decis Mak., 24:8–32, 2008 July.

[107] Wellner B., Huyck M., Mardis S., Aberdeen J., Morgan A., Peshkin L., Yeh A., Hitzeman J., and Hirschman L. Rapidly retargetable approaches to de-identification in medical records. J Am Med Inform Assoc., 14(5):564–573, 2007 Sept.-Oct.

[108] Uzuner O., Sibanda T.C., Luo Y., and Szolovits P. A de-identifier for medical dis- charge summaries. Artif Intell Med., 42(1):13–35, 2008 January.

[109] Department of Health Lafky D. and Human Services Presentation. The safe harbor method of de-identification: An empirical test, http://www.ehcca.com/presentations/ hipaawest4/lafky 2.pdf., Oct. 8, 2009.

[110] Goldwasser S., Micali S., and Rackoff C. The knowledge complexity of interactive proof systems. SIAM Journal on Computing, 18(1):186–208, 1989 February.

[111] Berman J.J. Health and human services workshop on the hipaa privacy rules de- identification standard, http://www.hhshipaaprivacy.com, March 8-9, 2010.

107 [112] U.S. Dept. of Health Office for Human Research Protections (OHRP) and Human Services. Guidance on research involving coded private information or biological specimens., October 2008.

[113] Boyd A.D., Saxman P.R., Hunscher D.A., Smith K.A., Morris T.D., Kaston M., Bayoff F., Rogers B., Hayes P., Rajeev N., Kline-Rogers E., Eagle K., Clauw D., Greden J.F., Green L.A., and Athey B.D. The university of michigan honest broker: a web-based service for clinical and translational research and practice. J Am Med Inform Assoc., 16(6):784–791, 2009 Nov.-Dec.

[114] Dhir R., Patel A.A., Winters S., Bisceglia M., Swanson D., Aamodt R., and Becich M.J. A multidisciplinary approach to honest broker services for tissue banks and clinical data: a pragmatic and practical model. Cancer, 113(7):1705–1715, 2008 Oct.

[115] Oracle Corporation. Java 2 platform standard edition version 1.4.2. http://download.oracle.com/ javase/1.4.2/docs/api/java /secu- rity/securerandom.html, Date accessed: April 2011.

[116] National Institute of Standards and Technology (NIST). Computer security division, computer security resource center, http://csrc.nist.gov/groups/stm/index.html, Date accessed: April 2011.

[117] B. Schneier. Sha-1 broken. http://www.schneier.com/ blog/archives /2005/02/ sha1 broken.html, February 15, 2005.

[118] Oracle Corporation. Oracle database data warehousing guide, 11g re- lease 1 (11.1), chapter 6, indexes http://download.oracle.com /docs/cd/b28359 01/server.111/b28313/indexes.htm, September 2011.

[119] Oracle Corporation. Fine grained auditing. http://www.oracle.com/technetwork/ database/security/index-083815.html, July 2010.

[120] International Business Machines Corporation. Ibm infosphere datastage. http://www01.ibm.com/software/ data/infosphere/datastage /requirements.html# ibm%20infosphere%20datastage85, Date accessed: September 2011.

[121] Oracle Corporation. Oracle warehouse builder. http://www.oracle.com/ technetwork/developer-tools/ warehouse/overview/introduction /index.html, Date accessed: September 2011.

[122] Kahmann S., Erdal B.S., Liu J., Kamal J., and Clymer B.D. Generalizable session dependent de-identification methods. In AMIA 2011 Annual Symposium, October 2011.

108 [123] Erdal B.S., Liu J., Key C.B., Kamal J., and Clymer B.D. Proxy pacs servers for im- age delivery through an information warehouse. In AMIA 2011 Annual Symposium, October 2011.

[124] National Institute of Standards and Technology (NIST). Computer secu- rity division, computer security resource center. random number generation. http://csrc.nist.gov/groups/st/toolkit/rng/index.html, Date accessed: September 2011.

[125] National Institute of Standards and Technology (NIST). Computer security division, computer security resource center. a statistical test suite for the validation of random number generators and pseudo random number generators for cryptographic applica- tions. http://csrc.nist.gov/groups/st/toolkit/rng/ documentation software.html, Date accessed: September 2011.

[126] Maurer U.M. A universal statistical test for random bit generators. Journal of Cryp- tology., 5(2):89–105, 1992.

[127] Chung K.L. Elementary Probability Theory with Stochastic Processes. New York: Springer Verlag, 1979.

[128] Malin B. Secure construction of k-unlinkable patient records from distributed providers. Artificial Intelligence in Medicine., 48(1):29–41, 2010.

[129] NLM. Unified medical language system. http://www.nlm.nih.gov/ re- search/umls/knowledge sources/ metathesaurus/index.html, Date accessed: April 2011.

[130] Bodenreider O. Using umls semantics for classification purposes. In AMIA Annu Symp Proc., pages 86–90, 2000.

[131] Campbell K.E., Oliver D.E., Spackman K.A., and Shortliffe E.H. Representing thoughts, words, and things in the umls. J Am Med Inform Assoc., 5(5):421–431, 1998 Sept.-Oct.

[132] Meystre S.M., Savova G.K., Kipper-Schuler K.C., and Hurdle J.F. Extracting infor- mation from textual documents in the electronic health record: a review of recent research. Yearb Med Inform., pages 128–144, 2008.

[133] Uzuner O., Goldstein I., Luo Y., and Kohane I. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc., 15(1):14–24, 2008.

[134] Uzuner O. Recognizing obesity and co-morbidities in sparse data. J Am Med Inform Assoc., 16(4):561–570, 2009.

109 [135] Suzuki T., Yokoi H., Fujita S., and Takabayashi K. Automatic dpc code selection from electronic medical records: text mining trial of discharge summary. Methods Inf Med., 47(6):541–548, 2008.

[136] Murff H.J., FitzHenry F., Matheny M.E., Gentry N., Kotter K.L., Crimin K., Dittus R.S., Rosen A.K., Elkin P.L., Brown S.H., and Speroff T. Automated identification of postoperative complications within an electronic medical record using natural language processing. JAMA., 306(8):848–855, 2011.

[137] Uzuner O., Juo Y., and Szolovits P. Evaluating the state-of-the-art in automatic de- identification. J Am Med Inform Assoc., 14(5):550–563, 2007 Sept.-Oct.

[138] Oracle Corporation. Oracle business intelligence enterprise edition plus. http://www.oracle.com/ technetwork/middleware/ bi-enterprise-edition/overview/ index.html, Date accessed: April 2011.

[139] Ding J., Liu J., and Kamal J. uquery: Hipaa-compliant web query tool for retrieving patient clinical data from a data warehouse. In AMIA Annu Symp Proc., page 821, Nov 2009.

[140] Murphy S.N., Mendis M.E., Berkowitz D.A., Kohane I., and Chueh H. Integration of clinical and genetic data in the i2b2 architecture. In AMIA Annu Symp Proc., page 1040, 2006.

[141] Murphy S.N., Mendis M., Hackett K., Kuttan R., Pan W., and Phillips L. et al. Ar- chitecture of the open-source clinical research chart from informatics for integrating biology and the bedside. In AMIA Annu Symp Proc., pages 548–552, 2007.

[142] Saltz J., Oster S., Hastings S., Langella S., Kurc T., Sanchez W., Kher M., Man- isundaram A., Shanbhag K., and Covitz P. cagrid: Design and implementation of the core architecture of the cancer biomedical informatics grid. Bioinformatics., 22(15):1910–1916, 2006.

[143] ATS Board of Directors and ERS Executive Committee. Statement on sarcoidosis. joint statement of the american thoracic society (ats), the european respiratory soci- ety (ers) and the world association of sarcoidosis and other granulomatous disorders (wasog) adopted by the ats board of directors and by the ers executive committee, february 1999. Am J Respir Crit Care Med, 160:736–55, 1999.

[144] Reid J.D. Sarcoidosis in coroners autopsies: a critical evaluation of diagnosis and prevalence from cuyahoga county, ohio. Sarcoidosis Vasc Diffuse Lung Dis., 15:4– 51, 1998.

[145] Iannuzzi M.C., Rybicki B.A., and Teirstein A.S. Sarcoidosis. N Eng J Med., 357:2153–2165, 2007.

110 [146] Crowley L.E., Herbert R., Moline J.M., Wallenstein S., Shukla G., Schechter C., Skloot G.S., Udasin I., Luft B.J., Harrison D., Shapiro M., Wong K., Sacks H.S., Landrigan P.J., and Teirstein A.S. sarcoid like granulomatous pulmonary disease in world trade center disaster responders. Am J Ind Med., 54:175–84, 2011.

[147] Sola R., Boj M., Hernandez-Flix S., and Camprubi M. Silica in oral drugs as a possible sarcoidosis-inducing antigen. Lancet., 373:1943–1944, 2009.

[148] Agrawal R. and Srikant R. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, Santiago, Chile, pages 487–499, September 1994.

[149] Dunham M.H. Data Mining Introductory and Advanced Topics. Pearson Education, Inc., 2003.

[150] Witten I.H., Frank E., and Hall M.A. Data Mining: Practical Machine Learning Tools and Techniques (Third Edition). Morgan Kaufman, 2011.

[151] Milman N. and Selroos O. Pulmonary sarcoidosis in nordic countries 1950-1982. epidemiology and clinical picture. Sarcoidosis., 7:50–7, 1990.

[152] Edwards B.K., Ward E.and Kohler B.A., Eheman C., Zauber A.G., Anderson R.N., Jemal A., Schymura M.J., Lansdorp-Vogelaar I., Seeff L.C., van Ballegooijen M., Goede S.L., and Ries L.A.G. Annual report to the nation on the status of cancer, 1975-2006, featuring colorectal cancer trends and impact of interventions (risk fac- tors, screening, and treatment) to reduce future rates. Cancer., 116:544–73, 2010.

[153] Thomeer M.J.and Costabel U.and Rizzato G.and Poletti V.and Demedts M. Compar- ison of registries of interstitial lung diseases in three european countries. Eur Respir J Suppl., 32:114s–118s, 2001.

[154] Rybicki B.A.and Iannuzzi M.C. Epidemiology of sarcoidosis: recent advances and future prospects. Semin Respir Crit Care Med., 28:22–35, 2007.

[155] Huang C.T.and Heurich A.E.and Sutton A.L.and Lyons H.A. Mortality in sarcoido- sis. a changing pattern of the causes of death. Eur J Respir Dis., 62:231–8, 1981.

[156] Richie R.C. Sarcoidosis: a review. J Insur Med., 37:283–94, 2005.

[157] Swigris J.J., Olson A.L., Huie T.J., Fernandez-Perez E.R., Solomon J., Sprunger D., and Brown K.K. Sarcoidosis-related mortality in the united states from 1988 to 2007. Am J Respir Crit Care Med., 183(11):1524–30, 2011.

[158] G. Keir and Wells A.U. Assessing pulmonary disease and response to therapy: which test? Semin Respir Crit Care Med., 31:409–18, 2010.

111 [159] Goh N.S., Desai S.R., Veeraraghavan S., Hansell D.M., Copley S.J., Maher T.M., Corte T.J., Sander C.R., Ratoff J., Devaraj A., Bozovic G., Denton C.P., Black C.M., du Bois R.M., and Wells A.U. Interstitial lung disease in systemic sclerosis: a simple staging system. Am J Respir Crit Care Med., 177:1248–54, 2008.

[160] Psathakis K., Mermigkis D., Papatheodorou G., Loukides S., Panagou P., Poly- chronopoulos V., Siafakas N.M., and Bouros D. Exhaled markers of oxidative stress in idiopathic pulmonary fibrosis. Eur J Clin Invest., 36:362–7, 2006.

[161] Wells A.U. Pulmonary functions tests in connective tissue disease. Semin Respir Crit Care Med., 28:379–88, 2007.

[162] Thomeer M., Grutters J.C., Wuyts W.A., Willems S., and Demedts M.G. Clinical use of biomarkers of survival in pulmonary fibrosis. Respir Res., 11:89, 2010.

[163] Martinez F.J. and Flaherty K. Pulmonary function testing in idiopathic interstitial pneumonias. Proc Am Thorac Soc., 3:315–21, 2006.

[164] Sundaram B., Chughtai A.R., and Kazerooni E.A. Multidetector high-resolution computed tomography of the lungs: protocols and applications. J Thorac Imaging., 25:125–41, 2010.

[165] Berry C.E. and Wise R.A. Interpretation of pulmonary function test: issues and controversies. Clin Rev Allergy Immunol., 37:173–180, 2009.

[166] Al-Khawari H., Athyal R.P., Al-Saeed O., Sada P.N., Sana Al-Muthairi S., and Al- Awadhic A. Inter- and intraobserver variation between radiologists in the detection of abnormal parenchymal changes on high-resolution computed tomography. Ann Saudi Med., 30:129–33, 2010.

[167] A.C. Best, A.M. Lynch, C.M. Bozic, D. Miller, G.K. Grunwald, and Lynch D.A. Quantitative ct indexes in idiopathic pulmonary fibrosis: relationship with physio- logic impairment. Radiology, 228:407–14, 2003.

[168] Hartley P.G., Galvin J.R., Hunninghake G.W., Merchant J.A., Yagla S.J., Speakman S.B., and Schwartz D.A. High-resolution ct-derived measures of lung density are valid indexes of interstitial lung disease. J Appl Physiol., 76:271–77, 1994.

[169] Rienmller R.K., Behr J., Kalender W.A., Schtzl M., Altmann I., Merin M., and Beinert T. Standardized quantitative high resolution ct in lung diseases. J Comput Assist Tomog., 15:742–9, 1991.

[170] Y. Jiao, F. H. Stillinger, and S. Torquato. Modeling heterogeneous materials via two-point correlation functions: Basic principles. Phys. Rev. E, 76(3):031110, Sep 2007.

112 [171] Shaker S.B., Dirksen A., Laursen L.C., Maltbaek N., Christensen L., Sander U., Seersholm N., Skovgaard L.T., Nielsen L., and Kok-Jensen A. Short-term repro- ducibility of computed tomography-based lung density measurements in alpha-1 an- titrypsin deficiency and smokers with emphysema. Acta Radiol., 45:424–30, 2004.

[172] Baughman R.P., Drent M., Kavuru M., Judson M.A., Costabel U., du Bois R., Albera C., Brutsche M., Davis G., Donohue J.F., Mller-Quernheim J., Schlenker-Herceg R., Flavin S., Lo K.H., Oemar B., and Barnathan E.S. Infliximab therapy in patients with chronic sarcoidosis and pulmonary involvement. Am J Respir Crit Care Med., 174:795–802, 2006.

[173] Wasfi Y.S., Rose C.S., Murphy J.R., Silveira L.J., Grutters J.C., Inoue Y., Judson M.A., and Maier L.A. A new tool to assess sarcoidosis severity. Chest., 129:1234– 45, 2006.

[174] Mayo J.R. Ct evaluation of diffuse infiltrative lung disease: dose considerations and optimal technique. J Thorac Imaging., 24:252–9, 2009.

[175] Park S.C., Tan J., Wang X., Lederman D., Leader J.K., Kim S.H., and Zheng B. Computer-aided detection of early interstitial lung diseases using low-dose ct im- ages. Phys Med Biol., 56:1139–53, 2011.

113