96 © 2017 IMIA and Schattauer GmbH

Grappling with the Future Use of Big Data for Translational Medicine and Clinical Care S. Murphy1, 2, 4, V. Castro2, K. Mandl3, 4 1 Massachusetts General Hospital Laboratory of Computer Science, Boston, MA, USA 2 Partners Healthcare Research Information Science and Computing, Somerville, MA, USA 3 Boston Children’s Hospital Computational Program, Boston, MA, USA 4 Harvard Medical School Department of Biomedical Informatics, Boston, MA, USA

Summary is surprisingly disjointed in its ability to Introduction handle imaging, waveform, genomic, and Objectives: Although patients may have a wealth of imaging, If we are to infuse science-based clinical continuous cardiac and cephalic monitoring. genomic, monitoring, and personal device data, it has yet to be practice into healthcare, we will need Much of these data fall under the label of fully integrated into clinical care. better technology platforms than the cur- “Big Data” [5]. Methods: We identify three reasons for the lack of integration. rent Electronic Medical Record Systems One defining aspect of “Big Data” is that The first is that “Big Data” is poorly managed by most Electronic (EMRS), which are not built to support the data tends to be organized in a funda- Medical Record Systems (EMRS). The data is mostly available scientific innovation. In different countries, mentally different way that cannot fit easily on “cloud-native” platforms that are outside the scope of most

the functionality and overall goals of EMRS and directly into hierarchical and relational EMRs, and even checking if such data is available on a patient can vary, but in the United States (US) the databases. Rather, much of the data arrives often must be done outside the EMRS. The second reason is primary purpose of EMRS is to bill for the as (sometimes very large) data objects. In S3 that extracting features from the Big Data that are relevant to patient visit [1]. The Epic EMRS, dominant object storing systems (such as https://aws. healthcare often requires complex machine learning algorithms, in many US Academic Healthcare Centers amazon.com/s3), there is often only minimal such as determining if a genomic variant is protein-altering. The and known for its strength of efficient pa- metadata regarding what information exists third reason is that applications that present Big Data need to be tient billing [2], is mostly closed to outside in the data. These data “objects” have their modified constantly to reflect the current state of knowledge, such software additions as part of the quest to own internal structure and often require a as instructing when to order a new set of genomic tests. In some develop the most efficient system possible. great deal of computation to extract health- cases, applications need to be updated nightly. Other EMRS in the US are perhaps best care-relevant features. Support exists for Results: A new architecture for EMRS is evolving which could summarized as vehicles for inter-clinician many tools within the cloud-native platform unite Big Data, machine learning, and clinical care through a communication [3]. This communication is that enable the analysis of the healthcare microservice-based architecture which can host applications fo- mostly facilitated through the exchange of “Big Data” such as Hadoop, CouchDB, cused on quite specific aspects of clinical care, such as managing narrative text, and therefore the information MongoDB, Matlab, R, SAS, and Docker, cancer immunotherapy. which is not coded for billing purposes is left but the paradigm of computing on data ob- Conclusion: Informatics innovation, medical research, and in the unstructured notes exchanged between jects rather than examining the data through clinical care go hand in hand as we look to infuse science-based clinicians [4]. This situation has greatly Structured Query Languages requires new practice into healthcare. Innovative methods will lead to a new reduced the appeal of using Big Data for skills and infrastructures which are differ- ecosystem of applications (Apps) interacting with healthcare Healthcare in the US. We will largely focus ent from the database-oriented approaches providers to fulfill a promise that is still to be determined. on the loss of this appeal in this report. currently employed in most EMRS. Keywords At odds with what is presented above In addition to barriers regarding training as an efficient Healthcare Enterprise In- people how to use the new tools, the tools Big Data, healthcare; decision support systems, clinical; genomics formation Technology (IT) management to process Big Data and extract health- system is data entering the system through care-relevant features must often reside on Yearb Med Inform 2017: 96-102 channels outside the Healthcare Enterprise, “Cloud-native” Platforms [6]. This is be- http://dx.doi.org/10.15265/IY-2017-020 data such as personal health data, patient cause they need massive parallel processing Published online August 18, 2017 surveys, patient self-reported outcomes, to scale their file processing environments. and patient-initiated genomic testing. Even Software services and tools on cloud-native inside the Healthcare Enterprise, the bill- platforms enable scalable provisioning of ing and communication-oriented EMRS compute resources, interoperability between

IMIA Yearbook of Medical Informatics 2017 97 Grappling with the Future Use of Big Data for Translational Medicine and Clinical Care

digital objects, indexing to promote discov- cephalogram waveform data from seizure feeds from the entire population over the past erability of digital objects, sharing of digital monitoring is commonly put into files in spe- year where people are identified in a coded objects between providers and processes, ac- cialized systems. The metadata of the files form. This presents a challenge to placing cess to and deployment of scientific analysis are not mapped to standard annotations, or boundaries on what data is shared, especial- tools and pipeline workflows, and connec- annotations of any kind, that are recognized ly regarding consented and unconsented tivity with other repositories, registries, and at the enterprise level. The system itself may patients. Generally, one must obtain patient resources. The idea of a Big Data Commons use S3 storage, but the internal structures consent to associate the internal data with has been advanced at the National Institute of the data are completely unrecognized specific patients, although in general, Big of Health (NIH) as “The NIH Commons” by a traditional healthcare enterprise query Data sets that are shared as “de-identified” (https://datascience.nih.gov/commons) system. Without a query language to make pose a significant risk for re-identification illustrating many of these differences with a semantic link between systems, much of due to the incredibly detailed data which traditional computing environments. this data is left on the healthcare analytics’ may be available per patient [8]. This makes Understanding the use of Big Data within cutting room floor. Big Data a challenge for integration into the the policy of healthcare remains a work-in- A third integration hurdle is the necessi- healthcare enterprise. Most EMRS are not progress. The cloud-native platform does ty to adopt new policies that deal with the prepared for the challenges of untangling not need to reside in the public cloud, and sharing of Big Data, which must address at the patient attributions in the data, nor similar computational infrastructure can be least four issues: 1) what data to focus upon placing boundaries around the granular data established in private and public clouds such sharing; 2) how to place boundaries on the elements inside the Big Data. that parts of the computing infrastructure can way data is shared; 3) how to maintain the Although the cost of keeping Big move from private to public clouds as need- cost of sharing data; and 4) how to provide Data in the public cloud is comparable to ed, especially when considering concerns adequate motivation for stakeholders to the cost of local S3 storage, the cost of of data privacy. support the sharing of the data. transferring data can be very high (http:// To focus on what data to share, it is im- www.networkworld.com/article/3164444/ portant to understand the use cases behind cloud-computing/how-to-calculate-the-

selecting the data to be shared. For example, true-cost-of-migrating-to-the-cloud.html). Organizing an Approach to the initial focus in a US healthcare data This reflects the general principle that network named PCORNet was to include costs are minimized by operating on the Big Data in Healthcare standard coded healthcare data only, but data locally, and extracting the results of Big Data faces numerous integration hurdles, when a needs’ assessment was later under- an operation, rather than transferring the both with the EMRS data and with the other taken among the network sites, most needs entire Big Data set to a remote site for an Big Data. This is due to several factors, in- were found to be for more specific data operation to be performed. For example, cluding the relatively unstructured approach such as cancer registry data, rare laboratory although millions of seconds of waveform to data organization in the cloud-native en- tests, and personal health data [7]. This data may be available, only particular vironment. The approach of S3 file storage emphasized that, although “classic” data in episodes of cardiac arrhythmia or brain is to allow each file to be described with the EMRS may be easier to handle than Big epileptiform discharges may be desired. metadata that helps organize the file system, Data, use cases are guiding the need to in- Supplying metadata can direct analysis pro- but the internal data structures of the files clude Big Data, and because this integration grams operating at the location of the data cannot be directly queried. Various sets will take a significant investment, it will be to return the most useful data if this func- of files commonly have different internal important to properly assess which Big Data tion is built into the local processing units. structures, defined by a schema that exists should be prioritized. Therefore, an architecture that minimizes in a separate file with a schema definition Most high priority data will need to be data transfer when using Big Data adopts a language. In turn, the files can be read and understood at the patient level, that is, data set of micro-services that allow operations their content manipulated by the robust, will need to be attributed to specific pa- on the data in-place [9]. One must consider parallel processing computational environ- tients. However, one of the most significant that, otherwise, many micro-transactions on ment that exists in cloud-native platforms. challenges for metadata descriptions of Big very different kinds of Big Data would need However, this paradigm is not like the way Data is that Big Data often arrives in large to be managed, requiring a sophisticated data is queried in EMRS database relational “blocks” without such attribution. For exam- system for accepting the updating transac- structures, and this means data integration ple, environmental data from area sensors tions, reconciliation, and failure recovery will need an innovative approach. give detailed data regarding the attributes if one were to attempt to aggregate the Big Another integration hurdle for Big Data attached to each sensor node, but there is no Data centrally. is that the sources are often in data silos that direct link to a patient’s exposure. Even data Providing the motivation for stakeholders are not directly semantically interoperable, attributed to people is in large blocks, like to open their Big Data repositories touches making queries across the systems nearly the registry of data from people who were many disciplines, including the cost of impossible. For example, patient electroen- participants in a study, or a file of Twitter preparation, legal ownership, and attribution.

IMIA Yearbook of Medical Informatics 2017 98 Murphy et al.

Sharing Big Data can make data provenance be returned from Big Data repositories. It platforms is through an ontology-driven and credit attribution to the creators of the is less desirable to unnecessarily obtain an approach whereby individual ontology items data difficult to manage, and several propos- entire data set in these cases, which is often specify where data resides throughout the als have been made regarding attribution for the case with Big Data file objects, exposing system. Once the ontology is used to locate “block” data sharing, including the FAIR to unnecessary scrutiny what can often be the specified data in the system, services then (Findability, Accessibility, Interoperability, an entire block of patients. Additionally, a direct queries against views of database fact and Reusability) paradigm [10]. The FAIR micro-service layer allows a policy to be put and dimension tables [11, 12] derived from paradigm has been formulated for web into a place that reflects the sensitivity and the Big Data S3 Repositories. The services services to automatically find and use Big ownership of the data. The local control of combine the results into a single data matrix Data. An example of a Big Data index that the data in the Big Data repositories allows (table) that appears as though it came from a implements FAIR principles is the NIH updates to be managed locally and exposure traditional relational database, but represents Big Data bioCADDIE Application (https:// of new types of data can occur naturally with the results from Big Data Repositories. This biocaddie.org/). data policies that are controlled at each site. is achieved by incorporating patient features An Enterprise Big Data Commons could A formulation of an Enterprise Big Data into a common tabular format regardless of be used to integrate the Big Data for sec- Integration was undertaken in the PIC-SURE the original source. The tabular format can ondary use into translational medicine and (Patient-centered Information Commons: be used for further analysis. clinical care (see Figure 1). An Enterprise Standardized Unification of Research El- Two innovations have allowed this inte- Big Data Commons is created in a hospital ements) Big Data to Knowledge (BD2K) gration of Big Data and the satisfaction of to accomplish two essential functions. The Center of Excellence, sponsored by the many of the requirements described above. first one is to employ an indexing strategy National Institutes of Health, where a dis- The first is a web service layer that allows a that allows data to be searched, preferably tributed query system ties together several system such as Informatics for Integrating in the location it exists rather than copied different platforms commonly used for pa- Biology and the Bedside (i2b2) to act as a to a new location. The second function is tient-oriented research into a single system supervisory layer for querying the Big Data to make the data available through web which can be queried to return integrated if the Big Data can be transformed so that at services that allow the data on a specific patient counts and data (http://pic-sure.org). least some of its features can be placed into patient to be found, and for specific data to The method of unification for the Big Data a patient-centric observation-fact table [12].

Fig. 1 shows the essential parts of integrating Big Data into Clinical Care. Starting at the left, the figure illustrates independent Big Data repositories of Imaging, Genomics, and Person Health data. Each repository exists as a compendium of files, often in an S3 file repository. The files are formulated with software into a standardized index of observations and facts, such as that which exists in an i2b2 Star Schema. It is the standardized index which is then queried through the EMRS Clinical System and integrated into queries against a similar standardized index created from the EMRS data. This query system for Big Data allows a full set of patient features from the integrated systems to be obtained and subsequently creates and optimized further features through machine learning. As described in the text, this machine learning takes advantage of the Big Data to overcome biases and errors in the clinical EMRS data. Accurate patient phenotypes are integrated into “Apps” which are able to rapidly adapt to the new features available in the Big Data, presenting a complete picture of the patient for decision support and clinical care.

IMIA Yearbook of Medical Informatics 2017 99 Grappling with the Future Use of Big Data for Translational Medicine and Clinical Care

The fact table in each Big Data repository biases, but this is different from laboratory notypic information to quantify subtle dif- serves to associate many of the features of sampling biases. Other common sources of ference in disease populations. Linking Big the data with a specific patient, and soft- secondary healthcare data, such as the coded Data EMRS datasets to biological samples ware can distribute queries to the different data obtained from billing, are also inherent- and genomic information in Biobanks is patient-centric data repositories and achieve ly biased due, this time, to oversampling of an emerging approach for the exploration a complex level of inter-system integration. the sickest patients for concurrent diseases of genetic causes of both common and rare The second innovation is an ontology [16]. The billing diagnoses are also biased diseases [19]. Relying solely on ICD-9/10 specification for every feature that exists in from the coding practices within the business codes for defining disease cohorts provides one of the Big Data repositories throughout of healthcare, which optimizes the codes that inadequate specificity for genetic studies the enterprise. The ontologies can be used guarantee the greatest reimbursement [17]. [20]. Machine learning approaches have to formulate patient-centric queries such as For advanced analytics to be performed upon been utilized to develop high-throughput previously described [12]. The ontologies of these data, one needs a strategy to eliminate phenotyping algorithms for characterizing the features are published and reconciled at the inaccuracies found in commonly record- patients with treatment-resistant depres- a central site so that the terms can be used in ed healthcare features. sion, rheumatoid arthritis, and cerebral a query. Likewise, the patients are published Harnessing the power of Big Data and aneurysms, among many others [21-23]. to a central honest broker and de-duplicated integrating it with the data from the EMRS Training and validation of algorithms is into a set of matched identifiers across the requires methods and tools to identify im- done through clinician chart review of the repositories. The honest broker publishes a portant features that can be transformed into medical record as well as with in-person table of matches and match probabilities, knowledge to inform clinical decision-mak- clinical trials [24]. which does not need to contain explicit ing. The EMRS features heterogeneous data Many important features in the EMRS are HIPAA identifiers if the sites can exchange sources combined for billing and clinical “locked” into narrative text [25)]. Natural coded patient identifiers. Only the honest care. Because the data are not collected Language Processing (NLP) tools must be broker controls the matching algorithm and primarily for science-based clinical care, used to leverage the information stored in the Protected Health Information used for the veracity and reliability of the data are clinical notes and reports, and patient-related the matching. Once all of this is in place, a inadequate for analysis in its raw form. On documents, such as blogs and social media. query can be formulated at the central site the other hand, the scale and diversity of the Researchers utilize the UMLS Metathe- and relevant sub-queries distributed to the data provide opportunities for developing saurus to map terms that are semantically Big Data repositories. Each system then runs machine learning algorithms that combine equivalent and use public reference sources the query against its fact table(s), returning multiple data types to increase reliability. such as Wikipedia and Medscape to identify the set of patient features or patients match- Below, we present three main use cases for important text features associated with a ing the features as specified by the ontology identifying important features in Big Data disease [26]. terms in the query. sources that can power clinical decision Methods for matching controls within support tools and help clinicians make sense hospital-based populations are also im- of the deluge of data to help care for patients. proving classification accuracy among comparison cohorts [16]. The portability of Harnessing the Power of these machine learning algorithms across populations and EMRS is an open research Big Data to Find Important Phenotyping question with considerable implications The diversity of Big Data provides an as EMRS are joined together via national Patient Features opportunity to characterize complex clinical research networks [27, 28]. The unification of patient data from re- phenotypes used for discovering genetic A collinear application of phenotyping is positories of genomic data, imaging data, associations with disease. A decade ago, cohort identification for clinical trial recruit- waveform data, and patient reported data genome-wide association (GWAS) studies ment. This use case focuses more on the sen- can provide value much greater than any focused on a binary phenotype (presence sitivity of machine learning algorithms since single source of data because some of the of absence of a disease) for identifying clinical trials often require large numbers of inherent biases in each source of data can be pathogenic variants. However, scientists patients screened to meet enrollment targets. overcome when these same biases are absent have subsequently discovered that most Methods have been developed to match the in alternative sources [13-15]. For example, diseases are not caused by a single gene EMRS data to trial inclusion and exclusion the lack of time dependence for germline (Mendelian) but are polygenic with many criteria, and real-time alerting of eligible genomic data allows it to overcome the variants conferring small amounts of risk. subjects within the EMRS interface [29, 30]. sampling bias present in most clinical lab- Complex genetic traits thus require more International efforts are underway aiming oratory data (when tests are taken at a time power to detect smaller effects [18]. This to make clinical trial participants more rep- when patients are believed to be sickest). Of power is achieved through both larger resentative of the general population using course, genomic data comes with its own cohorts and richer and more reliable phe- networks of EMRS for recruitment [31, 32].

IMIA Yearbook of Medical Informatics 2017 100 Murphy et al.

Hypothesis Generation ing improved outcomes through Big Data ingful Use Stage 3 and the 21st Century Cures analytics and clinical decision support after Act require that certified health information Big Data techniques provide unique opportu- surgical interventions and the early detection technologies have Application Programming nities to conduct the unsupervised analysis of of sepsis [45, 46]. Interfaces (API) to bring the full power of the data to identify novel features correlated The increasingly complex health care the Web to patient interactions, including to diseases and outcomes. These approaches environment presents clinicians with an external services and data. can help elucidate common biological path- amount of data that in some cases exceeds In healthcare, APIs are opening the clin- ways among diseases. For example, clusters a person’s ability to organize and make ical encounter to third-party IT innovation of comorbidities have been used to identify sense of the data. These growing demands and redefining clinical interoperability in population subtypes of autism, schizophre- beg for clinical decision support tools that terms of substitutability -- apps that can be nia, and inflammatory bowel disease using help clinicians make sense of all the data. added to or deleted from EMRS as easily EMRS data [33, 34]. Unsupervised methods These support tools will rely on aggregated as from a smartphone [48]. In 2009, Mandl have been developed to identify new indi- data from heterogeneous data sources in a and Kohane proposed that the EMRS should cations for existing medications – so called common information model, which can be be re-imagined as “iPhone-like” platforms drug repurposing [35] – as well as to detect used to train and deploy machine learning supporting a selection of ‘substitutable’ new adverse drug interactions that may put algorithms to identify patient cohorts and modular third-party applications (Apps). patients at risk [36]. In the area of health ser- predict future outcomes. The next section Substitutable apps connected to the EMRS vices utilization, unsupervised approaches describes a new breed of applications in bring the full power of the Web to patient can help detect patterns in the delivery of healthcare, the focused “App” which by us- interactions, including external services care that can inform future machine learning ing innovative visualization techniques can and data [49]. Subsequently, the SMART algorithm development [37]. Unsupervised present this information in a clinically-rel- Health IT project was funded by the Of- approaches have been criticized as difficult evant way and unlock the value of EMRS fice of the National Coordinator of Health to reproduce and filled with spurious asso- utilization of Big Data. Information Technology (ONC) under the ciations [38]. Indeed, these results must be Strategic Health IT Advanced Research considered as only the first steps in identi- Projects (SHARP) program as a part of the fying an important question or hypothesis, HITECH Act. which can then be confirmed with controlled Technically, SMART leverages Health observational and interventional studies. Apps to Connect the Level Seven’s (HL7) Fast Health Interop- Healthcare Big Data erability Resources (FHIR) to help stan- Commons to Clinical Care dardize communication between apps and Prediction – Clinical Decision the EMRS, several health care vocabularies To complete the infusion of research and to ensure common nomenclature (includ- Support innovation into healthcare outside the con- ing RxNorm for medications, LOINC for Accurate classification of disease cohorts is a straints of the EMRS, a new ecosystem of laboratory test, SNOMED CT for clinical precursor for identifying predictive features “Apps” is being developed that may provide terminology), and a universal web standard in Big Data that could lead to interventions a way to bridge Big Data into the EMRS (OAuth 2) for authorizing the apps to access that prevent disease, hospital re-admissions, workflow and therefore allow its translation health data. Together, these technologies and adverse events associated with drugs into science-based clinical care. Historically, form a robust apps API. [39]. Naïve Bayes models have been used to the point of care has been a walled garden. By defining an API that consistently assign risk scores based on a combination of EMRS principally displays to clinicians presents well-specified data, SMART Health features for the prediction of future suicide the information they entered previously, IT (a) enables purchasers, users, and admin- attempts and domestic abuse in children but not the wide range of data and Web- istrators of platform-based systems to be [40, 41]. Other modelling approaches have based services that could and should drive able to install and subsequently substitute focused of quantifying the cognitive burden cost-efficient care and decision-making [47]. apps from different vendors without soft- of patients’ medications to predict future These barriers have limited the health care ware programming and (b) creates a broad falls and hospital re-admissions [42, 43]. encounter from taking advantage of features market for app developers across multiple In the US, the shift from volume to val- derived from Big Data sources. systems, including the EMRS, patient-facing ue-based payment models creates incentives But now there is an opportunity to expose apps, and health information exchanges. for providers to promote good outcomes. the point of care to third party apps that can Fostering third-party apps creates a market These improved outcomes include prevent- bridge clinical activities with Big Data. Apps where innovations compete with each other ing re-admissions and emergency room are a particularly good fit with Big Data for purchase and use by providers (and pa- visits and overall reduction in costs incurred because of the ability to compute upon the tients), thus reducing dependency on updates by high-risk patients [44]. Initial outcome variety of features found in Big Data in a and specific functions made by an EMRS studies have shown some success in achiev- widely distributable application. Both Mean- vendor [50]. Recently the ONC funded an

IMIA Yearbook of Medical Informatics 2017 101 Grappling with the Future Use of Big Data for Translational Medicine and Clinical Care

Apps Gallery [51] for end users—clinicians, structures. Integrating big data into clinical References researchers, and patients at home to locate systems requires web services organized 1. Haneuse S, Daniels M. A General Framework for apps that can be connected to the EMRS via around a common information model to Considering Selection Bias in EHR-Based Stud- SMART and FHIR. reach out to the Big Data Repositories as ies: What Data Are Observed and Why? EGEMS These apps not only enable local custom- they exist in cloud-native infrastructures. (Wash DC) 2016;4(1):1203. ization of the EMRS to clinical tasks, but Second, features that are important to 2. Johnson RJ. A Comprehensive Review of an System Soon to Assume also allow the connection between the EMRS healthcare often need to be computed from, Market Ascendancy: EPIC. J Healthc Commun and external data and services. To advance rather than being readily available from, the 2016;1(4). genomic medicine, traditional EMRS cannot Big Data. Finally, until the development 3. Ash JS, Bates DW. Factors and Forces Affect- incorporate, for example, whole exome se- of SMART App standards, there were few ing EHR System Adoption: Report of a 2004 quences into their data core. But SMART en- ways to present features of Big Data into the ACMI discussion. J Am Med Inform Assoc 2005;12(1):8-12. ables the EMRS to readily query an external patient-clinician encounter. 4. Johnson SB, Friedman C. Integrating Data From FHIR genomics server to provide decision What is essential in representing Big Natural Language Processing into a Clinical In- support to a treating clinician [52, 53]. An Data is a permissive data structure that can formation System. Proc AMIA Annu Fall Symp example is the Precision Cancer Medicine represent features, including computed 1996:537-41. (PCM) app, designed to present patients’ features, of the data which can dynamically 5. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. Big Data: Astronomical genomic test results to oncologists in real expand using an ontology driven-approach. or Genomical? PLoS Biol 2015;13(7):e1002195. time as a component of clinical practice, This model can be linked to indexing 6. Stine M. Migrating to Cloud-Native Application as well as provide links to external knowl- and data retrieval paradigms through web Architectures: OReily; 2015. edge bases. Because the app was developed services. With a dynamically expandable 7. Berchuk A, Kahn M, Rusincovitc S, Meeker D, against the SMART API, even though the ontology-driven data model, new types of Murphy S, Bhosale R, et al. Editor Assessment and Planning for the PCORnet Common Data Model. initial deployment was at Vanderbilt, it can features of Big Data can be accommodated. AMIA 2017 Joint Summits on Translational Sci- easily be deployed to other EMRS. Local groups can be involved in creating ence; 2017. An emerging feature of the SMART specialized data sets. Different standards 8. Cimino JJ. The False Security of Blind Dates: ecosystem is the adoption of a standardized, can be accommodated as agreed upon by Chrononymization’s Lack of Impact on Data service-oriented architecture (SOA) for specialized groups. Various levels of data Privacy of Laboratory Data. Appl Clin Inform 2012;3(4):392-403. clinical decision support (CDS)—SMART granularity can be accommodated, so that 9. Knowledge Representation for Health Care, HEC CDS Hooks [54]. This approach separates the data presentation can more precisely 2016 International Joint Workshop. New York, NY: CDS rules from the EMRS itself. Instead, the adhere to privacy needs to allow a more Springer Berlin Heidelberg; 2017. actionable knowledge is hosted on a rules en- targeted answer for what a clinician “needs 10. Wilkinson MD, Dumontier M, Aalbersberg IJ, to know” rather than receiving large data Appleton G, Axton M, Baak A, et al. The FAIR gine made accessible by a third party. EMRS Guiding Principles for Scientific Data Manage- vendors, including Cerner and Epic, are sets with multiple patients which is the ment and Stewardship. Sci Data 2016;3:160018. embedding hooks, or triggers in the EMRS, hallmark of most Big Data. The ability to 11. Kimball R. The Data Warehouse Toolkit : Practical to launch third-party decision services and lose the data chain of custody is minimized Techniques for Building Dimensional Data Ware- and attribution can be specifically assigned houses. New York: John Wiley & Sons; 1996. SMART apps after key events—for exam- 12. Murphy SN, Weber G, Mendis M, Gainer V, Chueh ple, medication prescribing. This approach to data holders. HC, Churchill S, et al. Serving the Enterprise and is highly promising for bringing predictive Apps that will run inside and outside the Beyond with Informatics for Integrating Biology analytics to the point of care. The computa- EMRS can provide support for complex de- and the Bedside (i2b2). J Am Med Inform Assoc tion of risk scores in real time (e.g. risk of cisions and workflows that involve genomics, 2010;17(2):124-30. 13. Kruse CS, Goswamy R, Raval Y, Marawi S. Chal- readmission after discharge for congestive imaging, and personal health repositories. lenges and Opportunities of Big Data in Health heart failure) or firing of standardized rules Accurate phenotyping can become a routine Care: A Systematic Review. JMIR Med Inform (e.g. immunization recommendations from part of clinical care. An infrastructure based 2016;4(4):e38. the Advisory Committee on Immunization on the SMART API will allow an ecosystem 14. Belle A, Thiagarajan R, Soroushmehr SM, Navidi F, Beard DA, Najarian K. Big Data Analytics in Practices) can occur at a single point of of apps to be shared across healthcare insti- Healthcare. Biomed Res Int 2015;2015:370194. control, but influence care everywhere. tutions using web service standards such as 15. Brookhart MA, Sturmer T, Glynn RJ, Rassen J, FHIR. This provides a vision for a new type Schneeweiss S. Confounding Control in Health- of EMRS, which is expected to play out over care Database Research: Challenges and Potential Approaches. Med Care 2010;48(6 Suppl):S114-20. the next five to ten years. 16. Castro VM, Apperson WK, Gainer VS, Anan- thakrishnan AN, Goodson AP, Wang TD, et al. Conclusion Acknowledgments Evaluation of Matched Control Algorithms in Infusing Big Data into the healthcare en- Thanks to Christopher Herrick for devel- EHR-based Phenotyping Studies: a Case Study of Inflammatory Bowel Disease Comorbidities. J counter has several barriers. First, the data opment of figure 1. Funding from NIH Biomed Inform 2014;52:105-11. are often positioned outside the EMRS with awards U54 HG007963 and RO1 HG009174 17. O’Malley KJ, Cook KF, Price MD, Wildes KR, their own distinct semantic and technical contributed to these concepts. Hurdle JF, Ashton CM. Measuring diagnoses:

IMIA Yearbook of Medical Informatics 2017 102 Murphy et al.

ICD Code Accuracy. Health Serv Res 2005;40(5 Seventh International Conference on Semantic Hospital Discharge Notes Is Associated with Read- Pt 2):1620-39. Computing (ICSC). IEEE: 2013. mission and Mortality Risk: An Electronic Health 18. Solovieff N, Cotsapas C, Lee PH, Purcell SM, 30. Ennis C, Snyder D, Ainsworth T, Stacy M, Sand- Record Study. PLoS One 2015;10(8):e0136341. Smoller JW. Pleiotropy in Complex Traits: erson I, editors. Utilization of the EPIC Electronic 43. Castro VM, McCoy TH, Cagan A, Rosenfield HR, Challenges and Strategies. Nat Rev Genet Health Record System for Clinical Trials Manage- Murphy SN, Churchill SE, et al. Stratification of 2013;14(7):483-95. ment at Duke University. 2014 IEEE International Risk for Hospital Admissions for Injury Related 19. Gainer VS, Cagan A, Castro VM, Duey S, Ghosh B, Conference on Healthcare Informatics (ICHI). to Fall: Cohort Study. BMJ 2014;349:g5863. Goodson AP, et al. The Biobank Portal for Partners IEEE: 2014. 44. Burwell SM. Setting Value-based Payment Goals-- : A Query Tool for Working 31. Geifman N, Butte AJ. Do Cancer Clinical Trial HHS Efforts to Improve U.S. Health Care. N Engl with Consented Biobank Samples, Genotypes, and Populations Truly Represent Cancer Patients? J Med 2015;372(10):897-9. Phenotypes Using i2b2. J Pers Med 2016;6(1). A Comparison of Open Clinical Trials to the 45. Evans RS, Benuzillo J, Horne BD, Lloyd JF, Brad- 20. Sinnott JA, Dai W, Liao KP, Shaw SY, Anan- Cancer Genome Atlas. Pac Symp Biocomput shaw A, Budge D, et al. Automated Identification thakrishnan AN, Gainer VS, et al. Improving the 2016;21:309-20. and Predictive Tools to Help Identify High-risk Power of Genetic Association Tests with Imperfect 32. Porter M, Ramaswamy B, Beisler K, Neki P, Single Heart Failure Patients: Pilot Evaluation. J Am Med Phenotype Derived from Electronic Medical Re- N, Thomas J, et al. A Comprehensive Program for Inform Assoc 2016;23(5):872-8. cords. Hum Genet 2014;133(11):1369-82. the Enhancement of Accrual to Clinical Trials. Ann 46. Erskine AR, Karunakaran B, Slotkin JR, Fein- 21. Perlis RH, Iosifescu DV, Castro VM, Murphy Surg Oncol 2016;23(7):2146-52. berg DT. Harvard Business Review [Internet] SN, Gainer VS, Minnier J, et al. Using Electronic 33. Doshi-Velez F, Ge Y, Kohane I. Comorbidity Clus- 2016. Available from: https://hbr.org/2016/12/ Medical Records to Enable Large-Scale Studies ters in Autism Spectrum Disorders: an Electronic how-geisinger-health-system-uses-big-data-to- in Psychiatry: Treatment Resistant Depression as Health Record Time-Series Analysis. save-lives. a Model. Psychol Med 2012;42(1):41-50. 2014;133(1):e54-63. 47. Weber GM, Mandl KD, Kohane IS. Finding the 22. Liao KP, Cai T, Savova GK, Murphy SN, Karlson 34. Ananthakrishnan AN, Gainer VS, Cai T, Perez Missing Link for Big Biomedical Data. JAMA EW, Ananthakrishnan AN, et al. Development of RG, Cheng SC, Savova G, et al. Similar Risk of 2014;311(24):2479-80. Phenotype Algorithms using Electronic Medical Depression and Anxiety Following Surgery or 48. Mandl KD, Mandel JC, Kohane IS. Driving Inno- Records and Incorporating Natural Language Hospitalization for Crohn’s Disease and Ulcerative vation in Health Systems through an Apps-Based Processing. BMJ 2015;350:h1885. Colitis. Am J Gastroenterol 2013;108(4):594-601. Information Economy. Cell Syst 2015;1(1):8-13. 23. Castro VM, Dligach D, Finan S, Yu S, Can 35. Dang T-T, Ouankhamchan P, Ho T-B, editors. De- 49. Mandl KD, Kohane IS. No Small Change for A, Abd-El-Barr M, et al. Large-scale Identi- tection of New Drug Indications from Electronic the Health Information Economy. N Engl J Med fication of Patients with Cerebral Aneurysms Medical Records. 2016 IEEE RIVF International 2009;360(13):1278-81. Using Natural Language Processing. Neurology Conference on Computing & Communication 50. Bloomfield RA, Jr., Polo-Wood F, Mandel JC, 2017;88(2):164-8. Technologies, Research, Innovation, and Vision Mandl KD. Opening the Duke Electronic Health

24. Castro VM, Minnier J, Murphy SN, Kohane for the Future (RIVF). IEEE: 2016. Record to Apps: Implementing SMART on FHIR. I, Churchill SE, Gainer V, et al. Validation of 36. Tatonetti NP, Denny JC, Murphy SN, Fernald GH, Int J Med Inform 2017;99:1-10. Electronic Health Record Phenotyping of Bipolar Krishnan G, Castro V, et al. Detecting Drug Inter- 51. SMART Apps Gallery 2017 [Available from: Disorder Cases and Controls. Am J Psychiatry actions from Adverse-Event Reports: Interaction https://gallery.smarthealthit.org/. 2015;172(4):363-72. Between Paroxetine and Pravastatin Increases 52. Warner JL, Rioth MJ, Mandl KD, Mandel JC, 25. Hripcsak G, Albers DJ. Next-Generation Pheno- Blood Glucose Levels. Clin Pharmacol Ther Kreda DA, Kohane IS, et al. SMART Precision typing of Electronic Health Records. J Am Med 2011;90(1):133-42. Cancer Medicine: a FHIR-based App to Provide Inform Assoc 2013;20(1):117-21. 37. Weber GM, Kohane IS. Extracting Physician Genomic Information at the Point of Care. J Am 26. Yu S, Liao KP, Shaw SY, Gainer VS, Churchill SE, Group Intelligence from Electronic Health Records Med Inform Assoc 2016;23(4):701-10. Szolovits P, et al. Toward High-Throughput Phe- to Support Evidence Based Medicine. PLoS One 53. Alterovitz G, Warner J, Zhang P, Chen Y, Ull- notyping: Unbiased Automated Feature Extraction 2013;8(5):e64933. man-Cullere M, Kreda D, et al. SMART on and Selection from Knowledge Sources. J Am Med 38. Open Science C. PSYCHOLOGY. Estimating the FHIR Genomics: Facilitating Standardized Inform Assoc 2015;22(5):993-1000. Reproducibility of Psychological Science. Science Clinico-genomic Apps. J Am Med Inform Assoc 27. Mandl KD, Kohane IS, McFadden D, Weber GM, 2015;349(6251):aac4716. 2015;22(6):1173-8. Natter M, Mandel J, et al. Scalable Collaborative 39. Walsh C, Hripcsak G. The Effects of Data Sources, 54. Berman JJ. Concept-match medical data scrubbing. Infrastructure for a Learning Healthcare System Cohort Selection, and Outcome Definition on a How Pathology Text can be used in Research. Arch (SCILHS): Architecture. J Am Med Inform Assoc Predictive Model of Risk of Thirty-day Hospital Pathol Lab Med 2003;127(6):680-6. 2014;21(4):615-20. Readmissions. J Biomed Inform. 2014;52:418-26. 28. Weber GM, Murphy SN, McMurry AJ, Macfad- 40. Reis BY, Kohane IS, Mandl KD. Longitudinal His- den D, Nigrin DJ, Churchill S, et al. The Shared tories as Predictors of Future Diagnoses of Domes- Health Research Information Network (SHRINE): tic Abuse: Modelling Study. BMJ 2009;339:b3677. Correspondence to: a Prototype Federated Query Tool for Clinical 41. Barak-Corren Y, Castro VM, Javitt S, Hoffnagle Shawn Murphy MD, Ph.D. Data Repositories. J Am Med Inform Assoc AG, Dai Y, Perlis RH, et al. Predicting Sui- HMS Professor of Neurology and Partners’ Healthcare 2009;16(5):624-30. cidal Behavior From Longitudinal Electronic Chief Research Information Officer 29. Goodwin T, Harabagiu SM, editors. Automatic Health Records. Am J Psychiatry 2016:appia- Laboratory of Computer Science Generation of a Qualified Medical Knowledge jp201616010077. 50 Staniford Street, 7th floor Graph and its Usage for Retrieving Patient Cohorts 42. McCoy TH, Castro VM, Cagan A, Roberson AM, Boston, MA 02114, USA from Electronic Medical Records. 2013 IEEE Kohane IS, Perlis RH. Sentiment Measured in E-mail: [email protected]

IMIA Yearbook of Medical Informatics 2017