AI-Based Integration of Imaging, Omics, and Clinical Data
Total Page:16
File Type:pdf, Size:1020Kb
European Journal of Nuclear Medicine and Molecular Imaging (2019) 46:2722–2730 https://doi.org/10.1007/s00259-019-04382-9 ORIGINAL ARTICLE Why imaging data alone is not enough: AI-based integration of imaging, omics, and clinical data Andreas Holzinger1 · Benjamin Haibe-Kains2,3 · Igor Jurisica3,4,5 Received: 12 May 2019 / Accepted: 28 May 2019 /Published online: 15 June 2019 © Springer-Verlag GmbH Germany, part of Springer Nature 2019 Abstract Artificial intelligence (AI) is currently regaining enormous interest due to the success of machine learning (ML), and in particular deep learning (DL). Image analysis, and thus radiomics, strongly benefits from this research. However, effectively and efficiently integrating diverse clinical, imaging, and molecular profile data is necessary to understand complex diseases, and to achieve accurate diagnosis in order to provide the best possible treatment. In addition to the need for sufficient computing resources, suitable algorithms, models, and data infrastructure, three important aspects are often neglected: (1) the need for multiple independent, sufficiently large and, above all, high-quality data sets; (2) the need for domain knowledge and ontologies; and (3) the requirement for multiple networks that provide relevant relationships among biological entities. While one will always get results out of high-dimensional data, all three aspects are essential to provide robust training and validation of ML models, to provide explainable hypotheses and results, and to achieve the necessary trust in AI and confidence for clinical applications. Keywords Precision medicine · Artificial intelligence · Machine learning · Decision support · Integrative computational biology · Network-based analysis · Radiomics Data explosion and the vital need extremely sensitive to poor data quality [2]. In the medi- for high-quality data cal domain, data quality (i.e., completeness, correctness) is further extended by confidentiality requirements, and the First and foremost, the biggest problem in the AI world is need for comprehensive annotation. It is amazing how bad the quality and sufficient volume of data. The most suc- the standard data sets in the medical domain are (noisy, cessful machine learning methods are data hungry [1], and sparse, wrong, biased, etc.), which was already described by Komaroff in 1979, when he stated that “the taking of a medi- This article is part of the Topical Collection on Advanced Image cal history, the performance of the physical examination, the Analyses (Radiomics and Artificial Intelligence) interpretation of laboratory tests, even the definition of dis- Igor Jurisica eases, are surprisingly inexact” [3]. Unfortunately, with the [email protected] introduction of sophisticated electronic patient record sys- tems this is often even worse today, as incorrect data escape routine checks and the errors are compounded across com- 1 Institute for Medical Informatics / Statistics, Medical University Graz, Auenbruggerplatz 2/V, putational workflows, frequently increasing clinical errors 8036 Graz, Austria [4]. Therefore, within a whole ML pipeline the aspect of 2 Princess Margaret Cancer Centre, University Health Network, data processing is of utmost importance. Consequently, for Toronto, ON, Canada AI applications to be successfully applied to the medical 3 Departments of Medical Biophysics and Computer Science, domain, an integrative machine learning approach might University of Toronto, Toronto, Canada be necessary [5], calling for the integration and fusion of 4 Krembil Research Institute, UHN, 60 Leonard Avenue, heterogeneous data sets, e.g., images, physiological data, 5KD-407, Toronto, Ontario, M5T 0S8, Canada text (non-standardized“unstructure” and structured patient 5 Institute of Neuroimmunology, Slovak Academy of Sciences, records), and diverse omics profiles (gene, microRNA, pro- Bratislava, Slovakia tein, metabolic, etc.) [6]. Eur J Nucl Med Mol Imaging (2019) 46:2722–2730 2723 Generally, in the health domain one can identify different to reproduce the results, for examples and a discussion areas of data volumes depending on the medical context in see [9]. which they have been created [7]: • Patient data from the electronic patient records (EPR), Integrative computational biology including clinical reports, mainly unstructured informa- tion in written clinical reports, but also lab test data, and Developing effective knowledge systems to support the all types of biomedical signals (ECG, EEG, EOG, etc). governance, processing, inference, analysis and interactive • Imaging data, e.g., from radiology, pathology (whole visualization of integrated omics data is critical to maximize slide images), dermatology (dermoscopy), but also the impact on translational research. The importance of sonography, etc. visualization is often underestimated, but it is the quality • Biomedical research data, including clinical trial data of the visualization that enables the experts end users to and all sort of omics data, particularly from next- understand the data in the context of their problems [10]. generation sequencing (NGS), etc. In addition to this Visualization enables more accurate and relevant modeling data, which is produced in a clinical environment, more of healthy and disease states, and in turn support precision and more data is being generated outside the clinical medicine [11]. sector, i.e., • Health business data, including management data, Integrating layers of omics data logistics, and accounting but also more and more prediction (e.g., resource planning); Understanding complex diseases (e.g., arthritis, brain • Private patient data, produced completely outside the disorders, cancer) requires computational analyses that clinical context, which is fostered by the possibilities integrate diverse layers of data—imaging data, omics of modern low-cost smartphones such as sport data, profiles, clinical data, and annotations. Such complex data wellness and data for ambient assisted living (sensor needs to be analyzed using scalable data mining, machine data) learning and statistical methods. In turn, comprehensive The US Department of Health and Human Services integration, further analysis and modeling requires detailed (HHS) created a taxonomy of health data with the following annotations and relationships among these entities (see seven dimensions: Fig. 1). Such integrated network-based analyses help create and validate explainable disease models, and enable 1. Demographics and socioeconomic data including age, improved treatments strategies and patient outcomes. gender, education, etc. Radiomics can aid this process (see Fig. 1), by pro- 2. Health status data, including disabilities, diagnoses, and viding additional features for the analysis, improved dis- symptoms ease characterization and patient stratification. Radiomics 3. Health resources data including the capacities of the enables more accurate disease classification than traditional health system, performance and operating data disease grading; for example, it significantly improves sen- 4. Health-care utilization, including data about treatment sitivity of identifying structural knee osteoarthritis [12]. and duration Imaging data can provide integrated in the ”front end” by 5. Health-care cost and expenditure data, including supporting cancer detection Cameron-2016, characterizing charges, insurance status, etc. micro-environment [13], tracking tumor biology [14], cell 6. Health-care outcomes of current and past prevention, proliferation, and blood vessel formation [15]. This can aid treatments, etc. in more precise and earlier tumor detection and sybtyping 7. Other data including omics data, but also environmental [16], more accurate staging, treatment planning, progno- exposure data such as environmental impacts, etc. sis [17], and non-invasive response monitoring [18–20]. Reproducibility is another recognized but still mostly From the machine learning perspective, this helps reducing ignored issue. Currently, there is a huge trend in the noise and heterogeneity. Radiomics can also be integrated opposite direction: most scientific contributions are judged in the ”back end” by helping quantifying disease dynamics according to their novelty—rarely for their reproducibility, and predicting response to treatment [21] through series of which would be the essential criteria of good science— “digital biopsies”. see the recent debate in Science [8]. A typical example is that the data are considered as inaccessible due to ethical From individual biomarkers to signatures restrictions. Data protection is of course an issue but there must be solutions to publish data along with the results One of the central tasks in precision medicine is the and the international scientific community should be invited identification of groups of markers for aiding diagnosis, 2724 Eur J Nucl Med Mol Imaging (2019) 46:2722–2730 Binding U - Uncharacterized Chlorprothixene Thioridazine Clozapine D - Genome Maintenance VIPR2 CHRM4 TMEM120A PLLP HTR4 CCR9 CCR8 CCR6 CCR4 CERS6 BDKRB2 NPSR1 PRNP FA2H CCR2 MGLL CHRM3 OXTR CCR1 MC4R RXFP4 WHSC2 TR2CYSL SYT1 AFRPT ATP2A2 FZD7 GPR37 GNRHR RHBDD2 ADRB2 DRD2 CALM3 ATP1B1 OAZ1 GPRC5B HTR2C PTGER4 F2RL1 LTB4R2 MID1IP1 HTR2B RXFP3 AGTR1 YIF1A CHRM2 OPRM1 CHRM5 BUD31 1SLC25A1 TTYH1 SLC3A2 GPR35 C4orf3 SYNGR2 DNAJC8 GPRIN2 YIF1B TMEM161A CCR5 HTR6 SMO ADORA2A ADORA1 HRH1 ADRA1A BDKRB1 GABBR1 OPRL1 MBP HNRNPK APLNR GPR77 A1BHSP GPR182 AAR1T TSHR Catalytic Activity HTR4 C - Cellular Fate and Organization