Etriks Analytical Environment: a Practical Platform for Medical Big Data Analysis

Imperial College of Science, Technology and Medicine Department of Computing eTRIKS Analytical Environment: A Practical Platform for Medical Big Data Analysis Axel Oehmichen Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy in Computing of Imperial College London December 2018 Abstract Personalised medicine and translational research have become sciences driven by Big Data. Healthcare and medical research are generating more and more complex data, encompassing clinical investigations, 'omics, imaging, pharmacokinetics, Next Generation Sequencing and beyond. In addition to traditional collection methods, economical and numerous information sensing IoT devices such as mobile devices, smart sensors, cameras or connected medical devices have created a deluge of data that research institutes and hospitals have difficulties to deal with. While the collection of data is greatly accelerating, improving patient care by devel- oping personalised therapies and new drugs depends increasingly on an organization's ability to rapidly and intelligently leverage complex molecular and clinical data from that variety of large-scale heterogeneous data sources. As a result, the analysis of these datasets has become increasingly computationally expensive and has laid bare the limitations of current systems. From the patient perspective, the advent of electronic medical records coupled with so much personal data being collected have raised concerns about privacy. Many countries have intro- duced laws to protect people's privacy, however, many of these laws have proven to be less effective in practice. Therefore, along with the capacity to process the humongous amount of medical data, the addition of privacy preserving features to protect patients' privacy has become a necessity. In this thesis, our first contribution is the development a new platform called the eTRIKS Analytical Environment (eAE) as an answer to those needs of analysing and exploring massive amounts of medical data in a privacy preserving fashion with the constraint of enabling the broadest audience, ranging from medical doctors to advanced coders, to easily and intuitively exploit this new resource. We will present the use of location data in the context of public health research, the work done in the context of data privacy for location data and the extension of the eAE to support privacy preserving analytics. Our second contribution is the implementation of new workflows for tranSMART that leverage the eAE and the support of novel life science approaches for features extraction using deep learning models in the context of sleep research. Finally, we demonstrate the universality and extensibility of the architecture to other research domains by proposing a model aiming at the identification of relevant features for characterizing political deception on Twitter. i ii Copyright Declaration The copyright of this thesis rests with the author. Unless otherwise indicated, its contents are licensed under a Creative Commons Attribution-NonCommercial 4.0 International Licence (CC BY-NC). Under this licence, you may copy and redistribute the material in any medium or format. You may also create and distribute modified versions of the work. This is on the condition that: you credit the author and do not use it, or any derivative works, for a commercial purpose. When reusing or sharing this work, ensure you make the licence terms clear to others by naming the licence and linking to the licence text. Where a work has been adapted, you should indicate that the work has been changed and describe those changes. Please seek permission from the copyright holder for uses of this work that are not included in this licence or permitted under UK Copyright Law. iii iv Acknowledgements I would like to take this opportunity to express my thanks to all of those who have always been by my side and supported me through that adventure. Firstly, I must thank my supervisor, Professor Yi-ke Guo, without whom none of this would have been possible. I am deeply grateful for his professional guidance and sharing his wisdom. I would like to give a special thanks to Florian Guitton who has been a close collaborator and friend from whom I have learned and shared so much. I am thankful to Dr Heinis and Dr de Montjoye for their invaluable support and guidance. All my friends and colleagues in Imperial College London, Diana O'Malley, Kai Sun, Miguel Molina-Solana, Shubham Jain, Arnaud Tournier, Florimond Houssiau, Akara Supratak, Ioannis Pandis, Lei Nie, Hao Dong, Paul Agapow, Susan Mulcahy, Juan GómezRomero, Jean Grizet, Kevin Hua, Julio Amador D´ıazLópez, Pierre Richemond, Ali Farzaneh, David Akroyd, Shicai Wang, Chao Wu, Bertan Kavuncu and Ibrahim Emam. I would like to thank Cédric Wahl who encouraged me to follow this path. I would like to express my gratitude to the eTRIKS and OPAL projects for supporting this work. Finally, I would like to give my deepest thanks to Cécileand my mother for their constant support, patience, love and encouragement. v vi Dedication To my mother vii Vi Veri Veniversum Vivus Vici viii Contents Abstract i Copyright Declaration iii Acknowledgements v 1 Introduction 1 1.1 Motivation and objectives . .1 1.2 Contributions . .2 1.3 Impact and adoption of the research . .3 1.4 Thesis organisation . .4 1.5 Statement of Originality . .5 1.6 Publications . .5 2 Background 11 2.1 Towards large scale data analysis in Life Science . 11 2.1.1 A deluge of data . 12 2.1.2 Moving away from a pure symptom-based medicine . 13 2.1.3 Complexity of computing infrastructures in Life Science . 18 ix x CONTENTS 2.2 Scalability in distributed systems . 19 2.2.1 Introduction . 19 2.2.2 Scheduling and management scalability . 21 2.2.3 Storage scalability . 23 2.2.4 Computational scalability . 26 2.3 Architectures to support machine intelligence . 28 2.3.1 Machine Learning . 29 2.3.2 Deep Learning . 30 2.3.3 Hardware acceleration for AI research . 32 2.4 Compliance and security in distributed systems . 34 2.4.1 GDPR and privacy of patient data . 34 2.4.2 Security of the data . 37 2.4.3 Privacy of companies . 38 2.5 General-purpose analytical platforms for Life Science . 39 2.5.1 Introduction . 40 2.5.2 Existing architectures . 41 2.5.3 Conclusion . 45 3 eTRIKS Analytical Environment: Design Principles and Core Concepts 46 3.1 Introduction and users' needs . 46 3.2 Existing knowledge management platforms and their limitations . 49 3.3 eTRIKS Analytical Environment . 52 3.3.1 Introduction . 52 CONTENTS xi 3.3.2 General Environment . 53 3.3.3 Endpoints Layer . 54 3.3.4 Storage Layer . 56 3.3.5 Management Layer . 57 3.3.6 Computation Layer . 60 3.3.7 Interaction between Layers . 62 3.3.8 Security of the architecture . 63 4 Implementation of the eTRIKS Analytical Environment 65 4.1 Implementation . 65 4.1.1 General Environment . 65 4.1.2 Endpoints layer . 66 4.1.3 Storage Layer . 71 4.1.4 Management layer . 73 4.1.5 Computation Layer . 74 4.2 Benchmarking and Scalability . 75 4.2.1 Resource usage . 75 4.2.2 Scheduler . 76 4.2.3 Compute Scalability . 77 4.2.4 Storage Scalability . 79 4.2.5 Summary . 81 4.3 TensorDB: Database Infrastructure for Continuous Machine Learning . 82 4.3.1 Introduction . 83 xii CONTENTS 4.3.2 Related work . 84 4.3.3 Architecture . 85 4.3.4 Application Evaluation . 89 4.3.5 Conclusion . 90 5 eTRIKS Analytical Environment with Privacy 91 5.1 Building Privacy capabilities . 91 5.1.1 Location data as a support for public health . 91 5.1.2 Attempts at sharing location data . 95 5.1.3 Sensitivity of location data . 96 5.2 Privacy preserving eTRIKS Analytical Environment . 98 5.2.1 New services and features . 98 5.2.2 Scalability of the platform . 104 5.2.3 Privacy of the platform . 108 5.2.4 Algorithms on the platform . 111 5.2.5 Privacy module for density .......................... 113 5.2.6 Related work . 119 5.3 Discussion and future work . 121 6 Analytics Developed using the eTRIKS Analytical Environment 122 6.1 Analytics for tranSMART . 122 6.1.1 Iterative Model Generation and Cross-validation Pipeline . 123 6.1.2 General statistics . 126 6.1.3 Pathway Enrichment . 128 6.2 DeepSleepNet . 130 6.2.1 Introduction . 130 6.2.2 Tackle class imbalances . 132 6.2.3 Results . 132 6.3 Characterizing Political Deception On Twitter . 137 6.3.1 Background . 138 6.3.2 Data and Methodology . 141 6.3.3 Feature Selection . 147 6.3.4 Fake news classification . 158 6.3.5 Conclusion . 164 7 eTRIKS Analytical Environment supporting Open Science 166 7.1 Sustainability of the platform . 166 7.1.1 Hosting of the project and supporting the users . 166 7.1.2 Continuous integration and system deployment . ..

Etriks Analytical Environment: a Practical Platform for Medical Big Data Analysis

Accordion: Better Memory Organization for LSM Key-Value Stores

Artificial Intelligence for Understanding Large and Complex

Learning Key-Value Store Design

Myrocks in Mariadb

Myrocks Deployment at Facebook and Roadmaps

Optimal Bloom Filters and Adaptive Merging for LSM-Trees∗

Μtune: Auto-Tuned Threading for OLDI Microservices

Optimizing Space Amplification in Rocksdb

The Full-Stack Database Infrastructure Operations Experts for Web-Scale

Succinct Range Filters

LSM-Tree Database Storage Engine Serving Facebook's Social Graph

Μtune: Auto-Tuned Threading for OLDI Microservices Akshitha Sriraman and Thomas F