Standardisation and Organisation of Clinical Data and Disease Mechanisms for Comparison Over Heterogeneous Systems in the Context of Neurodegenerative Diseases
Total Page:16
File Type:pdf, Size:1020Kb
PhD-FSTC-2018-51 The Faculty of Sciences, Technology and Communication DISSERTATION Defence held on 03/07/2018 in Luxembourg to obtain the degree of DOCTEUR DE L’UNIVERSITÉ DU LUXEMBOURG EN BIOLOGIE by AISHWARYA ALEX NAMASIVAYAM Born on 12 November 1987 in Bharananganam (India) STANDARDISATION AND ORGANISATION OF CLINICAL DATA AND DISEASE MECHANISMS FOR COMPARISON OVER HETEROGENEOUS SYSTEMS IN THE CONTEXT OF NEURODEGENERATIVE DISEASES Dissertation defence committee Prof. Dr Reinhard Schneider, dissertation supervisor Professor, Université du Luxembourg Dr Inna Kuperstein Researcher and scientific coordinator, Institut Curie, Paris Dr Enrico Glaab, Chairman Senior research scientist, Université du Luxembourg Prof. Dr Karsten Hiller Professor, Technische Universität Braunschweig Dr Marek Ostaszewski, Vice Chairman Research associate, Université du Luxembourg Affidavit I hereby confirm that the PhD thesis entitled "Standardisation and Organisation of Clinical Data and Disease Mechanisms for Comparison Over Heterogeneous Systems in the Context of Neurodegenerative Diseases" has been written independently and without any other sources than cited. Luxembourg, July 26, 2018 Aishwarya Alex Namasivayam i ii Acknowledgements First and foremost, I would like to thank Dr. Reinhard Schneider, my supervisor for giving me the opportunity and support to pursue my PhD in the group. Biocore is a very wonderful working environment. I couldnt ask for a better boss! I would like to thank all my colleagues for their support and making this a memorable journey. Special thanks to Marek, Venkata, Wei and Piotr for their valuable suggestions and feedbacks. My sincere gratitude to Dr. Jochen Schneider and Dr. Karsten Hiller for agreeing to be part of the CET committee and the constructive criticism during the PhD. I would also like to thank Dr. Inna Kuperstein, for accepting to be on the defense committee. I am sincerely very happy to have a woman on the committee! A very special thank you to Adriano for the AETIONOMY times, without your support this wouldnt have been difficult. Thanks to Marie-Laurie for being the one stop person for everything at LCSB. These four years would not be possible without family and friends. Thank you Susana, for being my first friend in Luxembourg. I am glad we met and you are truly Smartinez. Shaman and Susi, thank you for all the times you welcomed me with open arms, provided me with shelter and food when I wanted to be lazy for the weekend (or any day). Thanks to Gaia, Dheeraj, Maharshi, Kavita, and Anshika for their friendship and interesting conversation during lunch time and beyond. Bruna, thank you for bringing a little bit of Brazil to us. A special thanks to Marouen, Gaia, Zuogong, Berto, Dheeraj and Shaman for RSG Luxembourg! It was an amazing experience, it wouldnt be possible without you guys. I would like to thank my friends at the ISCB student council, for a beautiful volunteering experience. Thank you Jakob, for believing I can deal with finances and bringing me on board. Farzana, thank you for a friendship beyond the Student Council. My sincere gratitude to my parents for always being supportive and believing in me, no questions asked. To my sister, for checking on me when I go into hibernate mode! To my in-laws, Achan, Amma and Krishnan, thank you for making me feel at home. Last but not the least, Raman, for realising that the wife will be 318km away but still marrying me! iii iv Dedication To my sister, Theresa, for never ever asking me \When do you finish the PhD?" and my husband, Raman, for asking me only five times (ok, maybe fifteen). v Abstract With increased interest in studying neurodegenerative diseases, data generated is grow- ing exponentially. The sheer amount of patient data being collected gives rise to the problem of how it can be stored, represented and classified. The representation of col- lected data varies from one centre to other, based on several factors such as language, region, standards adapted, study design. The results of this variation makes it difficult to study different cohorts by combining or comparing them. Therefore, variables that are collected in all these cohorts need to be standardised and harmonised for further re-use and analysis. Disease maps are another form of knowledge resources collecting existing biological facts in a single resource. Disease mechanisms are represented visually in the form of models or maps, capturing the knowledge extracted from the literature. There are several modelling languages which serve this purpose. Disease maps capture knowledge about disease related mechanisms at different molecular levels. Compar- ison of different disease maps can support co-morbidity studies to identify common disease mechanisms or drug targets. To this end, we developed a method to compare disease maps. We then compared Parkinson's and Alzheimer's disease maps using this approach. However, there are several modelling formats available. Therefore, here arises a need to harmonise and standardise the representation and make the different disease modelling formats convertible and comparable. For the course of the project, we focus on Open Biological Expression Language (OpenBEL) and Systems Biology Markup Language (SBML) modelling languages. A semi-automated convertor from OpenBEL format to SBML was developed to compare knowledge over heterogeneous systems his was then used to convert an Alzheimer's OpenBEL model to SBML format for better visualization and hierarchical representation and to enable comparison against other SBML models. In conclusion, the work presented in the thesis, emphasises the importance of standards in the representation and modelling of clinical and biological information to vi ensure interoperability between tools and models and facilitate data sharing, reusability and reproducibility. vii viii Contents Acknowledgements iii Abstract v 1 Introduction 1 1.1 Trends in biomedical research and their impact on bioinformatics . .3 1.2 Standards in biomedical research . .3 1.2.1 Need for standards . .4 1.2.2 Standardisation efforts . .6 1.3 Disease maps as knowledge resources . .9 1.3.1 Complexities of diseases and their comorbidities . 10 1.3.2 Representing diseases as a map . 11 1.3.3 Common modelling formats . 11 1.3.4 Available data . 18 1.4 Scope and Aim . 20 1.5 Thesis Overview . 21 ix x CONTENTS 2 Integrating heterogeneous data 23 2.1 Integrative platforms . 23 2.2 Data acquisition . 24 2.3 Data harmonisation . 25 2.4 Extraction, Transformation and Loading (ETL) . 26 2.5 From unstructured to structured data . 28 2.6 Results . 31 2.6.1 Use case 1: Alzheimer's disease cytokine study . 33 2.6.2 Use case 2: Expression data of substantia nigra from postmortem human brain of PD patients . 40 2.7 Summary . 42 3 Comparison of disease maps 44 3.1 Overview of existing comparison methods . 45 3.2 Methods for comparison . 46 3.3 Results . 51 3.3.1 Comparison of AlzPathway and PD map . 51 3.3.2 AKT1 Activity . 59 3.3.3 TAU (MAPT ) hyper-phosphorylation . 62 3.3.4 MAPK signalling . 64 3.3.5 Endoplasmic Reticulum Stress . 67 CONTENTS xi 3.3.6 Inflammation . 70 3.3.7 Wnt signalling . 72 3.3.8 Synaptic area . 73 3.4 Summary . 74 4 Comparison of different disease models 76 4.1 Interoperability between models . 77 4.2 Results . 84 4.3 Comparison of APP map to PD map . 88 4.4 Comparison of APP map to AlzPathway . 90 4.5 Summary . 96 5 Discussion 98 5.1 Integrating Heterogeneous Data . 99 5.2 Comparison and Conversion of Maps . 101 6 Summary and Outlook 104 6.1 Summary . 104 6.2 Outlook . 105 A Appendix A : Supplementary Materials 108 A.1 Alzheimer's and Parkinson's Disease datasets integrated in AETIONOMY109 A.2 AlzPathway overlayed on PD Map . 111 A.3 Semantic Mapping: CellDesigner (SBGN)-BEL . 119 A.4 Nodes extracted from APP BEL model . 121 A.5 Reactions extracted from APP BEL model . 122 B Publications 123 Bibliography 166 xii List of Tables 1.1 Features of SBML, OpenBEL, BioPAX . 15 3.1 Number of elements and reactions in the AlzPathway Map and PD Map and submaps . 52 3.2 Summary of model sizes . 54 3.3 Number of elements and reactions identified from AlzPathway in PD Map 55 3.4 Number of elements and reactions identified from PD Map in AlzPath- wayMap................................... 57 4.1 OpenBEL predicates and corresponding representation in CellDesigner 82 4.2 OpenBEL activity terms and corresponding GO annotation . 83 4.3 OpenBEL statements lost in conversion . 85 4.4 Submaps components based on cell type . 87 4.5 Reactions in the APP Map . 90 A.1 Summary of performance with different annotators . 111 A.2 Submap PD 180412 2 alzpath 8APR Element . 117 xiii A.3 Snapshot of reaction matches in PD Map and Alzpathway . 118 A.4 Example of nodes extracted from APP BEL model . 121 A.5 Example of reactions extracted from APP BEL model . 122 xiv List of Figures 1.1 Complexity of diseases and their comorbidities . 10 1.2 BEL Statement structure . 14 1.3 Disease models in SBML and OpenBEL formats . 16 1.4 Overview of the project . 20 2.1 Curation and harmonisation overview . 26 2.2 Unstructured or semi-structured to structured data . 28 2.3 Harmonised and structured datasets via tranSMART standard format files...................................... 29 2.4 Extraction Transformation and Loading of datasets . 30 2.5 Linking heterogeneous data in tranSMART . 31 2.6 Curated studies loaded in AETIONOMY tranSMART instance . 32 2.7 Overview of the AD Cytokine dataset loaded in tranSMART . 34 2.8 Distribution of MCP-1 and MIF levels in AD subjects compared to controls . 35 2.9 Correlation of cognitive functions with cytokine levels in AD subjects . 35 xv xvi LIST OF FIGURES 2.10 TNF-α was elevated in MCI compared to controls and AD subjects .