
Analysis and integration of heterogeneous large-scale genomics data : application to B cell differentiation and follicular lymphoma non coding mutations Marine Louarn To cite this version: Marine Louarn. Analysis and integration of heterogeneous large-scale genomics data : application to B cell differentiation and follicular lymphoma non coding mutations. Bioinformatics [q-bio.QM]. Université Rennes 1, 2020. English. NNT : 2020REN1S088. tel-03244465 HAL Id: tel-03244465 https://tel.archives-ouvertes.fr/tel-03244465 Submitted on 1 Jun 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE DE DOCTORAT DE L’UNIVERSITE DE RENNES 1 Ecole Doctorale N° 601 Mathématiques et Sciences et Technologies de l’Information et de la Communication Spécialité : Informatique Par « Marine LOUARN » « Analysis and integration of heterogeneous large-scale genomics data » «Application to B cell differentiation and Follicular Lymphoma non coding mutations» Thèse présentée et soutenue à RENNES, le 26 Novembre 2020 Unité de recherche : INSERM et IRISA Thèse N° : Rapporteurs avant soutenance : Adrien COULET MCU, LORIA Nancy Lydie LANE Chercheuse, SIB Lausanne, Suisse Composition du jury : Président : Prénom Nom Fonction et établissement d’exercice Examinateurs : Salvatore SPICUGLIA Directeur de recherche, INSERM U1090 Marseille Sarah COHEN BOULAKIA Professeure, LRI Orsay Fabrice CHATONNET Ingénieur de recherche, CHU Rennes Olivier DAMERON Professeur, Université Rennes1 Rennes Dir. de thèse : Anne SIEGEL Directrice de recherche CNRS, IRISA Rennes Co-dir. de thèse : Thierry FEST PU-PH, INSERM / CHU Rennes “I’m sure there are words that are simply in there ’cause I like them. I know I couldn’t justify each and every one of them.” Neil GAIMAN 3 ACKNOWLEDGEMENT I would like to sincerely thank Lydie Lana and Adrien Coulet for accepting to review this thesis. I would also like to thank Salvatore Spicuglia and Sarah Cohen Boulakia for accepting to be part of my jury. I would also like to thank Anne Siegel, Olivier Dameron, Fabrice Chatonnet and Thierry Fest for offering me that internship - which now feels likes ages ago - that lead to this Ph.D. That you for the mentoring, the support and the push to better myself scientifically. A huge thank to the Dyliss team, but also GenOuest and GenScale, it has been a pleasure working with you guys, laughing together during breaks and thanks for the overall good atmosphere that made those three and a half year a lot more fun. A special thank for Xavier, for his continuous support on AskOmics and for answering my numerous questions. And special thank to the ’Midi les docto- rants’ group, with all our silly yet fascinating discussions about basically everything, and for some of my best laugh also. They are too many names to give a full list but from the bottom of my heart, thanks you ! You made this thesis a lot more fun. Thanks also for hearing me venting when everything was going wrong. Thanks to the INSERM group for enduring my computer sciences presentation and helping me better understand the biological context under my thesis. Merci Arty et Céleste pour la Boirlothe devenue traditionnelle. Merci a tous les amis qui m’ont aidée à rendre ses années inoubliables, que ce soit sur cette campagne de D&D qui dure depuis aussi longtemps que ma thèse maintenant ou simplement autour d’un verre ou d’un mug. Un merci tout spécial à Pol, comme tu m’as dit « Merci et bravos à tous ». Un immense merci à mes parents, qui ont toujours cru en moi et qui m’ont encouragée au quotidient. Merci de m’avoir soutenu et aidé à faire des études aussi longues (10 ans. .). Merci papa pour le service technique. Merci maman pour les réserves de sucres et la couture ! Merci Amaury, ce fut fun de partager ces trois ans avec toi. Certes tu auras sans doute ta thèse avant, mais je garde l’avantage de l’âge et du nombre de diplômes et ça, c’est ce qui importe. Enfin, Mamick, je sais que tu ne peux plus lire ces mots, mais je sais que tu serais fière de moi. Merci pour tout. 1. If Covid allows. 5 6 2 8 5 6 4 9 1 7 3 3 7 9 8 5 1 6 4 2 Please find the solution of this sudoku in the Ph.D thesis of 6 1 4 2 3 7 8 9 5 1 6 3 7 9 5 2 8 4 Amaury LOUARN, called “A topological approach to virtual cinema- 9 4 8 1 6 2 3 5 7 tography.” (Université de Rennes 1, November 23rd, 2020). 7 5 2 4 8 3 9 1 6 Here is the solution to the sudoku proposed in his thesis : 4 3 1 5 2 8 7 6 9 5 9 7 3 1 6 4 2 8 8 2 6 9 7 4 5 3 1 much. nspot u a o understand not may but support, in came who folks and friends the For – 1 F IGURE 1 1 4 2 9 3 5 4 4 2 8 6 7 4 2 6 5 4 1 2 6 2 4 6 6 3 8 5 8 9 3 8 9 7 TABLE OF CONTENTS Résumé en Français 11 1 Introduction 19 2 State of the Art and Biological context 21 2.1 Heterogeneity of data in Life science.................... 21 2.2 Methods of Regulatory network inference................. 26 2.3 Data Structure in life science........................ 31 2.4 Synthesis................................... 37 3 Regulatory Circuits and its limits 41 3.1 Introduction.................................. 42 3.2 What is Regulatory Circuits : biological model, input / output data and computational concept............................ 42 3.3 What is Regulatory Circuits : Detailed workflow.............. 43 3.3.1 Global formula............................ 45 3.3.2 Normalized expression activity of regions and transcripts.... 45 3.3.3 Confidence score of the TF binding sites............. 47 3.3.4 Distance weight of the regions................... 47 3.3.5 From individual relations to networks................ 48 3.4 Issues with Regulatory Circuits ....................... 48 3.4.1 Understanding the files....................... 50 3.4.2 Regulatory Circuits scripts are not usable............. 50 3.4.3 Intermediary files not present.................... 50 3.4.4 Regulatory Circuits methodology could not be reproduced.... 51 7 TABLE OF CONTENTS 3.4.5 Conceptual issues.......................... 51 3.4.6 Problem with re-usability and application to new data/Fair ?... 52 3.5 Computing Regulatory Circuits ....................... 53 3.5.1 Three ways of computing Regulatory Circuits ........... 53 3.5.2 Comparing the three ways to calculate Regulatory Circuits circuits 54 3.6 Conclusion.................................. 60 4 Interpretation of Regulatory network inference pipeline as graph-based queries 65 4.1 Introduction.................................. 65 4.2 Contribution.................................. 66 4.2.1 Identifying relevant files among all Regulatory Circuits resources 66 4.2.2 Structuration............................. 69 4.2.3 Integration.............................. 71 4.2.4 Queries................................ 73 4.2.5 Performances............................. 80 4.3 Discussion.................................. 81 4.4 Conclusion.................................. 83 5 Workflow and intermediary results as graph-based query and model 85 5.1 Introduction.................................. 85 5.2 Approach................................... 86 5.3 Results.................................... 89 5.3.1 Design principles and modular organization............ 89 5.3.2 Biological and experimental data from Regulatory Circuits and metadata............................... 91 5.3.3 Sample-specific weights of the TF-gene regulations....... 93 5.3.4 Tissue-specific weights and score of the TF-gene regulations.. 94 5.3.5 Overall dataset............................ 96 5.3.6 Biologically-relevant queries..................... 98 8 TABLE OF CONTENTS 5.4 Discussion and perspectives........................ 99 5.5 Conclusion.................................. 102 6 Design of a suitable pipeline for biologically-close and sparse cells types103 6.1 Introduction.................................. 104 6.2 Design..................................... 104 6.3 Pre-processing................................ 107 6.3.1 Discretization patterns for read densities and gene expression. 107 6.3.2 Neighborhood relationship..................... 110 6.3.3 Finding TF binding sites in our regions............... 110 6.4 Data graph for the integration........................ 111 6.5 Compatibility table to assign sign to relations............... 115 6.6 Automation.................................. 119 6.7 Comparison between Regulatory Circuits workflow and our pipeline.. 120 6.8 Validation................................... 122 6.9 Conclusion.................................. 126 7 Application to B cells and interpretation 129 7.1 Introduction.................................. 129 7.2 Application of the pipeline.......................... 130 7.2.1 Input data............................... 130 7.2.2 Integration.............................. 131 7.2.3 Networks extraction and filtering.................. 133 7.3 Patterns interactions............................. 134 7.4 Finding master candidates of the regulation................ 139 7.4.1 Coverage............................... 139 7.4.2 Specificity............................... 141 7.4.3 Combination of coverage and specificity.............. 143 7.4.4 Consistency with the literature..................
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages170 Page
-
File Size-