Phd and Mphil Thesis Classes
Total Page:16
File Type:pdf, Size:1020Kb
Inference of gene networks from time series expression data and application to type 1 Diabetes Miguel Lopes Machine Learning Group (Faculte´ des Sciences, Department´ d’Informatique) ULB Center for Diabetes Research Universite´ Libre de Bruxelles A thesis submitted for the degree of Doctor of Philosophy (PhD) 2015 ii This thesis has been written under the supervision of the Prof. Gianluca Bontempi and Decio´ Eizirik. The members of the jury are: • Prof. Gianluca Bontempi (Universite´ Libre de Bruxelles) • Prof. Decio´ Eizirik (Universite´ Libre de Bruxelles) • Prof. Tom Lenaerts (Universite´ Libre de Bruxelles) • Prof. Maarten Jansen (Universite´ Libre de Bruxelles) • Prof. Kris Laukens (Universiteit Antwerpen)) This thesis has been written by the author, containing original work of his. Some described work is a product of a collaborative effort, whose contributors are acknowledged. iii iv Abstract The inference of gene regulatory networks (GRN) is of great importance to medical research, as causal mechanisms responsible for phenotypes are unravelled and potential therapeutical targets identified. In type 1 diabetes, insulin producing pancreatic beta-cells are the target of an auto-immune attack leading to apoptosis (cell suicide). Although key genes and regulations have been identified, a precise characterization of the process leading to beta-cell apoptosis has not been achieved yet. The inference of relevant molecular pathways in type 1 diabetes is then a crucial research topic. GRN inference from gene expression data (obtained from microarrays and RNA- seq technology) is a causal inference problem which may be tackled with well- established statistical and machine learning concepts. In particular, the use of time series facilitates the identification of the causal direction in cause-effect gene pairs. However, inference from gene expression data is a very challenging problem due to the large number of existing genes (in human, over twenty thousand) and the typical low number of samples in gene expression datasets. In this context, it is important to correctly assess the accuracy of network inference methods. The contributions of this thesis are on three distinct aspects. The first is on inference assessment using precision-recall curves, in particular using the area under the curve (AUPRC). The typical approach to assess AUPRC significance is using Monte Carlo, and a parametric alternative is proposed. It consists on deriving the mean and variance of the null AUPRC and then using these parameters to fit a beta distribution approximating the true distribution. The second contribution is an investigation on network inference from time series. Several state of the art strategies are experimentally assessed and novel heuristics are proposed. One is a fast approximation of first order Granger causality scores, suited for GRN inference in the large variable case. Another identifies co-regulated genes (ie. regulated by the same genes). Both are experimentally validated using microarray and simulated time series. The third contribution of this thesis is on the context of type 1 diabetes and is a study on beta cell gene expression after exposure to cytokines, emulating the mechanisms leading to apoptosis. 8 datasets of beta cell gene expression were used to identify differentially expressed genes before and after 24h, which were functionally characterized using bioinformatics tools. The two most differentially expressed genes, previously unknown in the type 1 Diabetes literature (RIPK2 and ELF3) were found to modulate cytokine induced apoptosis. A regulatory network was then inferred using a dynamic adaptation of a state of the art network inference method. Three out of four predicted regulations (involving RIPK2 and ELF3) were experimentally confirmed, providing a proof of concept for the adopted approach. Resum´ e´ L’inference´ des reseaux´ de regulation´ de genes` (GRN) est d’une grande impor- tante pour la recherche medicale,´ car les mecanismes´ de causalite´ responsable des phenotypes´ sont detect´ es´ et les cibles therapeutiques´ potentielles identifiees.´ Dans le diabete` de type 1, les cellules beta du pancreas´ qui produisent l’insuline sont la cible d’une attaque auto-immune conduisant a` l’apoptose (suicide cellulaire). Bien que des genes` et regulateurs´ cles´ ont et´ e´ identifies,´ une caracterisation´ precise´ du processus conduisant a` l’apoptose des cellules beta n’a pas encore et´ e´ accomplie. L’inference´ des voies moleculaires´ importantes pour le diabete` de type 1 est un sujet crucial de recherche. L’inference´ des GRN a` partir des donnees´ d’expression de genes` (obtenues par mi- croarray et par la technologie du RNA-seq) est un probleme` d’inference´ causale qui pourrait tre resolue´ par des concepts bien etablis´ de statistiques et de apprentissage automatique. En particulier, l’usage des series´ temporelles facilite l’identification de la direction causal dans les paires de genes` cause-effet. Cependant, l’inference´ a` partir des donnes´ d’expression de gene` est un probleme` tres` difficile du au large nombre de genes` existants (plus de 20000 chez l’homme) et le probleme` typique du peu d’echantillons´ disponible. Dans ce contexte, il est important d’evaluer´ correctement la prcision des methodes´ d’inference´ des reseaux.´ La contribution de cette these` porte sur trois aspects distincts. La premiere` porte sur l’evaluation´ de l’inference´ en utilisant les precision-recall curves, en particulier en utilisant l’aire sous la courbe (AUPRC). L’approche typique pour evaluer´ la significativite´ du AUPRC est d’utiliser Monte Carlo et une alternative parametrique´ est proposee.´ Il consiste a` deriver´ la moyennes et la variance de l’hypothese` nulle de l’AURPC et ensuite de les utiliser pour construire une distribution beta qui estime la vraie distribution. La deuxieme` contribution est une investigation sur les l’inference´ des reseaux´ a` partir des series´ temporelles. Plusieurs strategies´ actuelles sont experimentalement´ evalu´ ees´ et de nouveau heuristiques sont proposees.´ L’une est une approximation rapide des scores de causalites´ de Granger du premier ordre, appropriee´ pour l’inference´ des GRN dans le cas de large variable. Une autre identifie les genes` co-regul´ es´ (i.e regul´ es´ par les mmes genes).` Les deux sont experimentalement´ validees´ en utilisant les donnees´ de microarray et les series´ temporelles simulees.´ La troisieme` contribution de cette these` porte sur le diabetes` de type 1 et est une etude´ sur l’expression des cellules beta exposees´ aux cytokines, qui simulent les mecanismes´ conduisant a` l’apoptose. Huit jeux de donnees´ d’expression de genes` des cellules beta ont et´ e´ utilises´ pour identifier les genes` differentiellement´ exprimes´ avant et apres` 24 heures qui ont ensuite ont et´ e´ caracteris´ es´ fonctionnellement en utilisant les outils bioinformatiques. Les 2 genes` les plus differentiellement´ exprimes,´ prec´ edemment´ inconnus dans la litterature´ sur le diabetes` de type 1 (RIPK2 et ELF3) ont et´ e´ analyses´ et jouent un rle dans l’apoptose induite par les cytokines. Un reseau´ de regulation´ a et´ e´ inferre´ en utilisant une adaptation dynamique des methodes´ actuelles d’inference´ de reseaux.´ Trois sur quatre de regulations´ predites´ (impliquant RIPK2 et ELF3) ont et´ e´ experimentalement´ confirmees,´ donnant ainsi une proof of concept de l’approche que nous avons adoptee.´ To my dear family, old friends and new. ii Acknowledgements This thesis was funded by the ARC project Discovery of the molecular pathways regulating pancreatic beta cell dysfunction and apoptosis in diabetes using func- tional genomics and bioinformatics, based on a collaboration between the Machine Learning Group (Computer Science Department, ULB) and the ULB Center for Diabetes Research. The objectives of the project are the identification of relevant genes and regulatory pathways in β-cell apoptosis in diabetes; characterization of their role on cell function and fate; and experimental validation of predicted hypothesis, through bioinformatics and molecular biology techniques. The work here presented was supervised by the Professors Gianluca Bontempi and Decio´ Eizirik. I would like to thank my supervisors Gianluca and Decio´ for the invaluable guid- ance throughout these years as a PhD candidate; my friends and colleagues at the Machine Learning Group and the Center for Diabetes Research for all the enriching personal and professional interactions; Andrea, Elisa and Catharina for comments and suggestions regarding this thesis; and the members of the jury for kindly accepting to read and critically assess this work. Finally, I would also like to thank all my family and friends for all the support, availability, friendship and love. ii Contents List of Figures ix List of Tables xiii I Introduction 1 1 Introduction 3 1.1 Type 1 diabetes . .6 1.2 Transcription and translation . .7 1.3 Gene regulatory networks . .9 1.4 Measuring gene expression . 10 1.5 Machine learning . 11 1.6 The bias-variance trade off . 12 1.7 GRN inference . 13 1.7.1 The high variable to sample ratio in GRN inference . 14 1.7.2 Strategies for GRN inference . 15 1.8 Variable selection . 15 1.9 Statistical dependence and causal inference . 17 1.9.1 On causality . 17 1.9.2 Causal inference . 20 1.10 GRN inference assessment . 20 1.11 Thesis contributions . 23 1.11.1 On the null distribution of the precision-recall curve . 23 1.11.2 GRN inference from time series . 23 1.11.3 Knowledge inference in Type 1 Diabetes . 26 1.11.4 Publications used in this thesis . 26 1.11.5 Data and software . 28 iii CONTENTS 1.12 Structure of the thesis . 28 II Preliminaries and state of the art of network inference 29 2 Preliminaries 31 2.1 Basics of information theory . 31 2.1.1 Introduction . 31 2.1.2 Entropy and the mutual information . 32 2.1.3 Estimation of entropy and mutual information . 34 2.2 Estimation of linear dependences . 36 2.2.1 Linear regression . 36 2.2.2 Linear correlation . 37 2.2.3 Linear regression and partial correlation .