Unique Networks: a Method to Identify Disease-Specific Regulatory Networks from Microarray Data
Total Page:16
File Type:pdf, Size:1020Kb
UNIQUE NETWORKS: A METHOD TO IDENTIFY DISEASE-SPECIFIC REGULATORY NETWORKS FROM MICROARRAY DATA A thesis submitted for the degree of Doctor of Philosophy by Valeria Bo Department of Computer Science December 2014 Abstract The survival of any organism is determined by the mechanisms triggered in response to the inputs received. Underlying mechanisms are described by graphical networks that can be inferred from different types of data such as microarrays. Deriving robust and reliable networks can be complicated due to the microarray structure of the data characterized by a discrepancy between the number of genes and samples of several orders of magnitude, bias and noise. Researchers overcome this problem by integrating independent data together and deriving the common mechanisms through consensus network analysis. Different conditions generate different inputs to the organism which reacts triggering different mechanisms with similarities and differences. A lot of effort has been spent into identifying the commonalities under different conditions. Highlighting similarities may overshadow the differences which often identify the main characteristics of the triggered mechanisms. In this thesis we introduce the concept of study-specific mechanism. We develop a pipeline to semi- automatically identify study-specific networks called unique-networks through a combination of consensus approach, graphical similarities and network analysis. The main pipeline called UNIP (Unique Networks Identification Pipeline) takes a set of independent studies, builds gene regulatory networks for each of them, calculates an adapta- tion of the sensitivity measure based on the networks graphical similarities, applies clustering to group the studies who generate the most similar networks into study-clusters and derives the consensus networks. Once each study-cluster is associated with a consensus-network, we identify the links that appear only in the consensus network under consideration but not in the others (unique-connections). Considering the genes involved in the unique-connections we build Bayesian networks to derive the unique-networks. Finally, we exploit the inference tool to calculate each gene prediction-accuracy across all studies to further refine the unique-networks. Biological validation through different software and the literature are explored to validate our method. UNIP is first applied to a set of synthetic data perturbed with different levels of noise to study ii the performance and verify its reliability. Then, wheat under stress conditions and different types of cancer are explored. Finally, we develop a user-friendly interface to combine the set of studies by using AND and NOT logic operators. Based on the findings, UNIP is a robust and reliable method to analyse large sets of transcrip- tomic data. It easily detects the main complex relationships between transcriptional expression of genes specific for different conditions and also highlights structures and nodes that could be potential targets for further research. iii Acknowledgements First, I would like to thank my supervisor Allan Tucker for his constructive supervision, encour- agement and patience. Without his guidance, in both work and life, this work would have never been completed. It has been a privilege and a pleasure to work with him. I would also like to thank my second supervisor Stephen Swift for his invaluable advice and support. I am grateful to the members of the Rothamsted Research Centre Mansoor Saqi and Artem Lysenko. They have been great partners in collaboration. Thanks to Dimah Habash for her help using MapMan and to Tanya Curtis for the interesting discussions and the collaboration on the wheat data analysis. To my colleagues (past and present) Neda, Cici, Chandrika, Helga, Mahir, Ovidiu and many others for keeping the department a pleasure to work in. Thanks to Shav, Valentina and Valentina, Jessica, Massimo, Nicola and Gian Paolo for these amazing 3 years. A big thanks to Pam, for always being there for me. Last but not least a special thanks to my family for the love and incredible support over all these years. iv Publications The following publications have resulted from the research presented in this thesis: Bo V, Tucker A. Integrating Gene Regulatory Networks to identify cancer-specific genes. Submitted to AMIA - Joint Summit 2015. Accepted. Bo V, Curtis T, Lysenko A, Saqi M, Swift S, Tucker A. Discovering Study-Specific Gene Regulatory Networks. PloS one, 2014 Bo V, Lysenko A, Saqi M, Habash D, Tucker A. Integrating Multiple Studies of Wheat Microarray Data to Identify Treatment-Specific Regulatory Networks. Advances in Intel- ligent Data Analysis XII, 2013 Bo V, Lysenko A, Saqi M, Tucker A. Exploring the variation in gene regulatory networks: a study in wheat. European Conference on Computational Biology (ECCB), 2012 v Contents List of Figures x List of Tables xiv Glossary xvi 1 Introduction 1 1.1 Motivation ....................................... 1 1.2 Thesiscontributions ................................. 4 1.3 Thesisoutline...................................... 5 2 Background 6 2.1 GeneExpressionAnalysis .............................. 6 2.2 Microarrays....................................... 8 2.3 Analysisofmicroarraydata .. .. .. .. .. .. .. .. .. .. .. .. .. 12 2.4 Geneselection...................................... 14 2.5 GeneRegulatoryNetworks .. .. .. .. .. .. .. .. .. .. .. .. .. 16 2.5.1 BooleanNetworks ............................... 18 2.5.2 CorrelationNetworks ............................. 18 2.5.3 BayesianNetworks............................... 19 2.6 Identifying Gene Regulatory Networks structure . ........ 20 2.7 ModuleAnalysis .................................... 21 2.8 Constructionofrobustregulatorynetworks . ........ 22 2.9 Incorporatingexpertise. .... 24 2.10 Integration of multiple data . ... 25 2.11Conclusions....................................... 26 vi Contents 3 Key Concepts 27 3.1 Co-expression..................................... 27 3.2 Clustering........................................ 29 3.3 ScalefreevsRandomgraphs . .. .. .. .. .. .. .. .. .. .. .. .. 31 3.4 Weighted Gene Correlation Network Analysis . .... 32 3.4.1 WGCNAnetworksappliedtowheat . 34 3.5 Modelling GRNs using Glasso . 38 3.5.1 Inverse covariance and partial correlation . .... 38 3.5.2 Lasso ...................................... 38 3.5.3 Graphicallasso................................. 39 3.5.4 Glasso implementation in R . 40 3.5.5 Glasso networks applied to wheat . 42 3.6 Modelling GRNs using Bayesian Networks . .. 46 3.6.1 Modelselection................................. 47 3.6.2 D-separation, Markov property and conditional independence....... 49 3.6.3 Bayesian Network Inference Algorithms . .. 50 3.6.4 Prediction.................................... 51 3.6.5 Application to gene expression profiles . 52 3.7 Conclusion ....................................... 52 4 Analysis of synthetic data 53 4.1 Introduction...................................... 53 4.2 Methods......................................... 54 4.2.1 Single study glasso network . 56 4.2.2 Graph similarity . 56 4.2.3 Consensus networks and unique-connections . ..... 56 4.2.4 UniqueNetworks................................ 57 4.2.5 Bayesianunique-networks . 58 4.2.6 Predictionaccuracy .............................. 59 4.2.7 Biologicalsupport ............................... 59 4.2.8 Biclustering................................... 59 4.3 Datastructure .................................... 60 4.4 Resultsonsimulateddata .............................. 66 4.4.1 Unique-networks and intermediate results . ... 67 4.4.2 Prediction accuracy and final results . .. 69 vii Contents 4.5 Comparison with Biclustering . 74 4.6 Discussion........................................ 74 5 Analysis of Real Data 77 5.1 Introduction...................................... 77 5.2 Pipeline adaptation to real datasets . .... 78 5.2.1 Variables selection . 78 5.2.2 Consensusanduniquenetworks. 79 5.2.3 Biologicalsupport ............................... 79 5.3 Wheatresults...................................... 80 5.4 Wheatcomparison ................................... 89 5.4.1 Comparison with Bicluster . 89 5.4.2 ComparisonwithWGCNA .......................... 90 5.5 Biological validation - literature . .. 90 5.6 Fusariumresults .................................... 94 5.6.1 ComparisonwithWGCNA: . .. .. .. .. .. .. .. .. .. .. .. 95 5.7 Discussion........................................ 98 6 Cancer data and logic application 100 6.1 Introduction...................................... 100 6.2 Methoddescription.................................. 101 6.2.1 Variable selection . 102 6.2.2 Genecards validation and probability score . 103 6.2.3 LogicandGUI ................................. 103 6.3 Results and applications . 104 6.3.1 Identification of unique-genes through GeneCards . ...... 109 6.3.2 Gene-a-la-carte and source selection . 110 6.4 Interfacedescription-Logic . 111 6.5 Discussion........................................115 7 Conclusions 116 7.1 Thesiscontributions ................................. 116 7.1.1 Unique-networks ................................ 116 7.1.2 Unique Network Discovery Pipeline . 116 7.1.3 Application to several datasets . 117 7.1.4 Unique genes and probability score . 117 viii Contents 7.1.5 Logic Application . 118 7.2 Limitations .......................................118 7.3 Furtherwork ...................................... 119 7.3.1 NextGenerationSequencing . 119 7.3.2 Application to different kind of data . 120 7.3.3 Staticvsdynamicdata