Flexible Model-Based Joint Probabilistic Clustering of Binary and Continuous Inputs and Its Application to Genetic Regulation and Cancer
Total Page:16
File Type:pdf, Size:1020Kb
Flexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancer by Fatin Nurzahirah Binti Zainul Abidin Submitted in accordance with the requirements for the degree of Doctor of Philosophy THE UNIVERSITY OF LEEDS in the Faculty of Biological Sciences School of Molecular and Cellular Biology August 2017 i Intellectual Property and Publication Statements The candidate confirms that the work submitted is his/her own, except where work which has formed part of jointly authored publications has been included. The contribution of the candidate and the other authors to this work has been explicitly indicated below 1st Authored, used in this thesis, Chapter 2, 3 and 4: Fatin N. Zainul Abidin and David R. Westhead. "Flexible model- based clustering of mixed binary and continuous data: application to genetic regulation and cancer”. Nucleic Acids Res (2017) 45 (7):e53. This copy has been supplied on the understanding that it is copyright material and that no quotation from the thesis may be published without proper acknowledgement. The right of Fatin N. Zainul Abidin to be identified as Author of this work has been asserted by her in accordance with the copyright, Designs and Patents Act 1988. © 2017 The University of Leeds and Fatin N. Zainul Abidin ii Acknowledgements First and foremost, thanks to God for giving me the chance and strength to complete this thesis in time although I faced some personal difficulties along the way to complete this thesis, but I am glad I still manage to complete it. Then, thanks to my lovely family in Malaysia for their understanding and allowing me to further my study here and being away from home with a great distance for three and a half years. Next, I would like to express the deepest appreciation to my supervisor, Professor David Westhead, head of the School of Molecular and Cellular Biology who introduced me to the Bioinformatics field and for his patient, guidance and support whenever I ran into trouble or had a question about my research or writing. I was lucky to have learned a lot from Dave and his presence in each step of my PhD progressions and encouraging and kind to me whenever I get confused and frustrated was really a blessing for me. In addition, thank you to my fellow lab mates especially Vijay, Nisar, Francis, Chulin and Matt for providing me with unfailing support and continuous encouragement throughout my years of study. I would also like to acknowledge Dr. Joan Boyes as my secondary supervisor and also as the second reader of this thesis, and I am gratefully indebted for Dave’s and Joan’s very valuable comments in this thesis. Finally, I would like to thank the Majlis Amanah Rakyat (MARA), or the Council of Trust for the People, an agency under the Ministry of Rural and Regional Development Malaysia for their trust in me and willing to sponsor a huge amount of fund so that I am able to study here because of the lack of Bioinformatics concentration back home. iii Abstract Clustering is used widely in ‘omics’ studies and is often tackled with standard methods such as hierarchical clustering or k-means which are limited to a single data type. In addition, these methods are further limited by having to select a cut-off point at specific level of dendrogram- a tree diagram or needing a pre-defined number of clusters respectively. The increasing need for integration of multiple data sets leads to a requirement for clustering methods applicable to mixed data types, where the straightforward application of standard methods is not necessarily the best approach. A particularly common problem involves clustering entities characterized by a mixture of binary data, for example, presence or absence of mutations, binding, motifs, and/or epigenetic marks and continuous data, for example, gene expression, protein abundance and/or metabolite levels. In this work, we presented a generic method based on a probabilistic model for clustering this mixture of data types, and illustrate its application to genetic regulation and the clustering of cancer samples. It uses penalized maximum likelihood (ML) estimation of mixture model parameters using information criteria (model selection objective function) and meta-heuristic searches for optimum clusters. Compatibility of several information criteria with our model-based joint clustering was tested, including the well-known Akaike Information Criterion (AIC) and its empirically determined derivatives (AICλ), Bayesian Information Criterion (BIC) and its derivative (CAIC), and Hannan-Quinn Criterion (HQC). We have experimentally shown with simulated data that AIC and AIC (λ=2.5) worked well with our method. We show that the resulting clusters lead to useful hypotheses: in the case of genetic regulation these concern regulation of groups of genes by specific sets of transcription factors and in the case of cancer samples combinations of gene mutations are related to patterns of gene expression. The clusters have potential mechanistic significance and in the latter case are significantly linked to survival. iv Table of Contents Basic molecular biology .................................................................................. 1 Chromatin ....................................................................................................... 3 Experimental techniques for studying regulation ............................................. 6 Cancer .......................................................................................................... 11 Genomic projects – TCGA, etc. .................................................................... 15 Data analytical techniques ............................................................................ 16 Aims and objectives of the thesis .................................................................. 22 v Introduction ................................................................................................... 23 Methodology ................................................................................................. 34 Results ......................................................................................................... 47 Discussion .................................................................................................... 57 Conclusions .................................................................................................. 57 Introduction ................................................................................................... 59 Methodology ................................................................................................. 63 Results ......................................................................................................... 71 vi Discussion .................................................................................................... 96 Conclusion .................................................................................................... 96 Introduction ................................................................................................... 97 Methodology ............................................................................................... 102 Results and discussion ............................................................................... 107 Conclusion .................................................................................................. 119 Method development .................................................................................. 120 Transcriptional regulatory networks reconstruction ..................................... 121 Identification of cancer subtypes ................................................................. 122 Future work................................................................................................. 123 vii List of Figures Figure 1.1 A section of a chromosome (a gene) containing exon in between of introns. ................................................................................................... 2 Figure 1.2 Prokaryotic vs eukaryotic gene expression central dogmas of molecular biology. .................................................................................................. 3 Figure 1.3 Modulation of transcription of a eukaryotic gene in an active state. ........ 4 Figure 1.4 A genomic locus analysed by corresponding chromatin profiling experiments. .......................................................................................... 7 Figure 1.5 Workflow of RNA-seq from library preparation to RNA profiles quantification. ......................................................................................... 9 Figure 1.6 Representation of ChIP-seq of a DNA-binding protein on DNA. ........... 11 Figure 1.7 How DNA methylation silences a gene. ................................................ 14 Figure 1.8 Clustering methods applied for 40 genes which have been measured under two different experimental conditions. ........................................ 18 Figure 1.9 Top-down (agglomerative) and bottom-up (divisive) strategies in hierarchical clustering of six data points in this example....................... 19 Figure 2.1 Probability distribution for random variables 푿 = 풙풌 from tossing a fair coin twice. ............................................................................................ 25 Figure 2.2 Probability distribution for probability density function of random variables 푋 = 푥푘 from a list of continuous