Co-Evolving Protein Sites: Their Identification Using Novel, Highly-Parallel Algorithms, and Their Use in Classifying Hazardous

Co-evolving protein sites: their identification using novel, highly-parallel algorithms, and their use in classifying hazardous genetic mutations A thesis submitted in partial fulfilment of the requirement for the degree of Doctor of Philosophy Louise Knight 2017 Cardiff University School of Computer Science & Informatics Declaration This work has not previously been accepted in substance for any degree and is not concurrently submitted in candidature for any degree. Signed . (candidate) Date . Statement 1 This thesis is being submitted in partial fulfilment of the requirements for the degree of PhD. Signed . (candidate) Date . Statement 2 This thesis is the result of my own independent work/investigation, except where otherwise stated. Other sources are acknowledged by explicit refer- ences. Signed . (candidate) Date . Statement 3 I hereby give consent for my thesis, if accepted, to be available for photo- copying and for inter-library loan, and for the title and summary to be made available to outside organisations. Signed . (candidate) Date . 1 Abstract Algorithms for detecting molecular co-evolution have until now been ap- plied only to individual protein families, but not to the human proteome. Linked to this is the problem that performing the computations for identify- ing co-evolving sites in the human proteome would take a prohibitively long time using the serial algorithms already in use. In addition, co-evolving sites have not been pursued as a possible way of classifying mutations according to their likelihood to cause disease. The main contributions of this thesis are as follows: identification of three suitable methods for detecting molecular co-evolution by comparing the performance of published state-of-the-art methods on simulated data; implementation of these methods in the parallel architecture CUDA, and evaluation of these methods’ performance in comparison to serial implementations of the same methods; and identification of co-evolving sites across the entire human proteome, and analysis of these sites according to what is already known about disease-causing mutations. Be- yond demonstrating the effectiveness of CUDA for implementing molecular co-evolution detection algorithms, we derive insights into techniques for effi- cient implementation of algorithms in CUDA (particularly algorithms which require tree-based structures, such as parametric methods), and our results provide preliminary insights into the relationship between co-evolving sites and mutation pathogenicity. 2 Acknowledgements First I wish to thank my supervisors, Dr. Andrew Jones, Dr. Christine Mumford, and Dr. Andrew Pocklington, for their invaluable advice and en- couragement throughout my PhD, and also particularly towards the end, when everything was coming together. I would like to express my gratitude for your constructive feedback on the numerous drafts this thesis has gone through. Again thank you for giving me this opportunity in the first place; it has been a privilege to learn and to grow my knowledge and experience in this fascinat- ing field. There are many members of staff in the School of Computer Science & Informatics that I have interacted with and who have helped me in various ways over the years; thanks go to you all. I must thank my fellow research students in the School, not just for the numerous conversations about research, but also for providing a welcome dis- traction when necessary. If I start naming all of you, I risk missing someone out; there are truly too many of you to name. I give thanks to my parents, John and Jean, for always encouraging me to go as far as I can, not just in my PhD, but throughout my whole life. It is through your constant support from the very beginning that I gained confidence in my abilities, without which I would never have got this far; you have always been my biggest cheerleaders. Thank you also for reminding me to take breaks (and ensuring I actually take those breaks!). Really, there is more to thank you for than I can actually write here; thank you. Finally, thanks go to Ivan; although we unfortunately met when I had just over 3 months to go until submission, you implicitly understood the impor- tance of this work to me and gave me the time, space, and support to finish everything up, for which I am very grateful. i Contents List of Tables vii List of Figures ix 1 Introduction 2 1.1 Overview . .2 1.2 Hypothesis . .4 1.3 Thesis contributions . .4 1.4 Thesis structure . .5 1.5 Summary . .6 2 Background 7 2.1 How proteins are made . .8 2.2 Mutations . 11 2.2.1 Gamma distribution to model mutation rate . 11 2.2.2 Amino acid similarity matrices . 12 2.3 Ancestry and phylogenetic trees . 14 2.3.1 Phylogenetic trees . 16 2.3.2 Homology . 17 2.4 Phylogeny representation and construction . 17 2.4.1 Newick format . 17 2.4.2 ClustalW2 . 19 2.5 Ancestral sequence reconstruction . 19 2.5.1 Fitch’s algorithm . 19 2.5.2 Maximum Likelihood . 20 2.6 Multiple Sequence Alignments . 21 2.6.1 FASTA . 22 2.6.2 PHYLIP . 23 2.7 Co-evolution . 24 2.7.1 Sources of randomness in correlation signal . 26 2.8 Site conservation . 27 2.9 Parallel computing . 27 2.10 Complex brain disorders . 31 ii 2.11 Summary . 32 3 Literature Review 33 3.1 Non-parametric methods . 34 3.1.1 Mutual Information (MI) . 34 3.1.2 Observed Minus Expected Squared (OMES) . 37 3.1.3 Perturbation-based algorithms . 38 3.1.4 McBASC . 39 3.1.5 PSICOV and DCA . 40 3.1.6 Sequence divergence-based approximation . 41 3.2 Parametric methods . 42 3.2.1 Maximum Likelihood Approximation . 42 3.2.2 Tracking changes on branches . 44 3.2.3 Bayesian Mutational Mapping . 46 3.3 Discussion of methods for detecting molecular co-evolution . 47 3.3.1 Alphabet choice . 47 3.3.2 Parallelisation potential . 48 3.3.3 Method comparison . 48 3.3.4 Methods summary . 52 3.3.5 Recommendations . 55 3.4 Methods for removing background noise . 57 3.4.1 Alignment curation . 58 3.4.2 Correcting the null hypothesis/test statistic to account for phylogeny . 58 3.4.3 Chosen method of background removal . 59 3.5 Summary . 61 4 Data Collection 63 4.1 Databases . 63 4.1.1 Ensembl . 64 4.1.2 HomoloGene . 65 4.1.3 UCSC Genome Bioinformatics . 65 4.1.4 Choice . 66 4.2 Data collection methodology . 67 4.2.1 Alignment . 68 4.2.2 Data collected . 71 4.3 Summary . 71 5 Simulation 72 5.1 Simulation program . 72 5.2 Simulation parameters . 74 iii 5.2.1 Evolution model . 74 5.2.2 Alignment dimensions . 74 5.2.3 Branch scale . 76 5.2.4 Gamma scale parameter . 78 5.2.5 Co-evolving clusters . 79 5.2.6 Amino acid frequencies . 80 5.3 Iterations . 80 5.4 Summary . 81 6 Comparison of Serial Methods 82 6.1 Results-gathering approach . 82 6.2 Choosing a Z-score type . 85 6.3 Analysing results per UAA threshold . 87 6.4 Choosing a “lower” and a “higher” unique amino acid threshold 91 6.5 Choosing Z-score thresholds . 92 6.6 Results by number of sequences . 93 6.6.1 “Lower” threshold . 94 6.6.2 “Higher” threshold . 95 6.7 Results by alignment length . 96 6.7.1 “Lower” threshold . 96 6.7.2 “Higher” threshold . 98 6.8 Results by rate variation parameter . 99 6.8.1 “Lower” threshold . 99 6.8.2 “Higher” threshold . 100 6.9 Results by percentage of sites co-evolving . 102 6.9.1 “Lower” threshold . 103 6.9.2 “Higher” threshold . 103 6.10 Combining methods . 104 6.11 Summary . 105 7 Parallelisation of Methods using CUDA 107 7.1 Approach taken to parallelisation . 107 7.1.1 PlotCorr . 110 7.1.2 Waddell parametric methods . 118 7.2 Full-program method times . 122 7.2.1 Experimental set-up . 122 7.2.2 Approach taken to timing . 122 7.2.3 Timing results . 123 7.3 Cost/benefit analysis . 131 7.3.1 CPU-only system . 131 7.3.2 Raven . 131 iv 7.3.3 GeForce GTX 660 Ti . 131 7.3.4 GeForce GTX Titan X . 132 7.3.5 Larger alignments . 132 7.4 Summary . 133 8 Application of Methods to Real Data 134 8.1 Alternative methods for classifying mutations . 134 8.1.1 SIFT . 134 8.1.2 PolyPhen . 136 8.2 Real data analysis . 137 8.2.1 Amino acid frequencies . 138 8.2.2 Introduction to the data . 144 8.2.3 Data analysis introduction . 145 8.3 Summary . 154 9 Conclusions 156 9.1 A suitable method for detecting molecular co-evolution . 156 9.2 A tree representation suitable for single-pass CUDA algorithms157 9.3 Parallelisation of co-evolution detection methods . 157 9.4 General guidelines for porting algorithms to CUDA . 158 9.4.1 PlotCorr . ..

Load more