Sze-To Hoyin.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
Discovering Patterns from Sequences with Applications to Protein-Protein and Protein-DNA Interaction by Ho Yin Sze-To A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Systems Design Engineering Waterloo, Ontario, Canada, 2018 c Ho Yin Sze-To 2018 Examining Committee Membership The following served on the Examining Committee for this thesis. The decision of the Examining Committee is by majority vote. External Examiner Luis Rueda Professor Supervisor(s) Andrew K.C. Wong Distinguished Professor Emeritus Daniel Stashuk Professor Internal Member(s) Paul Fieguth Professor Department Chair Alexander Wong Associate Professor, P.Eng. Canada Research Chair in Medical Imaging Systems Internal-external Member Bin Ma Professor University Research Chair ii Author's Declaration This thesis consists of material all of which I authored or co-authored: see Statement of Contributions included in the thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. iii Statement of Contributions Content from 5 papers are used in this thesis. I was the co-author with major contri- butions on designing the methods, implementation and writing the papers: • Sze-To, A., & Wong, A. K. (2017). Pattern-Directed Aligned Pattern Clustering. In Bioinformatics and Biomedicine (BIBM), 2017 IEEE International Conference on. IEEE. Acceptance Rate: 19%. Won IEEE TCCLS Student Award. Incorpo- rated in Chapter 3 of this thesis. [126] • Sze-To, A., & Wong, A. K. (2018). Discovering Patterns from Sequences Using Pattern-Directed Aligned Pattern Clustering. NanoBioScience, IEEE Transactions on. Incorporated in Chapter 3 of this thesis. [127] • Lee, E. S. A., Sze-To, H. Y. A., Wong, M. H., Leung, K. S., Lau, T. C. K., & Wong, A. K. (2017). Discovering protein-dna binding cores by aligned pattern clustering. Computational biology and bioinformatics, IEEE/ACM transactions on, 14(2), 254-263. Incorporated in Chapter 4 of this thesis. [75] • Sze-To, A., Fung, S., Lee, E. S. A., & Wong, A. K. (2015, November). Predicting Protein-protein interaction using co-occurring Aligned Pattern Clusters. In Bioin- formatics and Biomedicine (BIBM), 2015 IEEE International Conference on (pp. 55-60). IEEE. Acceptance Rate: 19%. Incorporated in Chapter 5 of this thesis. [124] • Sze-To, A., Fung, S., Lee, E. S. A., & Wong, A. K. (2016). Prediction of Protein- Protein Interaction via Co-Occurring Aligned Pattern Clusters. Methods. Incorpo- rated in Chapter 5 of this thesis. [125] iv Abstract Understanding Protein-Protein and Protein-DNA interaction is of fundamental impor- tance in deciphering gene regulation and other biological processes in living cells. Tra- ditionally, new interaction knowledge is discovered through biochemical experiments that are often labor-intensive, expensive and time-consuming. Thus, computational approaches are preferred. Due to the abundance of sequence data available today, sequence-based interaction analysis becomes one of the most readily applicable and cost-effective methods. One important problem in sequence-based analysis is to identify the functional regions from a set of sequences within the same family or demonstrating similar biological func- tions in experiments. The rationale is that throughout evolution the functional regions normally remain conserved (intact), allowing them to be identified as patterns from a set of sequences. However, there are also mutations such as substitution, insertion, deletion in these functional regions. Existing methods, such as those based on position weight ma- trices, assume that the functional regions have a fixed width and thus cannot not identify functional regions with mutations, particularly those with insertion or deletion mutations. Recently, Aligned Pattern Clustering (APCn) was introduced to identify functional regions as Aligned Pattern Clusters (APCs) by grouping and aligning patterns with variable width. Nevertheless, APCn cannot discover functional regions with substitution, insertion and/or deletion mutations, since their frequencies of occurrences are too low to be considered as patterns. To overcome such an impasse, this thesis proposes a new APC discovery algorithm known as Pattern-Directed Aligned Pattern Clustering (PD-APCn). By first discovering seed patterns from the input sequence data, with their sequence positions located and recorded on an address table, PD-APCn can use the seed patterns to direct the incremental extension of functional regions with minor mutations. By grouping the aligned extended patterns, PD-APCn can recruit patterns adaptively and efficiently with variable width without relying on exhaustive optimal search. Experiments on synthetic datasets with different sizes and noise levels showed that PD-APCn can identify the implanted pattern with mutations, outperforming the popular existing motif-finding software MEME with much higher recall and Fmeasure over a computational speed-up of up to 665 times. When applying PD-APCn on datasets from Cytochrome C and Ubiquitin protein families, all key binding sites conserved in the families were captured in the APC outputs. In sequence-based interaction analysis, there is also a lack of a model for co-occurring functional regions with mutations, where co-occurring functional regions between interac- tion sequences are indicative of binding sites. This thesis proposes a new representation v model Co-Occurrence APCs to capture co-occurring functional regions with mutations from interaction sequences in database transaction format. Applications on Protein-DNA and Protein-Protein interaction validated the capability of Co-Occurrence APCs. In Protein-DNA interaction, a new representation model, Protein-DNA Co-Occurrence APC, was developed for modeling Protein-DNA binding cores. The new model is more com- pact than the traditional one-to-one pattern associations, as it packs many-to-many associ- ations in one model, yet it is detailed enough to allow site-specific variants. An algorithm, based on Co-Support Score, was also developed to discover Protein-DNA Co-Occurrence APCs from Protein-DNA interaction sequences. This algorithm is 1600x faster in run-time than its contemporaries. New Protein-DNA binding cores indicated by Protein-DNA Co- Occurrence APCs were also discovered via homology modeling as a proof-of-concept. In Protein-Protein interaction, a new representation model, Protein-Protein Co-Occurrence APC, was developed for modeling the co-occurring sequence patterns in Protein-Protein Interaction between two protein sequences. A new algorithm, WeMine-P2P, was developed for sequence-based Protein-Protein Interaction machine learning prediction by construct- ing feature vectors leveraging Protein-Protein Co-Occurrence APCs, based on novel scores such as Match Score, MaxMatch Score and APC-PPI score. Through 40 independent ex- periments, it outperformed the well-known algorithm, PIPE2, which also uses co-occurring functional regions while not allowing variable widths and mutations. Both applications on Protein-Protein and Protein-DNA interaction have indicated the potential use of Co- Occurrence APC for exploring other types of biosequence interaction in the future. vi Acknowledgements I would like to thank all the important people who made this thesis possible. Firstly, I would like to express my gratitude to my supervisors, Professor Andrew K. C. Wong and Professor Daniel Stashuk who guide and support me in my research. Professor Wong, thank you very much for providing me with opportunities in studying in University of Waterloo as an international student. I would not be able to study here without your support and assistance. Thank you for your patience and understanding. Your advice helps me to improve, and guides me throughout the entire Ph.D. process. Professor Daniel Stashuk, thank you very much for your consistent encouragement. Your words always boost my confidence, particularly when I need to face difficulties. I also appreciate your help in providing me with high-end computational equipment to run experiments. Secondly, I would like to thank my Ph.D. committee members Professor Paul Fieguth, from Department of Systems Design Engineering, Prof. Alexander Wong, from Department of Systems Design Engineering, Professor Bin Ma, from Cheriton School of Computer Science, for their time and commitment contributed in reviewing this thesis. I would also like to thank Professor Luis Rueda, from School of Computer Science in University of Windsor, for accepting reviewing this thesis, and sparing his vacation time to participate in my PhD defense. Thirdly, I would like to thank all my colleague whom I have worked and studied with at University of Waterloo, especially Dr. En-Shiun Annie Lee and Sanderz Fung for their outstanding collaborative work, as well as Dr. Dennis Zhuang, for his support in providing me with the relevant toolboxes. I would also like to specially thank Shawn Zhexuan Wang, who supported me tirelessly in my Ph.D study. Last but not least, I would like to thank Professor Fakhri Karray, the director of Centre for Pattern Analysis and Machine Intelligence, who granted me access to the laboratory, allowing me to conduct research in the laboratory. vii Dedication To my beloved parents, My beloved younger sister, My mentors and supporters throughout my life. viii Table of Contents Author's Declaration iii Abstract v Acknowledgements