Efficient Algorithms for Improving the Accuracy in Motifs Prediction Jerlin C
Total Page:16
File Type:pdf, Size:1020Kb
University of Connecticut OpenCommons@UConn Master's Theses University of Connecticut Graduate School 9-19-2012 Efficient Algorithms for Improving the Accuracy in Motifs Prediction Jerlin C. Merlin [email protected] Recommended Citation Merlin, Jerlin C., "Efficient Algorithms for Improving the Accuracy in Motifs Prediction" (2012). Master's Theses. 351. https://opencommons.uconn.edu/gs_theses/351 This work is brought to you for free and open access by the University of Connecticut Graduate School at OpenCommons@UConn. It has been accepted for inclusion in Master's Theses by an authorized administrator of OpenCommons@UConn. For more information, please contact [email protected]. Efficient Algorithms for Improving the Accuracy in Motifs Prediction Jerlin Camilus Merlin B. E., Anna University, India, 2005 A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science At The University of Connecticut 2012 Copyright by Jerlin Camilus Merlin 2012 APPROVAL PAGE Master of Science Thesis Efficient Algorithms for Improving the Accuracy in Motifs Prediction Presented by Jerlin Camilus Merlin, B.E. Major Advisor _____________________________________________________ Dr. Sanguthevar Rajasekaran Associate Advisor__________________________________________________ Dr. Reda A. Ammar Associate Advisor__________________________________________________ Dr. Chun-Hsi Huang University of Connecticut 2012 ii PREFACE The Master of Science Thesis “Efficient Algorithms for Improving the Accuracy in Motifs Prediction” is a part of the research conducted at the Department of Computer Science and Engineering, University of Connecticut. The computing facilities at the department of Computer Science and Engineering and Booth Engineering Center for Advanced Technologies (BECAT) were used for the purpose of this research. No human or animal subjects were involved in this research. This research has been supported in part by the National Institutes of Health grants numbered LM010101 and GM079689, and NSF Grant 0829916. iii ACKNOWLEDGEMENT From the very bottom of my heart, I express my heartfelt thanks to my Creator and Savior without whom this thesis would not have been possible. He was there for me and led me at all times and blessed me with His wisdom, understanding and knowledge. I express my genuine gratitude and thanks to Prof. Raj, who took me in as a research student, taught me the intricacies and complexities of this research program, and guided me throughout this research work with valuable advice and corrections. He is not only a committed researcher who motivates research students but also a wise and skillful teacher whose classes were always interesting, and a trusted counselor whom I could depend upon. Without his constant support and guidance, this thesis would not have been possible. It was really a joy working with him. I fondly remember the discussions that were done in very warm and congenial ambience. Certainly I will cherish the lessons I have learnt from him. I am also thankful to Prof. Martin Schiller, University of Nevada, Las Vegas, for making our weekly teleconference lively by providing biological inputs, extending necessary facilities, expressing his expectations and setting deadlines. I place in record, my sincere gratitude to Prof. Reda, Department HOD for his constant encouragement. I would also like to thank Prof. Chun-Hsi Huang for agreeing to be part of my graduate committee. iv I also wish to express my sincere gratitude to my uncle, Prof. Dr. R. S. Milton, SSN College of Engineering, India, for inspiring me to pursue higher education and constantly supporting me spiritually and emotionally. I also thank my fellow lab mates Tian Mi, Sudipta Pathak and others, for the stimulating discussions, inputs, and encouragements. And finally, I would like to thank my parents without whom I would not be at this place. No words can describe the love, care and affection they showered upon me at all times even though they were on the other side of the planet. And last but not least, I would like to thank my husband for his unflinching love, support and care. v Dedication I dedicate this work To my parents, Merlin and Selva Mary To my husband, Vinoth To my uncle, Milton To my siblings, Evan and Sadhu And to my pet dog, Violet, which still cannot focus on the webcam and bark. vi Abstract - Efficient Algorithms for Improving the Accuracy in Motifs Prediction Jerlin Camilus Merlin University of Connecticut 2012 Analysis of sequence homology has always played a major role in the understanding of biological factors such as protein domain identification, gene product relationships, and gene function determination. Short contiguous protein sequences that are conserved across proteins provide important information about such factors, and are of at most 15 residues in length. These segments of proteins are known as minimotifs. Identifying minimotifs has been of much use in the formulation of the hypothesis about the biological functions that otherwise might be uncharacterized. Mechanisms of motif predictions such as Minimotif Miner are widely used for predicting minimotifs. However, due to the small lengths of minimotifs, the probability of a motif occurring by a random chance is very high. That is, a motif predicted by an algorithm could be invalid as it may not have any biological significance in the context of the examined protein sequence, but shares the sequence of a known minimotif by coincidence. This is one of the major difficulties faced by motif prediction algorithms. Hence the need for sophisticated filtering mechanisms arises to reduce false-positive rate in motifs prediction. This research proposes two major filtering algorithms, along with its extensions, to effectively reduce the false positive rate in motifs prediction. vii TABLE OF CONTENT Chapter 1 - Introduction __________________________________________________________ 1 Chapter 2 - Protein-Protein Interactions Filter ________________________________________ 8 2. 1 Introduction ______________________________________________________________________ 8 2. 2 Data Sources for PPI Filters _______________________________________________________ 10 2. 3 Design of Protein-protein Interaction filters __________________________________________ 12 2. 3. 1 Basic PPI Filter _____________________________________________________________________ 13 2. 3. 2 PPI-Homologene Filter _______________________________________________________________ 14 2. 3. 3 PPI-Similarity Filter _________________________________________________________________ 15 2. 4 Results and Analysis ______________________________________________________________ 17 2. 4. 1 Evaluation of PPI Filter _______________________________________________________________ 18 2. 4. 2 Evaluation of the PPI-Homologene Filter _________________________________________________ 19 2. 4. 3 Evaluation of PPI-Similarity Filter ______________________________________________________ 20 2. 4. 5 Statistical significance of the PPI Filters __________________________________________________ 22 2. 4. 6 Study of Grb2 using PPI filters _________________________________________________________ 25 2. 4. 7 Analysis ___________________________________________________________________________ 28 2. 5 MnM User interface for the PPI Filter _______________________________________________ 30 2. 6 Conclusions _____________________________________________________________________ 34 Chapter 3 - Genetic Interactions Filter ______________________________________________ 36 3. 1 Introduction _____________________________________________________________________ 36 3.2 Data sources for gene interaction filters ______________________________________________ 37 3. 3 Design of Gene Interactions filtering algorithms _______________________________________ 39 3. 3. 1 Basic GI Filter ______________________________________________________________________ 40 3. 3. 2 GI-Node Filter ______________________________________________________________________ 41 3. 3. 3 GI-Homologene Filter ________________________________________________________________ 43 3. 4 Results and Analysis ______________________________________________________________ 44 3. 4. 1 Methods of analysis of GI Filters _______________________________________________________ 44 3. 4. 2 Evaluation of GI filters _______________________________________________________________ 45 3. 4. 3 Performance of the GI filter along with existing MnM Filters _________________________________ 46 3. 4. 4 The performance of the GI Filter as a function of minimotif properties __________________________ 48 3. 4. 5 Analysis ___________________________________________________________________________ 50 3. 5 MnM implementation of GI filters __________________________________________________ 52 3.6 MnM User interface for GI Filters __________________________________________________ 52 3.7 Conclusions _____________________________________________________________________ 54 Chapter 4 - Conclusions and Future Work __________________________________________ 55 Bibliography ___________________________________________________________________ 58 Appendix A – Percentage of minimotifs in the corresponding proteins that pass the Similarity-PPI filter________________________________________________________________________ 1 viii LIST OF FIGURES Figure 1 Evaluation of the Similarity-PPI filter ---------------------------------------------------------------------------- 21 Figure 2 ROC curve for the PPI-Similarity