Modularity Conserved During Evolution: Algorithms and Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Modularity Conserved during Evolution: Algorithms and Analysis Luqman Hodgkinson Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2015-17 http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-17.html May 1, 2015 Copyright © 2015, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement This thesis presents joint work with several collaborators at Berkeley. The work on algorithm design and analysis is joint work with Richard M. Karp, the work on reproducible computation is joint work with Eric Brewer and Javier Rosa, and the work on visualization of modularity is joint work with Nicholas Kong. I thank all of them for their contributions. They have been pleasant collaborators with stimulating ideas and good designs. Our descriptions of the projects in this thesis overlap somewhat with our published descriptions in [Hodgkinson and Karp, 2011a], [Hodgkinson and Karp, 2011b], [Hodgkinson and Karp, 2012], and [Hodgkinson et al., 2012]. Modularity Conserved during Evolution: Algorithms and Analysis by Luqman Hodgkinson B.A. Hon. (Hiram College) 2003 M.S. (Columbia University) 2007 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science with Designated Emphasis in Computational and Genomic Biology in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor Richard M. Karp, Chair Professor Christos H. Papadimitriou Professor John P. Huelsenbeck Spring 2013 The dissertation of Luqman Hodgkinson is approved: Professor Richard M. Karp, Chair Date Professor Christos H. Papadimitriou Date Professor John P. Huelsenbeck Date University of California, Berkeley Spring 2013 Modularity Conserved during Evolution: Algorithms and Analysis Copyright c 2013 by Luqman Hodgkinson Abstract Modularity Conserved during Evolution: Algorithms and Analysis by Luqman Hodgkinson Doctor of Philosophy in Computer Science with Designated Emphasis in Computational and Genomic Biology University of California, Berkeley Professor Richard M. Karp, Chair Modularity is a defining feature of biological systems. This dissertation presents our work on the development of algorithms to detect modularity in protein interaction networks and techniques of analysis for interpreting the results. A multiprotein module is a collection of proteins exhibiting modularity in their interactions. Multiprotein modules may perform essential functions and be conserved by purifying selection. A new linear-time algorithm named Produles offers significant algorithmic advantages over previous approaches. An algorithmic framework for evaluation is presented that fa- cilitates evaluation of algorithms for detecting conserved modularity with respect to their algorithmic goals. Optimization criteria for detecting homologous multiprotein modules are examined, and their effects on biological process enrichment are quantified. Graph theoretic properties that arise from the physical construction of protein interaction networks account for 36 percent of the variance in biological process enrichment. Protein interaction similarities between conserved modules have only minor effects on biological process enrichment. As random modules increase in size, both biological process enrichment and modularity tend to improve, though modularity does not show this trend in small modules. To adjust for this trend, we recommend a size correction based on random sampling of modules when using biological process enrichment to evaluate module boundaries. 1 Supporting software has been developed useful for designing high quality algorithms for detecting conserved multiprotein modularity. EasyProt is a parallel implementation of scientific workflow software designed for cloud computing that retrieves data from several sources, runs algorithms in parallel, and computes evaluation statistics. VieProt is visual- ization software for conserved multiprotein modularity that uses a dynamic force-directed layout and displays quality measures and statistical summaries. With high quality protein interaction data, it may be possible to use modules to improve the prediction of proteins that are orthologous to each other and that have maintained their function. We present statistical methods that may be useful for this purpose. The utility of these models will depend on anticipated improvements in protein interaction data quality. Professor Richard M. Karp, Chair Date 2 Acknowledgements This thesis presents joint work with several collaborators at Berkeley. The work on al- gorithm design and analysis is joint work with Richard M. Karp, the work on reproducible computation is joint work with Eric Brewer and Javier Rosa, and the work on visualization of modularity is joint work with Nicholas Kong. I thank all of them for their contributions. They have been pleasant collaborators with stimulating ideas and good designs. Our de- scriptions of the projects in this thesis overlap somewhat with our published descriptions in [Hodgkinson and Karp, 2011a], [Hodgkinson and Karp, 2011b], [Hodgkinson and Karp, 2012], and [Hodgkinson et al., 2012]. With regard to the design of Produles, we thank Manikandan Narayanan for provid- ing a starting point with his Match-and-Split algorithm, graciously providing full source code and ideas. We thank Bonnie Kirkpatrick for her guidance at several stages of the project, and Shuai Cheng Li for providing interesting alternative ideas for various subrou- tines. We thank John Huelsenbeck and Christos Papadimitriou for listening to ideas and providing constructive feedback. We thank Kimmen Sj¨olanderand Ruchira Datta for their contributions of PHOG data and their encouragement. We thank Nicholas Kong for contributions to the design of EasyProt and the initial implementation of VieProt. We thank Bonnie Kirkpatrick and Shuai Cheng Li for their feedback with regards to both EasyProt and VieProt. The long hours working on the implementation of EasyProt together with Javier Rosa were some of the highlights of my time at Berkeley, and we thank Eric Brewer for his guidance on this work. We thank Sabry Razick and Ian Donaldson for their help with iRefIndex and Psicquic. We thank Julie Hopper and Chris DiVittorio for their guidance in statistical analy- sis techniques for biological data, and Julie Hopper for her guidance with R. We thank Martin Wainwright for his guidance in statistical methods for NetAlign, and Sehand Negahban who provided helpful discussions about the graphical model used in NetAlign during its for- mative stages. We thank Garvesh Raskutti for valuable advice regarding the analysis and i comparison of optimization criteria for detecting conserved multiprotein modularity. My PhD work has been supported by grants NSF IIS-0803937 and NSF CCF-1052553. I thank LaShana Porlaris and Xuan Quach for their guidance in the Ph.D. program, and Linda Stubbs for her management of funding issues. I am pleased to warmly thank those who made it possible for me to pursue my academic dreams. Dan and Nancy Hoffman invited me into their home when I was a construction worker and opened the doors of academia to me. Graham Creasey and Laura Holmes provided a family away from home during my young life. Jack Rumbaugh personally drove me to Hiram College to talk to the director of admissions, and secured my admission, shortly before tragically passing away. To these key figures in my life I remain eternally grateful. I am pleased to warmly thank key academic figures in my life for the critical con- tributions that they added to my success. My advisors at Hiram College, Ellen Walker and Oberta Slotterbeck (Obie), gave me the initial support I needed in my early educa- tion and have been extremely supportive and encouraging ever since. Denny Taylor and Sigrid Anderson provided supportive academic in-classroom and out-of-classroom instruc- tion with a personal touch. Feodor Dragan at Kent State University gave me important opportunities to advance in my studies. Mihalis Yannakakis at Columbia University pro- vided guidance in computational complexity, algorithms, and mathematics, and maintained an unfailing belief in my abilities, even during difficult times. Itsik Pe'er at Columbia Uni- versity provided initial instruction in computational biology, provided friendly support, and accepted me as his Ph.D. student during difficult times, though I chose to attend Berkeley. The professors who have taught me at the University of California, Berkeley, provided the education in natural sciences that I desperately needed. I am especially pleased to warmly thank my advisor, Professor Richard M. Karp. Through- out my studies at the University of California, Professor Karp has guided me towards in- teresting problems. I have learned much from my interactions with him about the pleasure of solving interesting problems that can be viewed as puzzles. I deeply appreciate that he took me as his student during a difficult time in my life, and for his supportiveness when ii I was filling large gaps in my education. It was a great pleasure to serve as his teaching assistant twice for CS270: Combinatorial