University of Cincinnati

UNIVERSITY OF CINCINNATI Date:___________________ I, _________________________________________________________, hereby submit this work as part of the requirements for the degree of: in: It is entitled: This work and its defense approved by: Chair: _______________________________ _______________________________ _______________________________ _______________________________ _______________________________ On Applications of Statistical Learning to Biophysics A dissertation submitted to the Division of Research and Advanced Studies of the University of Cincinnati in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY (Ph.D.) in the Department of Physics of the College of Arts and Sciences Baoqiang Cao B.S., 1998 Northwest University of China M.S., 2000 Nanjing University (Thesis) and Northwest University of China (Certificate) Abstract In this dissertation, we develop statistical and machine learning methods for problems in biological systems and processes. In particular, we are interested in two problems– predicting structural properties for membrane proteins and clustering genes based on microarray experiments. In the membrane protein problem, we introduce a compact representation for amino acids, and build a neural network predictor based on it to identify transmembrane domains for membrane proteins. Membrane proteins are divided into two classes based on the secondary structure of the parts spanning the bilayer lipids: alpha-helical and beta-barrel membrane proteins. We further build a support regression model to predict the lipid exposed levels for the amino acids within the transmembrane domains in alpha-helical membrane proteins. We also develop methods to predict pore- forming residues for beta-barrel membrane proteins. In the other problem, we apply a context-specific Bayesian clustering model to cluster genes based on their expression levels and cDNA copy numbers. This dissertation is organized as follows. Chapter 1 introduces the most relevant biology and statistical and machine learning methods. Chapters 2 and 3 focus on prediction of transmembrane domains for the alpha-helix and the beta-barrel, respectively. Chapter 4 discusses the prediction of relative lipid accessibility, a different structural property for membrane proteins. The final chapter addresses the gene clustering approach. 1 Contents Preface List of Figures List of Tables Chapter 1. Introduction 1.1 Membrane proteins: structure and function 1.2 Microarray experiments and gene profiling 1.3 Classification and regression in structural bioinformatics 1.4 Unsupervised learning and clustering in genomics 1.5 Summary Chapter 2. Transmembrane domains prediction in alpha-helical membrane proteins 2.1 Introduction 2.2 Methods and data 2.3 Results and discussion 2.4 Conclusion Chapter 3. Transmembrane domains prediction in beta-barrel membrane proteins 3.1 Introduction 3.2 Methods and data 3.3 Results and discussion 3.4 Conclusion Chapter 4. Relative lipid accessibility prediction in membrane proteins 4.1 Introduction 4.2 Materials and Methods 4.3 Results and discussion 4.4 Conclusion Chapter 5. Cluster analysis of array comparative genomic hybridization data 5.1 Introduction 5.2 Methods and data 5.3 Results and discussion 5.4 Future plans 5.5 Summary Bibliography 2 Preface I would like to thank my advisors, Dr. Mark Jarrell and Dr. Jaroslaw Meller, for their outstanding mentoring, inspiration and financial support. Their enthusiasm for and impressive understanding of various branches of computational sciences proved to be truly irresistible and left a permanent mark on my personality. For that, their kindness and patience, I am deeply grateful. I also had the good fortune to work with Drs. Mario Medvedovic and Michael Wagner during my Ph.D. years. I am particularly indebted to Mario who taught me to look at biological data in a critical and unbiased way. Michael helped me apply optimization- based techniques to problems in genomics, which proved very important for the success of our collaborative efforts and my interdisciplinary work stemming from these collaborations. I would also like to thank the other members of my thesis committee, Dr. Rostislav Serota and Dr. Thomas Beck, for their instructive comments and their overall support. Furthermore, I would like to thank Dr. Aleksey Porollo for his much more appreciated technical help and Han Yong Ng and Eric Moss for their critical reading of the dissertation. I also want to thank my teachers and mentors from the Departments of Physics, Chemistry, and Biochemistry. I truly admire their knowledge and passion about teaching and research. There is an old saying in Chinese, which can be translated as “whoever is my teacher even for one day becomes my life long teacher”. I can not find any better words to express my appreciation and gratitude. 3 Last but not least, I would like to thank my wife, Ying Shen, and my parents, Haisheng Cao and Guaxian Suo for their support and encouragements. Without them, I would be unable to complete this journey. Three chapters of this dissertation have been published or submitted for publication. In particular, chapter 2 of this dissertation was published in Bioinformatics [11], chapter 3 will be published as a chapter in a volume of Advances in Computational and Systems Biology series [82] and chapter 4 of this dissertation was submitted to Proteins. The remaining parts of this dissertation have not been published or submitted as yet. We are planning to submit one manuscript based on chapter 5 in near future. 4 List of Figures 1.1 Functions of membrane proteins. 1.2 Various types of associations of membrane proteins with the bilayer lipid membrane. 1.3 Classes of membrane proteins. 1.4 Schematic experimental procedure to measure gene expression using a microarray. 2.1 The distribution of lengths for TM segments included in the MPtopo. 2.2 Dependence of prediction accuracy on the size of the sliding window, using 10-fold cross-validation on the non-redundant training set of 73 alpha-helical membrane proteins. 2.3 Examples of MINNOU transmembrane helix predictions for an ion channel (PDB code 1OTS) in panel A, and a glutamate transporter protein (PDB code 1XFH) in panel B. 3.1 An example of the TM segments prediction for a β-barrel membrane protein (PDB code 1P4T). 4.1 The proposed position specific error (PSE) function for ε-insensitive SVR models for the prediction of the relative lipid accessibility (RLA) in membrane proteins. 4.2 Average cross-validated RLA prediction accuracies (in terms of correlation coefficients) for the training set of 72 non-redundant chains with and without interface residues. 4.3 Example of RLA prediction for sodium/proton antiporter (PDB code 1ZCD, chain A). 4.4 TM domain and RLA prediction for sodium channel protein from human (SCN1A_HUMAN). 5.1 Directed acyclic graph for the Bayesian hierarchical model. 5 5.2 The ROC curves obtained using independent clustering of mRNA expression and cDNA copy number alteration patterns. 5.3 The ROC curves for the joint clustering of genes based on mRNA expression levels and cDNA copy number alteration. 5.4 Classification of samples using averaging of cDNA copy number alteration (the right panel) and the corresponding mRNA expression levels (left panel). 6 List of Tables 2.1 Per residue classification accuracy (Q3) of secondary structure predictions for TM proteins obtained using the SABLE and PSIPRED servers. 2.2 Accuracy of transmembrane segment (domain) prediction using alternative representations. 2.3 Accuracy of TM segment prediction using structural profiles including KD and WW hydropathy profiles, predicted RSA and SS, estimated using cross-validation on non- redundant sets of alpha-helical and beta-barrel proteins. 2.4 Improvements due to 1 st layer, 2 nd layer and filter (F) based predictors according to TMH Benchmark evaluation. 2.5 Assessment of TM helix prediction methods using the TMH Benchmark server. 2.6 Accuracy of different methods as measured by per-residue accuracy and Matthews correlation coefficients on a set of five helical membrane proteins not included in the training set. 2.7 Performance of MINNOU on the four control sets. 3.1 The non-redundant dataset of 13 β-barrel proteins (S13 set). 3.2 Performance of alternative predictors estimated using 10-fold cross-validation on different training sets. 3.3 Assessment of false positive rates for alternative TM β-strand segment predictors. 4.1 The effect of different representations and regression models on the RLA prediction accuracies. 4.2 Performance of the final (consensus) SVR model on the control set of 16 non- redundant membrane proteins. 7 Chapter 1 Introduction Massive amounts of genomic and other data pertaining to biological systems and processes are being generated, as illustrated by the Human Genome Project and related efforts [1]. The availability of these data creates an opportunity and a challenge for computational scientists to develop algorithms and tools to improve our understanding of molecular processes, with the ultimate goal of facilitating progress in medicine [2]. One successful approach is to use statistical learning and data mining techniques to extract general rules and patterns from massive data sets [3]. In particular, the availability of efficient computational implementations of statistical and machine learning methods makes it suitable for high-throughput data mining [3]. The role of statistical learning methods in genomics is best highlighted by the growing number of studies that apply such methods to molecular systems [3]. In general, machine learning approaches can be divided into two classes, namely the so-called supervised and unsupervised learning, both of them being used in this dissertation. The difference between these two approaches stems from the availability (or lack of thereof) of labeled data for which we know the outcomes, e.g., structural attributes of a protein that we are trying to predict or a clinical phenotype related to a given gene expression profile [3]. Thus, in supervised learning the goal is to adjust model parameters so that the outputs correspond more closely to outcomes that are imposed in the training [3].

Load more