Residue Associations in Protein Family Alignments

RESIDUE ASSOCIATIONS IN PROTEIN FAMILY ALIGNMENTS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Hatice Gulcin Ozer, B.S., M.S. * * * * * The Ohio State University 2008 Dissertation Committee: Approved by Dr. William C. Ray, Adviser Dr. Hakan Ferhatosmanoglu Adviser Dr. Charles Daniels Biophysics Graduate Program Dr. Thomas Magliery © 2008 Hatice Gulcin Ozer All Rights Reserved ABSTRACT The increasing amount of data on biomolecule sequences and their multiple alignments for families, has promoted an interest in discovering structural and functional characteristics of proteins from sequence alone. This has popularly been addressed the demand for discovering interpositional dependencies using the primary sequence and highly conserved positions as key signatures of the protein. However, in many structural interactions between residues appear to be key players in structure and function. The growing number of sequences makes analysis and explanation of this information possible. It is not possible to detect interpositional correlations and alternating motifs within a family alignment by means of consensus, or weight matrix, or hidden Markov models. We propose and analyze a method for detecting interpositional correlations and examine the applicability of this method to structural prediction. In the first part of this thesis, we presented the Multiple Alignment Variation Linker (MAVL) and StickWRLD to analyze biomolecule sequence alignments and visualize positive and negative interpositional residue associations [Ray, 2004, Ray, 2005, Ozer and Ray, 2006]. In the MAVL analysis system, the expected number of sequences that should share identities at a particular pair of positions is calculated based on positional probabilities, and residuals are calculated based on the observed population of sequences actually sharing the residues. Correlating pairs of residues based on these residuals are ii visualized in StickWRLD diagram. This analysis system allows us to extract additional information from the alignments, such as conditional dependencies between columns, which are not accessible to traditional column-based methods. In addition, a StickWRLD diagram enables the user to visualize the family alignment and positional dependencies in 3D and tweak the parameters of correlation. In the second part of the thesis, we discuss methodologies to identify residue associations in protein family alignments. We discussed the use of the residuals and the phi coefficient to determine the strength of a residue association, and Fisher’s Exact probability test to evaluate the statistical significances of the association. We computed identitywise residue associations for 961 Pfam family alignments and examined physical proximity and physiochemical properties of associated residues in the alignments and their presence on secondary structural elements. We observed that the proximity of residues increases as the strength of association and its statistical significance increase. Specifically, associations between aromatic residues and hydrophilic residues are present in closer proximity compared to other physicochemical properties. The amino acid contact predictivity of the residual parameter is the highest compared to the phi coefficient and the statistical significance. Compared to the expected distributions, we observed larger proportions of the pairs such that both residues are in a helix or one residue is in a structural element while the other is in a flexible region, or both residues are in a flexible region. iii ACKNOWLEDGMENTS I would like to thank my advisor Dr. William C. Ray for his mentorship and support. I also want to sincerely thank my dissertation committee members Dr Charles Daniels, Dr Hakan Ferhatosmanoglu, and Dr Thomas Magliery. This research was partially supported by US Department of Defense (DOD) Grant STTR W911NF-06-C-017¢. iv VITA 2004 – 2008 ………………………………. Graduate Student, The Ohio State University, Columbus, OH 2001 – 2003 ………………………………. M.S., Biophysics, Gaziantep University, Gaziantep, Turkey 1995 – 2000 ………………………………. B.S., Computer Engineering, Bogazici University, Istanbul, Turkey PUBLICATIONS Ozer, H.G. and Ray, W.C. (2007) Informative motifs in protein family alignments. R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645: 161 – 170. Ozer, H.G. and Ray, W.C. (2006) MAVL/StickWRLD: Analyzing Structural Constraints using Interpositional Dependencies in Biomolecular Sequence Alignments. Nucleic Acids Res. 34: W133-W136. Ray, W.C. and Ozer, H.G. “Discovering Biostructure Constraints using VRML Visualization”, Association for Computing Machinery SIGGRAPH, 32nd International Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA. July/August 2005. Ozer, H.G., Chen, J., Zhang, F., Yuan, B. (2005) Clustering of Eukaryotic Orthologs Based on Sequence and Domain Similarities Using Markov Graph-Flow Algorithm. M. He, G. Narasimhan and S. Petouklov (Eds.): Advances in Bioinformatics and its Applications, Proceedings of the International Conference on Bioinformatics and Its Applications (ICBA '04). Word Scientific Press. FIELDS OF STUDY Major Field: Biophysics Studies in Bioinformatics and Computational Biology: Dr. William C. Ray v TABLE OF CONTENTS ABSTRACT....................................................................................................................... ii ACKNOWLEDGMENTS ............................................................................................... iv VITA................................................................................................................................... v LIST OF TABLES .........................................................................................................viii LIST OF FIGURES ......................................................................................................... ix CHAPTERS 1 Background and Review of Literature ................................................................... 1 2 MAVL/StickWRLD ................................................................................................ 13 2.1 Introduction......................................................................................................... 13 2.2 Algorithm............................................................................................................. 14 2.3 Implementation ................................................................................................... 17 2.3.1 Results and Discussion .................................................................................................31 2.4 Conclusion ........................................................................................................... 39 3 Analysis of Residue Associations in the Pfam Database...................................... 42 3.1 Introduction......................................................................................................... 42 vi 3.2 Methods................................................................................................................ 43 3.2.1 Protein Family Alignments...........................................................................................43 3.2.2 Protein Structures .........................................................................................................48 3.2.3 Calculation of Pairwise Correlations............................................................................49 3.2.3.1 Calculation of Residuals ...................................................................................................49 3.2.3.2 Calculation of the Phi Coefficient.....................................................................................58 3.2.3.3 Calculation of Statistical Significances ............................................................................65 3.2.4 Grouping of amino acids ..............................................................................................77 3.2.5 Classification of Secondary Structural Components ....................................................80 3.2.6 Random Correlations....................................................................................................83 3.3 Results .................................................................................................................. 85 3.3.1 Distance distribution of associated residues.................................................................85 3.3.2 Residue Associations in Contact ................................................................................100 3.3.3 Comparison of Observed and Expected Proportions of Propertywise Residue Associations and Residence on Secondary Structural Elements .............................................112 3.4 Discussion........................................................................................................... 120 3.5 Conclusion ......................................................................................................... 128 4 CONCLUSION ..................................................................................................... 129 REFERENCES.............................................................................................................. 131 vii LIST OF TABLES Table 2.1. Colors codes for sticks depicting pairwise correlations based on types of predicted interactions.......................................................................................................

Residue Associations in Protein Family Alignments

Chapter 8 Example

Measures of Association for Contingency Tables

2 X 2 Contingency Chi-Square

The Modification of the Phi-Coefficient Reducing Its Dependence on The

Basic ES Computations, P. 1 BASIC EFFECT SIZE GUIDE with SPSS

Robust Approximations to the Non-Null Distribution of the Product Moment Correlation Coefficient I: the Phi Coefficient

Testing Statistical Assumptions in Research Dedicated to My Wife Haripriya Children Prachi-Ashish and Priyam, –J.P.Verma

On the Fallacy of the Effect Size Based on Correlation and Misconception of Contingency Tables

Learn to Use the Phi Coefficient Measure and Test in R with Data from the Welsh Health Survey (Teaching Dataset) (2009)

Bsc Chemistry

Data Analysis /&(

Non Parametric Statistics