RESIDUE ASSOCIATIONS IN PROTEIN FAMILY

ALIGNMENTS

DISSERTATION

Presented in Partial Fulfillment of the Requirements for

the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Hatice Gulcin Ozer, B.S., M.S.

* * * * *

The Ohio State University

2008

Dissertation Committee: Approved by Dr. William C. Ray, Adviser Dr. Hakan Ferhatosmanoglu Adviser Dr. Charles Daniels Biophysics Graduate Program Dr. Thomas Magliery

© 2008

Hatice Gulcin Ozer

All Rights Reserved

ABSTRACT

The increasing amount of data on biomolecule sequences and their multiple

alignments for families, has promoted an interest in discovering structural and functional

characteristics of proteins from sequence alone. This has popularly been addressed the

demand for discovering interpositional dependencies using the primary sequence and highly conserved positions as key signatures of the protein. However, in many structural interactions between residues appear to be key players in structure and function. The

growing number of sequences makes analysis and explanation of this information possible. It is not possible to detect interpositional correlations and alternating motifs within a family alignment by of consensus, or weight matrix, or hidden Markov

models. We propose and analyze a method for detecting interpositional correlations and

examine the applicability of this method to structural prediction.

In the first part of this thesis, we presented the Multiple Alignment Variation Linker

(MAVL) and StickWRLD to analyze biomolecule sequence alignments and visualize

positive and negative interpositional residue associations [Ray, 2004, Ray, 2005, Ozer

and Ray, 2006]. In the MAVL analysis system, the expected number of sequences that

should share identities at a particular pair of positions is calculated based on positional

probabilities, and residuals are calculated based on the observed population of sequences

actually sharing the residues. Correlating pairs of residues based on these residuals are

ii visualized in StickWRLD diagram. This analysis system allows us to extract additional

information from the alignments, such as conditional dependencies between columns,

which are not accessible to traditional column-based methods. In addition, a StickWRLD

diagram enables the user to visualize the family alignment and positional dependencies in

3D and tweak the parameters of correlation.

In the second part of the thesis, we discuss methodologies to identify residue

associations in protein family alignments. We discussed the use of the residuals and the phi coefficient to determine the strength of a residue association, and Fisher’s Exact probability test to evaluate the statistical significances of the association. We computed identitywise residue associations for 961 Pfam family alignments and examined physical proximity and physiochemical properties of associated residues in the alignments and their presence on secondary structural elements. We observed that the proximity of residues increases as the strength of association and its statistical significance increase.

Specifically, associations between aromatic residues and hydrophilic residues are present in closer proximity compared to other physicochemical properties. The amino acid contact predictivity of the residual parameter is the highest compared to the phi coefficient and the statistical significance. Compared to the expected distributions, we observed larger proportions of the pairs such that both residues are in a helix or one residue is in a structural element while the other is in a flexible region, or both residues are in a flexible region.

iii

ACKNOWLEDGMENTS

I would like to thank my advisor Dr. William C. Ray for his mentorship and support.

I also want to sincerely thank my dissertation committee members Dr Charles Daniels, Dr

Hakan Ferhatosmanoglu, and Dr Thomas Magliery.

This research was partially supported by US Department of Defense (DOD) Grant

STTR W911NF-06-C-017¢.

iv

VITA

2004 – 2008 ………………………………. Graduate Student, The Ohio State University, Columbus, OH 2001 – 2003 ………………………………. M.S., Biophysics, Gaziantep University, Gaziantep, Turkey 1995 – 2000 ………………………………. B.S., Computer Engineering, Bogazici University, Istanbul, Turkey

PUBLICATIONS

Ozer, H.G. and Ray, W.C. (2007) Informative motifs in protein family alignments. R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645: 161 – 170. Ozer, H.G. and Ray, W.C. (2006) MAVL/StickWRLD: Analyzing Structural Constraints using Interpositional Dependencies in Biomolecular Sequence Alignments. Nucleic Acids Res. 34: W133-W136. Ray, W.C. and Ozer, H.G. “Discovering Biostructure Constraints using VRML Visualization”, Association for Computing Machinery SIGGRAPH, 32nd International Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA. July/August 2005. Ozer, H.G., Chen, J., Zhang, F., Yuan, B. (2005) Clustering of Eukaryotic Orthologs Based on Sequence and Domain Similarities Using Markov Graph-Flow Algorithm. M. He, G. Narasimhan and S. Petouklov (Eds.): Advances in and its Applications, Proceedings of the International Conference on Bioinformatics and Its Applications (ICBA '04). Word Scientific Press.

FIELDS OF STUDY

Major Field: Biophysics

Studies in Bioinformatics and Computational Biology: Dr. William C. Ray

v

TABLE OF CONTENTS

ABSTRACT...... ii

ACKNOWLEDGMENTS ...... iv

VITA...... v

LIST OF TABLES ...... viii

LIST OF FIGURES ...... ix

CHAPTERS

1 Background and Review of Literature ...... 1

2 MAVL/StickWRLD ...... 13

2.1 Introduction...... 13

2.2 Algorithm...... 14

2.3 Implementation ...... 17 2.3.1 Results and Discussion ...... 31

2.4 Conclusion ...... 39

3 Analysis of Residue Associations in the Pfam Database...... 42

3.1 Introduction...... 42

vi 3.2 Methods...... 43 3.2.1 Protein Family Alignments...... 43 3.2.2 Protein Structures ...... 48 3.2.3 Calculation of Pairwise Correlations...... 49 3.2.3.1 Calculation of Residuals ...... 49 3.2.3.2 Calculation of the Phi Coefficient...... 58 3.2.3.3 Calculation of Statistical Significances ...... 65 3.2.4 Grouping of amino acids ...... 77 3.2.5 Classification of Secondary Structural Components ...... 80 3.2.6 Random Correlations...... 83

3.3 Results ...... 85 3.3.1 Distance distribution of associated residues...... 85 3.3.2 Residue Associations in Contact ...... 100 3.3.3 Comparison of Observed and Expected Proportions of Propertywise Residue Associations and Residence on Secondary Structural Elements ...... 112

3.4 Discussion...... 120

3.5 Conclusion ...... 128

4 CONCLUSION ...... 129

REFERENCES...... 131

vii

LIST OF TABLES

Table 2.1. Colors codes for sticks depicting pairwise correlations based on types of predicted interactions...... 21

Table 2.2. Amino acid properties and favorable interactions...... 22

Table 2.3. Amino acid property values that are applied by the ‘Order residues by’ interface option...... 27

Table 2.4. Grouping schema applied by ‘Group residues by’ interface option...... 28

Table 3.1. Shows DSSP codes and their definitions for secondary structure assignments...... 80

viii

LIST OF FIGURES

Figure 1.1 Popular methods to model biosequence family alignments...... 6 Figure 2.1. Entrance page of MAVL/StickWRLD web site...... 19 Figure 2.2. StickWRLD diagram and interface controls...... 20 Figure 2.3. StickWRLD graph representation of Integrin_alpha family...... 23 Figure 2.4. Pairwise correlations displayed in StickWRLD graph for total residual cutoffs of (A) 0.100, (B) 0.075, (C) 0.050, and (D) 0.025...... 26 Figure 2.5. Input sequence alignment screen...... 30 Figure 2.6. StickWRLD representation for the Pfam family of adenylate kinase active site lid (Pfam ID: ADK_lid, accession no. PF05191)...... 33 Figure 2.7. Three-dimensional structures of ADK_lid domains from (A) Bacillus stearothermophilus (PDB: 1ZIP) and (B) Bovine (PDB: 1AK2)...... 34 Figure 2.8. Pfam ADK_lid family alignment ...... 35 Figure 2.9. Sequence logo of Pfam ADK_lid family alignment...... 36 Figure 2.10. Sequence search results from http://pfam.sanger.ac.uk/...... 37 Figure 2.11. StickWRLD graph representations of the integrin alpha cytoplasmic domain from Pfam family alignment, and a corresponding 3D structure...... 41 Figure 3.1. of the number of sequences for all Pfam v22.0 families...... 44 Figure 3.2. Histogram of the length of sequences for all Pfam v22.0 families...... 45 Figure 3.3. Histogram of the number of distinct PDB references for all Pfam v22.0 families...... 45 Figure 3.4. Histogram of the number of sequences for Pfam families examined in this study (961 families)...... 46 Figure 3.5. Histogram of the length of sequences for Pfam families examined in this study (961 families)...... 47 Figure 3.6. Histogram of the number of distinct PDB references for Pfam families examined in this study (961 families)...... 47

Figure 3.7. Maximum and minimum possible values of the residual parameter (RNiMi) for given probabilities of P(Ni) and P(Mj)...... 53

ix Figure 3.8. (A) Histogram of total the number of correlations and (B) average number of correlations per structure versus residual parameter and number of sequences in the analyzed families...... 54 Figure 3.9. (A) Histogram of the total number of correlations and (B) average number of correlations per structure versus the residual parameter and sequence length of the analyzed families...... 56 Figure 3.10. (A) Histogram of the total number of correlations and (B) average number of correlations per structure versus phi coefficient and number of sequences in the analyzed families...... 61 Figure 3.11. (A) Histogram of the total number of correlations and (B) average number of correlations per structure versus phi coefficient and sequence length of the analyzed families...... 63 Figure 3.12. (A) Histogram of the total number of correlations and (B) average number of correlations per structure versus base-10 logarithm of statistical significance and the number of sequences in the analyzed families...... 73 Figure 3.13. (A) Histogram of the total number of correlations and (B) average number of correlations per structure versus base-10 logarithm of statistical significance and sequence length of the analyzed families...... 75 Figure 3.14. Hydrophobic similarity matrix by George et al. [1990] and grouping of amino acids based on their hydrophobic similarity score...... 78 Figure 3.15. The distribution of the number of amino acids in each property group for the sequence collection of 961 protein families analyzed in this study...... 79 Figure 3.16. The distribution of the secondary structural elements for the 2,972 structures that are referred by 961 protein families analyzed in this study...... 82 Figure 3.17. The distribution of the aggregated secondary structural elements for the 2,972 structures that are referred by 961 protein families analyzed in this study...... 82 Figure 3.18. The percentage of random correlations that are in close contact and not in contact...... 84 Figure 3.19. Box plots of the closest distances (left) and relative distances (right) between random amino acid pairs...... 84 Figure 3.20. Box plots of the closest distances between associated residues against phi coefficient...... 86 Figure 3.21. Box plots of the relative distances between associated residues against phi coefficient...... 87 Figure 3.22. Box plots of the closest distances between associated residues against phi coefficient for different residue properties...... 88 Figure 3.23. Box plots of the closest distances between associated residues against the base-10 logarithm of statistical significance...... 92

x Figure 3.24. Box plots of the relative distances between associated residues against base- 10 logarithm of statistical significance...... 93 Figure 3.25. Box plots of the closest distances between associated residues against the base-10 logarithm of statistical significance for different residue properties...... 94 Figure 3.26. Box plots of the closest distances between associated residues against residual...... 95 Figure 3.27. Box plots of the relative distances between associated residues against residual...... 96 Figure 3.28. Histogram of the number correlations between aggregate physical properties for the Pfam families examined in this study (961 families) versus phi coefficient...... 97 Figure 3.29. Histogram of number the correlations between aggregate physical properties for the Pfam families examined in this study (961 families) versus the base-10 logarithm of the statistical significance...... 98 Figure 3.30. Histogram of the number correlations between aggregate physical properties for the Pfam families examined in this study (961 families) versus the residual...... 99 Figure 3.31. The percentage of residue contacts versus phi coefficient...... 102 Figure 3.32. The percentage of residue contacts versus phi coefficient (Statistical Significance <0.0001)...... 103 Figure 3.33. The percentage of residue contacts versus statistical significance...... 104 Figure 3.34. The percentage of residue contacts versus residual parameter...... 105 Figure 3.35. The percentage of residue contacts grouped by chemical properties versus phi coefficient ...... 107 Figure 3.36. The percentage of residue contacts grouped by chemical properties versus phi coefficient for highly significant pairs (Statistical Significance < 0.0001) ...... 107 Figure 3.37. The percentage of residue contacts grouped by chemical properties versus statistical significance...... 108 Figure 3.38. The percentage of residue contacts grouped by chemical properties versus the residual parameter...... 109 Figure 3.39. The number of correlations in contact with other chains against the phi coefficient (left), base-10 logarithm of statistical significance (middle) and residual (right) in 564 protein families that are homo-oligomers...... 111 Figure 3.40. Observed and expected distributions of the propertywise correlations against the phi coefficient...... 114 Figure 3.41. The observed and expected distributions of the propertywise correlations against the base-10 logarithm of the statistical significance...... 115 Figure 3.42. The observed and expected distributions of the propertywise correlations against the residual...... 116

xi Figure 3.43. The observed and expected distributions of the residence of correlated residues on secondary structural elements against the phi coefficient...... 117 Figure 3.44. The observed and expected distributions of the residence of correlated residues on secondary structural elements against the base-10 logarithm of the statistical significance...... 118 Figure 3.45. The observed and expected distributions of the residence of correlated residues on secondary structural elements against the residual...... 119 Figure 3.46. Three dimensional of the phi coefficient, the base-10 logarithm of statistical significance and the number of sequences in the family alignment...... 123 Figure 3.47. Three dimensional scatter plot of the residual, the base-10 logarithm of statistical significance and the number of sequences in the family alignment...... 124 Figure 3.48. Three dimensional scatter plot of the phi coefficient, the residual and the number of sequences in the family alignment...... 125

xii

CHAPTER 1

1 Background and Review of Literature

Multiple sequence alignment methods were introduced in the early 1980s in order to

examine common sequence features across a whole family. When all of the sequences are

considered together, shared properties, typically characterized as identical, or

biochemically similar protein or nucleic acid residues, can be identified. Conserved sequence regions across a group of sequences are hypothesized to be evolutionarily related. Such conserved regions in a group of functionally related sequences can be used in conjunction with biochemical information to determine structurally and functionally critical positions in the family. Therefore, multiple sequence alignments have became an essential tool for understanding biomolecule’s structure and function [Edgar and

Batzoglou, 2006].

Computational and statistical methods are used to build a model to concisely summarize and describe the important characteristics of a known family of biosequences.

The simplest method for describing an alignment is the consensus sequence, where a diagnostic sequence motif is built by concatenating the most frequently occurring identities at each position of the alignment. However, this simple method does not represent all of the sequences in the alignment since information about occasionally or

1 rarely occurring identities is lost. Schneider [2002] explains how deceptive a simple consensus analysis can be for describing families that are composed of distant relatives.

To overcome the limitations of the consensus sequence, Staden [1984] described

Position Weight Matrix (PWM) by modeling the nucleotide distributions at each position using purely frequentist . Gribskov et al. [1987, 1990] proposed the usage of a position specific scoring matrix (PSSM) or profile analysis to describe protein families and detect distantly related family members. In both of these models, information in the sequence alignment is expressed in a matrix, which lists the frequency of each identity at each position of the alignment. Since weight matrix methods take into account all possibilities at each position, they completely represent the positional possibilities of the aligned sequence family. To identify new members of the family, the test sequence is aligned to the weight matrix and its likelihood score is calculated by summing up the weights for the identity aligned in each position.

One disadvantage of the weight matrix methods is that they assume the positions are independent, that is that the appearance of an identity in a column is not influenced by other identities in other columns. For certain types of sequences this assumption is trivially false. Zhang and Marr [1993] have proposed the weight array model (WAM) to partially overcome this problem. They generalized the positional weight matrix by allowing for the dependencies between the adjacent positions. However, in real sequences, spacial proximity is more important than sequential proximity. WAM cannot capture this information. Another problem with these models is that it is not possible to visually interpret them reducing their utility for conveying information to the user, although they are very descriptive for the given family alignment.

2 To address the visualization deficit of weight matrix methods, Schneider and

Stephens [1990, Shaner et al., 1993] introduced Sequence Logos for graphical representation of amino acid or nucleic acid multiple sequence alignments. The method begins by generating a weight matrix from the frequencies of each nucleotide or amino acid at each position of the aligned sequences. Then for each position the characters representing the sequence are stacked on top of each other. The height of each letter is made proportional to its frequency, and the letters are sorted so that the most common one is on top. The height of the entire stack is then adjusted to signify the information content of the sequences at that position. From this representation one can determine not only the consensus sequence but also the relative frequency of bases and the information content at every position. The logo displays both significant residues and subtle sequence patterns. Therefore, this is a very informative and visually interpretable model for visualizing sequence alignments. In addition to family description, methods have been developed to determine the sequence conservation of individual sequences in a family alignment. These individual information distributions can be used in combination with

PWM scoring to improve search for new members of the family [Schneider, 1997].

Sequence Logos are very good in representing family consensus and displaying the information content at each position when there is no positional dependency. However, positional dependencies remain unaddressed in families with alternating identities with strong dependencies (e.g. stem identities in an RNA stem-loop structure), Sequence

Logos will neglect these positions by displaying their information content close to zero.

To overcome this problem Beitz [2005] introduced subfamily logos to visualize subfamily specific sequence deviations. The display is similar to classical sequence logos.

3 Residues which are characteristics for the subfamily are displayed as upright characters

and residues typical for the remaining sequences are displayed upside-down. This representation is very informative to understand subfamily features, but still cannot visualize positional dependencies.

Hidden Markov model (HMM) techniques were introduced to computational biology community by Krogh et al. [1994] and gained increasing acceptance as a means of sequence modeling, multiple alignment and profiling [Baldi et al., 1994, Eddy et al.,

1995]. An HMM is a that describes a over a potentially infinite number of sequences. Profile HMMs [Krogh et al., 1994] are hidden

Markov model equivalent of profile analysis. A linear profile HMM for a family of nucleotide or amino acid sequences is a set of nodes that corresponds to columns in a multiple alignment. Typically each node will have three states: match, insert, and delete.

A ‘match’ state models the distribution of nucleotides or amino acids allowed in the column. An ‘insert’ state and ‘delete’ state at each column allow for insertion of one or more bases or residues between that column and the next, or for deleting the consensus.

Probabilities are associated with each character match and each transition between states.

An HMM model can answer two questions: For a given observed sequence, which model is the most likely to explain this data, and for a given sequence and a given model, what is the most likely reconstruction of the path through the states [Eddy, 1998, Birney, 2001].

HMMs can successfully represent some family alignments and can be used to identify very distant relatives.

Although HMMs theoretically provide the best possibility for modeling positional dependencies, practically HMMs are limited to near neighbor dependencies. This is due

4 to the branching structure of HMMs which limits the distance over which interpositional

relationships can be modeled, to the “depth” of the HMM. Long interdependence

approach intractable and crossing dependencies are impossible to represent in the HMM

paradigm [Eddy, 2004]. Additionally, while HMMs may accurately model certain

features of an alignment, it is impractical to tease human understanding of the represented

motifs back out of the HMM model.

Another popular method to describe sequence family alignments is the discovery of

sequence patterns that are common to (matches) all, or most of the sequences in the set.

Sequence patterns representing a single motif can be encoded in regular expressions

which represent features by logical combinations of characters. This model performs best

when a given family can be characterized by a highly conserved single motif. Regular

expressions possess two main limitations. First, they lose information, since only the

features that are judged to be the most conserved or significant from the given alignment

are modeled. Secondly, it is a deterministic model, i.e. matches have to be exact. [Brazma

et al., 1997, Orengo et al., 2003] To overcome this second drawback, fuzzy regular

expressions can be built by information on shared biochemical properties of amino acids.

Although this approach allows us to detect more distant relatives, it also increases the

chance of matches without any biological significance [Orengo et al., 2003].

All of these commonly used models for describing family alignments have their own

advantages and disadvantages. Figure 1.1 depicts popular methods to model family alignments.

5

Figure 1.1 Popular methods to model biosequence family alignments.

Integrin alpha cytoplasmic region (Pfam accession number: PF00357) alignment. a) Colored alignment by using the ClustalX coloring scheme in Jalview [Clamp et al., 2004]. b) Multi-level consensus sequence showing the most conserved letter(s) at each motif position by MEME [Bailey and Gribskov, 1998]. c) Position specific probability matrix, by MEME, specifies the probability of each possible letter appearing at each position. In order to make it easier to see which letters are most likely in each of the columns of the alignment, the simplified representation shows the letter probabilities multiplied by 10 and rounded to the nearest integer. d) The Sequence Logo, by WebLogo [Crooks et al., 2004], shows all possible residues present at each alignment position as a stack of letters. The overall height of the stack indicates the sequence conservation at that position, while the height of letter within the stack indicates the relative frequency of each amino acid at that position. e) Information content diagram, by MEME, provides an idea of which positions are most highly conserved. Each position can be characterized by the amount of information it contains (measured in bits). Highly conserved positions have high information, while positions that have all letters equally likely have information close to zero.

6 Consensus sequence models suffer from their applicability only to short, highly conserved sequences. Deterministic models like regular expressions are simple and easier to interpret than probabilistic patterns like profiles and HMMs. On the other hand probabilistic models have more modeling power. Unfortunately, all of these popular approaches inherently assume that positions within the alignment are independent of each other. This assumption is not sound for alignments of proteins and structural RNAs, since there are residue contact and base-pairing requirements, and preferences to build certain structures. Several methods have been proposed to incorporate position-dependence information into the description of family alignments.

There are numerous studies to explore dependencies within nucleotide and amino

acid sequences. Burge and Karlin [1997] a presented maximal dependence decomposition

(MDD) procedure to generate a model which captures the most significant dependencies between positions from an aligned set of signal sequences of moderate to large size.

Agarwal and Bafna [1998] suggested the tree network model to detect non-adjacent correlations within signals in DNA. Bulyk et al. [2002] determined the effects of sequence variations in transcription factor binding sites using microarray binding . Ellrott et al. [2002] presented a novel hidden Markov model method, which tries to capture dependencies between non-adjacent positions using a position reordering method. The mixture of positional weight matrix and tree-based Bayesian network models developed by Barash et al. [2003] can capture positional dependencies within motifs. Xing et al. [2003] developed the hidden Markov Dirichlet-Multinomial (HMDM) model for motif alignment, which captures site dependencies inside the motifs and incorporates prior knowledge of nucleotide distributions of all motif sites from

7 biologically known motifs. Osada et al. [2004] introduces pre-position information

content and local pairwise nucleotide dependencies to improve the motif search

performance. Unfortunately, such advanced techniques are either motif specific or

incorporate prior biological knowledge, and they are not integrated into current popular

modeling tools and motif discovery algorithms [Hu et al., 2005].

Prediction of correlated mutations from a protein family alignment is another popular

research area [Taylor and Hatrick, 1994, Afonnikov and Kolchanov, 2004]. Discovered correlated mutations are generally used to predict residue contacts [Casari et al., 1995,

Olmea et al., 1999, Pollastri et al., 2001]. These types of studies give insight into understanding properties of amino acid sequence alignments and possible ways to explore dependencies within them.

In the mid-90s, correlation based methods were proposed to identify compensating changes between residues at positions in multiple sequence alignments [Altschuh et al.,

1987, Taylor and Hatrick, 1994, Gobel et al., 1994, Tuffley and Steel, 1998, Pritchard et al., 2001, Tillier and Lui et al., 2003, Afonnikov and Kolchanov, 2004]. In correlation analysis, the underlying idea is to define the correlation between the residue conservation patterns of the two columns in a sequence alignment for intra-molecular analysis, or to define the correlation between the distributions of two columns of different sequence

alignments for intermolecular analysis. The measurements proposed for correlated

mutations are highly depend on residue conservation scoring matrices. There are

numerous diverse and sophisticated studies to define both correlated mutation

measurements [Halperin et al., 2006] and residue conservation scoring [Valdar, 2002].

8 Correlated mutations have been extensively studied to predict both intra- and

intermolecular contacts [Lapedes et al., 1993, Gobel et al., 1994, Shindyalov et al., 1994,

Hatrick and Taylor, 1994, Thomas et al., 1996, Olmea and Valencia, 1997,

Chelvanayagam et al., 1997, Pazos et al., 1997, Pollock et al., 1999, Olmea et.al., 1999,

Larson et al., 2000, Fariselli et al., 2001, Rigden, 2002, Nemoto et al., 2004, Kundrotas

and Alexov, 2006]. Most of these studies were developed or applied to either a specific

protein family or a small group of protein families. For instance, Larson et al [2000]

applied covariation analysis on the SH3 domain sequence alignment to predict tertiary

contacts and design compensating hydrophobic core substitutions, while Nemoto et al.

[2004] employed covariation analysis to detect pairwise residue proximity for G-protein- coupled receptors. The results of early studies were bound by the limited number of known structures available at the time. Gobel et al. [1994] generated contact maps for protein families based on correlated mutation analysis and evaluated the prediction accuracy of these contact maps only on 11 protein families. Correlated mutations method proposed by Shindyalov et al. [1994] was based on a statistical analysis of the distribution of mutations in the branches of phylogenetic tree. They analyzed pairs of positions with correlated mutations in 67 protein families. Fariselli et al. [2001] proposed incorporation of neural networks to correlated mutation analysis to predicting inter- residue contacts of proteins and evaluated their results on 173 non homologous protein structures. More recently, Kundrotas and Alexov [2006] reported a new implementation of a correlated mutations method with an added set of selection rules (filters). The

parameters of the algorithm were optimized against 15 non-homologous high resolution

structures and tested on 65 non-homologous high resolution structures. All of these

9 studies reported improved residue contact prediction in protein structures. Although

several correlated mutation measurements yield reasonable accuracy for predicting

residue contacts for some families, general reviews point out that current methodologies

of correlated mutations analysis are not suitable for large scale residue contact prediction

[Halperin et al., 2006, Pollock and Taylor, 1997].

Besides pure residue contact prediction approaches based on covariation analysis,

there are more generalized approaches to understand meaning of correlations in family

alignments. Omela et. al. [1999] effectively used sequence conservation and correlation

to understand localization of residues on the structure and to improve the fold recognition

of threading approaches. Rigden [2002] proposed the use of covariance analysis for the

prediction of structural domain boundaries from multiple protein sequence alignments

and employed 52 non-homologous two-domain protein structures to train and evaluate

the proposed method. Socolich et al. [2005] attempted to define the sequence rules for

specifying a protein fold by computationally creating artificial protein sequences using

only statistical information encoded in a multiple sequence alignment and no tertiary

structure information. By experimental testing of libraries of artificial WW domain

sequences the authors have shown that a simple statistical energy function capturing coevolution between amino acid residues is necessary and sufficient to specify sequences that fold into native structures. We found the approach and results of this study quite valuable and a good step towards understanding the effect of residue correlations on structure and function stability; however it is hard to generalize the proposed method and the results of such a small protein domain (sequence length of alignment: 31).

10 Finally, Bayesian networks have been used to model dependencies between positions

in protein motifs that were aligned according to 3-D structure [Klingler and Brutlag,

1994]. Recently, the maximal dependence decomposition (MDD) algorithm [Burge and

Karlin, 1997] was applied to group protein phosphorylation substrates into subgroups

[Huang et al., 2005]. This kinase-specific phosphorylation site prediction tool works with

both high sensitivity and specificity. However, these and other similar studies suffer from

either being too specific or from dependence on additional information such as structure.

Although the results of above mentioned correlation studies in protein family

alignments are either subject specific or tested on a small dataset, and can not be

generalized, such studies confirm the significance of amino acid correlations in protein

structure and function. Therefore, common tools to analyze and visualize correlations for

a given family alignment become a necessity in the field. Afonnikov and Kolchanov

[2004] introduced CRASP, an internet-available software for detection and analysis of

correlated residue substitutions in multiple alignments of protein families. The approach

is based on estimation of the correlation coefficient between the values of a physicochemical parameter at a pair of positions of sequence alignment. The main result

of the method is a correlation matrix enabling the user to examine pairwise relationships

between amino acid substitutions at protein sequence positions, and to estimate the

contribution of the coordinated substitutions to the evolutionary invariance or variability

in integral protein physicochemical characteristics such as the net charge of protein

residues and hydrophobic core volume.

We developed the MAVL/StickWRLD analysis to display and visualize positional

dependencies discovered in nucleic acid [Ray, 2004] and amino acid [Ray, 2005, Ozer

11 and Ray, 2006] family alignments. In the analysis system, the expected number of

sequences that should share identities at a particular pair of positions is calculated based

on positional probabilities, and residuals are calculated based on the observed population

of sequences actually sharing the residues. Correlating pairs of residues based on these

residuals are visualized in a StickWRLD diagram. This approach differs from correlated

mutation analysis by examining identitywise correlations between the columns of a

family alignment, for every possible identity combination in every pair of columns, rather

than simply for each column. This allows us to extract additional information from the

alignments, such as conditional dependencies between columns, which are not accessible

to traditional column-based methods. In addition, a StickWRLD diagram enables the user

to visualize the family alignment and positional dependencies in 3D and tweak the

parameters of correlation.

In the first part of this thesis, I will explain the underlying algorithm of the

MAVL/StickWRLD tool and the implementation details. I will demonstrate its usage on an example and discuss the results.

In the next chapter, I will discuss the appropriate statistical methods to extract

positional dependencies in family alignments. Then, I will investigate physical proximity,

physiochemical properties and residence on secondary structural elements for associated

residues by examining 961 protein families from Pfam [Finn et al., 2006] database and

Pfam-related structures in PDB [Berman et al., 2000] database.

12

CHAPTER 2

2 MAVL/StickWRLD

2.1 Introduction

The increasing availability of structurally aligned protein families has made it possible to

use statistical methods to discover regions of interpositional relationships of residue

identity. Such dependencies amongst residues often have structural or functional

implications, and their discovery can supply valuable constraints that assist in the

refinement of measured, or predicted molecular structure assignments.

Calculation of interpositional dependencies produces a four-dimensional matrix of

positions, identities, related positions and strengths of relationship. For a small RNA

molecule of 80 nucleic acids, this 4D matrix contains 160,000 values, while a small

protein of 300 amino acids produces a cross-correlation matrix containing 39,690,000

values. Since there is no single value or distinctly definable pattern amongst these values

that signals a statistically interesting relationship, an abstraction that allows the expert

researcher to visualize, interpret and query the matrix is clearly necessary [Ray and Ozer,

2005]. Multiple Alignment Variation Linker (MAVL) and StickWRLD were developed

to analyze and visualize interpositional dependencies within nucleic acid and amino acid

sequence alignments [Ray, 2004,2005]. In this representation, a positional weight matrix

representing the alignment is wrapped around a cylinder by placing spheres for each

13 identity and position. The diameters of the spheres are proportional to the percentage

frequency of the corresponding identity. Statistically significant pairwise correlations are

depicted as sticks between the related pairs of spheres. The diameter of the sticks is based

on the over-representation or under-representation of sequences sharing these identities as

compared with a weightmatrix-based expectation for the sequences as a family. This

visualization method allows the researcher to detect patterns of pairwise positional

dependencies easily, and comment on possible implications of these observations.

We completely redesigned and rewrote Ray’s [2004] original VRML implementation

of MAVL/StickWRLD to functions as a platform-independent Java applet, with real-time

dynamic controls that enable much more intuitive exploration and with the

data. This implementation also upgraded to enable visualization of a range of aggregate

residue properties, an extensive database of pre-computed StickWRLD diagrams based

on Pfam families [Finn et al., 2006] and visualization of correlations on known PDB

structures [Bateman et al., 2004]. As a Java applet these features are now available

directly from the interface, and do not require reprocessing of the data on the server to

accommodate every representational change.

2.2 Algorithm

Calculation of interpositional dependencies is explained in detail by Ray [2004]. MAVL calculates the residuals -the difference between observed probability and expected probability- for every pair of residues in every pair of columns of the family alignments.

First, a positional probability matrix is generated for the subject family alignment. Then, the expected number of sequences that should share identities at a particular pair of

14 positions is calculated based on the positional probabilities. To obtain a residual for every

possible pair of positions and identities in the alignment, this expected value is subtracted

from the observed number of sequences with that characteristic.

The expected value ENi,Mj is the number of sequences that would be predicted to share

identity N at position i and identity M at position j. That is,

××= TSppE ,MN ji i MN j

where pNi is probability of finding identity N at position i, pMj is probability of finding

identity M at position j, and TS is the total number of sequences under observation. The

observed value ONi,Mj is the actual number of sequences that share base N at position i

and base M at position j. Then, magnitudes of residuals are calculated as:

MN ( MN −= MN ) TSEOR ji ji ji

The binomial distribution can be used to calculate the likelihood of observing any

particular outcome when the expected probabilities are known in a population. Using the

binomial distribution we can calculate the probability of observing a particular outcome x

times out of n number of trials. If the probability of getting this particular outcome

(success) on an individual trial is P, then the binomial probability is:

⎛n⎞ x −xn (Pnxb = ⎜ ⎟ − pp )1(),, ⎝ x⎠

The statistical significance, αNiMj, of a residual is calculated as the sum of the binomial

probabilities from ONiMj to TS successes in TS trials where the probability of success is

ENiMj /TS per trial. That is,

n n! α = i − pp )1( i MN ji ∑ =ki − ini )!(!

15 where n is number of trials (TS), k is the number of success (ONiMj) and p is the probability of success (ENi,Mj/TS).

The residual and its statistical significance are used in the implementation to restrict the visualization of pairwise correlations that score better than user selected cutoffs. The residual parameter is divided into three independently filterable residual types; total residual (Tr), partial positive residual (Pr) and negative residual (Nr). This was done to allow the user more fine-grained control of the visualization. Each parameter allows the user to control the display of different subsets of the positional correlations, each with different biological interpretations. The total residual cutoff restricts the display of correlations where all residuals meet:

−≤ /TSEOTr MN ji MN ji

That is, the correlations where the observed population differs significantly from the expected population. The partial positive residual cutoff restricts the display of correlations where positive residuals meet:

Pr ( MN −≤ MN )/ EEO MN ji ji ji

That is, the correlations where a large fraction of the correlation population is part of the residual (a correlation between two very under-represented positives may be quite significant while not meeting the absolute magnitude requirements of the Tr cutoff).

Finally, negative residual enables the display of correlations where negative residuals meet:

−≤ /TSEONr MN ji MN ji

16 The use of all three residual controls allows display of positive and negative relationships equally weighted, or either positive or negative relationships are masked to better the visualization of the other.

2.3 Implementation

The original system developed by Ray analyzed users’ data from a web-form submission and presented the user with a static VRML diagram describing their data. The new implementation of MAVL/StickWRLD is written in Java3D. This system leaves the computationally intensive calculation of the complete interpositional correlation matrix on the server, but moves the selection of features for display, to the client applet running on the user’s computer. The Java3D implementation has four significant areas of improvement over the VRML version: enhanced real-time user interaction; clustering of residues by aggregate physicochemical properties; availability of pre-calculated

StickWRLD WRLDs (list of all possible interpositional dependencies) for all families in the Pfam database, and visualization of correlations on available PDB structures.

The entrance page of the MAVL/StickWRLD tool consists of a form to summit the family alignment and a list to retrieve pre-computed WRLDs for the families in Pfam database v21.0 (Figure 2.1). User can input his/her own family alignment and select type of the alignment i.e. nucleic acid or amino acid sequence alignment. Only the alignments in FASTA format or raw, block aligned sequences are accepted. If user inputs his/her own alignment, the input alignment and its type are posted to a CGI script and all possible interpositional dependencies are calculated by a C program. If user selects a

17 Pfam family from the list, the name of the family is posted to the CGI script and the pre- computed WRLD for the family is retrieved from the server.

The list of interpositional dependencies is passed as a variable to the Java3D applet on the next page. In a StickWRLD representation, first, the positional weight matrix of the alignment is wrapped around a cylinder by placing spheres for each identity (along the vertical axis of the cylinder) and position (around the cylinder). The sizes of the spheres are proportional to the percentage frequency of the corresponding identities.

StickWRLD graph can be scaled, translated and rotated. Figure 2.2 displays the

StickWRLD representation page for Integrin alpha cytoplasmic region alignment (Pfam

ID: Interin_alpha) -the same sequence family shown in Figure 2.1.

18

Figure 2.1. Entrance page of MAVL/StickWRLD web site.

The user can input his/her own sequence alignment in FASTA format or as block aligned sequences, select the type of input sequences (nucleic acid or amino acid), and post the alignment for the computation of interpositional dependencies. Alternatively, the user can select the name of a protein family from the list of 8,957 Pfam v21 families and post the pre-computed WRLD (list of interpositional dependencies) for StickWRLD visualization.

19

Figure 2.2. StickWRLD diagram and interface controls.

A StickWRLD diagram is displayed as a Java3D applet. Controls on the left side column can be used to tweak display parameters. The family alignment is listed under the StickWRLD diagram. Sequences in the family alignment can be traced visually on the StickWRLD diagram simply by clicking on them. ‘View Alignment and Structure References’ button opens the family alignment in a separate window with the available structural references.

20 Statistically significant pairwise correlations are depicted as sticks between the identity pairs based on overpopulation or underpopulation of sequences sharing these identities as compared with the consensus expectation. The thickness of each stick is proportional to the total residual calculated for this correlation. Sticks with striped textures represent negative correlations. The color of each stick is assigned based on the predicted type of interaction (Table 2.1). Amino acids are assigned into groups as hydrophobic, non-hydrophobic, hydrophilic, negatively and positively charged based on the theory that favorable (that is, physically/biochemically attractive, rather than repulsive) interactions are more indicative of structural interactions. Favorable interactions are color coded based on these properties (Table 2.2). Figure 2.3 depicts front and top views of StickWRLD representation of Integrin alpha cytoplasmic region alignment with pairwise correlations that have total residuals greater than 0.1. The

StickWRLD representation shows that positions 1, 3, 4, 5, 6 are 7 are strongly conserved and amongst the varying positions 9, 10, 11, 12 and 13 are display strong correlations.

Color Type of Predicted Interaction nonhydrophilic - nonhydrophilic interaction

nonhydrophilic - hydrophobic interaction hydrophobic - hydrophobic interaction

nonhydrophobic - hydrophilic interaction

hydrophilic - hydrophilic interaction

positive charge - negative charge interaction

negative correlations

other interactions

Table 2.1. Colors codes for sticks depicting pairwise correlations based on types of predicted interactions.

21

Gly Ala Val Leu Ile Pro Phe Tyr Trp Ser Thr Asn Gln Cys Met Asp Glu His Lys Arg non-hydrophilic Gly Hydrophobic Ala Hydrophobic Val Hydrophobic Leu Hydrophobic Ile non-hydrophobic Pro Hydrophobic Phe non-hydrophobic Tyr Hydrophobic Trp Hydrophilic Ser Hydrophilic Thr Hydrophilic Asn Hydrophilic Gln non-hydrophobic Cys Hydrophobic Met negatively charged, hydrophilic Asp negatively charged, hydrophilic Glu positively charged, hydrophilic His positively charged, hydrophilic Lys positively charged, hydrophilic Arg

Table 2.2. Amino acid properties and favorable interactions.

Amino acids assigned into groups as hydrophobic, non-hydrophobic, hydrophilic, negatively and positively charged as shown in the first column. Favorable interactions are color coded as shown on the table. These color codes are used to differentiate predicted types of interactions for the pairwise correlations visualized in StickWRLD.

22 Figure 2.3. StickWRLD graph representation of Integrin_alpha family.

In this front (A) and top (B) views of the StickWRLD graph, correlations with Tr >=0.1 are visualized, amino acids are colored by hydrophobicity and ordered by hydrophaty scores. A large portion of the domain, positions 1-7, are dominated by consensus features. Positions 8, 15 and 15 contain essentially randomly distributed residues, while positions 9-13 display complex inter-positional dependencies. For example, Agr at position 9 and Glu at position 13 are significantly over represented, while Tyr at position 9 and Pro at position 10 are underrepresented.

23

Figure 2.3

24 StickWRLD statistical parameters (Tr, the global over/underpopulation threshold; Pr, the per-edge overpopulation threshold; Nr, the per-edge underpopulation threshold; and alpha, the edge-significance threshold), and display parameters such as residue coloring, ordering and grouping can be adjusted by using the controls on the left column of the user interface (Figure 2.2). This allows the parameters to be adjusted to suit the complexity of the alignment being visualized. The user can toggle between coarse visualization parameters to quickly explore and locate interesting features of an alignment, and fine parameters to investigate detailed aspects of the relationship. Figure 2.4 shows visualization of correlations with four different Tr cutoffs.

Amino acids can be ordered along the vertical axis based on nine physical parameters.

The ‘Order residues by’ interface element allows the user to control this ordering;

Hydropathy scores by Kyte and Doolittle [1982], Average Surrounding Hydrophobicity

(ASH) index by Manavalan and Ponnuswamy [1978], hydrophobicity scales by

Ponnuswamy [1993], 8 A contact number by Nishikawa-Ooi [1980], optimal stucture- discriminative AAindex by Leary [2004], composition, polarity and volume by Grantham

[1974], and isoelectric point by Zimmerman [1968]. Numeric values of these physical parameters are extracted from AAIndex [Kawashima and Kanehisa, 2000] and listed in

Table 2.3.

Amino acids can be put into color groups based on six physical parameters; hydrophobicity by Janin [1976], Wolfenden [1981], Kyte and Doolittle [1982], and Rose

[1985], default coloring schema of ClustalX [Thompson, 1997] and charge (Table 2.4).

25

Figure 2.4. Pairwise correlations displayed in StickWRLD graph for total residual cutoffs of (A) 0.100, (B) 0.075, (C) 0.050, and (D) 0.025.

As the cutoff is decreased, allowing more relationships to be displayed, it becomes apparent that the residues at 8, 14 and 15 are not actually randomly distributed. Positions 1-7 remain primarily consensus dominated even when the cutoff is reduced to the point that other residues shows relationships to almost every other position in the family.

26 Amino Acid Hydropathy Scores Average Surrounding Hydrophobicity Index Hydrophobicity scales 8 A contact number Optimal Stucture-Discriminative AAindex Composition Polarity Volume Isoelectric point

A 1.8 12.97 0.85 0.23 0.045 0 8.1 31.0 6.000 R -4.5 11.72 0.20 -0.26 -0.152 0.65 10.5 124.0 10.760 N -3.5 11.42 -0.48 -0.94 -0.231 1.33 11.6 56.0 5.410 D -3.5 10.85 -1.10 -1.13 -0.272 1.38 13.0 54.0 2.770 C 2.5 14.63 2.10 1.78 0.223 2.75 5.5 55.0 5.050 Q -3.5 11.76 -0.42 -0.57 -0.179 0.89 10.5 85.0 5.650 E -3.5 11.89 -0.79 -0.75 -0.237 0.92 12.3 83.0 3.220 G -0.4 12.43 0 -0.07 -0.206 0.74 9.0 3.0 5.970 H -3.2 12.16 0.22 0.11 -0.117 0.58 10.4 96.0 7.590 I 4.5 15.67 3.14 1.19 0.405 0 5.2 111.0 6.020 L 3.8 14.90 1.99 1.03 0.334 0 4.9 111.0 5.980 K -3.9 11.36 -1.19 -1.05 -0.203 0.33 11.3 119.0 9.740 M 1.9 14.39 1.42 0.66 0.183 0 5.7 105.0 5.740 F 2.8 14.00 1.69 0.48 0.280 0 5.2 132.0 5.480 P -1.6 11.37 -1.14 -0.76 -0.205 0.39 8.0 32.5 6.300 S -0.8 11.23 -0.52 -0.67 -0.169 1.42 9.2 32.0 5.680 T -0.7 11.69 -0.08 -0.36 -0.055 0.71 8.6 61.0 5.660 W -0.9 13.93 1.76 0.90 0.065 0.13 5.4 170.0 5.890 Y -1.3 13.42 1.37 0.59 0.121 0.20 6.2 136.0 5.660 V 4.2 15.71 2.53 1.24 0.365 0 5.9 84.0 5.960

Table 2.3. Amino acid property values that are applied by the ‘Order residues by’ interface option.

Hydropathy scores by Kyte&Doolittle [1982], Ponnuswamy [1978], hydrophobicity scales by Ponnuswamy [1993], 8 A contact number by Nishikawa-Ooi [1980], optimal stucture-discriminative AAindex by Leary [2004], composition, polarity and volume by Grantham [1974], and isoelectric point by Zimmerman [1968] are used to order amino acids along the vertical axis in the StickWRLD representation.

27

Hydrophobicity (Janin) C I V L F M A G W H S T P Y N D Q E R K

Hydrophobicity (Wolfenden) G L I V A F C M T S W Y N K Q E H D R

Hydrophobicity (Kyte&Doolittle) I V L F C M A G T S W Y P H N Q D E K R

Hydrophobicity (Rose) Most Hydrophobic C F I V L M W H Y A G T S P R N Q D E K Least Hydrophobic

Composition (Grantham) C S D N E Q G T R H P K Y W V F M L I A

Polarity (Grantham) D E N K R Q H S G T A P Y V M C W F I L

Volume (Grantham) W Y F R K L I M H Q V E T N C D P S A G

Charge R H K D E N Q S T Y C W V A G I L M F P A V L I M F W S T N Q D E R K ClustalX G P Small and Hydroxyl and Charged H Y hydrophobic amine amino acids amino acids

Table 2.4. Grouping schema applied by ‘Group residues by’ interface option.

Amino acids are grouped based on hydrophobicity scores by Janin [1976], Wolfenden [1981], Kyte and Doolittle [1982], and Rose [1985], composition, polarity and volume by Grantham [1974], charge, and default grouping schema of ClustalX [Thompson, 1997]. Amino acids are colored as shown on the table. Color codes for hydrophobicity, ClustalX and charge are also used by the ‘Color residues by’ interface option.

To allow the system to more intuitively deal with the potentially weak relationships amongst poorly constrained sequences, clustering of residues by aggregate physicochemical properties has been implemented. This allows, for example, the conserved residue properties and interpositional requirements of distant homologs to be discovered, even in the absence of specifically conserved residues. Residues can be grouped by hydrophobicity [Janin, 1976, Wolfenden, 1981, Kyte and Doolittle, 1982,

Rose, 1985], composition, polarity, volume [Grantham, 1974], charge and default grouping schema of ClustalX [Thompson, 2000]. The relationship between these aggregate properties at each position is visualized, just as the specific sequence identities

28 can be. Using this option in many protein families results in the discovery of correlated physical properties, even when the contributing residue identities fall below the significance threshold, or are obscured by other more dramatic individual-residue relationships. A detailed example is discussed in section 2.3.1Results and Discussion.

Table 2.4 lists the available grouping schemas and color codes for the groups.

Individual sequences in the family alignment can be traced visually by simply clicking on the family alignment displayed under the StickWRLD graph (Figure 2.2).

Positional propensities and presence of correlations in a single sequence can be examined compared to the whole family. ‘View Alignment and Structure References’ button displays the family alignment in a new window with existing structural references (Figure

2.5). In this window, the user can select from the available PDB structures or load his/her own PDB structure for a sequence to visualize correlations on the structure. The PDB structure is displayed in ribbon form and correlations from the StickWRLD graph that exists within this structure are depicted as sticks between corresponding amino acids.

Finally, in this implementation we included the distribution of the variety of interpositional relationships highlighted by MAVL/StickWRLD, amongst known sequence families. To facilitate access to this information, and to make it easier for researchers to examine popular sequence families, we have pre-computed StickWRLD

WRLDs for the complete Pfam database v21. The Pfam alignments and the WRLDs generated from them are available directly from the MAVL/StickWRLD interface. We currently house 8,957 sequence families.

The MAVL/StickWRLD is available via the WWW at http://www.microbial- pathogenesis.org/stickwrld/.

29

Figure 2.5. Input sequence alignment screen.

The user can click on a sequence to visually track it in the StickWRLD graph. Available PDB references for the sequences are listed in comboboxes. The user either can select the available structural reference or load a PDB file to visualize pairwise correlations on the structure.

30 2.3.1 Results and Discussion

The new implementation of MAVL/StickWRLD results in fast, easy and accurate examinations of sequence families. Enhanced user interaction, availability of pre- calculated StickWRLD WRLDs for Pfam alignments, and visualization of dependencies on the available structures provide fast and convenient analysis. Grouping of the residues based on physicochemical properties supplies more information on conserved patterns within families than is available through individual residues alone.

Applying MAVL/StickWRLD to a typical protein domain results in improved understanding of the interacting residues, and the physical properties required to maintain structure or function. For example, the active site lid domain of Adenylate Kinases have a particular in the sequence. In Gram-negative bacteria and eukaryotes a network of hydrogen bonds restricts the conformation of a pair of structural loops at the active site lid. In Gram-positive bacteria, on the other hand, residues in the active site lid have been replaced by four cysteine residues and a zinc ion is bound to these tetrahedrally coordinated cyteine residues [Berry and Phillips, 1998]. Figure 2.6 shows the StickWRLD graph of the ADK_lid family alignment and positional dependencies found by MAVL analysis (Pfam ID: ADK_lid, accession no. PF05191). Subfamily preferences at positions 4, 7, 25 and 28 as (C4, C7, C25, C28) and (H4, S7, D25, T28) are clearly represented in the StickWRLD graph with positive correlations. Negative correlations clarify existence of two separate subfamilies. Moreover, MAVL/StickWRLD analysis suggests that R at 9 strongly correlates with (H4, S7) and negatively correlates with (C4, C7), and E at 31 strongly correlates with (D25, T28) and negatively correlates

31 with (C25, C28). Structural role of R9 and E31 in (H4, S7, D25, T28) variant of ADK_lid domain is first identified by MAVL/StickWRLD analysis [Ray, 2005].Figure 2.7 shows three-dimensional structures of (C4, C7, C25, C28) and (H4, S7, D25, T28) variants on

Bacillus stearothermophilus (PDB: 1ZIP) and Bovine (PDB: 1AK2) ADK_lid domains respectively. In the later variant, clearly, R9 and E31 are oriented to structurally interact with H4, S7, D25 and T28.

Sequence variation of this nature is particularly damaging to sequence family modeling efforts because the information contained in the covariation is critical to the family characteristic, but works directly against the most common tools used to statistically model the families. MAVL/StickWRLD’s first goal was to identify these situations and highlight them for the user so that they can avoid inappropriate uses of statistical modeling tools.

The alternating preferences within the ADK_lid family alignment are highlighted in

Figure 2.8 (A). A HMM is the most powerful and commonly used probabilistic model to describe protein families. A Sequence Logo provides a convenient visualization of this complex probabilistic family description [Schuster-Boeckler et al., 2004]. Figure 2.9 shows the Sequence Logo generated for ADK_lid family alignment. Identity probabilities at each position clearly indicate the positions with alternating preferences. However, it is not possible to infer relationships of these alternating subfamily positions.

32

Figure 2.6. StickWRLD representation for the Pfam family of adenylate kinase active site lid (Pfam ID: ADK_lid, accession no. PF05191).

Strong correlations amongst four cysteine residues (C4, C7, C25, C28) are depicted with green sticks indicating possible non-hydrophobic-non-hydrophobic interactions. Strong correlations amongst (H4, S7, R9, D25, T28, E31) are depicted with red and orange sticks indicating possible charge-charge and hydrophilic-hydrophilic interactions respectively. Negative correlations -depicted as stripe textured sticks- clarify the separate populations that exist for these two motifs.

33

Figure 2.7. Three-dimensional structures of ADK_lid domains from (A) Bacillus stearothermophilus (PDB: 1ZIP) and (B) Bovine (PDB: 1AK2).

At the active site lid of Bacillus stearothermophilus a zinc ion is bound to tetrahedrally coordinated cyteines (C4, C7, C25, C28). In Bovine the conformation of the active site lid is maintained by a network of hydrogen bonds amongst (H4, S7, D25, T28) and supported by R9 and E31.

34

Figure 2.8. Pfam ADK_lid family alignment

(A) Pfam ADK_lid family alignment. Subfamily preferences (H4, S7, D25, T28) and (C4, C7, C25, C28) are highlighted with cyan and yellow colors respectively. Complementing residues of (H4, S7, D25, T28) motif, R at position 9 and E at position 31, are also highlighted with magenta color. (B) An artificial sequence that was searched against the ADK_lid family alignment.

35

Figure 2.9. Sequence logo of Pfam ADK_lid family alignment.

The logo visually summarizes the preferences at each position of the alignment. Strongly conserved residues at positions (1, 11, 26, 36, 38) and alternate residues at positions (4, 7, 12, 13, 25, 28) are clearly seen. However, the logo does not suggest any information about the dependencies amongst positions 4, 7, 25 and 28.

HMMs can successfully represent a given family alignment and can be used to identify very distant relatives. Pfam WWW servers provide users the ability to compare input protein sequences against Pfam HMMs. Pfam HMM search results reports the information about matching families and corresponding bit-scores and E-values. The bit- score reflects whether the sequence is a better match to the profile model (positive score) or to the null model of nonhomologous sequences (negative score). The higher the bit- score, the better the match. The best criterion of statistical significance is the E-value

(expectation value). The E-value is calculated from the bit score as the number of hits that would be expected to have a score equal or better than this by chance alone. A good

E-value is much less than 1. Around 1 is what we expect just by chance.

36 For example, if we query the artificial sequence in Figure 2.8 (B) against HMM family models in Pfam database, this sequence is identified as a member of ADK_lid family with an E-value of 1.4e-21 (Figure 2.10). When the real sequences in ADK_lid family are searched against Pfam database, search results yield e-values between 1.4e-11 and 1.8-22. The artificial sequence is therefore scored as well as, or better than the majority of actual sequences, despite containing a residue pattern that is absolutely excluded by the actual family. This artificial query sequence has residues (C4, S7, C25,

T28) in the active site lid. According to subfamily preferences highlighted by Berry and

Phillips [1998] and StickWRLD analysis [Ray, 2005], this combination of residues is not suitable with the family description and function.

Figure 2.10. Sequence search results from http://pfam.sanger.ac.uk/.

Sequence search tool reports matching families in Pfam database for the given query sequence. The result shows that the artificial sequence in Figure adk_lid_aln (B) is a goal match with ADK_lid family with an E-value of 1.4e-21. This comparable to searches with actual members (1.4e-11 - 1.8e-22). These results would be dramatically improved and the synthetic chimera rejected as a non-possible member, if the positional dependencies in the sequence family were recognized and it was modeled as 2 independent subfamilies.

37 Theoretically, HMMs provide the best possibility for modeling interpositional dependencies, but practically they are limited to near neighbor dependencies and cannot represent interpositional dependencies. Therefore, the proposed MAVL analysis and

StickWRLD visualization provide invaluable information about interpositional dependencies within the family alignments and improve our understanding of the constraints in the family alignments.

MAVL/StickWRL system also provides computation and visualization of property- wise interpositional dependencies. Clustering of amino acids by aggregate physicochemical properties leads discovery of interpositional requirements of distant homologs even in the absence of specifically conserved residues. Figure 2.11 [Ozer and

Ray, 2006] shows StickWRLD representations and corresponding three-dimensional structures for Pfam family of integrin alpha cytoplasmic region (Pfam ID: integrin_alpha, accession no. PF00357; PDB ID: 1M8O). This short cytoplasmic region of integrin alpha chain consists of a small, strongly conserved helix, followed by a generally acidic region

[Vinogradova et.al., 2002, Humphries et.al., 2003]. MAVL/StickWRLD analysis (Figure

2.11 (A)) suggests that there are two alternative preferred sequences within the family,

(R9 P10 P11 Q12 E13) or (Y9 K10 M12). In the former pattern the conserved α-helix terminates in R9, broken by prolines at positions 10 and 11, and followed by Q12 and

E13 to complete the hairpin. The latter (alternative) pattern contains neither P10 nor P11, but instead a preferred tyrosine, lysine, methionine triplet at positions 9, 10 and 12.

Furthermore, grouping of the residues by charge (Figure 2.11 (B) and (C)) reveals more molecular preferences of the distinct sequence subfamilies. The turn following the α- helix of the cytoplasmic domain appears to be stabilized by either 5 residues following

38 the pattern (9+, 10NP, 11NP, 12P, 13-) or 8 residues (8+, 9P, 10+, 11-, 12NP, 13NP, 14P,

15-) (+:positively charged, -:negatively charged, P:polar, NP:nonpolar). There is also a preference in the first pattern for a slightly polar residue at 2, and against a non-polar residue at this position. Consensus residues of this small domain are colored and labeled in Figure 2.11 (D). Strongly conserved residues participate in the 10-residue N-terminal α

-helix of the domain. However, consensus methods only identify the remainder of the domain as ‘acidic’, and do not capture the sequence or property requirements, or suggest their involvement in stabilizing the hairpin. The canonical (proline containing) motif is highlighted in Figure 2.11 (E). It is clear that the positively charged arginine (9R) and negatively charged glutamate (13E) are brought together by the hairpin, and can act in concert to stabilize the hairpin with the support of other polar (10P and 11P) and non- polar (12L) residues. While crystal or solution structures are not available for any non- proline-motif integrin alpha domains, we predict that the alternative positive residues at 8 and 10, and negative residues at 11 and 15 will be found to interact similarly to stabilize the domain structure [Ozer and Ray, 2006].

2.4 Conclusion

The MAVL/StickWRLD tool provides rapid exploration of alignment properties and insight into potential structural requirements that are embedded in the sequence identities.

The analysis and visualization method has proven both intuitive and accurate for predicting likely structural features in protein [Ray, 2005, Ozer and Ray, 2006] and nucleic acid [Ray, 2004] families for which representative crystal or solution structures

39 are known. The visualization capabilities allow the user to rapidly identify and avoid defects in typical tools used to statistically model sequence families.

40

Figure 2.11. StickWRLD graph representations of the integrin alpha cytoplasmic domain from Pfam family alignment, and a corresponding 3D structure.

(A) Shows interpositional dependencies within the family in the form of a default StickWRLD graph; [B (detail) and C (overview)] show correlations obtained when residues are grouped by charge (red Positive, blue Negative, cyan Polar, tan Slightly- Polar, green Non-Polar, grey Gap); (D) illustrates the position of strongly conserved residues on the domain’s structure (E) displays strongly correlating residues based on MAVL/StickWRLD analysis of the aligned members of the domain [Ozer and Ray, 2006].

41

CHAPTER 3

3 Analysis of Residue Associations in the Pfam Database

3.1 Introduction

Correlation analysis on sequence family alignments have been extensively studied to identify residue contacts in the structure; advanced statistical methods, incorporation of evolutionary rates and selection rules have been proposed to improve the accuracy of contact prediction. Most of the previous studies in the literature suffer from either being too specific or being limited by the size of test dataset. Moreover, possible physical contact of correlated residues should not be the only aspect of structural and functional consequences. In this chapter, we address both problems, limited dataset and consequences of residue associations beyond physical contact. We discuss appropriate methods to identify residue associations and their statistical significances, computed residue associations for 961 Pfam family alignments, and examined physical proximity and physiochemical properties of associated residues in protein family alignments and their presence on secondary structural elements.

42 3.2 Methods

3.2.1 Protein Family Alignments

The Pfam database is a large collection of protein domains and families. Each family in

Pfam database consists of four elements, 1) annotation, 2) a seed alignment, 3) a profile

HMM (Hidden Markov Model) and 4) a full alignment. Each annotation contains a brief description of the domain, links to the other databases and the record for family construction. The seed alignment is a manually curated multiple alignment representing the family. HMMs are derived from the seed alignments and can be used to identify new members of the family. Finally, the full alignment is an automatic alignment of the domain [Finn et al., 2006]. Currently Pfam matches 72% of known protein sequences, but for proteins with known structure Pfam matches 95% [Sammut et al., 2008]. This impressive sequence coverage of the Pfam database makes it the first choice for a large scale protein family analysis.

Pfam v22.0 -released in July 2007- contains multiple sequence alignments of 9,318 protein families. of the number of sequences in, and sequence lengths of, all families in the Pfam database are shown in Figure 3.1 and Figure 3.2 respectively.

Although the average number of sequeneces is 143, the majority of the Pfam families

(95%) have 2-80 sequences and the average sequence length is 68. The majority of the

Pfam families (86%) have sequence length of 30-410.

Since we aim to summarize structural correspondence of correlating residues in family alignments, we eliminated the families with no structural information. There are

2,588 Pfam families that refer to PDB structural information. These 2,588 refer to a total

43 of 33,502 RCSB PDB structures. However, many individual sequences have multiple

PDB structure references. For example, the TIM domain family alignment refers to 73

PDB structures, but these are distributed amongst only 4 sequences in this family. Each of these has had its structure determined multiple times with multiple methods. If we eliminate multiple structure references, there are only 5,795 sequences scattered throughout Pfam that have corresponding structures. Figure 3.3 shows a histogram of the number of PDB references. Although on the average there are 2.2 structures per family, in the actual distribution 60% of the families have 1, 20% of the families have 2, 9% of the families have 3 known structures, and only 7 families (0.3%) have more than 20 structures.

Figure 3.1. Histogram of the number of sequences for all Pfam v22.0 families.

There are five more families not depicted on this histogram with sequence counts of 930, 1160, 1560, 1870 and 2450.

44

Figure 3.2. Histogram of the length of sequences for all Pfam v22.0 families.

There is only one family not depicted on this histogram with sequence length of 2300.

Figure 3.3. Histogram of the number of distinct PDB references for all Pfam v22.0 families. There are two families not depicted on this histogram with 62 and 79 PDB references each.

45 To be able to calculate acceptable statistical significance values, we further eliminated the families that have less than 20, or more than 500 sequences. Less than 20 sequences would not be enough for statistical analysis and more than 500 (only 20 families) would skew the results. There are 1,139 families with 20-500 sequences referring to 3,442 unique known structures. Amongst these families, we eliminated 178 of them due to insufficient overlap between the sequence and the structure. After applying all limiting criteria, we analyzed the seed alignments of 961 Pfam families that have at least one PDB reference and have 20-500 sequences. Because Pfam curates the seed alignments and eliminates duplicates, we used the seed rather than whole family alignments. Figure 3.4, Figure 3.5 and Figure 3.6 shows histograms of the number of sequences, sequence lengths and number of structural references for the 961 families examined in this study.

Figure 3.4. Histogram of the number of sequences for Pfam families examined in this study (961 families). This is our analyzed subset of Figure 3.1.

46

Figure 3.5. Histogram of the length of sequences for Pfam families examined in this study (961 families).

There is only one family not depicted on this histogram with sequence length of 1670. This is a subset of Figure 3.2

Figure 3.6. Histogram of the number of distinct PDB references for Pfam families examined in this study (961 families). This is subset of Figure 3.3

47 3.2.2 Protein Structures

We examined 2,792 available RSCB PDB structures for the above mentioned 961 Pfam families to compute the actual physical distances between the correlating residues and to understand residence of correlating residues in the secondary structural elements. On average there are 3 structures per family. As Figure 3.6 shows in more detail, 42% of the families have 1, 21% of the families have 2, 11% of the families have 3, 6% of the families have 4 known structures, and only 2 families have more than 20 known structures. We also observed that only about 5% of all pairwise correlations that exist in a family, are present in any particular referred structure. Therefore, selecting only one structure per family would result in the loss of considerable information and including all referred structures in the computations would only rarely result in overrepresentation of individual correlations.

We calculated the average diameters of PDB structures as the average of three sides of the bounding box of the molecule or domain (structure). We defined the bounding box using only the residues present in the family alignment. We incorporated these diameters into our calculations to examine relative distances of correlating residues. As the sequence length histograms (Figure Figure 3.2 and Figure 3.5) suggest the size of the structures could be quite different from each other. A distance of 10Å between two amino acids can be considered as large for a small domain, however such a distance would be considered close for a large structure and these residues would be considered as localized at the same fragment of the structure. Therefore, we examined the distribution of relative

48 distances between correlating residues to understand whether these correlations imply certain fragmentations on the structure.

3.2.3 Calculation of Pairwise Correlations

We computed pairwise correlations by using the residual approach as described in

Algorithm2.2 and also by using the phi coefficient. We calculated the statistical significance of a correlation using Fisher’s Exact test. The details of each calculation are described below.

3.2.3.1 Calculation of Residuals

The existence of a pairwise correlation between two pairs of position and identity in a family alignment is estimated by subtracting the expected number of occurrences from the observed number of occurrences of the pairs [Ray, 2004]. First, the positional probability matrix is generated for the subject family alignment. Then, the expected number of sequences that should share identities at a particular pair of positions is calculated based on positional probabilities. To obtain a residual for every possible pair of positions and identities in the alignment, this expected value is subtracted from the observed number of sequences with that characteristic.

The expected value ENi,Mj is the number of sequences that would be predicted to share identity N at position i and identity M at position j. That is,

×= )()( ×TSMPNPE ,MN ji i j where P(Ni) is probability of finding identity N at position i, P(Mj) is probability of finding identity M at position j, and TS is the total number of sequences under

49 observation. The observed value ONi,Mj is the actual number of sequences that share base

N at position i and base M at position j. Then, magnitudes of residuals are calculated as:

MN ( MN −= MN ) TSEOR ji ji ji

The final value is normalized by the number of sequences to provide percentage residual that is comparable across all sequence families. Residuals greater than 0 indicate the existence of a positive correlation between the pair of residues. On the other hand, residuals less than 0 indicate existence of a negative correlation between them.

The residual parameter is simple and easy to interpret. For example, a residual of 0.1 between residues Ni and Mj means that the NiMj pair exists in the sequences of the family alignment 10% more frequently than expected, given the raw occurrences of N at i and M at j. The residual parameter can take values between -0.25 and 0.25. This small range limitation is inherent in the construction of the residual from the co-occurrence expectation. Occurrences that would intuitively be sufficient to make the residual over

0.25, become part of the expectation and lower the residual value. For example, assume we have 10 sequences in our family alignment. If the total of P(Ni) and P(Mj) do not exceed 1, the maximum probabilities will be 0.5 and the expected number of pairs will be

0.5×0.5×10=2.5, while the maximum and minimum for the observed number of pairs will be 5 and 0 respectively. In that case the residual could be (0-2.5)/10=-0.25 at minimum and (5-2.5)/10=0.25 at maximum. If the total of P(Ni) and P(Mj) exceed 1, there will be at least some observed NiMj pairs purely due to the unavailability of alternate pairings

([P(Ni)×10+P(Mj)×10]-10). In that case as the P(Ni)×P(Mj) (expected probability) increases, the number of observed pairs is forced to increase. For instance, if P(Ni)=0.8

50 and P(Mj)=0.7, the expected number of pairs will be 0.8×0.7=5.6, while the maximum for the observed number of pairs will be 7 (if all Mj’s pair with an Ni) and minimum for the observed number of pairs will be 5 ([0.8×10+0.7×10] - 10) = 5. In that case residual could be (5-5.6)/10=-0.6 at minimum and (7-5.6)/10=0.14 at maximum. Figure 3.7 shows maximum (A, B) and minimum (C, D) possible values for the residuals for any given probabilities of P(Ni) and P(Mj).

The disadvantage of the residual parameter is that it cannot precisely reflect the degree of association between two residues. For instance, if P(Ni)=0.1 and P(Mj)=0.1 and

NiMj always paired in the alignment, the expected number of pairs will be 0.01×TS, while the observed number of pairs will be 0.1×TS. This gives a residual of (0.1×TS-

0.01×TS)/TS=0.09. A correlation with a residual of 0.09 will not be very important in a

20-sequence family alignment, while it will be quite valuable in a 200-sequence family alignment. Figure 3.8 (A) shows the histogram of calculated residuals over the number of sequences in the analyzed families, while Figure 3.8 (B) shows the average number of correlations per structure over the residual parameter and the number of sequences in the analyzed families. It is important to note that as the number of sequences increases, the family alignment gets more diverse and more variable. This is contrary to simple and somewhat customary intuition regarding family alignments. As a family grows larger, even the strongest associations are represented with smaller residual values because sequence diversity increases and percentage probability of any specific residue falls.

Figure 3.9 (A) and (B) shows the histogram of the number of calculated correlations and the average number of correlations per structure, respectively, over the residual parameter and sequence lengths of the analyzed families. As the sequences in the

51 alignment get longer, the number of possible position and identity pairs increase exponentially. Therefore we expect to observe more correlations for longer sequences. As shown in Figure 3.9 (B) the average number of correlations per structure increases as the family alignment gets longer.

52

Figure 3.7. Maximum and minimum possible values of the residual parameter (RNiMi) for given probabilities of P(Ni) and P(Mj).

(A) and (B) show maximum possible residual values for given probabilities. (B) is 180° rotated with respect to (A) to visualize maximum residual values at higher probability combinations. (C) and (D) show minimum possible residual values for given probabilities. (D) is 180° rotated with respect to (C) to visualize minimum residual values at higher probability combinations. Maximum and minimum potential residual values are obtained when both probabilities are equal to 0.5. Columns in bright orange (A, B) and bright green (C, D) show maximum residual value of 0.25 and minimum residual value of -0.25 respectively.

53

Figure 3.8. (A) Histogram of total the number of correlations and (B) average number of correlations per structure versus residual parameter and number of sequences in the analyzed families.

Since the majority of the analyzed families (70%) have 20-80 sequences (Figure 3.4), the number of correlations are amplified in that range (A). The average number of correlations per structure is calculated simply by dividing the total number of correlations with the number of family-structure pairs analyzed within the given number of sequence range (B). This normalized histogram clearly shows that we do not observe large residuals as the number of sequences increase in the family alignments. This is mostly because the columns of the alignment become more variable as the number of sequences in a family alignment increases. As shown in Figure 3.7 the amino acid probabilities between 0.4-0.6 maximize the residual parameter. As the family gets larger, it becomes less likely to find amino acids that occupy 40-60% of a column. Therefore, although there is not any theoretical limitation, in practice we do not observe correlations with large residuals, as the number of sequences increase in the family alignments.

54

Figure 3.8

55 Figure 3.9. (A) Histogram of the total number of correlations and (B) average number of correlations per structure versus the residual parameter and sequence length of the analyzed families.

Longer sequences more columns are incorporated in the correlation calculations. Therefore, we expect to observe more correlations as the sequences get longer. Since the majority of the analyzed families (77%) have sequence length of 20-300 (Figure 3.5), the number of correlations are amplified in that range (A). The average number of correlations per structure is calculated simply by dividing the total number of correlations with the number of family-structure pairs analyzed within the given sequence length range (B). This normalized histogram clearly shows that we observe more correlations as the sequences in the family alignments get longer.

56

Figure 3.9

57 3.2.3.2 Calculation of the Phi Coefficient

The phi coefficient is a measure of the degree of association between two binary variables [Edwards, 1976]. If two characteristics are tabulated for a population of S objects according to their possession by each of the objects, a 2×2 of the population S can be build as:

X- X+ Totals Y- a b a+b Y+ c d c+d Totals a+c b+d S

ƒ a+b objects in the population do not have the characteristic Y

ƒ c+d objects in the population do have the characteristic Y

ƒ a+c objects in the population do not have the characteristic X

ƒ b+d objects in the population do have the characteristic Y

ƒ a objects in the population have neither characteristic X, nor characteristic Y

ƒ b objects in the population have characteristic X but not characteristic Y

ƒ c objects in the population have characteristic Y but not characteristic X

ƒ d objects in the population have both characteristics X and Y

Based on this contingency table the phi coefficient is defined as:

×−× cbda φ = ++++ dbcadcba ))()()(( which can range from -1 to 1.

58 The phi coefficient can be effectively used to evaluate association between two pairs of position and identity in a family alignment. For instance, for a family alignment composed of S sequences, association between residue N at position i and residue M at position j can be tabulated as follows:

- + Ni Ni Totals - Mj a b a+b + Mj c d c+d Totals a+c b+d S

ƒ a+b sequences do not have residue M at position j

ƒ c+d sequences do have residue M at position j

ƒ a+c sequences do not have residue N at position i

ƒ b+d sequences do have residue N at position i

ƒ a sequences have neither Nj nor Mj

ƒ b sequences have Ni but not Mj

ƒ c sequences have Mj but not Mj

ƒ d sequences have the NiMj residue pair

Unlike the residual calculation, the phi coefficient also takes into account the nonexistence of associations. That is, the exclusion of the association from the family.

Besides the existence of the NiMj pair, the nonexistence of the NiMj pair affects the degree of association in the positive direction, while the existence of only one of Ni or Mj affects the degree of association in the negative direction. This makes the phi coefficient very susceptible to the gaps in the family alignment. If gaps in both positions i and j were counted as the nonexistence of the NiMj pair, the phi coefficient would be exaggerated.

59 Having gaps in both positions i and j actually means these columns do not exist in these sequences. Therefore, such pairs should not be taken into account when calculating residue associations in the family alignments. On the other hand, pairing of a residue with an insertion has to be included in the phi coefficient calculation, since such pairs constitute an important portion of the negative correlations. So, the value of a in the contingency table will be the number of sequences that do not have gaps in both positions i and j, and do not have the NiMj residue pair. As a result S, the total number of sequences used for phi coefficient calculation, could be quite different within the same family alignment depending on the columns of interest.

Figure 3.10 (A) shows the histogram of the calculated phi coefficients over the number of sequences in the analyzed families, while Figure 3.10 (B) shows the average number of correlations per structure over the phi coefficient and the number of sequences in the analyzed families. As the number of sequences increases, the family alignment gets more diverse and more variable. Therefore, we observe a decrease in the number of strong residue associations as the family gets larger. Unlike the residual parameter, it is still -theoretically and practically- possible to observe residue associations with a phi coefficient of 1 or -1, even with a large number of sequences in the family.

Figure 3.11 (A) and (B) shows the histogram of the number of calculated correlations and the average number of correlations per structure, respectively, over the phi coefficient and the sequence lengths of the analyzed families. As expected more residue associations are calculated for the families with longer sequences (Figure 3.11 (B)).

60 Figure 3.10. (A) Histogram of the total number of correlations and (B) average number of correlations per structure versus phi coefficient and number of sequences in the analyzed families.

Since the majority of the analyzed families (70%) have 20-80 sequences (Figure 3.4), the number of correlations are amplified in that range (A). The average number of correlations per structure is calculated simply by dividing the total number of correlations with the number of family-structure pairs analyzed within the given number of sequence range (B). This normalized histogram shows that the number of observed strong correlations gets smaller as the number of sequences increase in the family alignments. This is mostly because the columns of the alignment become more variable as the number of sequences in a family alignment increases. As the family gets larger, it becomes less likely to find strong residue associations. However, unlike residual parameter, we still observe correlations with phi coefficient of 1 or -1 even in the largest families.

61

Figure 3.10

62 Figure 3.11. (A) Histogram of the total number of correlations and (B) average number of correlations per structure versus phi coefficient and sequence length of the analyzed families.

Longer sequences mean more columns are incorporated in the correlation calculations. Therefore, we expect to observe more correlations as the sequences get longer. Since the majority of the analyzed families (77%) have sequence length of 20-300 (Figure 3.5), the number of correlations are amplified in that range (A). The average number of correlations per structure is calculated simply by dividing the total number of correlations with the number of family-structure pairs analyzed within the given sequence length range (B). This normalized histogram clearly shows that we observe more correlations as the sequences in the family alignments get longer.

63

Figure 3.11

64 Going back to the last example given in the 3.2.3.1; if P(Ni)=0.1 and P(Mj)=0.1 and

NiMj always paired in the alignment, the residual for this association will be 0.09. Let’s assume we have S number of pairs for the columns i and j to be considered in the phi coefficient calculation. In that case the contingency table will be as follows:

- + Ni Ni Totals - Mj 0.9×S 0 0.9×S + Mj 0 0.1×S 0.1×S Totals 0.9×S 0.1×S S

The phi coefficient will be:

××× SS − 01.09.0 φ = = 1 ×××× SSSS )1.0)(9.0)(1.0)(9.0(

As the phi coefficient of 1 indicates, Ni and Mj are very strongly associated, but the importance of such an association will be dramatically different if it is coming out a 20- sequence family or a 200-sequence family. Although the phi coefficient can identify this

100% association, we still do not know whether this association is due to chance or not.

Therefore, it is necessary to evaluate the statistical significance of the observed association.

3.2.3.3 Calculation of Statistical Significances

There is no satisfactory general measure of the degree of association. The important thing is rather the statistical significance of the observed association rather than any numerical measure of the association itself [Wilson, 1931]. The Chi-square test is frequently applied to a contingency table to determine whether the two categorical variables are associated.

This version of the chi-square test is named the chi-square test of association [Lowry,

65 2000]. It is not an ideal approach for this type of data, as will be illustrated here, but its familiarity makes it a good basis for description. The idea behind this test is to compare the observed frequencies and with the frequencies that would be expected. The null hypothesis assumes that there is no association between the variables, while the alternative hypothesis claims that some association does exist. The alternative hypothesis does not specify the type of association. Therefore, the chi-square test is intrinsically non- directional and close attention to the data is required to interpret the information provided by the test. The chi-square test of association is based on a test that measures the divergence of the observed data from the values that would be expected under the null hypothesis of no association. This requires calculation of the expected values based on the data. For our analysis, given the contingency table:

- + Ni Ni Totals - Mj a b a+b + Mj c d c+d Totals a+c b+d S

the expected values for each cell will be equal to (row total*column total)/S, where S is the total number of sequences included in the table.

- + Ni Ni Totals - Mj (a+b)(a+c)/S (b+d)(a+b)/S a+b + Mj (a+c)(c+d)/S (b+d)(c+d)/S c+d Totals a+c b+d S

The logical validity of the chi-square test is greatest when the expected cell frequencies are fairly large, and decreases as these frequencies become smaller. Chi- square tests can be legitimately applied only if all values of expected cell frequencies are

66 equal to or greater than 5 [Lowry, 2000]. For example, in a 100 sequence alignment, if Ni counts 17, Mj counts 16, and the NiMj pair counts 15, the contingency table and expected cell frequencies (shown in bold) will be as follows:

- + Ni Ni Totals 82 2 M - 84 j 69.7 14.3 1 15 M + 16 j 13.3 2.7 Totals 83 17 100

Based on this contingency table, the phi coefficient for association of Ni and Mj is 0.9.

+ + Since the expected frequency of cell [Ni , Mj ] is 2.7, chi-square test can not be applied to this example to test the statistical significance of the association.

Most of the previous published studies of residue correlations [Larson et al., 2000,

Kass and Horovitz, 2002, Nemato et al., 2004]. use the chi-square test to evaluate the statistical significance of correlations/co-variations/co-evolutions in family alignments.

Because of the above mentioned limitations of the chi square test, in these studies correlations are eliminated if an expected cell frequency is less than 5. Although this elimination step naively appears to only apply to insignificant correlations or alignments with smaller number of sequences, as seen in the above example, will also eliminate quite strong correlations with numerous sequences where there is little variability in the sub- family sharing the correlation.

This limitation of the chi-square test of association can be circumvented through application of the Fisher Exact Probability test [Lowary, 2000]. The Fisher test gives a statistical significance of 8.4e-15 for the above example correlation. Since there are no

67 minimums, Fisher test is plainly superior to a chi-square test even in the cases where chi- square might be legitimately employed. In addition, the chi-square test is intrinsically non-directional, whereas the Fisher test is capable of being applied as either a directional test or non-directional test [Lowary, 2000]. Therefore, in our analysis we used Fisher

Exact Probability test to evaluate statistical significances of identified amino acid associations.

Let’s consider the following example to understand the Fisher procedure. In a 15 sequence alignment, Ni occurs 7 times, Mj occurs 6 times, and the NiMj pair occurs 5 times. The phi coefficient for the association of NiMj will be 0.6.

- + Ni Ni Totals - Mj 7 2 9 + Mj 1 5 6 Totals 8 7 15

- + - + Given the marginal totals of Ni , Ni , Mj and Mj as 8, 7, 9 and 6, there are 7 possible

ways in which the specific correspondences between Ni and Mj might sort themselves out by mere chance. These possible outcomes are labeled as O1, O2, O3, O4, O5, O6 and

O7 in the following array of tables.

O1 O2 O3 O4 O5 O6 O7 - + - + - + - + - + - + - + Ni Ni Ni Ni Ni Ni Ni Ni Ni Ni Ni Ni Ni Ni - Mj 2 7 3 6 4 5 5 4 6 3 7 2 8 1 9 + Mj 6 0 5 1 4 2 3 3 2 4 1 5 0 6 6 8 7 8 7 8 7 8 7 8 7 8 7 8 7 15

Outcomes falling toward the left end of this array would indicate positive associations between Ni and Mj, those falling toward the right end would indicate negative associations between Ni and Mj, and those lying toward the middle would approximate

68 zero association. The mere chance probability of an outcome is greatest toward the middle of the range and decreases sharply as you go out toward either extreme. The null hypothesis in this example is that there is no association between Ni and Mj. The particular outcome observed in this example is O6. So, the question of statistical significance is that if the null hypothesis were true, how likely is it that we could end up with either O6 or O7?

In order to calculate the significance of the observed data, i.e. the total probability of observing data as extreme, or more extreme than was actually observed, given a true null hypothesis, we have to calculate the values of p for both these tables, and add up these separate disjunctive probabilities. The exact probability of observing a particular arrangement of the data can be calculated with the following formula:

⎛ + ba ⎞⎛ + dc ⎞ ⎛ S ⎞ ++++ dbcadcba )!()!()!()!( p = ⎜ ⎟⎜ ⎟ ⎜ ⎟ = ⎝ a ⎠⎝ c ⎠ ⎝ + ca ⎠ dcbaS !!!!!

According to this formula, p values for the outcomes O6 and O7 can be calculated as:

!7!8!6!9 p = = 03357.0 O6 !15!5!1!2!7

!7!8!6!9 p = = 00139.0 O7 !15!6!0!1!8

As a result the p-value of this example will be pO6+pO7=0.03357+0.00139=0.03496.

Because of the numerical and computational difficulties resulting from large values of the factorials, the Fisher Exact Probability test can be applied -the way described above- only to contingency tables with about up to 100 observations (S). To exceed this practical limitation, rapid approximation methods are typically employed. The Gamma

69 approximation is the most popular way to overcome this computational difficulty. In mathematics, the Gamma function is an extension of the factorial function to real and complex numbers. The logarithm of the gamma function can be incorporated in to the p

⎛n⎞ value calculations. The combination calculation⎜ ⎟ can be written in terms of the natural ⎝k ⎠ logarithm of the Gamma function (lngamma) as follows:

⎛ n ⎞ ⎜ ⎟ ln⎜ ⎟ ⎛n⎞ ⎝ k ⎠ ⎜ ⎟ = e ⎝k ⎠

⎛ ⎞ ln⎜ n! ⎟ ⎜ ()−knk !)! ⎟ = e ⎝ ⎠

⎛ ⎞ ⎜ n)lngamma( ⎟ ⎜ ⎟ ⎜ k lngamma()lngamma( − kn ) ⎟ = e⎝ ⎠

n)lngamma( k)lngamma( lngamma( −kn ) = e − e − e

Then, the p function to calculate the exact probability of observing a particular arrangement of the data can be written in terms of lngamma function as:

⎛ + ba ⎞⎛ + dc ⎞ ⎛ S ⎞ p = ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎝ a ⎠⎝ c ⎠ ⎝ + ca ⎠

⎛ lngamma( +ba ) a)lngamma( b)lngamma( ⎞ lngamma( +dc ) c)lngamma( lngamma(d ) ⎜e − − ⎟()eee − − ee = ⎝ ⎠ lngamma(S) lngamma( +ca ) lngamma( −− caS ) ()− ee − e

70 We used Lanczos’ [1964] approximation to the gamma function in our calculations.

We adopted Alan Miller’s FORTRAN implementation of the lngamma function and

Oyvind Langsrud’s web implementation of the Fisher Exact Probabilty test. Our Perl implementation of the natural logarithm of Gamma function is as follows:

sub lngamma # Translation of Alan Miller's FORTRAN-implementation # http://lib.stat.cmu.edu/apstat/245 { my ($z) = @_; $x = 0; $x += 0.1659470187408462e-06/($z+7); $x += 0.9934937113930748e-05/($z+6); $x -= 0.1385710331296526 /($z+5); $x += 12.50734324009056 /($z+4); $x -= 176.6150291498386 /($z+3); $x += 771.3234287757674 /($z+2); $x -= 1259.139216722289 /($z+1); $x += 676.5203681218835 /($z); $x += 0.9999999999995183; return (log($x)-5.58106146679532777-$z+($z-0.5)*log($z+6.5)); }

Figure 3.12 (A) shows the histogram of the base-10 logarithm of calculated statistical significances over the number of sequences in the analyzed families, while Figure 3.12

(B) shows the average number of correlations per structure over the base-10 logarithm of the statistical significance and the number of sequences in the analyzed families. When the sample size is large, very small differences will be detected as significant by statistical tests. In our case, when the number of sequences increases, small residue associations will be detected as statistically significant. Therefore, as the number of sequences increases, it is likely to observe correlations with better significance. On the other hand, while a significant test result permits us to reject null the hypothesis of the

71 statistical test, i.e. there is no association, with a certain degree of confidence, a non- significant result does not allow us to accept the null hypothesis. When the number of sequences is insufficient to obtain statistically significant correlations, strong correlation as measured by the phi coefficient can be taken into account.

Figure 3.13 (A) and (B) shows the histogram of the number of calculated correlations and average number of correlations per structure, respectively, over the base-10 logarithm of statistical significance and sequence lengths of the analyzed families. As expected, more residue associations are calculated for the families with longer sequences

(B).

72 Figure 3.12. (A) Histogram of the total number of correlations and (B) average number of correlations per structure versus base-10 logarithm of statistical significance and the number of sequences in the analyzed families.

Since the majority of the analyzed families (70%) have 20-80 sequences (Figure 3.4), the number of correlations are amplified in that range with lower statistical significances (A). The average number of correlations per structure is calculated simply by dividing the total number of correlations with the number of family-structure pairs analyzed within the given number of sequence range (B). This normalized histogram shows that correlations calculated for smaller sized families are less significant and as the families get larger more significant correlations can be calculated. A large number of correlations are calculated for some families with 300, 330, 370 and 400 sequences. This is mostly because these families are composed of very long sequences.

73

Figure 3.12

74 Figure 3.13. (A) Histogram of the total number of correlations and (B) average number of correlations per structure versus base-10 logarithm of statistical significance and sequence length of the analyzed families.

Longer sequences mean more columns are incorporated in the correlation calculations. Therefore, we expect to observe more correlations as the sequences get longer. Since the majority of the analyzed families (77%) have sequence length of 20-300 (Figure 3.5), the number of correlations are amplified in that range (A). The average number of correlations per structure is calculated simply by dividing the total number of correlations with the number of family-structure pairs analyzed within the given sequence length range (B). This normalized histogram clearly shows that we observe more correlations as the sequences in the family alignments get longer. Since the existence of statistically significant correlations depends on the number of sequences in the family alignment (Figure 3.12), families with a large number of sequences increase the average number of correlations at the highest significance levels.

75

Figure 3.13

76 3.2.4 Grouping of amino acids

Amino acids can be grouped together, for example, by their hydrophobicity, polarity, structure, size, etc. We have listed some of the important indices for amino acid properties in Table 2.3. The foremost characteristics of the amino acids are hydrophobicity and hydrophilicity. Depending on the polarity of side chain, amino acids vary in their hydrophilic and hydrophobic character. Hydrophobicity is a measure of how strongly the side chains are pushed out of water, while hydrophilicity is a measure of how likely the side chains are to be in contact with the aqueous environment. Amongst numerous hydrophobicity indices, Janin [1976], Wolfenden [1981], Kyte and Doolittle

[1982], and Rose [1985] are the most popular (Table 2.4). These are not identical because hydrophobicity scales are experimentally derived from the solvation and cavitation free energy calculations for the side chains. Therefore, the hydrophobicity scale of a side chain varies according to the conformational variation of the peptide [Junhyoung et al.,

2003]. Thus, there is no one hydrophobicity scale which is the best in all circumstance.

Besides these expected variations within the scales, it is also challenging to cluster continuous numbers of hydrophobicity scales.

In our analysis, we aim to extract and summarize correlations between residues in reasonably similar classes of hydrophobicity and hydrophilicity. To group amino acids with similar hydrophobicities, we used the hydrophobic similarity matrix by George et al.

[1990]. Since the matrix already has pairwise similarity scores for amino acids, clustering of the residues with similar hydrophobicities was straight forward. Figure 3.14 shows the similarity matrix and the grouping schema. Figure 3.15 shows the distribution of the

77 number of amino acids in each property group for 961 family alignments analyzed in this study. We summarized associations between the residues of the same property group based on this grouping schema. This allowed us to examine the structural differences amongst the correlations of distinct property groups.

Figure 3.14. Hydrophobic similarity matrix by George et al. [1990] and grouping of amino acids based on their hydrophobic similarity score.

We interpreted regions of coherently high similarity as cluster of similar residues. According to this grouping schema (D,E,K,R), (W,Y,F), (N,Q,G,S), (V,C,I,L,M,P) and (A,H,T) are grouped together as charged, aromatic, hydrophilic, very hydrophobic and hydrophobic amino acids respectively.

78

Figure 3.15. The distribution of the number of amino acids in each property group for the sequence collection of 961 protein families analyzed in this study.

(D,E,K,R), (W,Y,F), (N,Q,G,S), (V,C,I,L,M,P) and (A,H,T) are grouped as charged, aromatic, hydrophilic, very hydrophobic and hydrophobic amino acids respectively.

79 3.2.5 Classification of Secondary Structural Components

In a protein family, structural conformation is more conserved than the sequence. Since residue associations in family alignments represent the variable characteristics, propensities of residue associations being located on different secondary structural elements can help us to understand compensating correlations to conserve the structure.

Pfam family alignments provide secondary structural information for all sequences with a known structure. Secondary structural information was taken from DSSP

(Database of Secondary Structure in Proteins) program which is developed to standardize secondary structure assignment [Kabsch and Sander, 1983]. Table 3.1 lists the DSSP code and definitions for secondary structural elements.

DSSP code DSSP definition C Random Coil H Alpha-helix G 3(10) helix I Pi-helix E Hydrogen bonded beta-strand (extended strand) B Residue in isolated beta-bridge T H-bonded turn (3-turn, 4-turn, or 5-turn) S Bend (five-residue bend centered at residue i)

Table 3.1. Shows DSSP codes and their definitions for secondary structure assignments.

80 Figure 3.16 shows distribution of these secondary structural elements in 2,972 structures that are referred by the 961 families analyzed in this study. There are 8 different secondary structure assignments provided by DSSP and also cases where there is no secondary structural assignment. These 9 possible secondary structure assignments yield 45 possible pairings amongst them. Since studying distribution of pairwise residue associations in 45 different secondary structural pairings would be quite complicated and confusing, we aggregated similar secondary structural elements: alpha-helix, 3(10) helix and pi-helix are aggregated as helix; H-bonded turn and bend are aggregated as turn; random coil and residue in isolated beta bridge are aggregated into no secondary structure (none) assignment, and hydrogen bonded beta strand is kept by itself. Figure

3.17 shows the distribution of the aggregated secondary structural assignments for our dataset. We summarized the residence of pairwise residue associations on secondary structural elements based these aggregated secondary structural assignments.

81

Figure 3.16. The distribution of the secondary structural elements for the 2,972 structures that are referred by 961 protein families analyzed in this study.

The assignment of secondary structural elements was done by DSSP.

Figure 3.17. The distribution of the aggregated secondary structural elements for the 2,972 structures that are referred by 961 protein families analyzed in this study.

The secondary structural groups shown in Figure 3.16 are aggregated into similar groups. H-bonded turn and bend are grouped into turn, pi-helix, alpha helix and 3(10) helix are grouped into helix, hydrogen bonded beta sheet taken by itself as beta sheet, and random coil, residue in isolated beta bridge and none are grouped into none.

82 3.2.6 Random Correlations

To better evaluate the results of the analysis on physical distances of associated residues, we examined the distances between random amino acid pairs. We performed our random correlation on the same set of families used in this study. Associated residues identified by our analysis can be close or quite far way from each other on the protein sequence. Selecting completely random pairs of residues for a random correlation analysis would not provide the same sequential distance distribution as observed with real data. This is a problem because as the sequential distance between associated residues gets smaller, the probability of being in contact increases. Therefore, we shifted the real family alignments randomly and examined distances of associated residues on these shifted sequences. By doing so, we used the same biological data and same sequential distance distribution for associated residues to compute the distances between random

“correlated” positions. Since the influence neighborhood of an amino acid could be up to

8Å [Manavalan and Ponnuswamy, 1977], amino acids within 8Å of each other were accepted to be in close contact. Using our sequence permutation randomization, we observed that only 10% of the random correlations are in close contact (Figure 3.18).

Figure 3.19 shows box plots of closest and relative distances between random correlations

83

Figure 3.18. The percentage of random correlations that are in close contact and not in contact.

Figure 3.19. Box plots of the closest distances (left) and relative distances (right) between random amino acid pairs.

Half of the random residue pairs are present within 17-32Å of each other with a distance of 21Å. If we consider the unit distance as the molecule’s average diameter, half of the random residue pairs are within 0.27-0.56 unit of each other with a median distance of 0.43 unit.

84 3.3 Results

3.3.1 Distance distribution of associated residues

We examined the distribution of the closest distances between associated residues.

Besides absolute distances we also computed the ratio of the distances to the molecule’s average diameter. This allowed us to understand relative distances between associated residues compared to the molecule’s size. Essentially “are these residues near each other in the context of the molecule”. To be able to evaluate absolute and relative distance distributions correctly, we depicted the distribution of the closest distances and relative distances between random correlations as box plots. As shown in Figure 3.19, the median distance between random residue pairs is 21Å and 50% of them are in the range of 17-

32Å. The median relative distance between random residue pairs is 0.4 of the average molecule diameter and 50% of them are in the range of 0.27-0.56 of the average molecule diameter.

In Figure 3.20 and Figure 3.21, we present the relationship between the phi coefficient, and closest and relative distances between the associated residues of 961 families analyzed in this study. We observed that absolute and relative distances between associated residues get smaller as the phi coefficient gets larger in the positive direction.

However, the distribution of distances remains similar for the negative values of phi coefficient. When compared with the distance distribution of random pairs, the distance distribution of strongly associated residues is shifted down about 7Å. Relative distances between strongly associated residues are also closer than a random pair by about 0.1 of the average molecule diameter or about 25% closer on average. Figure 3.22shows box

85 plots of the closest distances between associated residues over the phi coefficient for different residue properties. Residues are grouped as explained in 3.2.4 and only the residue associations within these groups are plotted in this graph. We observed that associations between aromatic amino acids are present in closer proximity and have fewer outliers compared to other property groups.

Figure 3.20. Box plots of the closest distances between associated residues against phi coefficient.

As the phi coefficient gets stronger -both in positive and negative direction- the distance distribution narrows and the number of outliers decreases. Especially in the case of positive correlations, distance between associated residue pairs gets smaller with stronger correlation. For instance, when the phi coefficient is 1 or 0.9, 50% of associated residues are within 10-24Å of each other with a median distance of 16Å, and within 11-25Å of each other with a median distance of 18Å respectively. Compared to the distance distribution between random pairs -50% of pairs within 17-32Å of each other with a median distance of 21Å-, box plots of distances between strongly associated residues are about 7Å shifted down, i.e. strongly associated residues are tend to be 7Å closer than a random pair.

86

Figure 3.21. Box plots of the relative distances between associated residues against phi coefficient.

Distance ratios are calculated to obtain relative distances between associated residues simply by dividing closets distances between the pair by molecule’s average diameter. Relative distances between associated residue pairs get smaller with stronger positive correlations; however they seem to maintain a similar distribution for negative correlations. If we consider the unit distance as the molecule’s average diameter, when the phi coefficient is 1 or 0.9, 50% of associated residues are within 0.17-0.46 unit of each other with a median distance of 0.32 unit, and within 0.20-0.47 unit of each other with a median distance of 0.35 unit respectively. Compared to the relative distance distribution between random pairs -50% of pairs within 0.27-0.56 unit of each other with a median distance of 0.43 unit-, box plots of relative distances between strongly associated residues are shifted down about 0.1 unit, i.e. strongly associated residues are tend to be 0.1 unit -10% of molecule’s average diameter- closer than a random pair.

87

Figure 3.22. Box plots of the closest distances between associated residues against phi coefficient for different residue properties.

Associations between aromatic amino acids are present in closer proximity and have fewer outliers compared to other property groups.

88

Figure 3.22

89 In Figure 3.23 and Figure 3.24 we present the relationship between the base-10 logarithm of the statistical significance of the associations, and the closest and relative distances between associated residues of the 961 families analyzed in this study. We observed that absolute and relative distances between associated residues get smaller as the statistical significance of the association increase. Positively associated amino acids are present in closer proximity compared to negatively associated ones. When compared with the distance distribution of random pairs, the distance distribution of statistically significant positive and negative associations are shifted down about 8Å and 4Å respectively. The relative distances between statistically significant residue associations are also closer than for random pairs. Figure 3.25 shows box plots of the closest distances between associated residues over the base-10 logarithm of statistical significance for different residue properties. We observed that associations between aromatic amino acids are present in closer proximity and have fewer outliers compared to other property groups, while associations between charged residues seem to span larger distribution than other property groups.

In Figure 3.26 and Figure 3.27, we present the relationship between the residual parameter, and closest and relative distances between associated residues of the 961 families analyzed in this study. We observed that the absolute and relative distances between associated residues get smaller as the residual parameter get larger both in positive and negative direction. When compared with the distance distribution of random pairs, the distance distributions of residue associations with large residuals are shifted down about 8Å, and their relative distances are also closer than a random pair by about

0.1 of the average molecule diameter.

90 Histograms of the number of correlations computed for 961 Pfam families examined in study versus phi coefficient, base-10 logarithm of statistical significance and residual parameters are plotted in Figure 3.28, Figure 3.29 and Figure 3.30 respectively. The proportions of propertywise correlations and residence on secondary structural elements are also color coded in these figures.

91

Figure 3.23. Box plots of the closest distances between associated residues against the base-10 logarithm of statistical significance.

Bisque and white colored box plots show the distances between positively and negatively associated residues respectively. As the statistical significance increases the distance distribution narrows and the number of outliers decreases. Positively associated residues are present in closer proximity compared to negatively associated residues. For instance, when the statistical significance of associations are 10-10, 50% of positively associated residues are within 9-23Å of each other with a median distance of 15Å, and 50% of negatively associated residues are within 13-26Å of each other with a median distance of 20Å. Compared to the distance distribution between random pairs -50% of pairs within 17-32Å of each other with a median distance of 21Å-, box plots of distances between strongly associated residues are shifted down about 4-8Å.

92

Figure 3.24. Box plots of the relative distances between associated residues against base-10 logarithm of statistical significance.

Bisque and white colored box plots show the relative distances between positively and negatively associated residues respectively. Distance ratios are calculated to obtain relative distances between associated residues simply by dividing closest distances between the pair by molecule’s average diameter. If we consider the unit distance as the molecule’s average diameter, when statistical significance is 10-10, 50% of positively associated residues are within 0.19-0.45 unit of each other with a median distance of 0.31 unit, and 50% of negatively associated residues are within 0.23-0.47 unit of each other with a median distance of 0.35 unit. Compared to the distance distribution between random pairs -50% of pairs within 0.27-0.56 unit of each other with a median distance of 0.43 unit-, box plots of the relative distances between strongly associated residues are shifted down about 0.1 unit.

93

Figure 3.25. Box plots of the closest distances between associated residues against the base-10 logarithm of statistical significance for different residue properties.

Associations between aromatic amino acids are present in closer proximity and have fewer outliers compared to other property groups. Associations between charged residues seem to span larger distribution than other property groups, however their distribution also narrows as statistical significance increases.

94

Figure 3.26. Box plots of the closest distances between associated residues against residual.

As the residual increases, the distance distribution narrows and the number of outliers decreases. For instance, when the residual is 0.25, 0.24 or 0.23, 50% of positively associated residues are within 8-22Å of each other with a median distance of 15Å, within 10-24Å of each other with a median distance of 16Å, and within 9-24Å of each other with a median distance of 16Å, respectively. Compared to the distance distribution between random pairs -50% of pairs within 17-32Å of each other with a median distance of 21Å-, box plots of distances between strongly associated residues are shifted down about 8Å.

95

Figure 3.27. Box plots of the relative distances between associated residues against residual.

Distance ratios are calculated to obtain relative distances between associated residues simply by dividing closest distances between the pair by the molecule’s average diameter. Relative distances between associated residue pairs get smaller as the residual gets larger. If we consider the unit distance as the molecule’s average diameter, when the residual is 0.25 or 0.24, 50% of associated residues are within 0.16-0.37 unit of each other with a median distance of 0.26 unit, and within 0.21-0.43 unit of each other with a median distance of 0.31 unit respectively. Compared to the relative distance distribution between random pairs -50% of pairs within 0.27-0.56 unit of each other with a median distance of 0.43 unit-, box plots of relative distances between correlating residues with large residuals are shifted down about 0.1 unit and also the distribution is narrower by about 0.09 unit.

96

Figure 3.28. Histogram of the number correlations between aggregate physical properties for the Pfam families examined in this study (961 families) versus phi coefficient.

Proportion of secondary structural elements (A) and physiochemical propertywise pairs (B) are shown.

97

Figure 3.29. Histogram of number the correlations between aggregate physical properties for the Pfam families examined in this study (961 families) versus the base-10 logarithm of the statistical significance.

Proportion of secondary structural elements (A) and propertywise pairs (B) are shown.

98

Figure 3.30. Histogram of the number correlations between aggregate physical properties for the Pfam families examined in this study (961 families) versus the residual.

Proportion of secondary structural elements (A) and propertywise pairs (B) are shown.

99 3.3.2 Residue Associations in Contact

To determine the statistic most capable of predicting residue contact-type associations, we examined the percentage of residue contacts in all pairwise residue associations computed for the 961 Pfam families for a given parameter value, i.e. phi coefficient, statistical significance or residual. As explained in 3.2.6, 10% of random residue pairs in this dataset are in contact. So, we interpreted our results in comparison to this expected proportion.

In Figure 3.31, the percentage of residue pairs that are in contact is plotted against the phi coefficient. When the phi coefficient is high, between 50% more, and 90% more correlated pairs are in contact, than are observed amongst randomly chosen pairs. On the other hand negative associations maintain same expectations with random pairs. In our analysis, thousands of pairwise correlations are generated for a given family alignment. A statistical significance of 0.0001 means that it is possible to observe this specific occasion in 1 out 10,000 random observations. Therefore, among thousands of correlations statistical significance values larger than 0.0001 can be considered insignificant. In

Figure 3.32, we plotted the percentage of residue pairs that are in contact against the phi coefficient only for the correlations with statistical significance smaller than 0.0001. we observed that when the phi coefficient is high, between 60% to more than twice as likely to be in contact, than are observed amongst randomly chosen pairs. The contact proportions of strong negatively associated residues stays around the random expectation.

In Figure 3.33, the percentage of residue pairs that are in contact is plotted against the base-10 logarithm of statistical significance. Statistically significant residue associations

100 (<10-5) are more than twice as likely to be in contact, than are observed amongst randomly chosen pairs. For instance when the statistical significance of a positive residue association is smaller than 10-9, 22% of these associated residues are in contact. The percentage of residue contacts increases as the value of the statistical significance decrease in both positive and negative associations. However, proportion of residue contacts for positive correlations are always higher than negative correlations.

In Figure 3.34, the percentage of residue pairs that are in contact is plotted against the residual. When compared to the random residue pairs, correlated residues with a considerable residual (>|0.1| shown in this graph) always have a larger proportion of residue contacts. For instance, 27% and 20% of the correlated residue pairs with strongest residual values of 0.25 and -0.25, respectively, are in contact. This simple statistics, where applicable, is surprisingly more powerful than others for detecting contact-type interactions.

101

Figure 3.31. The percentage of residue contacts versus phi coefficient.

When the phi coefficient is high, between 50% more, and 90% more correlated pairs are in contact, than are observed amongst randomly chosen pairs. Negative associations do not give an idea about possible residue contacts.

102

Figure 3.32. The percentage of residue contacts versus phi coefficient (Statistical Significance <0.0001)

When only the residue associations with statistical significance smaller than 10-4 are considered, associated residues with large positive phi coefficient are 60% to more then twice as likely to be in contact compared to random residue pairs. Negative associations do not give an idea about possible residue contacts.

103

Figure 3.33. The percentage of residue contacts versus statistical significance.

The percentage of residue contacts between positively correlated residues increases as their statistical significance increases. When the statistical significance of a positive association is smaller than 10-9, 22% of these associated residues are observed to be in contact. That is more than twice as frequent as close contacts amongst random pairs. Negatively associated residues also present more contact as the statistical significance increase compared to random residue pairs.

104

Figure 3.34. The percentage of residue contacts versus residual parameter.

Correlated residues with large residual -both positive and negative- are in close contact between twice, and 2.5 times as often as random pairs.

105 Figure 3.35, Figure 3.36, Figure 3.37 and Figure 3.38, shows the percentage of residue contacts within propertywise correlations versus phi coefficient, phi coefficient and statistical significance smaller than 10-4, statistical significance and residual parameters respectively. Correlations between aromatic residues have higher proportions of residue contacts compared to the other property groups for any given phi coefficient and statistical significance. However, the proportion of aromatic residue contacts decrease considerably with large positive residual values. Residuals are either unable to detect these in high association position, or these interactions are overrepresented at

“medium” residual values and underrepresented at high residual values. Correlations between charged residues also have higher proportions of residue contacts for large positive phi coefficient and large residuals. In addition, the contact proportions of correlations between hydrophobic residues dramatically increases at the strongest positive phi coefficient, the highest statistical significance and the largest positive residual values.

106

Figure 3.35. The percentage of residue contacts grouped by chemical properties versus phi coefficient

Figure 3.36. The percentage of residue contacts grouped by chemical properties versus phi coefficient for highly significant pairs (Statistical Significance < 0.0001)

107

Figure 3.37. The percentage of residue contacts grouped by chemical properties versus statistical significance.

108

Figure 3.38. The percentage of residue contacts grouped by chemical properties versus the residual parameter

109

Figure 3.28

110 Amongst the 961 Pfam families analyzed in this study, 564 of them are oligomers, i.e. composed of two or more monomers. We examined the number of inter-chain correlations that are in contact for this subgroup. In Figure 3.39, the number of correlations in contact across the other chains versus phi coefficient, base-10 logarithm of statistical significance and residual are plotted. The increased number of correlations in contact for the smaller values of the parameters are due to the increased number of observations at these values. On average, we observed 25 inter-chain contacts per structure within the given range of phi coefficient and statistical significance, and 13 inter-chain contacts per structure within the given range of residual.

Figure 3.39. The number of correlations in contact with other chains against the phi coefficient (left), base-10 logarithm of statistical significance (middle) and residual (right) in 564 protein families that are homo-oligomers.

111 3.3.3 Comparison of Observed and Expected Proportions of Propertywise

Residue Associations and Residence on Secondary Structural Elements

In the last section of our analysis we compared observed and expected proportions of propertywise residue associations and residence of associated residues on secondary structural elements. The expected proportions of propertywise residue pairs and secondary structural element pairs are calculated based on the observed proportions of residue properties (Figure 3.15) and secondary structural elements (Figure 3.17) in the analyzed data respectively.

In Figure 3.40, Figure 3.41 and Figure 3.42, the observed and expected distribution of propertywise residue associations are plotted next to each other over the phi coefficient, base-10 logarithm of statistical significance and the residual parameters respectively. We observed that as the magnitude of the phi coefficient increases correlations between aromatic residues and between hydrophilic residues become more prevalent, while correlations between very hydrophobic residues have are less prevalent and correlations between charged residues are slightly less prevalent compared to the expected distribution. We obtained the same observations for the base-10 logarithm of the statistical significance and residual parameters with the exception that correlations between charged residues have a higher proportion than the expected when the statistical significance is equal or smaller than 10-10, or the residual is 0.25 or between

-0.21 and -0.25.

In Figure 3.43, Figure 3.44 and Figure 3.45, the observed and expected distribution of secondary structural element pairs are plotted next to each other over the phi coefficient,

112 base-10 logarithm of statistical significance and the residual parameters respectively.

Compared to the expected distributions, we observed larger proportions of the pairs such that both residues are in a helix or one residue is in a structural element while the other is in a flexible region, or both residues are in a flexible region. On the other hand, we observed smaller proportions of the pairs such that both residues are in beta sheet or one is in a beta sheet and the other is in a helix.

113

Figure 3.40. Observed and expected distributions of the propertywise correlations against the phi coefficient.

The observed proportions are shown in a solid color and the expected proportions are shown in a striped pattern. We observed that the correlations between aromatic and hydrophilic amino acids are more prevalent, while the correlations between very hydrophobic and charged residues are less prevalent compared to the expected proportions.

114

Figure 3.41. The observed and expected distributions of the propertywise correlations against the base-10 logarithm of the statistical significance.

The observed proportions are shown in a solid color and the expected proportions are shown in a striped pattern. We observed that the correlations between aromatic and hydrophilic amino acids are more prevalent, while the correlations between very hydrophobic and charged residues are less prevalent compared to the expected proportions.

115

Figure 3.42. The observed and expected distributions of the propertywise correlations against the residual.

The observed proportions are shown in a solid color and the expected proportions are shown in a striped pattern. We observed that the correlations between aromatic and hydrophilic amino acids are prevalent, while the correlations between very hydrophobic and charged residues are less prevalent as the residual parameter get larger –in either positive or negative direction- compared to the expected proportions.

116

Figure 3.43. The observed and expected distributions of the residence of correlated residues on secondary structural elements against the phi coefficient.

The observed proportions are shown in a solid color and the expected proportions are shown in a striped pattern. Compared to the expected proportions, we observed larger proportions of correlations where both residues are in a helix or one residue is in a structural element while the other is in a flexible region, or both residues are in a flexible region. On the other hand, compared to the expected proportions, we observed a smaller proportion of correlations with both residues in beta sheet or one in a beta sheet and the other in a helix.

117

Figure 3.44. The observed and expected distributions of the residence of correlated residues on secondary structural elements against the base-10 logarithm of the statistical significance.

The observed proportions are shown in a solid color and the expected proportions are shown in a striped pattern. Compared to the expected proportions, we observed larger proportions of correlations where both residues are in a helix or one residue is in a structural element while the other is in a flexible region, or both residues are in a flexible region. On the other hand, compared to the expected proportions, we observed a smaller proportion of correlations where both residues are in beta sheet or one is in a beta sheet and the other is in a helix.

118

Figure 3.45. The observed and expected distributions of the residence of correlated residues on secondary structural elements against the residual.

The observed proportions are shown in a solid color and the expected proportions are shown in a striped pattern. Compared to the expected proportions, we observed larger proportions of correlations where both residues are in a helix or one residue is in a structural element while the other is in a flexible region, or both residues are in a flexible region. On the other hand, compared to the expected proportions, we observed a smaller proportion of correlations where one residue is in a beta sheet and the other is in a helix.

119 3.4 Discussion

In this study, we analyzed methods to detect residue associations in protein family alignments and presented the results of a large scale analysis of these associations on protein family alignments and structures.

We observed that the number of sequences in the family alignment places practical limitations on the quantity and the strength of the computed residue associations. Figure

3.46 shows a scatter plot of the relationship between the phi coefficient, the base-10 logarithm of the statistical significance and the number sequences in the alignment. As the number of sequences increases it is less likely to observe strong residue associations while it is more likely to observe statistically more significant residue associations.

Conversely, when the number of sequences is small (<50) we observe phi coefficients all along its range, while statistical significances are not observed to be better than 10-5. We would like to note that the former is a practical limitation while the latter is a theoretical limitation. Theoretically, it is possible to observe any phi coefficient for any given number of sequences. However, in practice protein families become more diverse as they grow in the number of sequences and it becomes less likely to observe strong identitywise associations along the columns of such families. Therefore, for the families with a large number of sequences, the statistical significance of the association should be considered rather than just the magnitude of the association. On the other hand, statistical significance tells only how sure we are that there is an association. After finding a statistically significant association, it is important to evaluate its strength, i.e. phi coefficient. Consequently, this practical relationship between the phi coefficient, the

120 statistical significance and the number of sequences in the family alignment should be considered when evaluating results of a residue association study.

Figure 3.47 shows the scatter plot of the relationship between the residual, the base-

10 logarithm of the statistical significance and the number sequences in the alignment.

The effect of number of the sequences on the residual range is similar but more dramatic than on the phi coefficient. We do not observe any correlations with a large residual for the families with large number of sequences. Again statistical significance should be evaluated in the context of the strength of association, but in general residuals tend to be very small for large families. Thus, the residual parameter is suitable to interpret only the residue associations from small families.

Figure 3.48 shows the scatter plot of the relationship between the residual, the phi coefficient and the number of sequences in the alignment. There are a small number of cases where the phi coefficient is very small and negative while the residual is small and positive. This is due to the difference in the treatment of gaps in the alignment. Residual calculation takes into account the total number of sequences in the alignment to calculate expected probabilities. On the other hand phi coefficient calculation takes into account the non-gap pairs as the total number of sequences to be considered. When there are large number of gaps in both columns such conflicting results between residual and phi coefficient might occur.

In general, for a given residual value the range of the phi coefficient is limited. For example, when the residual is 0.2, the phi coefficient can take values between 0.75 and

1.00, when the residual is -0.15, the phi coefficient can take values between -0.55 and -1, etc. However, the converse is not true. This means, for a given residual there is a lower

121 limit for the phi coefficient, while for a given phi coefficient there is an upper limit for the residual but not a lower limit. As summarized in 3.3, amino acid contact predictivity of the residual parameter is the highest compared to the phi coefficient and the statistical significance. Although large values of the residual parameter can be observed for only small to mid-sized families and their quantity is not large, amino acid correlations with large residuals should be considered carefully. On the other hand, as the number of sequences increases statistically more significant associations are observed, but the residuals are near 0 even when the phi coefficient is large. This observation, the correspondence between the phi coefficient and the residual, confirms that the residual parameter is not suitable to identify strength of residue associations for large families.

122

Figure 3.46. Three dimensional scatter plot of the phi coefficient, the base-10 logarithm of statistical significance and the number of sequences in the family alignment.

123

Figure 3.47. Three dimensional scatter plot of the residual, the base-10 logarithm of statistical significance and the number of sequences in the family alignment.

124

Figure 3.48. Three dimensional scatter plot of the phi coefficient, the residual and the number of sequences in the family alignment.

125 The statistical significance of an association is important even when the other parameters are not informative, i.e. for small families, to understand disagreement of the phi coefficient and the residual, or for large families to understand whether a small association is significant or not. Therefore, appropriate calculation of the statistical significance for any given association is crucial. In the literature, statistical significances of covariations in the family alignments are most commonly evaluated by the chi-square test, such as [Larson et al., 2000, Kass and Horovitz, 2002, Nemato et al., 2004]. The general formula for the chi-square test is as follows:

2 2 χ = ∑ (),( ,OBSn − ,EXPn /) NNNji ,EXPn n where n is the set of different amino acid pairs that can be found at positions i and j,

Nn,EXP and Nn,OBS are the expected number of sequences and observed number of sequences with this specific pair of amino acids. Due to the nature of chi-square test results for positions i,j should be disregarded when Nn,EXP is smaller than 5. As explained in 3.2.3.3, the minimum requirement on the expected number of observations can eliminate many residue associations from consideration even for large families. Therefore, the results of a chi-square test in this context is always questionable. In this study, we proposed the use of Fisher’s Exact probability test to compute statistical significances of residue associations and demonstrated its applicability.

The above mentioned studies [Larson et al., 2000, Kass and Horovitz, 2002, Nemato et al., 2004] and the other large scale studies have also always analyzed columnwise amino acid correlations [Gobel et al., 1994, Shindyalov et al., 1994, Fariselli et al., 2001,

Kundrotas and Alexov, 2006]. Correlation analysis in family alignments is highly

126 dependent on the correctness of multiple sequence alignment. Columnwise correlation calculations assume that the family alignment is correct, then apply advanced statistical analysis and incorporate amino acid similarity or mutation matrices to detect correlated mutations, or co-varying positions. Conserved positions in a family can be aligned easily by any multiple sequence alignment approach. However, varying positions in the family are very susceptible to the multiple sequence alignment method used. The selection of scoring matrix, i.e. mismatch scores and gap penalties affects the final alignment dramatically at the variable positions. Thus, columnwise correlation analysis can be quite misleading. Even if the alignment is correct, a subgroup in the family might have an association preference while the rest may not. In the large family alignments, there may be errors in the alignment, but sequences with similar variational characteristics remain aligned correctly within the group. Any of these factors confounds the application of columnwise correlation analysis. However, our proposed identitywise correlations are considerably more robust to these types of perturbations. Misaligned subgroups in a family, simply show up as containing residue associations that are independent of the rest of the family associations. This increases the complexity of the analysis, and skews values such as the significance somewhat, but it does not mask the existence of associations as does coulmnwise correlation Therefore, identitywise correlations proposed in this study, can identify residue associations of the subgroups even when the alignment is not perfect.

127 3.5 Conclusion

We have presented methodologies and the results of a large scale analysis of residue associations in protein family alignments. We calculated identitywise residue associations based on the observed population of sequences actually sharing the residues. Three main parameters were employed to measure the residue associations: 1) the residual based on the difference between observed and expected residue frequencies in the alignment, 2) the phi coefficient to quantify degree of association and 3) statistical significance of association by Fisher’s Exact probability test. These parameters are computed for the 961

Pfam family alignments. Then, we examined the physical proximity and physiochemical properties of associated residues in protein family alignments and their presence on secondary structural elements. We observed that physical distances between associated residue pairs decreases as the strength of association and its statistical significance increase. Specifically, associations between aromatic residues and hydrophilic residues are present in closer proximity compared to other physicochemical properties. Despite its surprisingly simple statistical formulation, the amino acid contact predictivity of the residual parameter is the highest compared to the phi coefficient and the statistical significance. Compared to the expected distributions, we observed larger proportions of the pairs such that both residues are in a helix or one residue is in a structural element while the other is in a flexible region, or both residues are in a flexible region. Our method is robust to misalignment in a fashion that connot be accomplished with columnwise methods, and the results with misaligned sequences may have applicability in improving structural alignments by examination of the predicted associations.

128

CHAPTER 4

4 CONCLUSION

In this research, we proposed and analyzed a method for detecting residue associations in protein family alignments and examined the applicability of this method to structural prediction.

In the first part, we presented the Multiple Alignment Variation Linker (MAVL) and

StickWRLD to analyze biomolecule sequence alignments and visualize positive and negative identitywise interpositional residue associations [Ray, 2004, Ray, 2005, Ozer and Ray, 2006]. MAVL/StickWRLD tool provides rapid exploration of alignment properties and insight into potential structural requirements that are embedded in the sequence identities. The visualization capabilities allow the user to rapidly identify and avoid defects in typical tools used to statistically model sequence families. This method has proven both intuitive and accurate for predicting likely structural features in protein

[Ray, 2005, Ozer and Ray, 2006] and nucleic acid [Ray, 2004] families for which representative crystal or solution structures are known.

In the second part, we discussed the use of the residuals and the phi coefficient to determine the strength of a residue association, and Fisher’s Exact probability test to evaluate the statistical significances of the association. We computed identitywise residue associations for 961 Pfam family alignments. Then, we examined physical proximity and

129 physiochemical properties of associated residues in the alignments and their presence on secondary structural elements. We observed that the proximity of residues increases as the strength of association and its statistical significance increase. The amino acid contact predictivity of the residual parameter is the highest compared to the phi coefficient and the statistical significance. Compared to the expected distributions, we observed larger proportions of the pairs such that both residues are in a helix or one residue is in a structural element while the other is in a flexible region, or both residues are in a flexible region.

Compared to columnwise correlation methods, our proposed identitywise correlation approach can identify residue associations of the subgroups even when the family alignment is not perfect. This increases the complexity of the analysis, but it does not mask the existence of associations as does coulmnwise correlation.

130

REFERENCES

Afonnikov, D.A. and Kolchanov, N.A. (2004) CRASP: a program for analysis of coordinated substitutions in multiple alignments of protein sequences. Nucleic Acids Res. 32: W64–W68.

Agarwal, P. and Bafna, V. (1998) Detecting non-adjacent correlations within signals in DNA. In RECOMB’98.

Alan Miller. FORTRAN implementation of logarithm of Gamma function. http://lib.stat.cmu.edu/apstat/245

Altschuh D., Lesk A.M., Bloomer A.C. and Klug A. (1987) Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus. J Mol Biol. 193:693–707.

Bailey, T.L. and Gribskov, M. (1998) Combining evidence using p-values: application to sequence homology searches. Bioinformatics, 14: 48-54.

Baldi, P., Chauvin, Y., Hunkapiller, T. and McClure, M.A. (1994) Hidden Markov models of biological primary sequence information. Proc Natl Acad Sci. 91(3): 1059– 1063.

Barash, Y., Elidan, G., Friedman, N. and Kaplan T. (2003) Modeling Dependencies in Protein-DNA Binding Sites. In RECOMB’03.

Bateman,A., Coin,L., Durbin,R., Finn,R.D., Hollich,V., Griffiths-Jones,S., Khanna,A., Marshall,M., Moxon,S., Sonnhammer,E.L.L. et al. (2004) The Pfam Protein Families Database. Nucleic Acids Res., 32, D138–D141.

Beitz, E. (2006) Subfamily logos: visualization of sequence deviations at alignment positions with high information content. BMC Bioinformatics, 7(1): 313.

Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne: The Protein Data Bank. Nucleic Acids Res., 28 pp. 235-242 (2000).

Berry, M.B., Phillips,G.N. Jr. (1998) Crystal structures of Bacillus stearothermophilus adenylate kinase with bound Ap5A, Mg2+ Ap5A, and Mn2+ Ap5A reveal an intermediate lid position and six coordinate octahedral geometry for bound Mg2+ and Mn2+. Proteins, 32:276-288.

131 Birney, E. (2001) Hidden Markov models in biological sequence analysis. IBM J. Res. and Dev. 45(3/4)

Biro, J.C. (2006) Amino acid size, charge, hydropathy indices and matrices for protein structure analysis. Theor Biol Med Model, 3: 15.

Brazma, A., Jonassen, I., Eidhammer, I. and Gilbert, D. (1997) Approaches to the automatic discovery of patterns in biosequences. J Comput Biol. 5(2):279-305.

Bulyk, M.L., Johnson, P.L. and Church, G.M. (2002) Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res., 30:1255–61.

Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA J. Mol. Biol., 268, 78–94.

Casari, G., Sander, C. and Valencia, A. (1995) A method to predict functional residues in proteins. Nat. Struct. Biol. 2(2):171-178

Chelvanayagam, G., Eggenschwiler, A., Knecht, L., Connet, G.H. and Benner, S.A. (1997) An analysis of simultaneous variation in protein structures. Protein Eng. 10: 307– 316.

Clamp, M., Cuff, J., Searle, S. M. and Barton, G. J. (2004) The Jalview Java Alignment Editor. Bioinformatics, 12: 426-7

Crooks, G.E., Hon, G., Chandonia, J.M. and Brenner, S.E. (2004) WebLogo: A sequence logo generator. Genome Research, 14:1188-1190.

Eddy, S. (2004) What is a hidden Markov model? Nature Biotechnology 22: 1315-1316.

Eddy, S., Mitchison, G. and Durbin, R. (1995) Maximum discrimination hidden Markov models of sequence consensus. J. Comput. Biol. 2: 9-23.

Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14(9): 755--763.

Edgar, R.C. and Batzoglou, S. (2006) Multiple sequence alignment. Current Opinion in Structural Biology, 16:368–373.

Edwards, A. L. (1976) "The Phi Coefficient." §7.2 in An Introduction to and Correlation. San Francisco, CA: W. H. Freeman, pp. 68-72.

Ellrott, K., Yang, C., Sladek, F.M. and Jiang, T. (2002) Identifying transcription factor binding sites through Markov chain optimization. Bioinformatics, 18: S100–S109.

132 Fariselli P, Olmea O, Valencia A, Casadio R. (2001) Progress in predicting inter-residue contacts of proteins with neural networks and correlated mutations. Proteins, 45(Suppl 5):157–162.

Fersht A. (1999) The three-dimensional structure of proteins pp.1-53 in Structure and mechanism in protein science. W.H. Freeman and Company.

Finn, R.D., Mistry, J., Schuster-Böckler, B., Griffiths-Jones, S., Hollich, V. , Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S.R., Sonnhammer E.L.L. and Bateman A. (2006) Pfam: clans, web tools and services. Nucleic Acids Res., 34:D247-D251.

Gobel, U., Sander, C., Schneider, R. and Valencia, A. (1994) Correlated mutations and residue contacts in proteins. Proteins, 18:309–317.

Gribskov, M., Luthy, R., Eisenberg, D. (1990) Profile analysis. Methods in Enzymology, 183, 146.

Gribskov, M., McLachlan, A.D., and Eisenberg, D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. USA, 84, 4355-4358.

Halperin, I., Wolfson, H. and Nussinov, R. (2006) Correlated Mutations: Advances and limitations. Astudy on fusion proteins and on the Chesin-Dockerin family. Proteins, 63: 832–845.

Hatrick, K. and Taylor, W.R. (1994) Sequence conservation and correlation measures in protein structure prediction. Comp. Chem. 18(3): 245-249.

Hu, J., Li, B. and Kihara, D. (2005) Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res. 33(15): 4899-913.

Huang, H.D., Lee, T.Y., Tzeng, S.W., and Horng, J.T. (2005) KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. Nucleic Acids Res. 33: W226- W229.

Humphries,M.J., McEwan,P.A., Barton,S.J., Buckley,P.A., Bella,J. and Mould,A.P. (2003) Integrin Structure: heady advances in ligand binding, but activation still makes the knees wobble. Trends Biochem. Sci., 28, 313–319.

Junhyoung, K., Ky-Youb, N., Kwang-Hwi, C., Seung-Hoon, C.,Jae, S.N. and Kyoung, T. (2003) Theoretical Study on Hydrophobicity of Amino Acids by the Solvation Free Energy Density Model. Bull. Korean Chem. Soc. 12(24): 1742-1750.

Kabsch, W. and Sander, C. (1983) Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 12 (22):2577-637.

133 Kawashima,S. and Kanehisa,M. (2000) Aaindex: amino acid index database. Nucleic Acids Res., 28: 374.

Klingler, T.M. and Brutlag, D.L. (1994) Discovering structural correlations in α-Helices. Prot. Sci., 3:1847–1857.

Krogh,A., Brown,M., Mian,I.S., Sjolander,K., and Haussler,D. (1994) Hidden Markov models in computational biology: Applications to protein modeling. J. Mol. Biol. 235, 1501-1531.

Kundrotas, P.J. and Alexov, E.G..(2006) Predicting residue contacts using pragmatic correlated mutations method: reducing the false positives. BMC Bioinformatics, 16:7-503.

Kyte, J. and Doolittle, R.F. (1982) A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157: 105-132.

Lanczos, C. (1964) A Precision Approximation of the Gamma Function. SIAM Journal on Numerical Analysis series B, 1:86-96.

Lapedes, A.S., Giraud, B.G., Liu, L.C. and Stormo, G.D. (1993) Correlated mutations in protein sequences: phylogenetic and structural effects. In Proceedings of the AMS/SIAM Conference on Statistics in Molecular Biology Vol. 33, Monograph Series of the Institute for Mathematical Statistics, Hayward, CA, 1993, 236–256.

Larson, S.M., Di Nardo, A.A. and Davidson, A.R. (2000) Analysis of covariation in an SH3 domain sequence alignment: applications in tertiary contact prediction and the design of compensating hydrophobic core substitutions. J Mol Biol. 303:433–446.

Leary, R. H., Rosen, J. B. and Jambeck, P.(2004) An Optimal Structure-Discriminative Amino Acid Index for Protein Fold Recognition. Biophysical Journal 86:411-419.

Lowry, R. (2000) faculty.vassar.edu/~lowry/VassarStats.html (Featured in Science 26 May 2000: Vol. 288. no. 5470, p. 1295)

Manavalan, P. and Ponnuswamy, P.K. (1978) Hydrophobic character of amino acid residues in globular proteins. Nature 275: 673-674.

Nemoto, W., Imai, T., Takahashi, T., Kikuchi, T. and Fujita, N. (2004) Detection of Pairwise Residue Proximity by Covariation Analysis for 3D-Structure Prediction of G- Protein-Coupled Receptors. Protein J. 23(6):427-35.

Nishikawa, K. and Ooi, T. (1980) Prediction of the surface-interior diagram of globular proteins by an empirical method. Int. J. Peptide Protein Res. 16: 19-32.

Olmea, O. and Valencia, A. (1997) Improving contact predictions by the combination of correlated mutations and other sources of sequence information. Fold Des. 2:25–32.

134 Olmea, O., Rost, B. and Valencia, A. (1999) Effective use of sequence correlation and conservation in fold recognition. J Mol Biol. 293:1221-1239.

Orengo, C.A., Jones, D.T., and Thornton, J.M. (2003) Bioinformatics: Genes, proteins and computers. BIOS Scientific Publishers Ltd.

Osada, R., Zaslavsky, E. and Singh, M. (2004) Comparative analysis of methods for representing and searching for transcription factor binding sites. Bioinformatics, 20: 3516–3525.

Oyvind Langsrud. Web implementation of Fisher Exact Probabilty test. http://www.langsrud.com/fisher.htm

Ozer, H.G. and Ray, W.C. (2006) MAVL/StickWRLD: Analyzing Structural Constraints using Interpositional Dependencies in Biomolecular Sequence Alignments. Nucleic Acids Res., 34, W133-W136.

Pazos, F., Helmer-Citterich, M., Ausiello, G. and Valencia, A. (1997) Correlated mutations contain information about protein-protein interaction. J. Mol. Biol. 271: 511- 523.

Pollastri, G., Baldi, P., Fariselli, P. and Casadio, R. (2001) Improved prediction of the number of residue contacts in proteins by recurrent neural networks. Bioinformatics, 17: S234-S242.

Pollock, D.D. and Taylor, W.R. (1997) Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Protein Eng. 10: 647-657.

Pollock, D.D., Taylor, W.R. and Goldman, N. (1999) Coevolving protein residues: maximum likelihood identification and relationship to structure. J. Mol. Biol. 287: 187– 198.

Ponnuswamy, P.K. (1993) Hydrophobic characteristics of folded proteins. Prog Biophys Mol Biol. 59: 57-103.

Pritchard, L., Bladon, P.M.O., Mitchell, J.J. and Dufton, M.J. (2001) Evaluation of a novel method for the identification of coevolving protein residues. Protein Eng. 14: 549– 555.

Ray, W.C. and Ozer, H.G. “Discovering Biostructure Constraints using VRML Visualization”, Association for Computing Machinery SIGGRAPH, 32nd International Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA. July/August 2005.

Ray,W. (2004) MAVL and StickWRLD: visually exploring relationships in nucleic acid sequence alignments. Nucleic Acids Res., 32, W59–W63.

135 Ray,W. (2005) MAVL/StickWRLD for protein: visualizing protein sequence families to detect non-consensus features. Nucleic Acids Res., 33, W315–W319.

Rigden, D.J. (2002) Use of covariance analysis for the prediction of structural domain boundaries from multiple protein sequence alignments. Protein Engineering, 15 (13) 65- 77.

Sammut, S.J., Finn, R.D. and Bateman, A. (2008) Pfam 10 years on: 10 000 families and still growing. Briefings in Bioinformatics, 0: bbn010v1-bbn010

Schneider, T.D. (1997) Information content of individual genetic sequences. J Theor Biol. 189(4):427-41.

Schneider, T.D. (2002) Consensus sequence Zen. Appl Bioinformatics 1(3):111-9.

Schneider, T.D. and Stephens, R.M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18(20):6097-100.

Schuster-Boeckler, B., Schultz, J., Rahmann, S. (2004) HMM Logos for visualization of protein families. BMC Bioinformatics, 5:7

Shaner, M.C., Blair, I.M. and Schneider, T.D. (1993) Sequence Logos: A Powerful, Yet Simple, Tool. In Mudge, T.N., Milutinovic, V. and Hunter, L. (eds), Proceedings of the Twenty-Sixth Annual Hawaii International Conference on System Sciences, Vol. 1: Architecture and Biotechnology Computing. IEEE Computer Society Press, Los Alamitos, CA, pp. 813–821.

Shindyalov, I.N., Kolchanov, N.A. and Sander, C. (1994) Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng. 7: 349–358.

Socolich, M., Lockless, S.W., Russ, W.P., Lee, H., Gardner, K.H. and Ranganathan1, R. (2005) Evolutionary information for specifying a protein fold. Nature 437, 512-518.

Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 12(1):505-519.

Taylor, W.R. and Hatrick,K. (1994) Compensating changes in protein multiple sequence alignments. Protein Eng. 7: 341–348.

Thomas, D.J., Casari, G. and Sander, C. (1996) The prediction of protein contacts from multiple sequence alignments. Protein Eng. 9:941–948.

Thompson,J.D., Gibson,T.J., Plewniak,F., Jeanmougin,F. and Higgins,D.G. (1997) The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res., 25: 4876–4882.

136 Tillier, E.R. and Lui, T.W. (2003) Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments. Bioinformatics, 19: 750–755.

Tuffley, C. and Steel, M. (1998) Modelling the covarion hypothesis of nucleotide substitution. Math. Biosci. 147: 63–91.

Valdar, W.S.J. (2002) Scoring residue conservation. Proteins, 48: 227–241.

Vinogradova,O., Velyvis,A., Velyviene,A., Hu,B., Haas,T.A., Plow,E.F. and Qin,J. (2002) A structural mechanism of integrin alfaIIb beta3 ‘inside-out’ activation as regulated by its cytoplasmic face. Cell, 110, 587–597.

Wilson, E.B. (1931) Correlation and association. Journal of the American Statistical Association, 26 (173): 250-257.

Xing, E.P., Jordan, M.I., Karp, R.M. and Russell, S. (2003) A hierarchical Bayesian Markovian model for motifs in biopolymer sequences. Proc. of Advances in Neural Information Processing Systems 16.

Zhang, M.O. and Marr, T.G. (1993) A weight array method for splicing signal analysis. Computer Applications in Biosciences, 9(5): 499-509.

137