<<

Defining functional using embeddings of gene ontology annotations

Gilad Lerman*† and Boris E. Shakhnovich†‡

*Department of , University of Minnesota, Minneapolis, MN 55455; and ‡Program in Bioinformatics, Boston University, Boston, MA 02215

Communicated by Ronald R. Coifman, Yale University, New Haven, CT, April 9, 2007 (received for review June 7, 2006) Although rigorous measures of similarity for sequence and struc- worse than methods based on diffusion-type manifold embed- ture are now well established, the problem of defining functional ding (10) proposed in this work. relationships has been particularly daunting. Here, we present Defining between functional categories is integrally several manifold embedding techniques to compute distances important due to potential insights into the coevolution of between Gene Ontology (GO) functional annotations and conse- sequence, structure and (11). For example, function quently estimate functional distances between protein domains. broadly defined as all activities performed by a set of sequences To evaluate accuracy, we correlate the functional distance to the that fold into a domain structure, can be represented as a well established measures of sequence, structural, and phyloge- weighted subgraph of the GO directed acyclic graph (DAG) netic similarities. Finally, we show that manual classification of (12). This representation of function was used to establish the structures into folds and superfamilies is mirrored by proximity in importance of considering homology relationships in a phylo- the newly defined . We show how functional dis- genetic context. In this paper, we introduce more accurate and tances place structure–function relationships in biological context sensitive functional distances based on diffusion-type manifold resulting in insight into divergent and convergent evolution. The embeddings of GO annotations to explore the structure– methods and results in this paper can be readily generalized and function relationship in detail. applied to a wide array of biologically relevant investigations, such Manifold embedding techniques are based on kernels (see as accuracy of annotation transference, the relationship between definition in Materials and Methods), which have already been sequence, structure, and function, or coherence of expression successfully applied to various problems in bioinformatics (13). modules. In particular, computational approaches aimed at integrating various data sets have explored the effect of adding GO kernels methods ͉ diffusion geometry ͉ domain evolution ͉ for use in subsequent classification by SVM (14). Although our functional annotation ͉ homology modeling approach also employs kernels defined on GO, there are several fundamental differences. Most importantly, we apply these ne of the fundamental questions in biology deals with the kernels to quantify functional distances as opposed to applica- Ointer-relationship between structure, function and evolu- tions centered on classification of data into specific categories. tion. The need to precisely and quantitatively measure evolu- Moreover, our approach naturally extends the notion of func- tionary relationships encouraged the development of robust and tional distance to protein domains by using the geometric accurate sequence (1) and structure (2, 3) comparison methods. interpretation of the manifold embedding (see Materials and The importance of these algorithms to computational biology Methods). Finally, we apply functional distances to exploring cannot be underestimated. For example, the efficacy of trans- coevolution of sequence, structure, and function. ferring functional annotation depends on the precision of these Functional distances defined here via diffusion-type manifold sequence and structure comparison algorithms (4, 5). Although embedding techniques allow for increased sensitivity and arbi- significant progress has been made in defining distance between trary levels of granularity. Using our measures of functional sequences and structures, a rigorous understanding of functional distance, we can estimate the average divergence of function distance is still limited. with respect to structure, sequence or phylogenetic similarity. At first glance, the notion of functional distance is qualitative Although clearly an area of active research, we show that and subjective. The development of annotation systems that functional distances are already accurate enough to discover depict function in a machine readable format was the first step specific relationships between protein domain functions. Finally, in treating functional annotation rigorously. For example, the we show how functional distances can be used to explore Gene Ontology (6) (GO) has become the gold standard for divergent as well as convergent evolution. describing molecular functions of genes and proteins. However, the GO is not naturally amenable to measuring distance. Results One complication is an intrinsic bias in annotation where Defining Functional Distance. The molecular function component large numbers of unrelated genes share the same annotation of the GO represents functional annotations as nodes on a DAG (ATPase), making those categories uninformative. Previous (6). We can capitalize on the hierarchical structure of the DAG attempts at identifying functional relationships between genes to define local distances between functional annotations. Con- focused mostly on calculating statistical over-representation of sider that there are only 20 possible annotations at the top, and functional categories (7). These methods are well suited for Ͼ2,000 on the fifth level of the Gene Ontology. Thus, compar- quantifying coherence of function in sets of genes, but not useful for ison at the top level of hierarchy will be, by design, less precise exploring structure–function or sequence–function relationships. Recently, researchers have recognized the importance of measuring distance between annotations (8) and proposed a Author contributions: G.L. and B.E.S. performed research and wrote the paper. simple measure of distance using the shortest path algorithm (9). The authors declare no conflict of interest. However, these kinds of distances lack resolution and are Abbreviations: GO, Gene Ontology; DAG, directed acyclic graph. complicated by somewhat arbitrary characteristics of the ontol- †To whom correspondence may be addressed. E-mail: [email protected] or [email protected]. ogy, e.g., when annotations on the same level differ in their This article contains supporting information online at www.pnas.org/cgi/content/full/ degree of generality. Accordingly, we show that functional 0702965104/DC1. metrics based on shortest path algorithms perform significantly © 2007 by The National Academy of Sciences of the USA

11334–11339 ͉ PNAS ͉ July 3, 2007 ͉ vol. 104 ͉ no. 27 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0702965104 Downloaded by guest on September 28, 2021 Fig. 1. Correlation of functional distance with sequence, structure, and phylogenetic similarity. Functional distances between domains are based on manifold embedding of GO annotations using three different kernels and also formed by geodesic distances. Structural comparisons (black squares) are based on DALI Z-Scores (16), sequence comparisons (green triangles) are based on BLAST scores, and phylogenetic comparisons (red circles) are based on MI between phylogenetic profiles (see Materials and Methods). The structural, sequence and phylogenetic scores are normalized to be in the range of [0,1]. They appear on the y axis, whereas functional distances appear on the x axis. The data are binned such that there are 50 equidistant bins for each choice of kernel. Each bin represents thousands of functional comparisons. The best fit by a logistic function to the correlation between sequence alignment score and functional distance is shown as a dashed line. (a) Functional distance calculated using the local linear embedding kernel (28, 27). (b) Functional distance calculated using a diffusion kernel with power m ϭ 7 (10). (c) Functional distance calculated using a pseudoinverse of the graph Laplacian (27). (d) Functional distance based on shortest path algorithm.

than at the bottom level. One way to address this would be to ding procedure involves defining local similarity weights as model the inherent bias of the ontology by taking into account described above and using them to form a kernel (the types node usage (14). For example, consider a case where a large of kernels used here and their direct relation to the notion of proportion of proteins are coannotated with a pair of GO terms, manifold embedding are described in Materials and Methods). APPLIED the distance between these nodes on the GO DAG will be large The choice of kernel is arbitrary, but integrally important in the MATHEMATICS because their cooccurrence is not specifically correlated with definition of distance. Thus, we compared the performance of shared function. several kernels in their ability to accurately represent functional Thus, the basic idea behind building an appropriate kernel is distance between protein domains. We report results for four that GO terms shared by few protein sequences will be assigned different choices of kernels. The first three are formed by small local distances or equivalently high values of local simi- diffusion-type kernels, whereas the fourth is similar to previously larities. Alternatively, general annotations appearing at the top proposed shortest distance between GO annotations (9).

of the ontology will be assigned large local distances (or small We use Z scores (2) from DALI (16) to quantify structural EVOLUTION similarities). Using the intuition outlined above, we form a graph proximity, BLAST (1) for sequence similarity and mutual infor- where weights represent local similarities and use several tech- mation (MI) between phylogenetic profiles (15) for phylogenetic niques of manifold and graph embedding to calculate global similarity (see Materials and Methods). We find that functional distances between functional annotations. Embedding strategies distances between protein domains calculated using diffusion- exploit the underlying geometry of the graph and can implicitly type kernels correlate well with sequence alignment, structural correct ambiguities in the ontology. Finally, we use a global proximity and phylogenetic similarity (Fig. 1 a–c). Importantly, measure of distance between GO terms in combination with the dynamic range of the correlations is very large and the representation of domain function as a GO subgraph (12) to averaging due to binning almost insignificant. On the other hand, compute meaningful functional distances between protein do- the distance metric based on the shortest path algorithm shows mains [see Materials and Methods and supporting information no significant correlation with either homology or phylogenetic (SI) Text]. similarity (Fig. 1d). A clear benefit of developing a rigorous functional distance metric is the comparison of functional infor- Correlating Functional Distance with Sequence, Structure, and Phy- mation in sequence, structure alignment, and phylogenetic profil- logenetic Proximity. We use the well known correlations of ing. function with sequence, structure (12), and phylogenetic profiles One thing to note from Fig. 1 is the dependence of the (15) to evaluate the efficacy of using manifold embedding to observed correlations on the choice of kernel. For example, the quantify functional relationships between domains. The embed- correlation between protein structure similarity and functional

Lerman and Shakhnovich PNAS ͉ July 3, 2007 ͉ vol. 104 ͉ no. 27 ͉ 11335 Downloaded by guest on September 28, 2021 Table 1. Fitted parameters to observed correlations between Building a Functional Domain Universe Graph. Next, we wanted to functional distance and sequence, structure, or phylogenetic explore whether our definition of functional distance that cor- similarity measures relates on average with sequence, structural, and phylogenetic Parameter LLE Diff Inv. Lapl. similarities (Fig. 1) is accurate enough to yield biologically meaningful insights into the structure–function relationships of Structure (parameter T) 0.54 0.28 0.23 specific protein domains. We begin by creating a graph where Phylogenetic similarity (parameter T) 0.18 0.21 0.19* nodes are domains colored by SCOP (17) fold annotation, and Sequence (parameter S) 6.38 8.13 13.88 edges represent functional proximity calculated using the diffu- sion kernel (with m ϭ 7). The graph is transformed into an The structure and phylogenetic similarity correlations were fit to a first- ϭ order exponential decay curve (see Materials and Methods), and the expo- unweighted version using an empirically derived threshold of F nential decay parameter T corresponds to the mean lifetime reported. The 0.23. The resulting graph (Fig. 2a) illustrates both the specific correlation with sequence is fit to a logistic function and the slope S of the functional relationships between individual domains and global transition is reported (see Materials and Methods). relationships between folds and functions. *The differences between values in this row are not significant. Two things become immediately apparent from functional embedding of the protein domain universe. First, at short functional distances, domains sharing fold classification form distance can be described by a first-order exponential decay, clusters sharing common function. Second, at intermediate along the full range from far-diverged folds (Z ϭ 6) to super- functional distances, clusters of domains with related functions family (Z ϭ 9), and closely related proteins that belong to the are proximal on the graph. For example, DNA-binding domains same structural family (Z Ͼ 12) (11) and often share the same form a cluster that is close to the cluster containing exonuclease function. This behavior is similar for all diffusion-type kernels domains and transcription factor domains. As another example, considered in the present work. However, the rate of exponential Rossman fold domains performing oxidoreductase activity are decay depends on the kernel. We observed that the LLE kernel ϭ separated by only one step from domains with dehydrogenase shows the slowest decay rate (T 0.54; T is the mean lifetime), activity. whereas the inverse Laplacian kernel (pseudoinverse of the ϭ Although the graph shows separation of domains by fold and graph Laplacian) shows the steepest one (T 0.23) (Table 1). function, the structure–function relationship is clearly multifac- Furthermore, it is reasonable to assume that sequence alignment eted. Functional clusters are not entirely monochromatic, e.g., will relate to functional distance through a logistic function. functions are usually fulfilled by domains of several different Indeed, good sequence alignment is highly informative of sim- folds. Some folds are also multifunctional and appear in clusters ilarity in function, whereas above a certain threshold, sequence that are far from each other, e.g., Ferredoxins. Other folds are alignment provides little information about functional proxim- more functionally exclusive and only participate in clusters that ity. Once again, we note that the different diffusion kernels can are in close proximity, e.g., TIM beta/alpha barrel are mostly be characterized by the slope of the transition in the logistic fit. enzymatic functions. Finally, it appears that this representation Consistent with results on correlation with structure, we find that of relationships between protein domain functions captures the ϭ the LLE kernel shows the shallowest slope (S 6.38), whereas separation of folds into functionally related superfamilies (17). the inverse Laplacian has the steepest transition slope (S ϭ 13.88) (Table 1). Exploring Structure–Function Coevolution. Interestingly, there are We find that the differences in the observed correlations certain domains that link proximal clusters. These domains may between functional distances derived from each kernel and represent the intermediates in the evolutionary path from one sequence, structure and phylogenetic proximity measures can function to another. For example, consider two clusters (labeled provide insight into the behavior of the kernel at different scales B and E on Fig. 2a) populated mostly by 3-helical bundle of resolution. The chosen diffusion-type kernels (pseudoinverse domains. The B cluster contains domains responsible for DNA of the graph Laplacian, LLE and diffusion powers) emphasize binding. Domains in this cluster bind to DNA nonspecifically (a different ranges of interaction between GO annotations and representative structure is 1hlv, which is a centromere binding consequently result in range-specific resolutions. Specifically, the protein; ref. 18). On the other hand, the cluster labeled E is LLE kernel corresponds to a low power of diffusion and thus dominated by domains with the same 3-helical bundle structure, emphasizes shorter-range interactions between annotations. The but those that bind to specific DNA sequences. These are mostly diffusion kernel of power m ϭ 7, represents a functional distance domains that carry out transcription initiation activity (a repre- with good resolution at medium distances because it takes into sentative structure is the engrailed transcription factor 2hdd; ref. account larger paths along the unified GO annotation graph. 19). Interestingly, there is one domain that also has the structure Consequently, the range of approximately linear correlation with of a 3-helical bundle that is functionally proximal to both clusters sequence alignment shortens. At last, the inverse Laplacian takes and appears as the connecting hub. This domain is coded by a into account all powers of diffusion and thus incorporates all family of gamma-delta resolvases (1gdt; ref. 20). This is a family paths along the unified GO annotation graph. Therefore, it has of proteins that binds to imperfectly conserved sequences (21) impressive resolution at longer functional distances. (Fig. 2b). Consistent with the explanation presented above, both the Clearly, sequence binding specificity is not explicitly described structural alignment and sequence alignment show increasingly by GO. However, the 3-helical bundles are a remarkable example sharp transitions when applying the LLE kernel, followed by the of how GO embedding and the subsequent graph theoretical diffusion kernel with power m ϭ 7 and at last the inverse treatment can uncover relationships between structures by plac- Laplacian kernel (Table 1). Thus, manifold embedding of GO ing their functions in biological context. Subsequent application can produce a functional distances at needed resolution by of evolutionary trace methods to the three families can uncover choosing a kernel appropriate to the specific application. For the residues responsible for the differential binding specificity of maximum resolution at small functional distances, the LLE the 3-helical bundles and their mutational dynamics. kernel is most appropriate, whereas maximum resolution at long Specificity of DNA binding in 3-helical bundle domains is an distances can be achieved by using the inverse Laplacian kernel. example of divergent evolution where sequences are related by However, as expected, the qualitative behavior of the correla- common ancestry (22). On the other hand, convergent evolution tions remains the same for all choices of diffusion kernels. is often defined as two proteins with no apparent homology

11336 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0702965104 Lerman and Shakhnovich Downloaded by guest on September 28, 2021 a

b Fig. 2. Functional domain universe graph and structure-function coevolution. (a) Functional domain universe graph. The vertices of the graphs represent protein domains, whereas its edges represent func- tional similarity between domains. To draw the graph, we chose the top nine most commonly occurring folds and colored them according to SCOP classification. If functional score based on the diffusion kernel (m ϭ 7) was less than 0.23, we drew an edge between the nodes. The boxes represent sets of domains involved in common function: A, heat shock proteins; B, DNA binding; C, RNA binding; D, exonucleases; E, transcrip- tion factors; F, glucose/galactose enzymes; G, AAA- ATPases; H, oxidoreductases; I, dehydrogenases; J, retroviral integrases; K, kinases. (b) An example of

traversing the functional protein domain universe APPLIED

graph. A representative domain [1hlv (18), a Centro- MATHEMATICS mere Binding Protein] from the ‘‘B’’ cluster is blown up and the structure shown. The same is done for a rep- resentative domain [2hdd (19) Engrailed Transcription Factor] from the ‘‘E’’ cluster. The domain which has observable functional similarity to both clusters is 1gdt (20) (Site Specific Resolvase). The crystal structures clearly show that the protein chain gets increasingly

closer to the DNA as site specificity is increased. EVOLUTION

performing the same function (22). An additional benefit of Using nonlinear manifold embedding techniques, we were able defining functional distances is that we can easily detect in- to define distances between functional annotations and use those stances of convergence by examining domains with close func- to quantify distances between protein domains. We find that tional distance and no structural similarity. For example, using diffusion kernels perform remarkably well in creating an accu- functional distances, we easily confirmed the well documented rate global distance metric applicable to quantifying functional case of convergence of tRNA synthases [1pys (23) and 1a8h (24), relationships between protein domains. F score ϭ 0.001 and Z score Ͻ 2]. As an example of specific insights that can be uncovered using the proposed distance metric, we explore functional relation- Discussion ships between 3-helical bundle domains which form two clusters Machine readable representations of function, e.g., GO, are a in function space. These functional clusters turn out to be necessary first step toward high-throughput functional annota- separable by the specificity of DNA binding. The family of tion of data from whole-genome sequencing and structural sequences that are functionally similar to both clusters binds with genomics projects. Although these databases represent an intu- intermediate specificity. We were also able to confirm examples itively appealing representation of function, they are not imme- of convergence where domains sharing close functional proxim- diately amenable to accurate definitions of functional distance. ity appear to have evolved independently. Further exploration of

Lerman and Shakhnovich PNAS ͉ July 3, 2007 ͉ vol. 104 ͉ no. 27 ͉ 11337 Downloaded by guest on September 28, 2021 this representation of the protein domain universe will undoubt- ϭ ͸ ϭ  dii wij, dij 0ifi j. edly uncover many more insights into the relationship between j evolution of structure and function. Kernel-based functional distance metrics have several impor- The normalized matrix tant advantages over previously described methods (14), Euclid- ean measures (12), and shortest path algorithms (9). First, the P: ϭ DϪ1W diffusion-type manifold embedding techniques give rise to dis- tances taking into account both the geometry of the ontology represents local transition probabilities between GO annota- and intrinsic biases in annotations in a robust way (insensitive to tions. Its symmetric version is small amounts of noise). In particular, distances between sub- 1 1 Ϫ Ϫ graphs of annotation (e.g., those representing protein domains) P˜: ϭ D 2WD 2. have a clear geometric interpretation. Secondly, manifold em- bedding learns distances between annotations, rather than using Following Coifman et al. (10), the diffusion kernel of ‘‘power’’ kernels for classifications or defining distances between genes. (or transition step) m is the matrix Consequently, this approach is more natural for evaluating and comparing relationships between sequence, structure, and func- K ϭ P˜2m. tion as opposed to previous metrics that focused on applying GO kernels as part of a heterogeneous dataset for classification of A related diffusion kernel, suggested by Ham et al. (27), is protein–protein interactions (14). As a result, these methods are formed by taking the pseudoinverse of the graph Laplacian, significantly more general and can be applied in calculations of that is, functional distances between arbitrary numbers of genes. Addi- ϭ ͑ Ϫ ͒ϩ tionally, techniques presented here can be easily adapted to K D W . other ontologies. Finally, correlations with sequence, structure (2, 15) and phylogenetic proximity (Fig. 1) show that metrics A similar LLE (28) (local linear embedding) kernel is obtained by following Ham et al. (27): We denote by e the uniform column based on diffusion-type manifold embedding are significantly ͌ more accurate than previously proposed measures (9). vector of size N and length 1, that is, its elements are 1/ N.We Having the ability to estimate ‘‘distance’’ in function space is then set fundamental to computational biology in the postgenomic era. M ϭ ͑I Ϫ eeT͒͑I Ϫ P͒T͑I Ϫ P͒͑I Ϫ eeT͒. A variety of computational tasks including assessment of anno- tation accuracy from homology modeling and module detection Finally, we denote by ␭max, the largest eigenvalue of M and form from microarray data can be facilitated by an accurate measure- the LLE kernel by the formula ment of functional relationship between genes. K ϭ ␭ I Ϫ M. Materials and Methods max The GO DAG (6) can be found at www.geneontology.org. For Other forms of diffusion kernels (29, 30) are described in SI Text. structural proximity calculations, we use the Dali domain dic- tionary (2). The list of domains (3306) can be found at romi. Distances Between Annotations and Their Relation to Manifold Em- bu.edu/kernel࿝mapping/dali.txt. We use ASTRAL (25) to deter- bedding. Given a kernel K, we compute the distance d(x, y) mine the SCOP (17) annotation for each domain. We use between GO annotations x and y as follows: BLAST (1) to compare domain sequences. Matlab codes com- puting the following functional distances between annotations d2͑x, y͒ ϭ K͑x, x͒ ϩ K͑y, y͒ Ϫ 2K͑x, y͒. [1] and protein domains can be found in www.math.umn.edu/ ϳlerman/supp/protein࿝distance. More specific details of the This formula has a straightforward interpretation. Any kernel K ϭ ͳ ʹ methods are discussed in SI Text. can be written in the form: K(x, y) F(x), F(y) , where F embeds the graph vertices into a Euclidean space (usually referred to as Annotating Each Structure as a Subgraph on the GO. To annotate feature space). Consequently, Eq. 1 can be written as: structures using GO (6), we use the strategy (12) of collecting all 2͑ ͒ ϭ ʈ ͑ ͒ Ϫ ͑ ͒ʈ2 annotations for sequences (from NRDB; ref. 26) that fold into d x, y F x F y . the structure and reconstructing all paths up to the root of the The distance d(x,y) thus represents the Euclidean distance GO DAG. between the embedded annotations (in feature space). Assuming that the graph approximates a low-dimensional Local Similarities Between GO Annotations. Formally, we form a unified graph G whose nodes are all annotation of GO appearing manifold or another continuous geometric structure, we view the in protein domains and whose edges are the union of all edges graph embedding, F, as an approximation to a corresponding of subgraphs representing protein domains. The local similarity manifold embedding. The embedding and its corresponding distance are determined by the choice of kernel, which reflects weight wij on an edge connecting annotation i and j is defined as geometric properties of the underlying graph or manifold. follows: wij ϭ 1/nij where nij is the number of domain subgraphs containing that edge. Indeed, when applying the diffusion kernel of power m (10), the corresponding distances measure the rate of connectivity be- Similarities by Diffusion and LLE Kernels. A (positive definite) tween vertices according to paths of length m. The distances kernel K for the unified graph is a real symmetric matrix whose obtained by the inverse Laplacian represent the expected time to size is N, the number of vertices of the unified graph, and whose travel from one vertex to another vertex and then back to the eigenvalues are nonnegative. Its elements Ki,j represent local original vertex (27). The LLE distance is similar to a diffusion similarities between corresponding graph nodes (i and j). The kernel with low powers. The corresponding LLE embedding tries diffusion kernels are based on local diffusion process on the to preserve local distances to nearest points along the graph (see unified graph. We first normalize the local similarity weights SI Text). defined above by the degree matrix D, which is defined as In the SI Text, we discuss efficient numerical evaluation of the follows: functional distances for different kernels and large N.

11338 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0702965104 Lerman and Shakhnovich Downloaded by guest on September 28, 2021 The geodesic distances were calculated using Dijkstra’s algo- maximal if p00(x, y) ϭ p11(x, y) ϭ 0.5. That is, half of the terms rithm on the global GO graph (with local distances nij). of the two phylogenetic vectors are perfectly correlated (as a measure of nonorthologous gene displacement; ref. 33), whereas Distances Between Subgraphs Representing Protein Domain Func- the terms in the other half are perfectly anticorrelated. tions. We define the distance d(x,A) between a node x and the set of vertices A as follows: Curve Fitting (Fig. 1). All curve fitting was done using Origin 7 SR1 (www.originlab.com). Exponential decay was modeled using the d͑x, A͒ ϭ min d͑x, y͒. y␧A equation x Ϫ The distance between the two sets of vertices A and B is then Y ϭ y ϩ Ae T . computed using the formula: 0 The values of T when correlating functional distance with 1 1 1 ͑ ͒ ϭ ͸ 2͑ ͒ ϩ ͸ 2͑ ͒ structure and phylogenetic similarity are reported in Table 1. d A, B ͩͱ d xi, A ͱ d xi, B ͪ. 2 ͉B͉ ͉A͉ xi␧B xi␧A The correlation between sequence alignment and functional distance was modeled by the logistic function: Variants of this ‘‘distance’’ and their properties are discussed in A Ϫ A refs. 31 and 32. ϭ ϩ 1 2 Y A2 xϪx0 . ϩ ␬ Phylogenetic Similarity Between Protein Domains Based on Phyloge- 1 e netic Profiles (P Score). We evaluate the phylogenetic similarity Here, the slope reported in Table 1 is simply between structures by BLASTing (1) the set of nonredundant sequences found to fold into each domain against all fully A Ϫ A ϭ 1 2 sequenced genomes. The similarity between any two domains is S ␬ . then just the empirical mutual information, MI, between their phylogenetic profiles (15). If x and y are two phylogenetic All fitted functions had coefficients of determination in the profiles, then range 0.89 Ͻ R2 Ͻ 0.97.

1 1 p ͑x, y͒ ͑ ͒ ϭ ͸ ͸ ͑ ͒⅐ ͫ ij ͬ We thank Mark Green and Institute for Pure and Applied Mathematics MI x, y pij x, y log ͑ ͒ ͑ ͒ , (University of California, Los Angeles) for inviting us to participate in pi x pj y iϭ0 jϭ0 a proteomics workshop, where we first met and started our discussion ϭ that led to this paper. G.L. thanks Ronald R. Coifman, Stephane Lafon, where pij(x, y), i, j 0, 1, describe the frequencies of occurrence and Mauro Maggioni for introducing him to diffusion geometries and for of all four possible combinations of presence (i/j ϭ 1) or absence forwarding him some of their papers and software. B.S. thanks Eugene (i/j ϭ 0) in the same genome for the two domains, pi(x), i ϭ 0, Shakhnovich, Nick Grishin, Tim Reddy, and Joe Mellor for fruitful are the frequencies of occurrence (i ϭ 1) or absence (i ϭ 0) in discussions and critical reading of the manuscript. G.L. is supported by profile x and pj(y), j ϭ 0, 1, are defined similarly. MI will be National Science Foundation Grant 0612608.

1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman 18. Tanaka Y, Nureki O, Kurumizaka H, Fukai S, Kawaguchi S, Ikuta M, Iwahara DJ (1997) Nucleic Acids Res 25:3389–3402. J, Okazaki T, Yokoyama S (2001) EMBO J 20:6612–6618. 2. Dietmann S, Holm L (2001) Nat Struct Biol 8:953–957. 19. Tucker-Kellogg L, Rould MA, Chambers KA, Ades SE, Sauer RT, Pabo CO 3. Shindyalov IN, Bourne PE (2001) Nucleic Acids Res 29:228–229. (1997) Structure (London) 5:1047–1054.

4. Sauder JM, Arthur JW, Dunbrack RL, Jr (2000) Proteins 40:6–22. 20. Yang W, Steitz TA (1995) Cell 82:193–207. APPLIED 5. Gerstein M, Levitt M (1998) Protein Sci 7:445–456. 21. Graham KS, Dervan PB (1990) J Biol Chem 265:16534–40. MATHEMATICS 6. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, 22. Ponting CP, Russell RR (2002) Annu Rev Biophys Biomol Struct 31:45–71. Dolinski K, Dwight SS, Eppig JT, et al. (2000) Nat Genet 25:25–29. 23. Mosyak L, Reshetnikova L, Goldgur Y, Delarue M, Safro MG (1995) Nat Struct 7. Berriz GF, King OD, Bryant B, Sander C, Roth FP (2003) Bioinformatics Biol 2:537–547. 19:2502–2504. 24. Sugiura I, Nureki O, Ugaji-Yoshikawa Y, Kuwabara S, Shimada A, Tateno M, 8. Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Lorber B, Giege R, Moras D, Yokoyama S, Konno M (2000) Structure Snyder M, Greenblatt JF, Gerstein M (2003) Science 302:449–453. (London) 8:197–208. 9. Lord PW, Stevens RD, Brass A, Goble CA (2003) Bioinformatics 19:1275– 25. Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner 1283. SE (2004) Nucleic Acids Res 32:D189–D92.

10. Coifman RR, Lafon S, Lee AB, Maggioni M, Nadler B, Warner F, Zucker SW 26. Holm L, Sander C (1998) Bioinformatics 14:423–429. EVOLUTION (2005) Proc Natl Acad Sci USA 102:7432–7437. 27. Ham J, Lee DD, Mika S, Scholkopf B (2004) Proceedings of the Twenty-First 11. Shakhnovich BE, Max Harvey J (2004) J Mol Biol 337:933–949. International Conference on Machine Learning (AAAI Press, Menlo Park, CA), 12. Shakhnovich BE (2005) PLoS Comput Biol 1:e9. 47–54. 13. Scho¨lkopf B, Tsuda K, Vert J-P (2004) Kernel Methods in Computational 28. Roweis ST, Saul LK (2000) Science 290:2323–2326. Biology (MIT Press, Cambridge, MA). 29. Kondor RI, Lafferty J (2002) Machine Learning: Proceedings of the Nineteenth 14. Ben-Hur A, Noble WS (2005) Bioinformatics 21(Suppl 1):i38–i46. International Conference (ICML) (Morgan Kaufmann, San Francisco), pp 15. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO (1999) 315–322. Proc Natl Acad Sci USA 96:4285–4288. 30. Belkin M, Niyogi P (2003) Neural Computation 15:1373–1396. 16. Dietmann S, Park J, Notredame C, Heger A, Lappe M, Holm L (2001) Nucleic 31. Memoli F, Sapiro G (2005) Found Comp Math 5:313–347. Acids Res 29:55–57. 32. Dubuisson MP, Jain AK (1994) Proceedings of the 12th IAPR (IEEE Comp Soc 17. Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG Press, Los Alamitos, CA) 566–568. (2004) Nucleic Acids Res 32:D226–D229. 33. Koonin EV, Mushegian AR, Bork P (1996) Trends Genet 12:334–336.

Lerman and Shakhnovich PNAS ͉ July 3, 2007 ͉ vol. 104 ͉ no. 27 ͉ 11339 Downloaded by guest on September 28, 2021