Finding New Order in Biological Functions from the Network Structure of Gene Annotations

Kimberly Glass 1,2,3∗, and Michelle Girvan 3,4,5 1Biostatistics and Computational , Dana-Farber Institute, Boston, MA, USA 2Department of , Harvard School of Public Health, Boston, MA, USA 3Physics Department, University of Maryland, College Park, MD, USA 4Institute for Physical Science and Technology, University of Maryland, College Park, MD, USA 5Santa Fe Institute, Santa Fe, NM, USA

The Gene Ontology (GO) provides biologists with a controlled terminology that describes how genes are associated with functions and how functional terms are related to each other. These term-term relationships encode how conceive the organization of biological functions, and they take the form of a directed acyclic graph (DAG). Here, we propose that the network structure of gene-term annotations made using GO can be employed to establish an alternate natural way to group the functional terms which is different from the hierarchical structure established in the GO DAG. Instead of relying on an externally defined organization for biological functions, our method connects biological functions together if they are performed by the same genes, as indicated in a compendium of gene annotation data from numerous different experiments. We show that grouping terms by this alternate scheme is distinct from term relationships defined in the ontological structure and provides a new framework with which to describe and predict the functions of experimentally identified sets of genes. Tools and supplemental material related to this work can be found online at http://www.networks.umd.edu.

1. INTRODUCTION networks. For example, many networks exhibit commu- nity structure, meaning that there are clusters of nodes in the network within which edges are relatively dense The Gene Ontology (GO) [4][34] has been around for [13]. Within the field of complex networks, many recent over a decade, during which time it has been widely uti- research papers [21, 26, 28, 31] have focused on the de- lized both to validate and to predict the results of biolog- velopment of methods to detect such module in various ical experiments (see, for example [11, 17, 19, 20, 25, 38, types of networks [26][21] in a computationally efficient 39]). The structure of the ontology, where different “cate- and accurate manner [6]. In this study, we leverage the gories” or terms are related to each other in a hierarchical community structure in gene annotation networks to de- fashion, provides a well-established format with which to velop an endogenous organization of biological functions. classify and subclassify all biological functions and pro- cesses. This classification approach is well-structured and Ontologies are utilized across many disciplines in- well-characterized, however, we seek to determine if it is cluding economics, artificial intelligence, engineering, li- the only natural way in which to classify this type of brary science, and biomedical informatics (for example, biological information. We address two main questions. [5, 22, 32]). The Gene Ontology, specifically, describes First, does there exist another natural way to organize the relationships between different biological concepts or the functional terms that is distinct from the ontological functions [4]. It breaks these concepts into three main do- organization? Secondly, if such an alternate classification mains, or distinct ontologies: “Biological Process” (BP), arXiv:1210.0024v2 [q-bio.QM] 3 May 2013 exists, can it be used to interpret biological data? describing sets of molecular events, “Molecular Func- In recent years, complex networks tools have been tion” (MF), describing the activities of gene products, used alongside traditional techniques to and “Cellular Component” (CC), describing parts of a study many different kinds of biological networks [27], cell or its external environment. Each of the three pri- including, but not limited to, gene regulatory networks mary domains in GO takes the form of a directed acyclic [24, 33], protein-protein interaction networks [18, 36], and graph (DAG), in which “child” functional categories, metabolic networks [16, 40]. Developments in network or “terms”, are subclassified under one or more “par- theory provide the computational tools needed to calcu- ent” terms. Each parent and all its subsequent progeny late global properties of such networks, lending insights therefore define multiple, overlapping, sets of terms, or into the behavior of the systems represented by these “branches” in GO. Using GO, genes are annotated to in- dividual terms representing their particular role in a cell, and these annotations are transitive up the relationships in the DAG such that each “parent” term takes on all the ∗contact: [email protected] gene annotations associated with any of its progeny [35]. 2

In the past there has only been minimal investigation of Gene Ontology Gene-Term Annotations how biological functions might be related to each other Term Hierarchy (DAG) (Bipartite Graph) outside of the ontology structure, with the majority fo- cusing on discovering individual links between functions [19] rather than investigating the structure as a whole. In incorporate this work, we propose an alternate classification of func- annotations tional terms that relies on gene-term annotations rather project than ontological relationships. Branch1 evaluate compare network Branch2 enrichment in Branch3 structure Comm1 experimental Our complex networks approach to organizing biologi- Comm2 Comm3 gene sets cal functions using annotations made to the Gene Ontol- partition ogy is outlined in Figure 1. We begin by considering term terms relationships defined by the GO hierarchy. We then add in gene-term annotation information collected from nu- merous different experiments and encapsulate these con- Annotation-Driven Annotation-Driven nections in the form of a bipartite network. Next, we Term Communities Term Network use this bipartite network to construct another network describing the relationships between functional terms FIG. 1: Visual representation of our approach. First, we sum- based on shared gene annotations. We apply community marize gene annotations made to functional terms in the Gene structure finding algorithms to partition this annotation- Ontology hierarchy as a gene-term bipartite graph. From driven network into communities of terms and compare these gene-term relationships, we project a term-term net- these communities to branches (ontological groupings of work. We partition this network into communities and com- terms) from the GO hierarchy. We show that, although pare those term communities to branches of terms in the DAG. Finally, we perform functional enrichment analysis on there are some similarities, there are also very strong dif- experimentally-defined gene sets using both the term commu- ferences between the two ways of organizing terms. Fi- nities and GO branches. nally, we test the applicability of the community-derived classification, utilizing functional analysis techniques to evaluate the enrichment of cancer signatures (sets of is not associated with that term. Thus, genes associated with cancer) in both term communities ( and GO branches. We find that certain signatures are 1 if gene p is annotated to term i B = . (1) enriched primarily in our term communities and not GO pi 0 if gene p is not annotated to term i branches. Therefore, we suggest that by linking func- tional terms based on shared genes, we can create an This bipartite network represents a summary of the re- alternate, biologically meaningful, network-derived orga- lationships between 18930 genes and 15033 functional nization of terms that is both distinct from the GO DAG terms, derived from many different types of biological and can be used to investigate biological systems. Tools evidence and contributed to by multiple laboratories[10]. and supplemental material related to this work can be We note that although GO is broken into three primary found online at http://www.networks.umd.edu. domains and gene-annotations are made to the ontology for many species, for simplicity in the following analysis we will combine information from all three domains and use annotation information only that pertains to human 2. METHODS genes. Domain-specific and comparative species analysis is provided in the Supplemental Material.

2.1. Characterizing Gene Ontology Annotations in a Bipartite Graph 2.2. Constructing a Term Network from Gene Ontology Annotations In the following analysis we explore if there exists an alternate, natural way to classify terms that is distinct Next, we used gene-term annotations to construct a from the ontology structure. To begin, we use term-term network representing term-term relationships. Using the ontology relationships and gene-term annotation infor- bipartite network (Equation 1) one could create a term mation for human genes downloaded from the GO web- network by simply joining together any pair of terms that site (geneontology.org) to construct a gene-term bipartite share common genes; however, the number of genes an- network. We choose to represent this network in the form notated to each term has a heavy-tailed distribution (see of an nG × nT adjacency matrix, where nG is the total Supplemental Figure S1(a)), thus this approach would number of genes and nT is the total number of terms lose a large amount of information as connections be- listed in the annotation file. In this matrix a value of one tween pairs of highly-annotated terms would be given the indicates a known connection between the corresponding same weight as connections between pairs of terms with gene and term, and a value of zero indicates that the gene only a few annotations. We correct for the skewed term 3 degree distribution by constructing a diagonal weighting were roughly similar to those defined by the branches at matrix, w, and then projecting a term network T , whose different levels of the GO DAG (see Supplemental Fig- edges are modified by this weighting matrix: ure S1(a)). Like GO branches, which represent overlap- ping sets of functional categories rather than one discreet δij 0 0 partition of terms, communities found at different reso- wij = n ,T = w B Bw. (2) XG lutions are highly overlapping and represent functional Bqi structure at many different levels of specificity. q=1

The values of Tij take a maximum value of one when terms i and j each only have the same single gene an- 3. RESULTS: AN ALTERNATE “NATURAL” notation and a minimum value of zero when none of the GROUPING OF GO TERMS genes annotated to term i are annotated to term j. The use of the weighting matrix emphasizes the weights of 3.1. Term Communities and GO Branches network edges between low degree terms. Since these Represent Distinct Collections of Biological terms represent biological functions performed by only a Functions handful of genes, we believe this weighting is more likely to capture highly-specific shared biological information. To better understand the relationships between the communities found at different resolutions, we visualized the term communities with ten or more members for the 2.3. Identifying Communities of GO terms six lowest values of resolution used (Figure 2). In this visualization each community is represented by a single Finally, we seek to identify the community structure circle, whose radius scales as the log of the number of in annotation-driven term-term relationships, or clusters terms belonging to that community and whose color cor- of terms within which there are many or high-weight re- responds to the percentage of members from each pri- lationships in our projected network (Equation 2), but mary domain that belong to that community. Between between which there are only few or low-weight relation- the communities found at adjacent resolutions, we draw a ships. In order to quantify the strength of community line from a community at a higher resolution to a commu- structure we use a quantity known as modularity [28]. nity at a lower resolution if at least 10% of the members Modularity (Q) can be defined as: of the community from the higher resolution also belong to the community at the lower resolution. The thick-     1 X r kikj Q = A − 1 + δ(x , x ) (3) ness of the line is indicative of the overlap between the 2m ij hki 2m i j two communities. For more details on the visualization ij process see the Supplemental Material. where δ is the Kronecker delta function, xi is the com- The structure of annotation-driven term relationships munity of node i, ki is the degree of node i, A is the is distinct from the structure of those relationships as adjacency matrix, a matrix with values representing the defined by GO branches. This is evidenced clearly by weight between nodes i and j, and m is the total weight the fact that, although each GO branch can only belong of the edges in the network[3, 26]. Traditionally, in order to one primary ontology, and thus would be pure yel- to partition a network into communities, the resolution low, cyan or magenta in this type of visualization, com- parameter, r in Equation 3, is set equal to zero. Vary- munities, even smaller ones and those found at higher ing this value allows one to look for alternate divisions resolutions, generally contain members from multiple on- of a network into communities at different scales, or res- tologies, resulting in a rainbow of colors. We also observe olutions, with r > 0 uncovering sub-structures in the that communities at higher resolutions do not merely rep- network [3]. resent the “splitting apart” of communities at lower reso- We used a weighted version of the Fast Greedy Com- lutions (represented by a child community only connect- munity Structure algorithm [6] to investigate the com- ing to a single parent), but instead each resolution often munity structure of our term network, and found fifty- brings about a new way of partitioning the network. An six communities at maximum modularity. We then im- analogous visualization of GO branches reveals a similar plemented a modified version of the Fast Greedy that a complex partitioning of terms in the GO DAG, albeit maximizes modularity for non-zero values of the reso- segregated by primary domain (see Supplemental Figure lution parameter in order to find many different viable S2). partitions. We varied the resolution parameter several Next we directly compared the membership of the term orders of magnitude and found 11491 different commu- communities with that of branches in the GO DAG. In nities (see Supplemental Table S1). We gave our com- order to quantify the similarity between each commu- munities numeric identities that vary from TC:0000001 nity and branch, we calculated the Jaccard similarity, to TC:0011491 and will refer to them as such in the which takes the value J(x, y) = |x ∩ y|/|x ∪ y|. Then, for following analysis. The different values of the resolu- each community (x), we determined the corresponding tion parameter were chosen to give community sizes that branch (y) that has the highest overlap in membership 4

BP

CC MF

FIG. 2: Visualization of communities (circles) of GO terms found at the six lowest levels of resolution (rows), in increasing order (top to bottom). The width of the line connecting two communities is proportional to the percentage of terms in the child community that are also in the parent community. The size of communities is proportional to the log of the number of terms in the community. Color represents the normalized percentage of terms in the community which belong to the BP (yellow), MF (cyan) and CC (magenta) primary domains.

by this measure: Jm(x) = max{J(x, y): y ∈ Y }, and communities often distinct from branches, within the vice versa. Because the exact value of the Jaccard sim- branches themselves the annotation-driven classification ilarity is highly sensitive to incremental changes in set is often very distinct from the defined ontological rela- membership when comparing sets with only a few mem- tionships. This pair is a representative example of the bers, we will limit all the following analysis to communi- maximal shared information that is typically found be- ties and branches that contain ten or more terms in order tween a community and branches, therefore we conclude to focus on the most robust results. Figure 3(a) shows that although there is occasional similarity between our the distribution of Jm comparing these 2370 communities found communities and GO branches, the communities and 2151 branches. Although a handful of communities are not simply a recapitulation of the DAG. and branches are quite similar to each other, the ma- jority of communities are dissimilar to the GO Branches and vice versus. We have repeated this analysis con- 3.2. Capturing the Biological Information in Term structing the term network and corresponding partitions Communities three more times, using annotations specific to each of the three primary ontologies, and observe similar results We have illustrated that our term communities rep- (see Supplemental Figure S1(b)-1(d) and Supplemental resent a natural partitioning of functional terms that Table S2). is distinct from the GO DAG, however, the biological meaning of these communities is, at this point, unclear. To better interpret these values, we selected several On a mathematical level they represent sets of biologi- communities to inspect more closely. First we selected cal functions that are generally performed by the same a community with a very high J value to inspect (Fig- m collection of genes. Labeling and understanding the bio- ure 3(b)). TC:0007391 is most similar to GO:0070570 logical meaning behind these communities is vital if they (“regulation of neuron projection regeneration”) with are to have the same wide-range applications as the GO J = 0.6667. It is interesting that in addition to mem- m branches. Therefore, in order to easily interpret the con- bers from the BP domain, TC:0007391 also includes two tents of an individual community we choose to summa- members from the MF and CC domains, “neutrophin re- rize the descriptions of its member terms in the form of ceptor activity” and “perineuronal net” respectively, the a word cloud, coloring each word in the cloud based on former of which is involved in the regeneration of injured the normalized percentage of times the members is it de- axons [12] while the degradation of the latter has been rived from belong to each primary domain, and scaling shown to favor axon regeneration [30]. This indicates the size of each word by the statistical enrichment of its that terms found in the community but not the branch frequency in that community (for details see Supplemen- are consistent with known biology. tal Material). We illustrate the biological content of two Next we selected TC:0000936, which is most simi- communities in Figure 4(a)-4(b). lar (Jm = 0.1667) to GO:0060538 (Figure 3(c)). We The word clouds illustrate a richness of biological note that that the dissimilarity found between this com- information in term communities. Although Com- munity and branch cannot be attributed to community munity TC:0000400 (Figure 4(a)) contains 335 mem- membership from multiple primary domains, as all of bers harking from all three primary domains, the word TC:0000936’s members belong to the “Biological Pro- cloud presentation easily summarizes this information. cess” primary domain. Interestingly, the branch de- The individual words are often contained in terms as- fined by GO:0060538 has members that belong to eleven sociated with multiple domains, resulting in a com- distinct communities, demonstrating that not only are plex coloration, but reveal that this community in- 5

cludes biological concepts related to various types of 300 Branches RNA, including “rRNA”, “tRNA”, “mRNA”, “LSU- Communities rRNA”, “SSU-rRNA”, “ncRNA”, “RNA-polymerase” and more. In contrast, TC:0000061 contains many 200 words related to the heart such as “cardiac”, “muscle”, “ventricle”, “ventricular” and “heart” (Figure 4(b)). Neither community is very similar to any particular 100 branch in GO, although they represent similar biolog- ical information. TC:0000400 is most similar (Jm = 0.22146) to GO:0016070 or “RNA metabolic process”, 0 and TC:0000061 has the highest similarity (Jm = 0.149) # of Branches/Communities 0 0.25 0.5 0.75 1 with GO:0072358, or “cardiovascular system develop- Maximum Jaccard (J ) m ment.”

(a)Distribution of Jm for Branches and Communities We point out that one can also represent the biological information contained in branches in the form of word clouds, although, because the members of each branch can only belong to one of the three primary domains, all the words in the cloud will be the same color. Two branches are illustrated in Figure 4(c)-4(d). The first, GO:0000003, or “reproduction” clearly contains terms pertaining to sex-related processes as it contains words (b)TC:0007391, Jm = 0.6667 with GO:0070570 such as “female”, “sex”, “prostate” and “male.” Simi- larly, the cloud for GO:0002376, whose parent term name is “immune system process” contains words pertaining to the immune system.

3.3. Term Communities can be used to Evaluate and Predict Genetic Function

(c)TC:0000936, Jm = 0.1667 with GO:0060538 Finally, we wanted to test how our communities might be used in one common application of the Gene Ontology: functional enrichment analysis. To begin, we downloaded FIG. 3: A comparison of branches in the GO DAG and term a collection of experimentally derived genes sets from communities found by partitioning the term network. (A) the Gene Signatures Database (GeneSigDB) [8]. This Distribution of Jm, the maximum similarity a community or branch with ten or more members has compared to all database is a manual curation of previously published other branches or communities with ten or more members, gene expression signatures, focusing primarily on can- respectively. Although a small number of communities and cer and stem cell signatures [7], and includes 509 human branches have similar memberships, most are highly dissimi- signatures that contain at least 100 and less than 1000 lar. (B)-(C) Two example comparisons between communities genes annotated in the Gene Ontology. To perform the and branches: (B) TC:0007391 compared to GO:0070570, and functional enrichemnt analysis we use Annotation En- (C) TC:0000936 compared to GO:0060538. In each panel on richment Analysis [14] since it has been shown to better the left hand side a community and its inter-community con- estimate the biological functions of experimentally de- nections in the annotation-driven term network is shown and rived sets of genes, and has a conceptual framework con- on the right hand side the branch with which that commu- ducive to estimating functional enrichment between sets nity has the the highest Jaccard similarity is illustrated. In of terms and genes, rather than simply between two sets the right panel edges represent the ontological associations defined by the Gene Ontology term hierarchy. Each term of genes (for more details see Supplemental Material). member of the community or branch is colored both by its as- In general, we observe that term communities con- sociated primary domain (inner color - BP:yellow, MF:cyan, tain slightly more statistically enriched associations with CC:magenta) and its community membership (outer color), these experimental signatures than GO branches (which determined at the same resolution value as the illustrated are widely used in functional enrichment analysis) and community. Terms common between each community and we verified that this level of enrichment is absent for ran- branch pair are circled. domly constructed communities (see Supplemental Fig- ure S3). Knowing that our communities are statistically associated with experimental gene signatures, we next sought to know if there was a context in which our term communities captured biological information from these signatures that is missed by the branches, or vice versus. 6

Thus we selected gene signatures that were significantly enriched (p < 10−6) in at least one community/branch but not significantly enriched (p > 5 × 10−5) in any GO branch/community, respectively. Figure 4(e) shows a heat map of the enrichment values for the nineteen signatures that met this criteria across any community or branch statistically enriched in at least one of those (a)TC:0000400 Word (b)TC:0000061 Word signatures. Cloud Cloud It is immediately striking that of these signatures, the majority are enriched in communities and not GO branches. Two signatures, in particular, are enriched in a collection of communities. The first, an embryonic stem cell signature [37], represents genes that are up- regulated in cardiomyocytes compared to non-selected embryoid bodies and hESC. The communities repre- sented in this signature contain several different themes, (c)GO:0000003 Word (d)GO:0002376 Word all consistent with the expected properties of genes se- Cloud Cloud lected from stem cells and related to the heart. The

6 corresponding clouds emphasize words such as “cardiac” Ovarian (Ouellet ’05) Bone (Heller ’08) Sarcoma (Lee ’04) and “muscle” (TC:0000012), “actin”, “myosin”, and “fil- Leukemia (Walker ’06)

StemCell (Osada ’05) −log ament” (TC:0000249), “morphogenesis” and “develop- HeadandNeck (Bredel ’06) Sarcoma (Missiaglia ’10) 4 Viral (Urosevic ’07) 10 ment” (TC:0000365), “blood”, “pressure” and “contrac- EmbryonicStemCell (Xu ’09) p−value Bladder (Osman ’06) Sarcoma (Filion ’09) tion” (TC:0000582), with the other clouds generally con- Breast (Creighton ’08) Intestine (Vecchi ’07) 2 taining these words in different combinations (see for ex- Breast (Mazurek ’05) HeadandNeck (Roepman ’06) Lung (Hou ’10) ample TC:0000061, illustrated in Figure 4(b)). The sec- Breast (Mira ’09) Colon (Grade ’06) Leukemia (Calln ’08) ond signature is a list of bladder cancer specific genes 0 [29]. Most of the words emphasized by the commu- TC:0000012 TC:0000061 TC:0000249 TC:0000365 TC:0000410 TC:0000488 TC:0000582 TC:0000611 TC:0000836 TC:0000881 TC:0001501 TC:0002025 TC:0000000 TC:0000062 TC:0000116 TC:0000122 TC:0000152 TC:0000247 TC:0000256 TC:0000400 TC:0000960 TC:0000004 TC:0000059 TC:0000114 TC:0000163 TC:0000063 TC:0000118 TC:0000179 TC:0000261 TC:0000057 TC:0000258 TC:0000369 TC:0000869 TC:0000165

nity clouds are related to cell proliferation. For exam- GO:0006139 GO:0009987 GO:0005622 GO:0044424 GO:0032501 GO:0042221 GO:0002376 GO:0050896 ple, it is enriched in TC:0000400 (illustrated in Fig- (e)Annotation Enrichment Analysis ure 4(a)) and the other clouds emphasize words such as “cell-cycle”, “mitotic”, “meiotic”, “checkpoint”, “re- pair”, “replication”, “recombination”, “telomere”, “spin- FIG. 4: (A-D) Term Communities (TC:0000400, TC:0000061) dle”, “complex”, “DNA”, “chromosome”, “histone”, and and branches (GO:0009607, GO:0050896) summarized as “methylation”. Although the connection to the bladder word clouds. In each case the color of a word represents is not obvious, the connection to cancer and the high rate how often the term description containing that word be- of cell proliferation in tumor cells [2] is apparent. longs to each of the primary domains (BP:yellow, MF:cyan, CC:magenta, also see Figure 2 for mixed-domain coloration) and size represents that word’s statistical enrichment in that 4. CONCLUSION community/branch. (E) A heat map showing the statisti- cal enrichment of selected cancer signatures (see text) in GO branches and term communities. The network structure of gene annotations made to the Gene Ontology has not previously been exploited in a manner that reveals an organization of biological function unique from the published hierarchical classifi- cation of the GO DAG. By analyzing functional annota- tion data we were able to construct an alternate, natural, and biologically-relevant way in which to categorize cel- not previously been investigated systematically. Using lular functions. This categorization is structurally and the simple principle of co-annotation we suggest that conceptually distinct from the GO DAG and allows us to in the future biological concepts from other classifica- uncover multiple, strong connections between terms that tion databases could also be analyzed or even combined do not share a parent/child relationship. It takes advan- with these results. We concede that the communities de- tage of a large amount of data from a variety of sources fined here likely do not represent the only way to group and creates a classification scheme that is motivated pri- functional terms outside of the ontology structure. Even marily by the data reported rather than the organization so, we believe that our functional enrichment analysis of human conceptions. demonstrates that these term communities, in particular, The term communities defined in this work represent are more than a mathematical phenomenon and have a an integration of information across all three primary high potential to be used to better interpret biological domains in GO that, to the authors’ knowledge, has data. 7

Acknowledgements algorithm that included the resolution parameter.

We wish to thank Geet Duggal for supplying an im- plementation of the Fast Greedy Community Structure

[1] http://www.softpedia.com/progdownload/ibm- ory and Experiment, 2005(09):P09008, September 2005. word-cloud-generator-download-160283.html. ISSN 1742-5468. doi: 10.1088/1742-5468/2005/09/ URL http://www.softpedia.com/progDownload/ P09008. URL http://dx.doi.org/10.1088/1742-5468/ IBM-Word-Cloud-Generator-Download-160283.html. 2005/09/P09008. [2] M Andreeff, Goodrich DW, and Pardee AB. Holland- [10] Emily C. Dimmer, Rachael P. Huntley, Yasmin Alam- Frei Cancer Medicine. BC Decker, Hamilton, ON, 5th Faruque, Tony Sawford, Claire O’Donovan, Maria J. edition, 2000. Martin, Benoit Bely, Paul Browne, Wei Mun Chan, [3] Alex Arenas, Alberto Fernandez, and Sergio Ruth Eberhardt, Michael Gardner, Kati Laiho, Duncan Gomez. Analysis of the structure of complex Legge, Michele Magrane, Klemens Pichler, Diego Pog- networks at different resolution levels. e-pub: gioli, Harminder Sehra, Andrea Auchincloss, Kristian http://arxiv.org/abs/physics/0703218, January Axelsen, Marie-Claude C. Blatter, Emmanuel Boutet, 2008. doi: 10.1088/1367-2630/10/5/053039. URL Silvia Braconi-Quintaje, Lionel Breuza, Alan Bridge, http://dx.doi.org/10.1088/1367-2630/10/5/053039. Elizabeth Coudert, Anne Estreicher, Livia Famiglietti, [4] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, Serenella Ferro-Rojas, Marc Feuermann, Arnaud Gos, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel- Janet James, Silvia Jimenez, Florence Jungo, Guillaume Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Keller, Phillippe Lemercier, Damien Lieberherr, Patrick Richardson, M. Ringwald, G. M. Rubin, and G. Sher- Masson, Madelaine Moinat, Ivo Pedruzzi, Sylvain Poux, lock. Gene ontology: tool for the unification of biology. Catherine Rivoire, Bernd Roechert, Michael Schneider, the gene ontology consortium. Nature genetics, 25(1): Andre Stutz, Shyamala Sundaram, Michael Tognolli, Ly- 25–29, May 2000. ISSN 1061-4036. doi: 10.1038/75556. die Bougueleret, Ghislaine Argoud-Puy, Isabelle Cusin, URL http://dx.doi.org/10.1038/75556. Paula Duek-Roggli, Ioannis Xenarios, and . [5] B. Chandrasekaran, John R. Josephson, and V. Richard The UniProt-GO annotation database in 2011. Nucleic Benjamins. What are ontologies, and why do we need acids research, 40(Database issue):D565–D570, January them? IEEE Intelligent Systems, 14(1):20–26, January 2012. ISSN 1362-4962. doi: 10.1093/nar/gkr1048. URL 1999. ISSN 1541-1672. doi: 10.1109/5254.747902. URL http://dx.doi.org/10.1093/nar/gkr1048. http://dx.doi.org/10.1109/5254.747902. [11] Lude Franke, Harm van Bakel, Like Fokkens, Edwin D. [6] Aaron Clauset, M. E. J. Newman, and Cristopher de Jong, Michael Egmont-Petersen, and Cisca Wijmenga. Moore. Finding community structure in very large net- Reconstruction of a functional human gene network, with works. Physical Review E, 70(6):066111+, December an application for prioritizing positional candidate genes. 2004. doi: 10.1103/PhysRevE.70.066111. URL http: American journal of human genetics, 78(6):1011–1025, //dx.doi.org/10.1103/PhysRevE.70.066111. June 2006. ISSN 0002-9297. doi: 10.1086/504300. URL [7] Aed´ın C. Culhane, Thomas Schwarzl, Razvan Sul- http://dx.doi.org/10.1086/504300. tana, Kermshlise C. Picard, Shaita C. Picard, Tim H. [12] H. Funakoshi, J. Fris´en, G. Barbany, T. Timmusk, Lu, Katherine R. Franklin, Simon J. French, Gerald O. Zachrisson, V. M. Verge, and H. Persson. Differential Papenhausen, Mick Correll, and John Quackenbush. expression of mRNAs for neurotrophins and their recep- GeneSigDB–a curated database of gene expression signa- tors after axotomy of the sciatic nerve. The Journal of cell tures. Nucleic acids research, 38(Database issue):D716– biology, 123(2):455–465, October 1993. ISSN 0021-9525. D725, January 2010. ISSN 1362-4962. doi: 10.1093/ URL http://view.ncbi.nlm.nih.gov/pubmed/8408225. nar/gkp1015. URL http://dx.doi.org/10.1093/nar/ [13] M. Girvan and M. E. J. Newman. Community struc- gkp1015. ture in social and biological networks. Proceedings of the [8] Aed´ın C. Culhane, Markus S. Schr¨oder, Razvan Sul- National Academy of Sciences, 99(12):7821–7826, June tana, Shaita C. Picard, Enzo N. Martinelli, Caroline 2002. ISSN 0027-8424. doi: 10.1073/pnas.122653799. Kelly, Benjamin Haibe-Kains, Misha Kapushesky, Anne- URL http://dx.doi.org/10.1073/pnas.122653799. Alyssa St Pierre, William Flahive, Kermshlise C. Pi- [14] Kimberly Glass and Michelle Girvan. Annotation en- card, Daniel Gusenleitner, Gerald Papenhausen, Niall richment analysis: An alternative method for evalu- O’Connor, Mick Correll, and John Quackenbush. Gen- ating the functional properties of gene sets. e-pub: eSigDB: a manually curated database and resource for http://arxiv.org/abs/1208.4127, August 2012. URL analysis of gene expression signatures. Nucleic Acids Re- http://arxiv.org/abs/1208.4127. search, 40(D1):D1060–D1066, January 2012. ISSN 1362- [15] Kimberly Glass, Edward Ott, Wolfgang Losert, and 4962. doi: 10.1093/nar/gkr901. URL http://dx.doi. Michelle Girvan. Implications of functional similarity org/10.1093/nar/gkr901. for gene regulatory interactions. Journal of the Royal [9] Leon Danon, Albert D´ıaz-Guilera, Jordi Duch, and Society, Interface / the Royal Society, February 2012. Alex Arenas. Comparing community structure iden- ISSN 1742-5662. doi: 10.1098/rsif.2011.0585. URL tification. Journal of Statistical Mechanics: The- http://dx.doi.org/10.1098/rsif.2011.0585. 8

[16] Roger Guimera and Luis A. Nunes Amaral. Functional (2):026113+, February 2004. doi: 10.1103/physreve.69. cartography of complex metabolic networks. Nature, 433 026113. URL http://dx.doi.org/10.1103/physreve. (7028):895–900, February 2005. ISSN 0028-0836. doi: 10. 69.026113. 1038/nature03288. URL http://dx.doi.org/10.1038/ [29] Iman Osman, Dean F. Bajorin, Tung-Tien T. Sun, Hong nature03288. Zhong, Diah Douglas, Joseph Scattergood, Run Zheng, [17] Da W. Huang, Brad T. Sherman, Qina Tan, Joseph Mark Han, K. Wayne Marshall, and Choong-Chin C. Kir, David Liu, David Bryant, Yongjian Guo, Robert Liew. Novel blood biomarkers of human urinary blad- Stephens, Michael W. Baseler, H. Clifford Lane, and der cancer. Clinical cancer research : an official jour- Richard A. Lempicki. David bioinformatics resources: nal of the American Association for Cancer Research, 12 expanded annotation database and novel algorithms to (11 Pt 1):3374–3380, June 2006. ISSN 1078-0432. doi: better extract biology from large gene lists. Nucl. Acids 10.1158/1078-0432.CCR-05-2081. URL http://dx.doi. Res., 35(Web Server issue):gkm415+, June 2007. ISSN org/10.1158/1078-0432.CCR-05-2081. 1362-4962. doi: 10.1093/nar/gkm415. URL http://dx. [30] Erika Pastrana, Maria Teresa T. Moreno-Flores, Este- doi.org/10.1093/nar/gkm415. ban N. Gurzov, Jesus Avila, Francisco Wandosell, and [18] H. Jeong, S. P. Mason, A. L. Barab´asi,and Z. N. Oltvai. Javier Diaz-Nido. Genes associated with adult axon re- Lethality and centrality in protein networks. Nature, 411 generation promoted by olfactory ensheathing cells: a (6833):41–42, May 2001. ISSN 0028-0836. doi: 10.1038/ new role for matrix metalloproteinase 2. The Journal of 35075138. URL http://dx.doi.org/10.1038/35075138. neuroscience : the official journal of the Society for Neu- [19] O. D. King, R. E. Foulger, S. S. Dwight, J. V. White, and roscience, 26(20):5347–5359, May 2006. ISSN 1529-2401. F. P. Roth. Predicting gene function from patterns of doi: 10.1523/JNEUROSCI.1111-06.2006. URL http: annotation. research, 13(5):896–904, May 2003. //dx.doi.org/10.1523/JNEUROSCI.1111-06.2006. ISSN 1088-9051. doi: 10.1101/gr.440803. URL http: [31] Mason A. Porter, Jukka-Pekka Onnela, and Pe- //dx.doi.org/10.1101/gr.440803. ter J. Mucha. Communities in networks. e-pub: [20] Insuk Lee, Shailesh V. Date, Alex T. Adai, and Ed- http://arxiv.org/abs/0902.3788, September 2009. URL ward M. Marcotte. A probabilistic functional network http://arxiv.org/abs/0902.3788. of yeast genes. Science, 306(5701):1555–1558, November [32] Simon B. Shum, Enrico Motta, and John Domingue. 2004. ISSN 1095-9203. doi: 10.1126/science.1099511. ScholOnto: an Ontology-Based digital library server URL http://dx.doi.org/10.1126/science.1099511. for research documents and discourse. International [21] E. A. Leicht and M. E. J. Newman. Commu- Journal on Digital Libraries, 3:237–248, 2000. URL nity structure in directed networks. Physical Review http://citeseerx.ist.psu.edu/viewdoc/summary? Letters, 100:118703+, March 2008. doi: 10.1103/ doi=10.1.1.36.3835. PhysRevLett.100.118703. URL http://dx.doi.org/10. [33] Ricard V. Sol´e,Ramon F. Cancho, Jose M. Montoya, 1103/PhysRevLett.100.118703. and Sergi Valverde. Selection, tinkering, and emergence [22] Usakali M¨aki,editor. The Economic World View: Stud- in complex networks. Complex., 8(1):20–33, September ies in the Ontology of Economics. The Press Syndicate 2002. ISSN 1076-2787. URL http://portal.acm.org/ of the University of Cambridge, Cambridge, UK, 2001. citation.cfm?id=770715.770720. [23] Marina Meila. Comparing clusterings by the variation [34] Robert Stevens, Carole A. Goble, and Sean Bechhofer. of information. Learning Theory and Kernel Machines, Ontology-based knowledge representation for bioinfor- pages 173–187, 2003. URL http://www.springerlink. matics. Brief Bioinform, 1(4):398–414, January 2000. com/content/rdda4r5bm9ajlbw3. doi: 10.1093/bib/1.4.398. URL http://dx.doi.org/10. [24] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, 1093/bib/1.4.398. D. Chklovskii, and U. Alon. Network motifs: Sim- [35] The gene ontology consortium. Creating the gene ontol- ple building blocks of complex networks. Science, 298 ogy resource: design and implementation. Genome Res., (5594):824–827, October 2002. ISSN 1095-9203. doi: 10. 11(8):1425–1433, August 2001. ISSN 1088-9051. doi: 1126/science.298.5594.824. URL http://dx.doi.org/ 10.1101/gr.180801. URL http://dx.doi.org/10.1101/ 10.1126/science.298.5594.824. gr.180801. [25] Sara Mostafavi and Quaid Morris. Fast integration of [36] Andreas Wagner. The yeast protein interaction network heterogeneous data sources for predicting gene func- evolves rapidly and contains few redundant duplicate tion with limited annotation. Bioinformatics, 26(14): genes. Mol Biol Evol, 18(7):1283–1292, July 2001. ISSN 1759–1765, July 2010. ISSN 1367-4811. doi: 10.1093/ 0737-4038. URL http://mbe.oxfordjournals.org/cgi/ bioinformatics/btq262. URL http://dx.doi.org/10. content/abstract/18/7/1283. 1093/bioinformatics/btq262. [37] Xiu Qin Q. Xu, Set Yen Y. Soo, William Sun, and [26] M. E. Newman. Analysis of weighted networks. Phys Rev Robert Zweigerdt. Global expression profile of highly E Stat Nonlin Soft Matter Phys, 70(5 Pt 2), November enriched cardiomyocytes derived from human embryonic 2004. ISSN 1539-3755. URL http://view.ncbi.nlm. stem cells. Stem cells (Dayton, Ohio), 27(9):2163–2174, nih.gov/pubmed/15600716. September 2009. ISSN 1549-4918. doi: 10.1002/stem.166. [27] M. E. J. Newman. The structure and func- URL http://dx.doi.org/10.1002/stem.166. tion of complex networks. SIAM Review, 45 [38] Xuerui Yang, Yang Zhou, Rong Jin, and Christina Chan. (2):167–256, 2003. URL http://scitation. Reconstruct modular phenotype-specific gene networks aip.org/getabs/servlet/GetabsServlet?prog= by knowledge-driven matrix factorization. Bioinformat- normal&id=SIREAD000045000002000167000001&idtype= ics, 25(17):2236–2243, September 2009. ISSN 1367-4811. cvips&gifs=yes. doi: 10.1093/bioinformatics/btp376. URL http://dx. [28] M. E. J. Newman and M. Girvan. Finding and evaluating doi.org/10.1093/bioinformatics/btp376. community structure in networks. Physical Review E, 69 [39] Ahrim Youn, David J. Reiss, and Werner Stuetzle. 9

Learning transcriptional networks from the integration of ChIP-chip and expression data in a non-parametric model. Bioinformatics, 26(15):1879–1886, August 2010. doi: 10.1093/bioinformatics/btq289. URL http://dx. doi.org/10.1093/bioinformatics/btq289. [40] Jing Zhao, Hong Yu, Jianhua Luo, Z. Cao, and Yixue Li. Complex networks theory for analyzing metabolic networks. Chinese Science Bulletin, 51(13):1529–1537– 1537, July 2006. ISSN 1001-6538. doi: 10.1007/ s11434-006-2015-2. URL http://dx.doi.org/10.1007/ s11434-006-2015-2. 10

Supplemental Material Value of r Number of Number of New (Resolution) Communities Communities In this section we provide additional materials meant 0 56 56 to compliment and expand upon the analysis in “Find- 0.25 80 51 ing New Order in Biological Functions from the Network 0.5 89 44 Structure of Gene Annotations.” In the first section, 1 116 67 “Communities Found Across Multiple Resolutions”, we 2 146 95 provide more detail about how we used the resolution 4 193 148 parameter to generate multiple, overlapping, yet unique 8 320 257 communities of terms. In the second section, “Domain- 16 576 422 specific Analysis” we provide analysis of communities of 32 1007 703 terms compared to branches within each primary domain of GO. Next, in “Visualizing GO Branches” we illustrate 64 1557 965 relationships between GO branches at different “levels” 128 2409 1439 of the GO DAG. In the section entitled “Functional En- 256 3585 1965 richment Analysis”, we briefly explain the methodology 512 4983 2375 used to evaluate the statistical enrichment of gene sets 1024 6730 2904 in the term communities and GO branches, and pro- vide analysis showing that this enrichment is absent for TABLE S1: The number of communities found at each value randomly generated term communities or randomly gen- of resolution used. As the resolution is increased, the number erated sets of genes. Finally, in “Comparative Species of communities found at that resolution increases as well. Analysis” we construct and analyze annotation-driven term networks using gene annotations for sixteen addi- tional organisms and compare those networks with each two community assignments into that of the community other as well as the human results presented in the main from the lower resolution. For example, of the 80 com- text. munities found at a resolution value of r = 0.25, 29 had For information regarding the methodology used to an identical membership to one of the 56 communities illustrate the term communities or generate the word found at r = 0. We removed those 29 “redundant” com- clouds in the main text, see “Supplemental Methods” munities from the r = 0.25 partitioning, retaining only below. their r = 0 assignment, and record that at r = 0.25 only 51 additional communities are found. Similarly, of the 89 communities found at a resolution of r = 0.5, 45 of them are identical in membership to one of the 107 unique com- 4.1. Communities Found Across Multiple Resolutions munities found at r = 0 or r = 0.25. These “redundant” communities were removed and we record only 44 addi- tional communities found at r = 0.5. This procedure was In order to find additional viable partitions of the repeated for the remaining resolutions, resulting in 11491 annotation-driven term network, representing different “unique” communities (see Table S1). The cumulative resolutions, we varied the weighting parameter (see r in distribution of the number of members in these commu- Equation 3, and “Methods” in the main text). In addi- nities is shown in Figure S1(a). For comparison, on the tion to the fundamental partition (r = 0) we chose val- same graph we also plot the cumulative distribution of ues of r in geometrically increasing steps ranging from −4 10 the number of annotations made to GO branches. These r = 2 = 0.25 to r = 2 = 1024 such that the final dis- distributions are heavy-tailed and demonstrate that the tribution of community sizes would resemble that of the number and sizes of branches and communities are simi- branches. This procedure resulted in fourteen different lar. The heavy-tailed distribution for branches is a result partitions of the terms in the annotation-driven network of the hierarchical DAG structure as the members of each such that, initially, each term can be assigned to exactly branch are also members of their parent branch(es) (see fourteen communities, one at each resolution. We em- [14, 15]). phasize that while communities at any given resolution do not overlap, communities at different resolutions can be highly overlapping. We point out that it is possible for the membership of a 4.2. Domain-specific Analysis community found at one resolution to be identical to the membership of a community found at another resolution. The Gene Ontology is broken into three, fully indepen- In order to eliminate completely redundant community dent, primary domains, each of which takes the form of information from our found communities, at each resolu- a directed acyclic graph [35]. In the main text we com- tion, we determined if any of the communities found at bined information from all three primary domains to con- that resolution were identical in membership as a commu- struct the annotation-driven term network. Because of nity found at a lower resolution. If so, we “collapsed” the this, the dissimilarity between the GO branches and the 11

4 200 100 50 10 Branches Branches Branches Branches Communities Communities Communities Communities 80 40 3 150

N) 10 ≥ 60 30 2 10 100 40 20 1 (Members 10 50 20 10

0

10 Number of Branches/Communities Number of Branches/Communities Number of Branches/Communities # of Branches/Communities 0 0 0 1 10 100 1,000 10,000 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Number of Members (N) Maximum Jaccaard Maximum Jaccaard Maximum Jaccaard (a)Distribution of Branch and (b)Biological Process Terms (c)Molecular Function Terms (d)Cellular Component Terms Community Sizes

FIG. S1: (A) The cumulative Distribution for the sizes of all branches in the Gene Ontology and all unique term communities found at the various resolutions. (B-D) Distribution of Jm, the maximum similarity a domain-specific community or branch with ten or more members has compared to all other domain-specific branches or communities with ten or more members, respectively. Although a small number of communities and branches have similar memberships, most are very dissimilar. term communities we find by partitioning this network ure S1), the maximum similarity a community has to may be at least partially attributable to the fact that any branch, or vice versus, for each of the domains, is not members of a particular community can belong to any strikingly different from that presented when using infor- of the primary domains whereas members of a particular mation collected from all three domains (see Figure 3(a) GO branch must all belong to the same primary domain in the main text). There might be slightly more similar- based on the construction of the hierarchy. To address ity when directly comparing communities and branches the extent of this issue, we constructed three “domain- derived from either the “Molecular Function”, or “Cel- specific” term networks by using only terms specific to lular Component” primary domains, but we remind the a particular primary domain and the gene annotations reader that the majority (approximately two-thirds) of made to those specific terms. We partitioned each of the terms in GO belong to the “Biological Process” pri- these networks using the same resolution values as were mary domain, and this distribution (Figure S1(b)) is used previously and, as with the partitions using all an- practically indistinguishable from the one presented in notations, we retained only the “unique” set of communi- Figure 3(a) of the main text. ties for the following analysis (see Section 4.1). Table S2 shows the number of communities found and the num- ber of branches defined in GO for each of the primary 4.3. Visualizing GO Branches domains. Next we compared the membership of these communi- To better understand the relationships between the ties and branches using the Jaccard similarity (see “Term branches in the Gene Ontology, we visualized branches in Communities and GO Branches Represent Distinct Col- a manner similar to the way we visualized the relation- lections of Biological Functions” in the main text). We ships between term communities found at different reso- did the comparison three times, each time confining the lutions (see Figure 2 in the main text). First, we deter- comparison to those branches defined by a particular pri- mined a “level” for each GO branch in order to segregate mary domain and the term communities derived from the the branches in a manner similar to the resolution param- network constructed using those terms and their corre- eter. To begin, the head nodes of the three primary do- sponding gene annotations. The distribution of Jm (Fig- mains (“Biological Process”, “Molecular Function”, and “Cellular Component”) were assigned to level one. Next, we determined branches for which the head-node is a Number of Number of term that has a parent-child relationship with only one Branches Communities of these three level-one terms, and assigned those terms All Terms 15033 11491 a level of two. Continuing, head-nodes that have parent- BP Terms 10192 9043 child relationships with only level-one or level-two terms MF terms 3634 5423 were given a level assignment of three, and so on. This CC Terms 1207 2107 assignment was repeated until all head-nodes were as- signed a “level”. Branch levels were then defined based TABLE S2: The number of branches defined within each pri- on the level of its head-node. mary domain as well as the number of communities found We visualized branches with one-hundred or more by varying the resolution parameter and partitioning a term- term-members at the six lowest levels (Figure S2), lin- network derived by gene annotations made to terms in this ing up branches with the same level assignment horizon- primary domain. The large size of the “Biological Process” tally. Each branch is represented by a single circle, whose primary domain compared to the others is evident. radius scales as the log of the number of terms belong- 12

FIG. S2: Visualization of branches of GO terms. Each branch is represented as a circle whose radius scales as the log of the number of terms in the branch. The width of the line connecting two branches is proportional to the percentage of terms in the child branch that are also in the parent branch. Color represents whether the terms in the branch belong to the BP (yellow), MF (cyan) or CC (magenta) primary domain. ing to that branch and whose color represents whether enrichment of gene sets in branches [14]. Therefore, we the terms in the branch belong to the BP (yellow), MF choose to use Annotation Enrichment Analysis (AEA) (cyan) or CC (magenta) primary domain. Between the to evaluate the functional enrichment of experimentally- branches found at adjacent levels, we draw a line from derived gene sets in both the GO branches and our term a branch at a higher level to a branch at a lower level communities. AEA allows the user to specify a collection if at least 10% of the members of the branch from the of terms and a collection of genes and uses a randomiza- higher level also belong to the branch at the lower level. tion protocol to evaluate the probability that these two The thickness of the line is indicative of the overlap in sets are more connected than by chance. membership between the two branches. We ran AEA using one million randomizations defin- Unsurprisingly, we observe three distinct sets of inter- ing collections of genes using a public database of Cancer connected branches corresponding to branches belonging Signatures [8] and using collections of terms defined by to each of the three primary domains. Like in the visual (1) membership in the term communities; or (2) member- representation of the term communities across different ship in GO branches. We plot the number of pairs of term resolutions (see Figure 2 in the main text), we observe and gene “collections” that are enriched beneath various a lot of “cross-talk” between branches at adjacent levels, significance thresholds (Figure S3(a)) and observe clear whereby a branch at a given level is very likely to con- statistical enrichment of experimentally-derived gene sig- tain members from multiple branches at a lower level. We natures in both term communities and GO branches note that in this representation, an individual term can with several thousand community-signature and branch- be a member of multiple branches at the same “level”. signature pairs enriched at a p-value significance less than This is in contrast to segregating communities by reso- 10−6. lution, in which case each term only appears once on a Next, we wished to verify that this enrichment was not given resolution-row. As a consequence, the inter-level an artifact of the community construction – namely, we connections between branches are somewhat structurally wanted to verify that experimental gene signatures are different from connections between communities found at not enriched in random collections of terms and that term adjacent resolutions. Namely, branches that share a term communities are not enriched in random collections of will necessarily also share a set of parent branches. Re- genes. Therefore we constructed “random” term commu- dundancy of the same term member(s) across multiple nities by taking the community assignments of terms and branches at a given level is visually evident among “Bi- swapping term labels. Similarly, we constructed “ran- ological Process” (yellow) branches, where there exists dom” gene sets by randomly swapping gene labels. This groups of branches at each level that connect primarily gives us random term communities and random gene sets to the same set of branches at a lower level. with both the same size and relative overlap as the real term communities and the experimentally-defined signa- tures. We then ran AEA an additional four times, using: 4.4. Functional Enrichment Analysis (1) random communities and the experimentally-defined signatures; (2) term communities and random gene sets; We also wanted to test how our communities might be (3) branches and random gene sets; and (4) random com- used in functional enrichment analysis, a very common munities and random gene sets. The result of AEA indi- application of the Gene Ontology. Traditionally, each cated no enrichment using either random term commu- branch of GO is collapsed to its head node and all the nities or random sets of genes (Figure S3(a)), showing genes annotated to that head node are grouped into one that the term communities generated by partitioning the “set.” (Note that this set is the same as the genes an- annotation-driven term network contain useful biological notated to the entire branch because of the propagation information. of gene annotation assignment, see the “Introduction” in AEA evaluates the overlap in annotations made by a the main text). Recently there has been evidence that set of genes or to a set of terms. Other, traditional func- this approach over-simplifies the complex structure of the tional enrichment analysis procedures, however, often use Gene Ontology and has the potential to mis-represent the Fisher’s Exact Test (FET) to evaluate the overlap in two 13

≥8000 GO Branches (Cancer Signatures) GO Branches (Random Signatures) 7000 Term Communities (Cancer Signatures) Term Communities (Random Signatures) Random Communities (Cancer Signatures) 6000 Random Communities (Random Signatures) 5000 4000 3000 2000 1000 Number of Comparisons Meeting Significance Level 0 p ≤ 10−6 p ≤ 10−5 p ≤ 10−4 p ≤ 10−3 p ≤ 10−2 p ≤ 10−1 (a)Annotation Enrichment Analysis

≥240000 GO Branches (Cancer Signatures) GO Branches (Random Signatures) 210000 Term Communities (Cancer Signatures) Term Communities (Random Signatures) Random Communities (Cancer Signatures) 180000 Random Communities (Random Signatures) 150000 120000 90000 60000 30000 Number of Comparisons Meeting Significance Level 0 p ≤ 10−6 p ≤ 10−5 p ≤ 10−4 p ≤ 10−3 p ≤ 10−2 p ≤ 10−1 (b)Fisher’s Exact Test

FIG. S3: Level of statistical enrichment found when comparing branches, communities, and randomly generated communities with either cancer signatures or randomly generated gene sets. Only branches and our identified term communities show statistical enrichment in experimentally-derived gene signatures, with slightly more enrichment for the communities compared to the branches. sets of genes – one defined by a gene set or signature to GO, the genes collected in the signatures published of interest, and the other defined by taking the set of by GeneSigDB are biased in that they are generally an- genes annotated to all the terms in a GO branch (same notated much more frequently to GO than one would as the genes annotated to the parent node). Although expect by chance (see [14]). Furthermore, by taking all there is evidence that such analysis is highly sensitive genes annotated to a collection of terms, highly anno- to the annotation degree of genes and terms [14], we tated genes are also more likely to be represented in the wanted to see if our communities were enriched in cancer gene sets representing the branches, term communities, signatures using this more traditional approach. There- and the random communities. The enrichment of the ex- fore, for each GO branch, term community and random perimental gene signatures in the random communities, community, we took the collection of genes annotated therefore, is attributable to the fact that FET finds sig- to all terms in that branch/community/random commu- nificant overlap between sets containing an abundance nity, and assigned this set of genes to represent that of highly annotated genes, independent of the biological branch/community/random-community. We then eval- content of those sets. For more discussion, see [14]. uated the significance of overlap between theses sets of genes and sets of genes as defined by cancer signatures, or random sets of genes. 4.5. Comparative Species Analysis As with AEA, both term communities and GO branches show enrichment in experimentally-defined gene Even though the Gene Ontology hierarchy establishes a signatures using FET, with term communities perhaps species-independent terminology, one could imagine con- having a slightly greater level of enrichment (Figure structing networks of terms using species-specific gene S3(b)). At first it is surprising to observe that random annotations, and thereby constructing species-specific communities also show a large amount of enrichment in term communities. These communities would reflect the the experimental gene signatures – much more than ei- biological terms that are performed by the same sets of ther the branches or real communities!! In retrospect, genes in a particular species and wouldn’t necessarily be however, this serves to highlight a known weakness of the same across different species. In this section we eval- FET to incorrectly over-estimate the significance of over- uate the similarity between partitions of GO terms de- lap when genes in a set contain a higher than expected rived from the annotations of seventeen model organisms, number of annotations. Whereas our random gene sets including thale cress (Arabidopsis thaliana), Escherichia represent a random sampling from all genes annotated coli, slime mold (Dictyostelium discoideum), Aspergillus 14

2 Thale cress between two partitions, X and Y . VI(X,Y ) represents E. coli 1.8 a distance between the information shared between two Slime mold A. nidulans 1.6 partitions, X and Y , with VI(X,Y ) = 0 indicating iden- Variation of Information Yeast tical partitions. 1.4 Budding yeast We calculated the VI between the partitions for every Fission yeast 1.2 Worm pair of species, using only terms common to both parti- Fruit fly 1 tions when different sets of terms are annotated in the Zebrafish Chicken 0.8 different species. Figure S4 shows these values. We also Pig 0.6 calculated a VI value for a random shuffling of commu- Cow Dog 0.4 nity assignments in each pair of species and observe that Mouse “random” VI takes a value of approximately 2.4 ± 0.12. Rat 0.2 Human The actual VI is generally lower than this random value 0 showing that there is some shared information; however, most comparisons are much closer to this random value

Sus scrofa than to “perfect” agreement (VI = 0), indicating that Bos taurus Danio rerio Gallus gallus Mus musculus Homo sapiens these term partitions are far from identical. Escherichia coli Candida albicans Rattus norvegicus Interestingly, there is a relatively higher level of simi- Aspergillus nidulans Arabidopsis thaliana Canine lupus familiaris

Caenorhabditis elegans larity between the species belonging to the Fungi King- Drosophila melanogaster Dictyostelium discoideum

Saccharomyces cerevisiae dom (A.nidulans, yeast, budding yeast and fission yeast). Schizosaccharomyces pombe On the other hand there is higher dissimilarity between animals belonging to the Chordata phylum (zebrafish, FIG. S4: The variation of information between term par- chicken, pig, cow, dog, mouse, rat, human), both between titions generated from species-specific projected term net- each other and compared to the other species. This could works. Labels along the x-axis are the scientific name for the be a consequence of evolutionary diversity playing a more species and labels along the y-axis are the common name for the species. Although the partitions are more similar than dominant role in the organization of biological function random (VI ≈ 2.4), they are still far from being identical in these organisms, perhaps through more complex regu- (VI = 0), indicating that these species-specific communities latory mechanisms such as epigenetics. The exception is of functional terms carry distinct information. when comparing mouse, rat and human, which is not en- tirely surprising given the extent to which mouse is used to mimic the human system in laboratory experiments. nidulans, three types of yeast including Candida albicans, Although some of the differences between the species- budding yeast (Saccharomyces cerevisiae), and fission specific term communities may be due to variations in the yeast (Schizosaccharomyces pombe), worm (Caenorhab- annotation practices among groups that supply annota- ditis elegans), fruit fly (Drosophila melanogaster), ze- tions to GO, it is also likely that they reflect real, biolog- brafish (Danio rerio), chicken (Gallus gallus), pig (Sus ical differences in the cellular organization of these sys- scrofa), cow (Bos taurus), dog (Canine lupus familiaris), tems. We suggest that using these partitions of terms in a mouse (Mus muculus), rat (Rattus norvagicus) and hu- species-specific context may enhance the results of func- man (Homo sapiens). We downloaded gene annotation tional analysis for these model organisms. Furthermore, files for each of these species and projected term-term identifying the exact differences between these communi- networks (see Section “Constructing a Term Network ties may uncover important cellular properties of various from Gene Ontology Annotations” in the main text). We species, an investigation we leave to future work. then partitioned each of these networks into communities (see Section “Identifying Communities of GO terms” in the main text). For simplicity we choose to focus only on Supplemental Methods the fundamental partition (resolution parameter r = 0, see Equation 3 in the main text). This results in exactly In this section we provide additional information on one discreet partitioning of GO terms associated with the methodology used to illustrate the term communities each species. across different resolutions (Figure 2 in the main text) Comparing community structure identification is an and generate the word clouds representing the biological ongoing area of research in the complex systems field and content of those communities (Figure 4 in the main text). there are multiple proposed methods for comparing two discreet partitions of a set of nodes (e.g. [9, 23]). One we will employ here is the variation of information (VI) 4.6. Illustrating Community Structure at different [23]: resolutions VI(X,Y ) = H(X) + H(Y ) − 2MI(X,Y ) (4) To better understand the relationships between the where H(X) is the Shannon’s entropy associated with a communities found at different resolutions, we visual- partition, X, and MI(X,Y ) is the mutual information ized the uniquely-found term communities with ten or 15 more members for the six lowest values of resolution used value) of the frequency of that word in the community (r = {0, 0.25, 0.5, 1, 2, 4}) (Figure 2 in the main text). compared to its frequency across the descriptions of all We line up the communities found at each resolution and GO terms using the hypergeometric probability: visualize each as a circle whose radius scales as the log of the number of term members found in that commu- min[Nw,Nc] NcNtot−Nc X i Nw−i p = P (N ≥ Nwc|Nw,Nc,Ntot) = , nity and whose color corresponds to the percentage of Ntot “Biological Process”, “Molecular Function” and “Cellu- i=Nwc Nw lar Component” terms that belong to that community. (5) In other words, for each community we count the num- where Nwc is the number of times that word appears ber of members in that community from the “Biological in the term descriptions specific to a community, Nw is Process” domain and divide by the number of members the number of times that word appears across all term in the entire “Biological Process” domain. We then do descriptions, Nc is the number of individual words in the the same things for the other two domains. After these term descriptions specific to a community, and Ntot is the percentages are calculated, within each community they number of individual words across all term descriptions. are “normalized” by dividing by the maximum found per- We scale the sizes of the words in the word cloud as centage such that the vector representing that commu- −log10(p) such that words with the lowest probability of nity’s domain content has at least one member with a being in the community by chance are given the largest value of one. We can think of these values in terms of size, and those one might expect by chance are given a a three-part cmy color vector. The normalization pro- size close to zero. cess causes communities with an equal percentage from We also colored each word based on the percentage of all three primary domains to be colored black (cmy color times the terms that word comes from in a community vector equal to [1,1,1]), and those with members only belongs to each of the primary domains. Specifically, for from one primary domain to be exactly yellow ([0,0,1]) for each instance of a word in the term descriptions, we de- “Biological Process”, cyan ([1,0,0]) for “Molecular Func- termine the domain assignment of that term and give tion”, or magenta ([0,1,0]) for “Cellular Component”. that word instance the same domain assignment. To Between the communities found at adjacent resolu- color the word in a community-specific context, we de- tions, we draw a line from a community at a higher reso- termine the domain assignments made to all instances lution to a community at a lower resolution if at least 10% of that word in the community. We count these in- of the members of the community from the higher resolu- stances and divide by the domain assignments made to all tion also belong to the community at the lower resolution. words. This will generate a three-part vector represent- Line thickness scales linearly based on the percentage of ing the percentage of the primary domain represented by members of the community from the higher resolution the word in the community. We “normalize” this vector that belong to the community at the lower resolution. by dividing by the maximum found percentage, resulting Note that although connections are only made between in a three-part cmy color vector has at least one mem- communities in adjacent resolutions, sometimes the par- ber with a value of one. The described normalization ent community is identical to another community found causes a word with an equal percentage from all three at an even lower resolution, in which case the connection primary domains to be colored black (cmy color vector is made from the child community to the copy of the equal to [1,1,1]), and those with members only from one parent at its lowest found resolution. primary domain to be exactly yellow ([0,0,1]) for “Bio- logical Process”, cyan ([1,0,0]) for “Molecular Function”, or magenta ([0,1,0]) for “Cellular Component”. 4.7. Capturing Biological Information In Word Clouds

In order to easily interpret the contents of our commu- nities we summarize the information contained in each in the form of word clouds using a free word-cloud making program [1]. This program automatically configures the orientation of the words in the clouds, but we manually assign each word a relative size and color to represent that word’s statistical enrichment in the community and the primary domain that word is representing in the com- munity, respectively. To begin, for each community, we determine all the descriptions corresponding to the member terms of that community and count the number of times an individual word appears across all these descriptions. Then, for each of these words, we calculate the statistical enrichment (p-