UC San Diego UC San Diego Electronic Theses and Dissertations

Title Assembling, analyzing, refining, and cataloging molecular interaction network

Permalink https://escholarship.org/uc/item/0hf2t2fp

Author Mak, Huajiang Craig

Publication Date 2008

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO

ASSEMBLING, ANALYZING, REFINING, AND CATALOGING MOLECULAR INTERACTION NETWORKS

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy

Biology

Huajiang Craig Mak

Committee in charge:

Professor Trey Ideker, Chair Professor Joseph Ecker Professor Amy Kiger Professor Richard Kolodner Professor William Loomis

2008 Copyright

Huajiang Craig Mak, 2008

______

______Chair

University of California, San Diego

2008

iii I stepped from Plank to Plank A slow and cautious way The Stars about my Head I felt About my Feet the Sea.

I knew not but the next Would be my ﬁ nal inch⎯ This gave me the precarious Gait Some call Experience

Emily Dickinson

iv TABLE OF CONTENTS

SIGNATURE PAGE ...... iii

EPIGRAPH ...... iv

TABLE OF CONTENTS ...... v

LIST OF FIGURES ...... x

LIST OF TABLES ...... xii

ACKNOWLEDGEMENTS ...... xiii

VITA ...... xv

ABSTRACT OF THE DISSERTATION ...... xvi

Chapter 1: Introduction

Networks: the output of high-throughput experiments ...... 1 The Network Lifecycle, Part I: Assemble and Analyze ...... 2 The Network Lifecycle, Part II: Reﬁ ne and Catalog ...... 5 Contributions of this dissertation ...... 8

(continued next page)

v Chapter 2: Assembling a DNA damage response network

Abstract ...... 11 Introduction ...... 12 Results ...... 13 Identifying transcription factors involved in the MMS response ...... 13 Measuring the MMS-induced transcriptional network ...... 14 Deletion-buffering analysis ...... 18 Building a network of regulatory pathways ...... 20 Experimentally testing the model ...... 23 Methods ...... 25 Strains and Media ...... 25 Phenotypic Sensitivity Assays ...... 25 Whole Genome Expression Analysis ...... 26 Genome Wide Transcription Factor Binding Analysis ...... 26 Array Preparation and Hybridization ...... 27 Data Post-processing and Signiﬁ cance Assessment ...... 27 Overlap in Binding Between Conditions ...... 28 Signiﬁ cance of expansion or contraction ...... 29 Motif Enrichment in Transcription Factor Binding Data ...... 30 Deletion Buffering Analysis ...... 31 Assembly of Transcriptional Pathways From Integrated Data ...... 32 Assessing overlaps in binding between TF pairs ...... 33 Tables ...... 35 Supplemental Figures ...... 36 Acknowledgements ...... 44

(continued next page)

vi Chapter 3: Analyses of position effects within the yeast transcriptional network

Abstract ...... 45 Introduction ...... 46 Results ...... 47 Discovery of a large family of Subtelomere Binding Transcription Factors ...... 47 SBTFs do not behave like known telomere-binding complexes or transcription factors but do interact with these proteins ...... 52 SBTFs regulate stress and carbon metabolism in three broad clusters54 Six SBTFs target heterochromatin domains that are derepressed during growth on alternative carbon sources ...... 55 Dynamic subtelomeric binding is correlated with gene expression ...... 58 Discussion ...... 60 Methods ...... 63 Computational screen for SBTFs ...... 63 Analysis of protein complexes and interactions involving SBTFs ...... 64 Analysis of repetitive element binding by SBTFs ...... 64 Screening SBTF binding profi les for correlation with hda1Δ sensitivity ...... 65 Dilution assays for growth of SBTF deletions on alternative carbon sources ...... 66 Inferring SBTF modes of action by integrated analysis of matched binding and expression profi les ...... 66 Identifi cation of SBTF orthologs ...... 67 Supplemental Figures ...... 69 Supplemental Tables ...... 74 Acknowledgements ...... 77

(continued next page)

vii Chapter 4: Reﬁ nement of a transcriptional regulatory network using graphical probabilistic models

Abstract ...... 78 Introduction ...... 79 Results ...... 80 Summary of physical regulatory models ...... 80 Experiment selection ...... 83 Model validation ...... 85 Automated model reﬁ nement ...... 87 Learning curve analysis ...... 88 Discussion ...... 89 Conclusions ...... 91 Methods ...... 92 Model building and inference ...... 92 Experiment scoring ...... 92 Expression proﬁ ling ...... 93 Expression coherence ...... 94 Tables ...... 96 Supplemental Tables ...... 98 Acknowlegements ...... 102

(continued next page)

viii Chapter 5: Cataloging network models in a searchable online database

Abstract ...... 103 Introduction ...... 104 Results ...... 106 A spectrum of network models ...... 106 Database coverage and assembly ...... 108 Network model query ...... 109 Meta-analysis of models ...... 111 Discussion ...... 113 Methods ...... 114 Data processing ...... 114 Web interface ...... 114 Scoring models for Gene Ontology (GO) annotation ...... 115 Scoring similarity between publications ...... 115 Tables ...... 116 Supplemental Tables ...... 117 Acknowledgements ...... 118

Chapter 6: Conclusions

Physiologically relevant interpretation of high-throughput data ...... 119 Harnessing complementary relationships ...... 120 Controlling for missing data and spurious data ...... 122 Remaining grounded in biology ...... 123 Beyond wiring diagrams ...... 125 Final remarks ...... 127

References ...... 128

ix LIST OF FIGURES

Figure 2.1. Overview of the systems approach...... 13 Figure 2.2. TF selection, chIP-chip, and expression profi ling experiments...... 15 Figure 2.3. Overlap in transcription factor binding patterns in response to MMS...... 17 Figure 2.4. Deletion buffering of the MMS response...... 20 Figure 2.5. A network of regulatory interactions connecting transcription factors to deletion-buffered genes...... 22 Supplemental Figure 2.1. MMS sensitivity assay of transcription factor deletions strains...... 36 Supplemental Figure 2.2. Scoring overlap in TF binding between conditions...... 37 Supplemental Figure 2.3. Clustering of MMS expression responses with other stresses...... 38 Supplemental Figure 2.4. MMS responsive genes implicate G1/S checkpoint...... 39 Supplemental Figure 2.5. Crt1 deletion-buffering P-values...... 40 Supplemental Figure 2.6. Number of genes buffered by each transcription factor deletion (buffering p-value <0.005)...... 41 Supplemental Figure 2.7. Correspondence between expression and regulation of the RNR genes by Crt1p...... 42 Supplemental Figure 2.8. Direct binding interactions supported by deletion-buffering effects...... 43 Figure 3.1. Transcription Factors that Preferentially Bind Sequences at Subtelomeres...... 48 Figure 3.2. Example SBTFs that Show Dynamic Binding Preferences as a Function of Growth or Stress Condition...... 50 Figure 3.3. Subtelomeric Binding Preference is Dynamic and Distributed into Distinct Clusters...... 51 Figure 3.4. Interactions Between SBTFs and Telomere-related Genes...... 53 Figure 3.5. Binding Profi les, Expression, and Growth Phenotypes Link SBTFs to Hda1p...... 57 Figure 3.6. Dynamic Binding and Expression Profi les Suggest Models of SBTF Function and Link SBTFs to Poorly Characterized Genes at Subtelomeres...... 59

(continued next page)

x Supplemental Figure 3.1. Estimating the error in identifying SBTFs...... 69 Supplemental Figure 3.2. A map of the subtelomeric regulatory circuitry...... 70 Supplemental Figure 3.3. Analysis of protein complexes...... 71 Supplemental Figure 3.4. Screening expression proﬁ les of singe gene deletions for enrichment of subtelomeric genes...... 72 Supplemental Figure 3.5. Expression responses of subtelomeric genes to environmental perturbations...... 73 Figure 4.1. Wiring diagrams for example network models...... 81 Figure 4.2. Schematic of the experimental design approach...... 84 Figure 4.3. Validation and reﬁ nement of Swi4 transcriptional cascades...... 86 Figure 4.4. Simulated learning curves of three experimental design methods...... 89 Figure 5.1. The need for a new type of database...... 104 Figure 5.2. Representative network models stored in CellCircuits...... 107 Figure 5.3. Web interface (http://www.cellcircuits.org)...... 110 Figure 5.4. Meta-analysis of models...... 111 Figure 6.1. Contributions of the dissertation to each stage of the network lifecycle. .119

xi LIST OF TABLES

Table 2.1. DNA sequence motifs found in promoters bound in only one condition .....35 Table 3.1. Summary of Subtelomere Binding Transcription Factors ...... 68 Supplemental Table 3.1. List of telomere-related genes curated from the literature. ...74 Supplemental Table 3.2. Correlation of binding proﬁ les with the presence of genomic features...... 75 Supplemental Table 3.3. Orthologs of SBTFs in other species...... 76 Table 4.1. Internal validation for 21 of the 38 inferred models...... 96 Table 4.2. Top-ranking knock-out experiments proposed for model discrimination. ..97 Supplemental Table 4.1. Internal validation for 17 out of 24 restricted network models...... 98 Supplemental Table 4.2. Correlations between swi4Δ and gcn4Δ data from Rosetta and the new experiments...... 99 Supplemental Table 4.3. Restricted subsets used to evaluate the reproducibility...... 100 Supplemental Table 4.4. Gene sets for external validation...... 101 Table 5.1. Sources of data...... 116 Supplemental Table 5.1. Figures curated...... 117

xii ACKNOWLEDGEMENTS

I thank Trey Ideker, Chris Workman, Scott McCuine, Chen-Hsiang Yeang, and Mike Daly for the opportunity to collaborate on this research. Your insight, advice, and work were critical and will continue to shape my thinking in the future. To my advisor, Trey, I owe the opportunity to work in the lab, continued guidance and advice, and an uncompromising standard of excellence. I am grateful to Tim Ravasi, Kai Tan, Sourav Bandyopadhay, Ryan Kelley, Dwight Kuo, and Bianca Gruebel for their discussions, comments, and contributions to my research. I would also like to thank Sandi Jacobson, Scott McCuine, Colin Luo, Kate Licon, and George Kassavites for experimental assistance. Sandi Jacobson and Lorraine Pillus welcomed me into their lab and shared their love of yeast, chromatin, and telomeres – Thank you. I am grateful to Joe Ecker, Amy Kiger, Richard Kolodner, and Bill Loomis for helpful advice and discussions. I also owe thanks to Geoff Rosenfeld and Valentina Perissi for an initial foray into the biology of nuclear receptors. That exposure has turned out to be invaluable. Finally, I thank everyone in the Ideker lab. Over the years, you all shaped my work for the better in one way or another. Vasanthan Dasan, Tae Kim, and Ann Togasaki helped me get to where I am today and led by example along the way. Thank you. My time in La Jolla was greatly enriched, and my sanity preserved, by friends who helped remind me about life outside of biology. I thank John, Hudson, Dan, Lauren, Dave Shear, Ambika, and Dave and Naomi Spinak. I am indebted to my family, Nina, Mom, Dad, and T, for encouragement, patience, and an unwavering foundation from which to go forward. Last, but not least, I owe everything to Kiran for keeping it real. This is for all of you.

xiii Chapter 2, in full, is a reprint of material as it appeared on May 19, 2006 in Science, 2006, vol. 312, p. 1054-1059. Workman, Christopher T.; Mak, H. Craig; McCuine, Scott; Tagne, Jean-Bosco; Agarwal, Maya; Ozier, Owen; Begley, Thomas J.; Samson, Leona D.; Ideker, Trey. The dissertation author was one of three primary investigators (with C.T. Workman and S. McCuine) and two primary authors (with C.T. Workman) of this paper.

Chapter 3, in full, has been submitted for publication as it may appear in Genome Research, 2008, Mak, H. Craig; Pillus, Lorraine; Ideker, Trey. The dissertation author was the primary investigator and author of this paper.

Chapter 4, in full, is a reprint of material as it appeared on July 1, 2005 in Genome Biology, 2005, 6:R62. Yeang, Chen-Hsiang; Mak, H. Craig; McCuine, Scott; Workman, Christopher; Jaakkola, Tommi; Ideker, Trey. The dissertation author was one of two primary investigators and authors (with C.H. Yeang) of this paper.

Chapter 5, in full, is a reprint of material as it appeared on November 29, 2006 in Nucleic Acids Research, 2006, vol. 35, D538–D545, Database issue 2007. Mak, H. Craig; Daly, Mike; Gruebel, Bianca; Ideker, Trey. The dissertation author was one of two primary investigators and authors (with M. Daly) of this paper.

xiv VITA

1999 Bachelor of Science, Massachusetts Institute of Technology 1999-2002 Member of Technical Staff, Sun Microsystems Inc. 2003-2005 Teaching Assistant, University of California, San Diego 2002-2008 Research Assistant, University of California, San Diego 2008 Doctor of Philosophy, University of California, San Diego

PUBLICATIONS

“Validation and reﬁ nement of gene regulatory pathways on a network of physical interactions” Genome Biology, Volume 6:R62, July 2005

“A systems approach to mapping DNA damage response pathways” Science, vol 312, pp 1054-1059, May 2006

“CellCircuits: A database of protein network models” Nucleic Acids Research, vol 35, D538–D545, Database issue 2007, November 2006

“Dynamic reprogramming of transcription factors to and from the subtelomere” Genome Research, submitted, August 2008

FIELDS OF STUDY

Major Field: Biology (Computational and Systems Biology)

Studies in Computational and Systems Biology Professor Trey Ideker

xv ABSTRACT OF THE DISSERTATION

ASSEMBLING, ANALYZING, REFINING, AND CATALOGING MOLECULAR INTERACTION NETWORKS

Huajiang Craig Mak

Doctor of Philosophy in Biology

University of California, San Diego, 2008

Professor Trey Ideker, Chair

Life within an organism is sustained by biomolecular interactions. Mapping networks of interactions and deciphering their functions – at both individual and global scales – is a long-standing challenge. Network analyses promise to illuminate how complex behaviors emerge from collections of individual interactions. Many disease states and physiological responses, for instance, emerge from the complex interconnections between biological pathways.

xvi This dissertation addresses four challenges inherent in deriving biological meaning from networks. We studied transcriptional regulatory networks, although the insights gained are more widely applicable. First, we investigated methods for assembling regulatory networks by integrating physical and functional data. Second, we analyzed individual regulatory pathways and global properties of networks. Third, we investigated methods for effi cient network validation and refi nement. Fourth, we describe a database of network models: its value, its implementation, and its potential uses as a research tool. We present the fi rst large-scale network model of the transcriptional response to a DNA damaging agent in a eukaryotic cell. We developed novel methods for statistically assigning confi dence scores to high-throughput data and for integrating physical and functional data to generate a network that is induced by an environmental perturbation. In a second study, we developed computational methods to discover and characterize a group of 22 yeast transcription factors that specifi cally bind to targets in the subtelomeric regions near the ends of chromosomes. In a third study, we used a measure of information gain to prioritize potential experiments for validating a network. We then performed three top-ranking experiments and used the results to refi ne the network. In a fourth study, we describe CellCircuits, an online, searchable database of network models containing over 1000 models from eleven published studies. This database was made available to the wider research community. As a result of our work, we identifi ed general guidelines and themes for working with molecular interaction networks. We conclude with remarks on grounding computational research in the realities of biology.

xvii Chapter 1: Introduction

Networks: the output of high-throughput experiments

Imagine the molecules in a cell. Little machines that carry out life’s tasks: breaking, synthesizing, modifying, transporting. Replicating. Each has specifi c roles, but none work in isolation. Each has the potential to work with many others, but not at the same time or same place within the cell. If we could map the web of interactions between molecules, we would have a network: a record of chemical affi nities, of what infl uences what, of how cells get things done. Today we can, in fact, map networks, albeit to an incomplete but ever improving extent. Key advances in technology have enabled biological measurement on a massive, so called high-throughput, scale. Whereas the Southern blot has been used since 1975 to measure the amount of messenger RNA produced from a single gene1, the development of the microarray in 1995 allowed us to measure in a single experiment the RNA produced from all genes2. The measurement capacity of microarrays has been paired with other experimental protocols from molecular biology to identify the genomic locations bound by regulatory proteins3, the targets of kinase enzymes4, and the individual cells among a pool of a thousand variants that grow the fastest5. Microarrays derive their power by performing many experiments in parallel. Parallelization is a powerful strategy that has also recently been applied to other experimental techniques including DNA sequencing6-8, generating mutants with gene deletions9, and mutant phenotyping9. The net results of these parallel experiments are millions of data points.

Each experimental technique illuminates a different facet of life in a cell, possibly at different times, possibly under different environmental conditions, possibly with different components of the cell modiﬁ ed or disabled. Each experiment produces a

1 2 network. And combined, these networks together form a more complex network that is more comprehensive in its representation of cellular activity. The challenges inherent in extracting meaningful information from networks are the central subjects of this dissertation. We conceptually group these into four stages: network assembly, network analysis, network reﬁ nement, and network cataloging. Together these four stages constitute the lifecycle of a network.

The Network Lifecycle, Part I: Assemble and Analyze

In assembling a network, data collection is the fi rst step. At this point the network may contain an impressive amount of raw data, but it is not very useful. Among the thousands of genes and interactions represented by the network, only a handful may be involved in a biochemical pathway or disease being studied. Filtering becomes an important task. And as alluded to previously, the activities within a cell are multi-faceted, involving protein, DNA, RNA, and metabolite interactions. Some of which represent direct contact between molecules, while others represent the indirect effects observed when one or more molecules are disrupted. Integrating these different interactions becomes a second important task. Together, data collection, fi ltering, and integration are the challenges faced in network assembly. Filtering is often simple. It is aimed at answering questions such as: Where is my favorite gene in the network? And what does it interact with? Software tools have been developed that make it easy to answer these simple questions10,11. Online databases of molecular interactions also provide the ability to retrieve data for a specifi c gene or molecular pathway12-16. Moving beyond single genes, networks have been analyzed for global properties – consistent organizing principles that illuminate, for instance, how a cell functions as a system, how it is robust to failures, how it adapts to the environment, or how one external signal can reprogram several internal 3 pathways17. For example, one property of a protein in a network is the number of other proteins with which it interacts. Most proteins have a small number of interacting partners, while fewer have many partners, forming a hub-and-spoke topology18-21. The hubs have been interpreted as “master regulators,” proteins that sit at the top of a regulatory hierarchy, responsible for propagating one signal to many downstream targets in the network. To the spokes are ascribed more specialized roles: perhaps functioning in one tissue in the body or one biochemical pathway among many22. On top of this hub-and-spoke model other data can be layered. By layering the time during the cell cycle when each protein is expressed, for instance, we observe that some hub proteins act as gatekeepers, regulating progression through the cell cycle23-25. Yet other hubs appear to interact with the same sets of partners, but only one hub is ever expressed in the cell at a given time, suggesting that their roles may be interchangeable, but not redundant19. Layering additional information on to a network is an example of what is referred to as data integration. Data integration is usually more complex than fi ltering. But what, exactly, does it involve? Integration. Layering. These are words borrowed from other disciplines – mathematics or stone masonry perhaps – to describe the biologist’s abstract aspirations, not actual actions. They describe what we feel must be possible, but do not know exactly how to do yet, otherwise we would have a more precise word. We think: new technology generates so much data; there must be a way to integrate it all to better understand a living organism. Somehow, someway we should be able to mush it all together, wringing out biological insight. But how? What way? The methods used to date have been typically grounded in either statistical pattern fi nding or in a search for patterns directed by biologically inspired templates. Computer scientists use the terms ‘unsupervised’ versus ‘supervised’ to express 4 this distinction in methodology. Both methods have been successful. Statistical pattern fi nding approaches use techniques from a branch of computer science known as machine learning and have successfully identifi ed small, repeated patterns of interactions (so called network motifs)26, groups of genes and regulators involved in different cancer types27, and modules of highly interconnected genes that respond coherently to external stimuli28. Patterns directed by biologically inspired templates have been used to search for gene expression signatures among large collections of microarray data29,30, evolutionarily conserved DNA regulatory motifs31, interconnectivity between molecular machines in protein interaction networks32, and pathways conserved across the networks of multiple species33. Two themes emerge from past efforts at data integration. First, data integration often produces additional networks – either composite networks made of interactions drawn from several data sources, focused networks made of a subset of genes, or a network both composite and focused. Second, network assembly and analyses are often interconnected activities. The act of integration is, itself, one of analysis – an application of rules or heuristics that formalize the biologist’s insight about how one data type is related to another. In this dissertation we posit that data integration can be defi ned as the act of formalizing these relationships. To integrate, to layer data onto the network, to mush it all together is to articulate a deep understanding of how those data are related. If we observe an interaction between proteins, what does that tell us about their functional relationship? If two interacting proteins are expressed at different times or at different locations in the cell, what does that imply? If deleting a regulatory protein only affects the expression of its targets under specifi c environmental conditions, do we expect that regulator to directly bind the promoters of those targets? The variations on these questions are endless. Proposing answers, codifying, and then testing them 5 against the data collected is the methodology we will use to extract meaning from networks. The biological focus of this dissertation will be on transcriptional regulatory networks. These networks represent interactions between proteins that regulate gene expression and their target genes. Regulatory proteins may include transcription factors (TFs), signaling molecules, chromatin proteins that package DNA, or the RNA polymerase holoenzyme. We articulate relationships between direct (physical binding) and indirect (functional dependence) interactions linking TFs to their targets (Chapters 2 and 4). We also articulate relationships between the physical location of a gene along the chromosome and TF-target binding interactions (Chapter 3). We describe computational methods that codify these relationships and then apply these methods to study transcriptional responses to genetic perturbations (Chapter 4), responses to DNA damaging chemicals (Chapter 2), and properties related to the organization of genes along chromosomes (Chapter 3).

The Network Lifecycle, Part II: Reﬁ ne and Catalog

Next, we turn to the challenges of working with a network after it has been assembled and perhaps analyzed. Among the questions the biologist might ask when presented with a network: What is new and what is already known? How do the network results compare with previous fi ndings? Why is a well-known observation not represented in the network? What conclusions can be drawn from this network about the function of my favorite gene? As a result of exploring these questions, the network may need to be refi ned, either by adding new data, revisiting analysis methods, or throwing it all out and starting over. In many cases, the network may be known to be incomplete. It is widely recognized, for example, that networks assembled using data from high- 6 throughput experiments often contain many false negatives (missing interactions) and false positives (spurious interactions)34. Alternatively, a network may be interpreted in more than one way that is consistent with the observed data35. In all of the aforementioned scenarios we are interested in the best ways to refi ne the network – either by adding missing interactions or pruning erroneous ones. Network refi nement is an open problem that is often overlooked in the rush to gather, assemble, and analyze. Networks can be refi ned by quantitatively scoring the accuracy of individual interactions. Low scoring interactions may be discarded, leaving only the proteins and interactions judged most likely to be accurate. Krogan and Greenblatt identifi ed a high-confi dence set of protein interactions using a decision trees and Bayesian statistics – tools from a branch of computer science known as machine learning36. Networks have also been compared to each other and to “gold standard” data sets of interactions known to be true. Collins and Krogan used a probabilistic scoring method to assess the accuracy of two protein interaction networks gathered using the same experimental technique, affi nity purifi cation followed by mass spectrometry37. Reguly and colleagues curated a network of over 33,000 interactions from the published literature and compared it to networks obtained by high-throughput methods38. After applying methods like these, the resulting refi ned networks are more likely to contain true biological interactions and less likely to contain interactions measured in error. A second approach to network refi nement involves using the network to predict a biological response – perhaps to a gene deletion or to an environmental stimulus. Then, experiments are performed to observe the actual response. The agreement between the predicted and the observed responses is a measure of how well the network represents the real biological system. Next, computational approaches can be used to refi ne the network by adding or removing interactions, and then using 7 the refi ned network to predict the response. The best refi nements will improve the network’s predictive power. This predict-compare-refi ne strategy has been applied to the yeast metabolic network39 and to transcriptional regulatory networks40,41. In Chapter 4, we apply this approach to refi ne transcriptional regulatory networks that model responses to genetic perturbations. ______

Biological discovery is an ongoing process in which databases of biological information have been recognized as important tools. DNA sequences42,43, protein structures44, microarray results45,46, and protein interactions13,15 are all stored and disseminated in online databases. The scientifi c community continually adds new and refi nes existing data. Databases of interaction networks, though, have only recently begun to emerge. Designing a database for networks is primarily an engineering challenge. However, as we discuss in Chapter 5, the resulting database can be used to address scientifi c questions. Engineering design criteria include the mechanisms used to store networks, the interface for displaying networks to a user, and the tools provided for searching or browsing the repository. Networks are typically stored and presented in two ways: in a graphical format conducive to human interpretation or in a textual format that is easier to manipulate using computers. The software programs used for network analysis have typically used simple text formats10,11. But specialized markup languages, such as BioPAX47 and SBML48, have been developed to describe complex networks. In contrast, networks in journal articles are often presented in graphical fi gures, which may be easier for people to interpret, but are not amenable to further computational analysis. A number of network-centric databases exist. The BioModels database stores 8 quantitative dynamic network models49. Reactome50, MetaCyc51, and KEGG52 archive metabolic networks. GeNet53 and Science STKE54 are focused on signaling networks. These databases are all designed to fi ll a need for storing a particular kind of network, and are intended to store validated pathways or networks. The growth in sequence and protein data has been explosive in recent years. The growth of biological network data is likely to follow a similar trajectory, suggesting that network databases will become invaluable, much like sequence databases are today. We present a database of networks in Chapter 5.

Contributions of this dissertation

In this dissertation, we describe four studies that address biological questions by using molecular interaction networks. These studies are broadly focused on gene regulatory networks – networks that model how proteins, typically transcription factors, regulate the expression of genes. We describe methods for assembling regulatory networks from high-throughput experiments that measures TF-promoter interactions, protein interactions, and gene expression levels. We also describe methods for analyzing regulatory networks. Several analysis methods seek to identify combinations of proteins that work together to regulate expression, either at same promoter or in the context of a regulatory pathway. The experimental and computational tasks performed in these studies fall into four categories – network assembly, network analysis, network reﬁ nement, and network cataloging – that together can be thought of as the network lifecycle.

Assembly. In Chapter 2, we describe a map of the transcriptional response in yeast to a DNA damaging chemical, methyl-methanesulfonate (MMS). MMS alkylates DNA molecules in multiple places and is used as a model for understanding 9 chemotherapeutic drugs55. We codifi ed the relationship between two types of network data: (1) physical binding interactions between transcription factors and gene promoters, and (2) functional interactions derived from expression profi les of transcription factor deletions. We applied this knowledge to model how a select group of 30 transcription factors regulate DNA damage induced expression changes through a network of TF-promoter and protein-protein interactions. We present the fi rst large scale regulatory network of the DNA damage response, and identify many instances of cross-talk, where a factor typically associated with one biological process in fact appears to regulate several processes. Analysis. In Chapter 3, we describe an analysis of position effects in the yeast transcriptional regulatory network. We identifi ed 22 transcription factors that preferentially bind the subtelomeric regions of chromosomes. Subtelomeric binding is a novel property of these transcription factors and of the transcriptional regulatory network. These fi ndings suggest that genome position effects involve not only co-expression of neighboring genes along a genome but are also evident in the architecture, and perhaps evolution, of entire transcriptional regulatory networks. Refi nement. In Chapter 4, we describe a combined computational and experimental method for refi ning a transcriptional regulatory network. We used an information-theoretic score to rank genes in the network for further experimental investigation. We then performed four highly ranked experiments and assessed how well the observed responses were predicted by the network. We use simulations to demonstrate that our refi nement method is theoretically superior to randomly picking genes and to picking ‘hub’ genes, those with many interaction partners. Cataloging. In Chapter 5, we describe CellCircuits, a database of 1052 network models obtained from 11 published datasets. CellCircuits archives networks in graphical and text formats. It is accessible via the World Wide Web, and provides 10 an interface for searching for network models by a specifi c gene or Gene Ontology annotations. We demonstrate that a database of network models is useful for more that just storing models. By performing meta-analysis of the 1052 models in CellCircuits, we show how a database of models can be used to assess the novelty of new network results. Chapter 2: Assembling a DNA damage response network

Abstract

Failure of cells to respond to DNA damage is a primary event associated with mutagenesis and environmental toxicity. To map the transcriptional network controlling the damage response, we measured genome-wide binding locations for 30 damage-related transcription factors (TFs) after exposure of yeast to methyl- methanesulfonate. The resulting 5,272 TF-target interactions revealed extensive changes in the pattern of promoter binding and identifi ed damage-specifi c binding motifs. As systematic functional validation, we identifi ed interactions for which the target changed expression in wild-type cells in response to MMS but was non- responsive in cells lacking the TF. Validated interactions were assembled into causal pathway models that provide global hypotheses of how signaling, transcription, and phenotype are integrated after damage.

11 12

Introduction

Exposure of cells to chemical and physical damaging agents can result in DNA lesions that contribute to the onset of cancer, aging, immune defi ciencies, and other degenerative diseases56. Initially, DNA damage is sensed by a highly-conserved mechanism consisting of the ATM and ATR protein kinases (the Tel1 and Mec1 enzymes in yeast) which aggregate at DNA lesions57 and activate signaling cascades that include the Chk protein kinases (Chk1, Rad53, and Dun1 in yeast). These kinases, in turn, trigger both transcriptional and transcription-independent responses, including activation of DNA repair machinery and cell-cycle arrest56. Beyond the known DNA repair genes, genome-wide expression profi ling in yeast has identifi ed several hundred genes58-60 that undergo increased or decreased expression in response to alkylation damage by methyl-methanesulfonate (MMS). At the level of growth phenotype, systematic deletion studies have also identifi ed several hundred genes that are required for normal recovery from alkylation damage61- 63. Surprisingly, the set of genes that, when deleted, affect damage recovery is not enriched for genes whose transcript levels change upon damage exposure62,64. Thus, transcriptional profi ling alone, or genomic phenotyping alone, does not adequately defi ne the cellular response to DNA damaging agents. However, these studies do suggest that the DNA damage response involves multiple levels of regulation, affecting not only DNA repair genes but also genes that infl uence protein and lipid turnover, cytoskeleton remodeling, and general stress pathways. 13

Results

Identifying transcription factors involved in the MMS response

To construct a global model of yeast transcriptional networks activated by MMS, we applied a systems approach65 that integrated data from genome-wide chromatin immunoprecipitation assays, expression proﬁ ling, systematic phenotyping, and protein interaction databases. First, we performed a systematic screen for transcription factors (TFs) involved in the MMS response. TFs were chosen from a set of 141 yeast DNA-binding factors66 and were selected according to any one of three criteria T, B, or S (Figure 2.1A). These criteria were TF expression (that is, the TF was differentially expressed after exposure to 0.03% MMS); expression of Bound genes (that is, the TF had been previously shown66 to bind the promoters of genes that were differentially expressed in the above MMS experiment); or Sensitivity (that is, deletion of the TF gene, if not lethal, caused growth sensitivity in MMS relative to nominal conditions, see Fig. S1).

TF Selection A T B S L

(30 TFs) TF-Binding B 5272 +MMS Condition ChIP-chip Diff. (3845 both) dependent +/- MMS binding interactions 4996 -MMS (6,423 TF-binding interactions) Expression

C TF∆ Bayesian (341) Deletion- expression analysis buffering +/- MMS relationships

Validation & integration (119 TF-binding interactions) Results D Causal networks Regulatory TF-motif modules associations C TGATTC TF TF deletion Target gene Diff. expressed Figure 2.1. Overview of the systems approach. 14

A set of 23 TFs was identiﬁ ed (Figure 2.2A). Nine TFs were implicated by multiple criteria (Yap1, Gcn4, Cin5, Fkh2, Swi5, Swi6, Ixr1, Rim101, Rpn4), and three were encoded by essential genes (Hsf1, Mcm1, Ndd1). Two of the 14 MMS- sensitive TF-deletion strains (sok2Δ and ecm22Δ) had not been reported in a previous genome-wide assay for MMS sensitivity61. The set also included seven of the nine known cell-cycle regulators25, as might be expected given that DNA damage affects cell cycle progression56. A search of the literature identiﬁ ed 15 TFs that had been previously associated with regulation of DNA repair67-69. Eight of these were also detected by our systematic criteria; the remaining seven were added to our list to yield a total of 30 TFs associated with the DNA damage response.

Measuring the MMS-induced transcriptional network

We then used the technique of chromatin immunoprecipitation coupled with microarray chip hybridization (chIP-chip) to identify the MMS-induced transcriptional network immediately downstream of each TF. Exponentially-growing yeast cultures were exposed to 0.03% MMS for one hour, resulting in ~50% cell viability58,61. At a signifi cance threshold of p ≤ 0.001, the number of promoters bound by each TF ranged from 13 (Adr1 and Dig1) to 1,078 (Ino4) with an average of 214 per factor (Figure 2.2B). A total of 5,272 protein-DNA interactions with 2,599 distinct genes were identifi ed for the 30 factors. To reveal how the transcriptional network was reprogrammed between damaging and non-damaging conditions, we compared the MMS-induced interactions for each TF to the corresponding interactions as reported in nominal (non-induced) growth conditions by Lee et al.66. Raw data from nominal conditions were reanalyzed using an identical data processing pipeline and signifi cance threshold (p ≤ 0.001) as for the MMS experiments, yielding a total of 4,996 protein-DNA interactions with 15

A B Number of genes bound T B S L PC PE Cad1 26,17 69 Pdr1 2, 25 48 Ino4 50 568 460 Rim101 14 124 107 Uga3 25 84 74

Dal81 “Expanded” 11 136 113 Mcm1 * 17 261 137 Rpn4 0 198 74 Ecm22 23 291 93 Swi4 38 398 90 Cin5 29 263 12 Sok2 17 255 9 MMS only Swi6 � 22 160 8 �MMS only Fkh2 35 193 8 Both Yap1 12 58 6 Ixr1 � 35 104 5 Ndd1 * 30 78 15 Dig1 �� 2,10 1 Ace2 36 56 11 ��

Adr1 P -value scale 5, 7 1 Gcn4 �� 31, 42 5 Sko1 153 199 14 Fzf1 12, 2 0 Msn4 15, 3 2 Ash1 119 103 6 Crt1 47, 25 0 Swi5 61 36, 10 Hsf1 143 113 41

* “Contracted” Rtg3 37, 8 5 Yap5 104 28, 3

Figure 2.2. TF selection, chIP-chip, and expression proﬁ ling experiments. (A) Results of the four criteria used to select the 30 TFs (TF expression, expression of Bound genes, Sensitivity or Literature; see text). In column T, a ‘+’ or ‘−’ represents increased or decreased respectively. For column S, a ‘*’ denotes essential TFs. (B) Number of gene promoters bound by each TF. The three regions represent promoters bound exclusively in the absence of MMS (blue), presence of MMS (orange), or in both conditions (green). The proportion of genes in each region was compared to a negative control data set using Fisher’s exact test (see Methods). A red square in the left column

(PC, contracted) indicates that the proportion both/(both+absence) is signiﬁ cantly lower than expected

in the negative control. Similarly, red in the right column (PE, expanded) indicates that the proportion both/(both+presence) is lower. 16

2,588 distinct genes for the same 30 factors in nominal conditions. For each TF, we applied a pairwise statistical analysis to score the overlap in promoter binding between the damage-induced and non-induced conditions. This method exploits the dependency in p-values of binding between two related chIP- chip data sets to identify TF-promoter interactions with more sensitivity than can be obtained if each data set were analyzed separately (see Methods and Supplemental Figure 2.2). Three to six factors bound signifi cantly more genes in the presence of MMS, and conversely, eight factors bound signifi cantly more genes under nominal growth conditions (Figure 2.2B). Several promoter sets were enriched for DNA sequence binding motifs reported in the literature (Table 2.1)70-72. The observed shifts in promoter binding for Cad1 and Hsf1 correlated with the known Cad1 and Hsf1 binding motifs, which were enriched in the sets of promoters bound before but not after MMS exposure. We also used the ANN-Spec algorithm73 (see Methods) to search each set of promoters for motifs that had not been previously reported. Four such motifs were found, associated with Gcn4, Sko1, Ndd1, or Swi5, respectively (Table 2.1). The GCTCGAAAA motif was found upstream of 12 of 15 genes bound by Ndd1 in the presence of MMS, but was found upstream of only three of 30 genes bound in the absence of MMS. Such motifs may have been missed in past analyses because they are not active prior to damage exposure. They may represent DNA-binding sequences bound directly by the associated factor (Gcn4, Sko1, Ndd1, or Swi5) upon post-translational modifi cation, or alternatively, they may be bound by a different TF that is coordinately recruited to the promoters after MMS treatment. To identify combinations of TFs that regulate genes in common, we scored the signifi cance of overlap between the gene sets bound by each pair of the 30 TFs (Figure 2.3A; hypergeometric test at p ≤ 0.01). Cad1 was found to pair with Yap1 in 17

Significant overlap 1 2 1 0 1 2 4 4 3 1 1 2 4 1 1 1 3 2 1 1 2 1 8 1 5 5 6 4 1 1 4 1 5 1 I I I M

of genes bound A N N K D T E L M N R D F R H O B G H P P

A F O M N G R C S C A I P G D I I R A S T C W C S W W D K O D K Z A A N X +MMS A S I D M R E R U A F N S S C Y D I F C A G C H S S P R M Y -MMS YAP5 both

Ribosome biogenesis CRT1

MSN4 RPN4 ECM22

PDR1 RIM101 DAL81 RTG3 Pyrimidine metabolism UGA3

MCM1 INO4

ASH1 Heat shock and protein folding FKH2 SWI5 SWI4 HSF1 CAD1

Cell cycle CIN5 Redox SOK2 homeostasis ACE2 SWI6 Sugar transport; RNA pol II YAP1 NDD1 transcription

IXR1 SKO1

Figure 2.3. Overlap in transcription factor binding patterns in response to MMS. (A) TFs are linked if they bind sets of promoters that signiﬁ cantly overlap. Thicker lines represent overlap that is more statistically signiﬁ cant than overlap represented by thin lines. Linked TFs are labeled with the high-level functions of their associated sets of (co)-regulated genes. (B) Hierarchical clustering matrix for the 30 TFs (columns) across the top 1,000 gene targets with highest variance (in log p-value of binding; rows) pooled over –MMS and +MMS conditions. Color is based on the binding condition(s), as is in (A), or is black when not bound in either condition. the absence of MMS but paired with Hsf1 in the presence of MMS. Several cell-cycle TFs that normally co-regulate large numbers of genes (such as Fkh2 and Swi6 or Ace2 and Swi5)25 no longer appeared to do so after MMS treatment, perhaps because DNA damage causes delayed progression through the cell cycle. A prominent combination that emerged after MMS treatment consisted of Ino4 and six other factors (Dal81, Mcm1, Rim101, Ecm22, Rpn4, and Uga3) from the “expanded” set, which regulated several hundred genes in common (Figure 2.3B). An additional set of >200 genes were targeted uniquely by Ino4 both before and after damage exposure (Figure 2.3B), supporting a previous hypothesis that Ino4 is actually a global regulator of gene expression74. 18

Deletion-buffering analysis

Next, we sought to validate transcriptional effects of the measured binding interactions and to pinpoint the particular interaction pathways involved in transmission of the damage response signal. For this purpose, we used yeast genome microarrays to monitor MMS-induced gene expression changes across the viable knockout strains75 for the TFs found to be important in the DNA damage response (Methods). Of the 30 TFs identifi ed, 27 were non-essential (Figure 2.2A column “S”) and could be profi led. Hierarchical clustering over all genes confi rmed that these knockout profi les were globally more similar to the responses of wild-type cells to MMS [as measured in this study and previously58,59] than to the expression responses of wild-type cells to other stress conditions60,76 (Supplemental Figure 2.3). Furthermore, although ~20% of the transcriptional response to MMS may be due to slowed cell cycle progression through G1/S phase59,60, the majority of responsive genes were not periodically expressed during the cell cycle (Supplemental Figure 2.4). Processed expression data were analyzed to identify genes that were genetically “buffered” by one or more TF deletions. In this context, we defi ne “deletion buffering” to mean an effect in which genes that are normally differentially expressed become unresponsive in a specifi c knockout background. For each gene-TF combination, wild-type and knockout profi les were analyzed to score the signifi cance of the deletion-buffering effect using a Bayesian scoring scheme (Supplemental Figure 2.5 and Methods). At p ≤ 0.005, a total of 341 genes showed deletion buffering in the 27 knockouts, corresponding to 27 genes on average over a range of 90 genes for the Adr1 knockout to four genes for the Ecm22 knockout (Supplemental Figure 2.6). As a positive control, we examined the deletion-buffering results for Crt1 (Rfx1), a transcriptional repressor of the ribonucleotide-diphosphate reductase (RNR) complex which catalyzes synthesis of new nucleotides during DNA repair77,78. As 19 expected, the expression levels of RNR2, 3, and 4 were deletion-buffered in the crt1Δ strain but not in most other strains, and Crt1 bound the promoters of these genes before but not after MMS treatment (Figure 2.4A; Supplemental Figure 2.7). Many of the remaining deletion-buffering events represent previously undocumented regulatory relationships. For example, contradictory to a prior report79 we found that both members of the Swi4-Swi6 complex could bind the DUN1 promoter and were required for the DUN1 transcriptional response (Figure 2.4B,C). Beyond validation of individual interactions, the deletion-buffering analysis provided insights at the level of the damage response system as a whole. Paradoxically, the set of genes that are differentially expressed in response to DNA damage does not signifi cantly overlap with the set of genes required for growth under damaging conditions62,64. Our new expression data confi rmed these fi ndings for MMS, but also showed that the number of genes buffered by a TF knockout was highly correlated with its degree of MMS sensitivity (r=0.72; Figure 2.4D). For example, adr1Δ, the most sensitive TF knockout in our study, also buffered the largest number of genes. Thus, the transcription factors most essential for cellular recovery after MMS exposure are, apparently, also the most central to the MMS transcriptional response. The opposite of deletion-buffering is deletion-enhancement, that is, genes that are MMS-responsive in a TF deletion strain but not in wild type. Deletion enhancement was a much rarer event than deletion buffering. At the same p-value threshold, only 16 genes showed deletion enhancement whereas 341 showed deletion buffering. TFs associated with deletion-enhancement are apparently required to maintain stable expression of a set of genes, which become MMS responsive in their absence. 20

Building a network of regulatory pathways

ChIP Differential expression +/− MMS Gene A RNR3 RNR4 1 1

t RNR2 r YPR015C g C ARN1 0.1 n AUT7 i d

OLE1 n i -2

b 10 : f

RNR4 p o B i h

RNR1 e -3

c 10 u -

DUN1 l P a 4 I

i CDC21 v h -

w PIR1 -4

C p 10

S SCW11 MET6 : -3 YNL300W n 10 o i

s -2

C DUN1 s 10 e

DIN7 r p PRE8 e 0.1 x u l PRE1 e 6 i a YLR297W l v a w - i

MTR4 t p S -0.1 BAR1 n d e r FAR1 e -2

e -10 n f

COX7 f g i i

s -3 e ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ D s S S

t -10 p 1 1 2 3 1 2 4 1 2 1 1 4 1 5 5 1 2 1 4 4 6 5 3 1 4 1 1 i i i f t M M r r r m y k h g o n e h d n p p n 2 n a g o 0 8 r t z i i l x t d d w w w p i f k k o s c s c c a a a p g n 1 M M r c f d i a m s s s a p r x s s d a a c g y y u − + l m c d m i e i e r l w l D a 0 0

1 ADR1 s e n 0 e

8 DAL81 g RPN4 d e r 0 e 6 f f u b - 0 n 4 SWI5 o i t RTG3 e l SWI4 GCN4 e 0 FZF1

2 FKH2 D ASH1 YAP1 SWI6

# IXR1 RIM101 CIN5 0

0 5 10 15 20 Sensitivity score

Figure 2.4. Deletion buffering of the MMS response. Differential expression and chIP-chip binding p-values of representative genes showing deletion- buffering for (A) Crt1, (B) Swi4, or (C) Swi6. Expression changes are colored yellow for up-regulation or cyan for down-regulation. Genes highlighted in red are those previously known to function in the DNA damage response. (D) Correlation between phenotypic sensitivity of each deletion strain in MMS (x-axis) versus the number of genes buffered by the TF-deletion (y-axis). Sensitivity scores were drawn from61. Scores range from 0 to 30, where high scores indicate that a TF-deletion was found to be sensitive to low concentrations (less than 0.03%) of MMS over replicate growth experiments. Only TFs with detectable sensitivity (score > 0) are shown. 21

Only 11% of the observed deletion-buffering events (37 out of 341) coincided with a direct chIP-chip binding interaction (Supplemental Figure 2.8). Such low overlap might occur for two reasons. First, failure to detect deletion buffering does not necessarily invalidate TF-promoter binding. Second, the observed deletion- buffering effect might be indirect, that is, mediated by longer regulatory pathways connecting the deleted TF to its regulated genes through one or more intermediate factors. To identify these longer pathways, we applied a Bayesian modeling procedure35 to search the known physical network for the smallest set of paths (of two interactions) that were supported by the largest number of deletion-buffering events (Methods). Types of regulatory pathways identifi ed are shown in Figure 2.5A. The physical network consisted of TF-promoter binding interactions from: [1] the 5,272 interactions measured in the presence of MMS for the 30 TFs in our study; [2] the 4,996 interactions measured for these TFs under nominal conditions; and [3] 5,903 interactions for the 74 additional TFs assayed by Lee et al.66, also in nominal conditions. We also included [4] a set of 14,319 high-throughput protein-protein interaction measurements from the Database of Interacting Proteins16. These protein- protein interactions were measured in cells grown in the absence of MMS, and hence might or might not be present in MMS-treated cells. In the combined network, a link from protein a to b represented the observation that a directly targets the promoter of the gene encoding b (sources [1] through [3]) or that a and b physically interact (source [4]). In total, we identifi ed 68 buffering events that validated 88 longer paths. These paths were combined with the 37 direct effects identifi ed earlier to formulate a model of the transcriptional response to MMS (Figure 2.5B). This model explains the MMS expression response of 82 genes and provides the basic scaffold on which inter-process 22

A YAP5 DAL81 FKH2 C RTG1 FKH1 MBP1 SWI6

INO4 SOK2 SFL1 PDR1 ACE2 RAD51 HNM1 PUT3

MAG1 RFA1 HOR7 ZWF1 RPA34 YIL169C GLC3 RNR1 YLR108C HXK1 DIN7 FAR1 YML131W Direct Indirect Reinforcing Feed-forward loop MRD1

B Metabolism Proteasome

RPN4 FKH2 MCM1 ALD3 INO4 RTG3 HSP78 ASH1 GRE3 RPA190 YHB1 NAB2 BUD9 DAL81 YER079W YIL169C PCL5 CPA2 ARX1 LYS20 SFP1 YGR146C SWI5 NDD1 ACE2 PUT3 GCN4 PDR1 HAP4 REB1 AZF1 ZMS1 ADR1

IXR1 HOR7 TSA1 TPS1 UGA3 ECM40 YOL155C BNA1 OM45 YFR017C VPS73 YHR020W YJL161W ARG3 YDL110C CDC20 YIL158W AAD4 DIP5 ZWF1 HST3 HOM3 YJL051W GPG1 RFA2 YHR138C

Cell cycle Stress response

EGT2 DNA repair SCW11

ENP1 GTT1 YLR297W DIN7 DUN1 HSP30 MAK5 MSS11 ARG80 COX7 CYT1 MBP1 RPA34 ECM22 RNR1

SKO1 SWI6 GTS1 HO

YAP1 SOK2 YAP5 RIM101 PHD1 SWI4 YOR114W YGR079W YBR053C

CWP1 YAP6 SER3 GAT1 SFL1 NRG1 YBL054W ABF1 CIN5 MET6 PTR2 TEC1 BET4 YGR035C FLR1 TRM2 ISU2 SUN4 ATO3 HSP12 AAD6 FUR1 GSH1 OYE2 GTT2 RFA1 ATR1 MAG1 Stress response YLR460C SUM1 Cell cycle RNR4 CRT1 Genes Interactions YPR015C Deleted TF Protein-DNA, �MMS only RNR2 Intermediate TF Protein-DNA, +MMS only Buffered gene Protein-DNA, both �/�MMS WT response Protein-protein DNA repair Down-regulated Up-regulated Buffering effect

Figure 2.5. A network of regulatory interactions connecting transcription factors to deletion- buffered genes. (A) Example regulatory paths in which deletion-buffering effects (squiggled arrows) support binding interactions (straight arrows). (B) The full validated model based on overlap between binding and buffering (see text). Buffering effects are omitted for clarity. Regions of the network are organized based on known functions of the deleted TFs. (C) Models resulting from follow-up experiments to generate buffering proﬁ les for Mbp1 and Rtg1. 23 communication is achieved in the transcriptional response to an alkylating agent. At the core of the model are the known damage response genes RNR1, RNR2, RNR4, RFA1, RFA2, DIN7, DUN1, and MAG1 58. Although some of the TFs regulating these genes are expected based on previous studies (e.g., Swi4, Swi6, or Crt1), many others would not have been predicted, including those that have been previously associated with lipid metabolism (Ino4), stress response (e.g., Yap5), or cAMP-dependent signal transduction (Sok2). Overall, the model highlights extensive regulatory cross-talk among the processes of DNA replication and repair, cell cycle and cell-cycle arrest, stress responses, and metabolic pathways.

Experimentally testing the model

Every regulatory path of length two implicates an intermediate factor which is expected to regulate a similar set of genes as the deleted TF. A majority of these paths (49 out of 88), such as Rtg3→Ino4→RFA2 and Swi4→Sok2→MAG1, were already consistent with available data for both the source (e.g., Rtg3 or Swi4) and the intermediate (e.g., Ino4 or Sok2) factor, as both TFs were already included among the 27 assayed (“Reinforcing” in Figure 2.5A). Other paths included intermediate factors for which the transcription profi les were implied but untested (“Indirect” in Figure 2.5A), suggesting follow up experiments to refi ne the model. Swi6 is thought to bind DNA in a complex with either Swi4 or Mbp180. To discriminate which Swi6 targets were Mbp1 dependent, we analyzed an mbp1 knockout strain. The set of genes deletion-buffered by mbp1Δ was found to overlap with the swi6Δ buffered set, including the genes RNR1 and DIN7 (Figure 2.5C). Thus RNR1 and DIN7 appear to depend on both Mbp1 and Swi6 for proper regulation. Noting that loss of Rtg3 caused deletion-buffering of many downstream genes through the path Rtg3→Ino4, we investigated whether loss of Rtg1, a binding partner 24 of Rtg3 81, would produce a similar outcome. Indeed, loss of Rtg1 buffered many genes that were bound by Ino4 (Figure 2.5C). However, the sets of Rtg1 versus Rtg3 deletion-buffered genes did not strongly overlap. Thus, in response to MMS exposure, the three factors Rtg1, Rtg3, and Ino4 apparently collaborate to regulate a battery of genes (whose products infl uence phospholipid metabolism and retrograde transport), but their functional roles are not interchangeable. We have integrated transcription factor binding profi les with genetic perturbations, mRNA expression, and protein interaction data to reveal direct and indirect interactions between transcription factors and MMS-responsive genes. The result is a highly interconnected physical map of regulatory pathways supported by both binding and deletion-buffering profi les. Some relations in this map are confi rmed by previous studies but most represent the basis for new hypotheses. As systems-level approaches continue to map the connectivity of large cellular systems, an important goal will be to make these maps even more integrative, and to learn how to use them to predict the effects of different drugs, dosages, and genetic dispositions on pathway function. 25

Methods

Strains and Media

Strains used in phenotyping and expression proﬁ ling were derived from the haploid BY4741. The parent strain was obtained from ATCC (Manassas, Virginia, USA), and all nonessential deletion strains constructed by the Saccharomyces Gene Deletion Project were obtained from Research Genetics (Huntsville, AL). Epitope- tagged strains (c-myc) used in construction of the physical regulatory network were derived from W303 and obtained from the laboratory of Dr. Richard A. Young at the Whitehead Institute (Cambridge, MA). Cells were cultured in standard yeast rich media (YPD) at 30°C except when noted.

Phenotypic Sensitivity Assays

Parent and all viable deletion strains harboring single knockouts of transcription factors (TFs) proﬁ led in Lee et al. 24were arrayed into 96-well plates containing YPD broth and grown to saturation. Settled cells in each well were resuspended to ensure homogeneity, and 5µl from each well was spotted simultaneously onto YPD agar plates with the use of a Hydra liquid handling apparatus (Robbins Scientiﬁ c). The process was repeated on additional YPD agar plates containing a range (0.01% to 0.1%) of methyl methanesulfonate concentrations (Sigma Chemical

Company) to determine the optimum concentration to adequately detect sensitive knockout strains. MMS was added to cooled agar and used fresh within one day to ensure the stability of the agent. Spotted plates were grown for 60 h at 30°C and imaged using a Gel Doc 1000 (BioRad) running Quantity One software, and all screens were performed in triplicate using fresh cultures. 26

Whole Genome Expression Analysis

Gene expression experiments were processed in parallel with at least two distinct biological samples from each strain (colonies of similar size picked from fresh YPD + G418 agar plates) grown to saturation in YPD overnight at 30°C. The overnight culture was diluted 1:100 in a fl ask of 100ml fresh YPD and grown in a shaking incubator at 30°C 180 rpm until the culture reached an OD600 of 0.8 – 1.0. Every effort was made to closely match the fi nal optical density of each of the samples. Once a culture reached the appropriate cell density, the sample was split in half, MMS added to a fi nal concentration of 0.03% in the non-reference samples, and both cultures grown for an additional hour. Cells were harvested by centrifugation at 3000 rpm for fi ve minutes at room temperature in a Legend RT centrifuge (Kendro Laboratory Products, Asheville, NC, USA). Cell pellets were immediately snap- frozen in liquid nitrogen to suspend gene expression (including any temperature stress response genes) and stored at −20°C prior to RNA extraction. Total RNA from each sample was isolated by hot acid phenol extraction and mRNA-purifi ed via Poly(A)Pure kits (Ambion). Labeling of cDNA was performed in a dye-reversal scheme by direct incorporation using a CyScribe First-Strand cDNA Labeling Kit (Amersham Biosciences). Corresponding Cy-3 and Cy-5 labeled samples were co-hybridized to microarrays containing the Yeast Genome Oligo Set Version 1.1 (Qiagen).

Genome Wide Transcription Factor Binding Analysis

Samples were processed in parallel using three distinct biological replicates from each epitope-tagged strain and grown to saturation in YPD overnight at 30°C. The overnight culture was diluted 1:100 in a ﬂ ask of 50 ml fresh YPD and grown in a shaking incubator at 30°C at 180 rpm until the culture reached an OD600 of 0.8 – 1.0. 27

MMS was added to a ﬁ nal concentration of 0.03%, and the culture was grown for an additional hour. Protein-DNA binding locations were assayed as previously described by Lee et al. 24with corresponding IP-enriched and unenriched samples co-hybridized to a single cDNA microarray containing all yeast intergenic sequences derived from PCR ampliﬁ cation.

Array Preparation and Hybridization

Microarrays were spotted on UltraGAPS II slides (Corning) using an OmniGrid100 microarrayer (GeneMachines). After spotting, arrays were baked at 80°C for 2 hours, cross-linked at 300 mJ in a UV Stratalinker 2400 (Stratagene), and stored under vacuum. Hybridizations were conducted at 42°C for 15 hours using the Lucidea SlidePro automated hybridization machine (Amersham Biosciences), and arrays scanned using GenePix 4000A (Axon Instruments) or ScanArray Express scanners (Perkin-Elmer) at a 10.0 μm resolution.

Data Post-processing and Signiﬁ cance Assessment

Scanned images were processed using GenePix Pro 3.0 (Axon Instruments) or QuantArray (PerkinElmer) software to obtain raw Cy-3 and Cy-5 foreground

(f3, f5) and background (b3, b5) intensity measurements for each spot on the array. Background intensities were smoothed using a 7x7 median spatial ﬁ lter, separately

´ ´ for the Cy-3 and Cy-5 channels to obtain (b3 , b5 ). Background-adjusted log ratios

´ ´ L=log10[(f5−b5 )/(f3−b3 )] were corrected for cyanine-dye dependent bias using Qspline normalization82. Corrected log ratios L´ were spatially normalized by subtracting the median L value within a 9x9 window.window. The four replicate arrays for each gene expression experiment were processed using the VERA package83 to estimate multiplicative and additive errors and to associate a p-value of differential 28 expression with each gene. This approach was also applied for error modeling and signifi cance assessment of the chIP-chip data. However, unlike for gene expression analysis, in which both increases and decreases in fl uorescent intensity are of interest, DNA binding is indicated only for increases in intensity, representing increased promoter binding in the IP-enriched sample versus the IP-unenriched sample. Thus, signifi cance of DNA binding must be assessed using a one-sided test. The VERA likelihood ratio test was therefore modifi ed to use a one-sided statistic [force μx ≥ μy in the denominator of Eqn. 5 of reference83].

Overlap in Binding Between Conditions

For each transcription factor, p-values of binding from +MMS and –MMS experiments are integrated to select sets of genes bound in –MMS only, +MMS only, or in both conditions, as displayed in Figure 2.1 of the main text. Typically, genes are selected in a single condition by including those with p-values below a certain threshold, such as t = 0.01. However, given that two conditions c1 and c2 are under consideration, the dependency in the corresponding pair of p-values (p1, p2) measured for each gene can be exploited to achieve a more sensitive selection. Treating the pair of values as replicate measurements of binding, a compounded p-value of binding (in any/either condition) can be computed by multiplying p-values, i.e., genes for which

´ p1 · p2 < t are chosen as bound. Based on this observation, we combine the one- and two-condition thresholding methods, associated with t and t´ above, to partition the two-dimensional space of p-values into four regions as shown in Figure S2. Region (A) is deﬁ ned by

´ (p1 ≤ t, p2 > t /t) and represents binding in c1 only. Region (C) is symmetrically deﬁ ned

´ by (p1 > t /t, p2 ≤ t) and represents binding in c2 only. Region (B) is the area deﬁ ned by

´ ´ ´ (p1 · p2 ≤ t , p1 ≤ t /t, p2 ≤ t /t) and represents binding in c1 and c2. The remaining values 29

of (p1, p2) not covered by regions (A-C) represent lack of binding. Furthermore, for a given simple threshold t, the compound threshold t´ is set so that regions (A-C) are equal in area, which ensures that under the null hypothesis (i.e., no genes are bound in either condition) the expected numbers of genes which fall into each region are the same. This two-dimensional thresholding scheme is similar to one proposed by Zaykin et al.84 for epidemiological studies.

Signiﬁ cance of expansion or contraction

Given overlap counts (A, B, C) for each TF, we computed the signifi cance of a possible change in genome-wide binding pattern between untreated and MMS-treated conditions. To construct a null model in which the TF binding profi les did not change across conditions, additional chIP-chip experiments were performed in untreated conditions for each of three transcription factors Gcn4, Crt1, and Rpn4. As with all other experiments, three replicate experiments per TF were normalized and processed using the VERA/SAM package. These data were compared to the original chIP experiments for Gcn4, Crt1, and Rpn4 generated by the Young laboratory24,70, also in untreated conditions. Each of these untreated (original Young lab) vs. untreated (new Ideker lab) comparisons was analyzed to compute the overlap in binding, exactly as described above for the untreated (original Young lab) vs. treated (new Ideker lab) binding profi les. Promoter counts falling into regions A, B, or C were aggregated across Gcn4, Crt1, and Rpn4 yielding A=347, B=459, C=239. These totals were used to derive a pooled estimate of B/(B+(A+C)/2) = 61% for the proportion of promoters bound in both conditions in the negative control. This expected value was compared to the B/(B+A) and B/(B+C) proportions for the -/+ MMS comparison across each of the 30 TFs in our study using a one-sided 30

Fisher’s Exact Test (FET). The p-value of contraction or expansion was defi ned as the FET signifi cance that the fi rst or second fraction, respectively, was lower than expected under the null model.

Motif Enrichment in Transcription Factor Binding Data

Transcription factor binding site motifs were defi ned as position-specifi c weight matrices (PWMs) of log-likelihood ratios85. PWMs were compiled for 114 different individual TFs from Harbison et al.70 and public data bases71,72. When more than one matrix was defi ned for the same TF, the PWM with the highest information content per position (relative entropy) was selected. Using the PWM scoring functionality of ANN-Spec73, the score distribution for each motif was determined over all possible subsequences of the intergenic regions (as defi ned by the probes on the chIP-chip array) such that a score threshold could be selected to ensure a rate of < 10-4 predicted sites per base pair. ANN-Spec was used to discover potentially new motifs in promoter sets defi ned by the method described in “Overlap in Binding Between Conditions” (t=10-3). This defi ned three non-overlapping sets for each TF; +MMS-only, –MMS- only, and BOTH. For each sequence set, 100 training runs were performed for pattern widths 8 to 14 using ANN-Spec’s discriminative mode against all intergenic regions. Training allowed for zero or more sites per alignment (so called ‘ZOPS’ occurrence model). Redundant or approximately redundant alignments were merged (correlation coeffi cient >0.8, over 6 or more contiguous alignment positions in either orientation) and the remaining alignments were fi ltered against the 114 known TF-motifs in a similar way. The remaining 222 alignments were converted to PWMs and represented our novel motif models (108 in –MMS-only, 71 in +MMS-only and 43 in BOTH). For each TF and known and novel motif, PWM enrichments were calculated 31

(hypergeometric p-value) in either the –MMS-only or +MMS-only non-overlapping binding sets to identify differentially enriched motifs. Differential enrichment was deﬁ ned when binding p-values <10-7 (<10-3 after Bonferroni correction) were observed in one condition, and p-values >10-2 were observed in the other condition.

Deletion Buffering Analysis

Expression profi les (+/−MMS treatment for wild type and 27 TF knockout strains) were analyzed to identify genes that were differentially expressed in the vast majority of profi les but did not change in expression in a particular TF knockout background. In these cases, the gene was said to be deletion-buffered by the TF in question and the TF was “epistatic” to this gene. For each (gene, TF) combination, the probability of buffering was computed using a Bayesian score function, as follows. Let Tko vs. Tother represent the event of true differential expression of the gene in the TF knockout vs. all other strains. Similarly, let pko and pother represent the corresponding observed values for the gene, which in this case are p-values of differential expression. The value for pother is computed as the combined p-value of differential expression over all profi les excluding the knockout under consideration, according to Fisher’s rule of multiplication 86:

m p = ∏ i ∀i≠ko n−1 (− ln m)i p = m other ∑ i! i=0 Since most genes are likely regulated (directly or indirectly) by only a fraction of the TFs encoded by the genome, they are expected to behave similarly between the wild type and most of the 27 knockout backgrounds. The combined p-value thus provides a measure of differential expression in + vs. – MMS conditions that is more robust than 32

the single p-value obtained for a wild type experiment. Given measurements of pko and pother, the probability of deletion-buffering is equivalent to the probability that the gene is truly differentially expressed in other experiments but not in the knockout in question. Using Bayes’ rule:

Pr(pko , pother | Tko = 0,Tother = 1)Pr(Tko = 0,Tother = 1) Pr(Tko = 0,Tother = 1| pko , pother )= Pr(pko , pother )

Pr(p | p ,T = 0,T = 1)Pr(p | T = 0,T = 1) ≅ ko other ko other other ko other Pr(pko , pother )

[Because Pr(Tko=0, Tother=1) is constant]

Pr(p | T = 0)Pr(p | T = 1) ≅ ko ko other other Pr(pko , pother ) [Given T, p is conditionally independent of other variables]

By identity, Pr(pko | Tko = 0) = pko. The remaining two terms are estimated empirically from the available frequency distributions. Pr(pko, pother) is estimated by a two-dimensional histogram of the observed values over all genes and experimental conditions in our study. Pr(pother | Tother = 1) is estimated by a histogram of pother restricted to “true” differentially-expressed genes, deﬁ ned as the set of 255 genes for which the median absolute expression change was greater than two-fold over at least two of seven time points (5, 15, 30, 45, 60, 90, 120 min.) monitored by Gasch et al. after MMS addition58.

Assembly of Transcriptional Pathways From Integrated Data

Physical mechanisms of transcriptional regulation were modeled using an approach described previously35. Briefl y, we postulated that the regulatory effects of deleting a gene are propagated along paths of physical interactions (protein-protein 33 and protein-DNA). We formalized the properties of these paths and interactions using a factor graph and found the most probable set of paths using the max-product algorithm 87. The raw data used in the modeling procedure included: 7,055 promoter- binding interactions found in –MMS only; 3,851 promoter binding interactions found in both + and – MMS; and 1,431 promoter-binding interactions found in +MMS only; the set of all 15,166 pair-wise protein-protein interactions recorded in the Database of Interacting Proteins as of April 2004 16; and the 341 buffering interactions found using the method described in “Genome Wide Epistasis Analysis” (buffering p-value ≤ 0.005). For the 30 transcription factors in our study (see Figure 2.2a), we used the method described in “Overlap in Binding Between Conditions” to identify genes bound in either +MMS only, -MMS only, or both conditions. For 75 additional factors from Lee et. al. for which we had data for the –MMS condition only, we used a strict p-value threshold of p ≤ 0.001 to select signifi cant promoter-binding interactions and classifi ed these as –MMS only interactions. In Figure 2.5b of the main text all paths that directly or indirectly (via one intermediate TF) connect TFs to deletion-buffered genes are shown. In supplemental web fi gures “Integrated models by TF” these paths are partitioned to show the paths connecting each individual TF to genes that it buffers. To generate integrated models for the Mbp1 and Rtg1 validation experiments (Figure 2.5c), the same promoter- binding and protein-protein interaction network was used along with either 14 buffering interactions from the Mbp1 experiment or 20 from the Rtg1 experiment (buffering p-value ≤ 0.004).

Assessing overlaps in binding between TF pairs

Each of the sets of genes bound by a TF (in –MMS or +MMS conditions using a strict threshold of p ≤ 0.001) was systematically compared to the sets of genes bound 34 by each of the other TFs using the hypergeometric test. Signiﬁ cant overlaps between sets (p ≤ 0.01 after Bonferroni correction) are displayed in Figure 2.3 of the main text. TFs are linked by a green line if a signiﬁ cant number of genes were bound by both TFs in –MMS and (a possibly different set of genes were bound) in +MMS conditions. If two TFs only bound the same genes in either –MMS or +MMS, they are linked with a blue or orange line, respectively. Hierarchical clustering was performed using the ClustArray program (http://www.cbs.dtu.dk/services/DNAarray/). 35

Tables

Table 2.1. DNA sequence motifs found in promoters bound in only one condition

1A. Motifs found in -MMS, but not +MMS1 # promoters # promoters Promoter w/motif w/motif Consensus sequence2 set3 (-MMS)4 (+MMS)4 Motif source5 TGACTC Gcn4 20/31 0/5 GCN4†, BAS1∗ ATTAGTAAGC Cad1 18/26 0/69 CAD1∗ cGGGGG Hsf1 35/143 1/41 ADR1∗ CGGGGCACnCTcStCCG Hsf1 43/143 3/41 GAL4†, PUT3∗ TTCtannnnnTTC Hsf1 44/143 7/41 HSF1∗ CCGGtACCGG Hsf1 37/143 0/41 LEU3∗ CGGCGCCCGCGAn Hsf1 43/143 6/41 RFA2 GCCSnGSCC Hsf1 40/143 1/41 SKN7∗ GCGGCnnnGCGGC Hsf1 39/143 2/41 STP1∗ ACCCGTACAt Yap5 44/104 0/3 SFP1∗

1B. Motifs found in +MMS, but not –MMS6 # promoters # promoters Promoter w/motif w/motif Consensus sequence2 set3 (-MMS)4 (+MMS)4 Motif source5 TMSCTGCAAAnTT Gcn4 0/31 5/5 Predicted (ANNspec) GCTCGAAAA Ndd1 3/30 12/15 Predicted (ANNspec) TGAYTAACn Sko1 7/153 11/14 Predicted (ANNspec) TnTCnCTCAT Swi5 6/61 9/10 Predicted (ANNspec) GCGGCnnnGCGGC Pdr1 0/2 21/48 STP1∗ GCSGGGnCGG Pdr1 0/2 19/48 SUT1∗

1 Enriched in -MMS only promoters (hypergeometric p≤10-7) but not in +MMS only (p>10-2) 2 Letter based on IUPAC code, lowercase used when information content < 0.3 3 The set of promoters bound by each TF was analyzed for the presence of known and novel motifs 4 Motif presence was scored using ANN-spec at a threshold that predicted sites at a rate of <10-4 overall intergenic regions. 5 Novel motifs were identiﬁ ed using ANN-spec. Known motifs were drawn from the † SCPD databases or from * Harbison et al. 6 Enriched in +MMS only promoters (hypergeometric p≤10-7) but not in –MMS only (p>10-2) 36

Supplemental Figures

A YPD 30oC, 3 days B YPD + 0.025% MMS 30oC, 3 days

A A B B C C D D E E F F G G H H I I J J K K 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 C wild type sensitive A 1 wild type C 1 IXR1 E 1 SKO1 G 1 wild type I 1 HAA1 K 1 SFP1* A 2 YAP7 C 2 RGT1 E 2 blank G 2 GAT3 I 2 MBP1 K 2 HAP5 A 3 RTG1 C 3 PHD1 E 3 MTH1 G 3 CHA4 I 3 STP1 K 3 HAP4 A 4 blank C 4 MSN4 E 4 SUM1 G 4 YAP1 I 4 GTS1 K 4 YAP5 A 5 USV1 C 5 wild type E 5 GAT1 G 5 SOK2 I 5 wild type K 5 blank A 6 CUP9 C 6 ASH1 E 6 YFL044C G 6 MAC1* I 6 SWI4 K 6 blank A 7 SMP1 C 7 RME1 E 7 MAL13 G 7 ARG80 I 7 HIR2 K 7 blank A 8 THI2 C 8 AZF1 E 8 wild type G 8 MSS11 I 8 ARO80 K 8 wild type A 9 YBR267W C 9 SFL1 E 9 MOT3 G 9 RGM1 I 9 PUT3 K 9 blank A 10 INO2 C 10 YJL206C E 10 HAL9 G 10 DAL82 I 10 ADR1 K 10 blank B 1 CAD1 D 1 HAP2 F 1 DOT6 H 1 STB1 J 1 GCR2 B 2 GCN4 D 2 FZF1 F 2 ARG81 H 2 CIN5 J 2 MSN1 B 3 GLN3 D 3 RLM1 F 3 ROX1 H 3 HMS1 J 3 SWI5 B 4 blank D 4 DIG1 F 4 SIP4 H 4 UGA3 J 4 SWI6 B 5 RIM101 D 5 MET31 F 5 STP2 H 5 CRZ1 J 5 ECM22 B 6 YAP3 D 6 PHO4 F 6 ZMS1 H 6 RTG3 J 6 RCS1* B 7 SKN7 D 7 HIR1 F 7 BAS1 H 7 FKH2 J 7 INO4 B 8 LEU3 D 8 HAP3 F 8 NRG1 H 8 YJL206C J 8 PHO2 B 9 ACE2 D 9 PDR1 F 9 MAL33 H 9 DAL81 J 9 RPH1 B 10 RFX1 D 10 MIG1 F 10 CBF1 H 10 MSN2 J 10 FKH1

Supplemental Figure 2.1. MMS sensitivity assay of transcription factor deletions strains. Light images of yeast colonies on agar plates are shown. Each position in the grid contains the colonies for TF-deletion or wild-type strains grown for 3 days in YPD (A) and YPD +0.025% MMS (B). The key for the location of yeast strains is shown in (C). Sensitive mutants for this assay are indicated in red. Factors marked with ‘*’ were not found to be sensitive in subsequent trials. 37 0 . 1 (A) 0.2 8 . 0 t' = 0.002 e 6 u . l 0

a t = 0.01

v 0

- 0 0.2 p 4 . 0 S M M 2 . − 0 (B) (C) 0 . 0

0.0 0.2 0.4 0.6 0.8 1.0 +MMS p-value

Supplemental Figure 2.2. Scoring overlap in TF binding between conditions. For a given TF, each gene is assigned a pair of p-values for binding in +MMS vs. –MMS. Genes with p-value pairs that fall into the region (A), (B), or (C) are considered bound in +MMS only, both conditions, or –MMS only, respectively. The inset shows an enlargement of the plot for p-values < 0.25. The regions shown correspond to p-value thresholds of t=0.01 and t’=0.002. The thresholds used in the actual study were t=0.001 and t’=1.5×10-4. 38

DTT, early (15-30 min)

Diauxic shift, early (timepts 1-4) YPD, early (2-4 hrs) 1M Sorbitol, late (60-120 min)

+/-MMS knockouts: cad1, dal81

Amino acid starvation (0-6 hrs)

N2 depletion, early (0-4 hrs)

H2O2 0.32 mM, late (80-160 min)

H2O2 0.32 mM, early (10-60 min) 1 mM Menadione (10-160 min)

+/-MMS wildtype +/-MMS knockouts: ace2, ash1, cin5, dig1, ecm22, fkh2, fzf1, gcn4, ino4, ixr1, msn4, pdr1, rfx1, rim101, rpn4, rtg3, sko1, sok2, swi4, swi5, swi6, uga3, yap1, yap5 Gasch MMS timecourse: 15, 30, 45, 60, 90, 120 min. Jelinsky MMS timecourse: 30, 60 min.

DTT, late (60-480 min) 1 M Sorbitol, early (5-45 min) Heat shock (5-60 min) 1.5 mM Diamide (5-90 min) N2 depletion, late (8 hrs-5 days) Diauxic shift, late (timepts 5-7) YPD, late (6 hrs-5 days)

YP ethanol YP galactose YP raffinose YP sucrose YP fructose Steady state 15, 17, 21, 25, 29, 36 deg C

Supplemental Figure 2.3. Clustering of MMS expression responses with other stresses. Expression profi les generated in this study (wild type and TF knockouts +/− 0.03% MMS) were clustered together with previous expression data for MMS [Gasch et al. 58and Jelinsky et al.59] and other environmental stresses76. Hierarchical clustering was performed using ClustArray (http://www.cbs.dtu. dk/services/DNAarray/) to construct a dendrogram on the conditions. The dendrogram shows that the +/− MMS wild type profi le from this study is in relative agreement with previous ones. Moreover, all but two (cad1Δ and dal81Δ) of the +/− MMS knockout profi les are more similar to the MMS wild type response than to other stress responses. 39

log ratio 1 (111 genes +/-MMS) S

G median M / A B C 1 2

0.5 0.0 0.5 M G S G M M log-ratios

G1 G1 (32) differentially cell cycle genes 0.5 expressed genes (800) S S (9) +/-MMS (514) S/G2 S/G2 (10) 0 403 111 689 G2/M G2/M (35)

M/G1 (25) M/G1 -0.5 (29)(19) (12) (9) (16) (17)

Supplemental Figure 2.4. MMS responsive genes implicate G1/S checkpoint. (A) Approximately 22% of the MMS-responsive genes (111 of 514) have been shown to be temporally regulated during the cell cycle205. (B) Distributions of the median log-ratios (over the 27 TF knockouts and wild-type) for genes found to be up-regulated in G1, S, S/G2, G2/M, M/G1. Boxplot colors correspond to the color scale in C. (C) Clustering these MMS responsive genes with representative profi les from cell cycle phases G1, S, G2, M and M/G1 205 shows that the MMS profi le is most similar to that of late G1 and S phases. Further evidence of G1 and S phase arrest can be found in the expression responses of ACE2 and SWI5, cell cycle regulators expressed in M/G1 phase205. Both factors are down regulated in response to MMS (Figure 2.2a in the main text), and have contracting binding profi les (Figure 2.2b). Moreover, the Ace2 and Swi5 TFs only show signifi cant binding overlap in –MMS (Figure 2.3). Together, these fi ndings confi rm that exposure to MMS causes cells to progress more slowly through G1/S and S phase60. 40

−10 10 Deletion ARN1 buffering P -value P-value YFR017C YBR241C YGL188C OLE1 −20 1 10

YIL065C 0.1

FLR1 GLR1 (+/- MMS, all experiments) (+/- MMS, −30 Differential expression 10 0.01

0.001

−40 10 RNR2 RNR3 RNR4 -4 10 0.0 0.2 0.4 0.6 0.8 1.0 Differential expression P-value (+/- MMS, crt1�)

Supplemental Figure 2.5. Crt1 deletion-buffering P-values. Deletion buffering P-values (colors ranging from green to white) are computed for each gene (black points) based on their P-values of differential expression for crt1Δ (x-axis) versus all experiments (y- axis). 41

90 ADR1 75 DAL81 67 PDR1 67 ACE2 44 RIM101 36 SKO1 30 MCM1 29 RPN4 28 SWI6 22 RFX1 21 FZF1 20 RTG3 20 GCN4 20 FKH2 19 SOK2 18 SWI5 18 ASH1 17 MSN4 17 DIG1 16 CAD1 15 INO4 11 NDD1 11 HSF1 Average = 26.5 9 SWI4 8 IXR1 7 CIN5 4 ECM22 4 WT

0 20 40 60 80 100

Number of genes buffered

Supplemental Figure 2.6. Number of genes buffered by each transcription factor deletion (buffering p-value <0.005). Because more than one transcription factor may buffer each gene, the number of buffered genes (341) is less than the total number of buffering interactions observed (757). As indicated by the dashed grey line, the average number of genes buffered by each TF is 26.5. 42

chIP-chip binding A WT vs crt1� expression for RNR1-4 B for Crt1p �/�MMS

1 RNR1 8000 RNR2

0.1 RNR3 RNR4 6000 RNR3 0.01 RNR1

4000 0.001 Binding p-value 1e�4 Normalized spot intensity 2000 RNR4

1e�5 RNR2

0 1e�6

WT WT crt1� crt1� �MMS �MMS �MMS �MMS �MMS �MMS

Supplemental Figure 2.7. Correspondence between expression and regulation of the RNR genes by Crt1p. (A) Distributions of normalized spot intensities (cy3 and cy5 channels) for the four RNR genes in crt1Δ + vs. –MMS over four replicate microarrays (16 spot intensities for each distribution). The wild-type expression levels show an induction of these genes after exposure to MMS while the crt1Δ shows constitutive expression both + and –MMS. (B) The binding data show strong evidence for Crt1p binding upstream of RNR2,4 (and marginal evidence for RNR1,3) in nominal growth condition while binding evidence suggests a lack of Crt1p binding for all RNR genes after exposure to MMS. These data support the induction (by derepression) observed in the ﬁ rst two distributions in (A). 43

SWI6 SWI4 CRT1 SKO1 RPN4

MET6 RNR1 NAB2 DIN7 DUN1 RNR2 RNR4 YPR015C CWP1 HO YLR297W YIL169C

ACE2 FKH2 PDR1 UGA3 INO4 YAP1

OYE2 YHB1 YER079W BUD9 ZWF1 YFR017C HOR7 RFA2 ISU2 AAD6 GSH1 GTT2 ATR1 GCN4 DAL81 SOK2 YLR460C

CPA2 BNA1 ALD3 ARX1 RPA190 YGR146C MAG1 PCL5 ARG3 HSP78 LYS20

GENES INTERACTIONS deleted TF Protein-DNA, -MMS only intermediate TF Protein-DNA, +MMS only buffered gene Protein-DNA, both +/-MMS Protein-Protein RESPONSE TO MMS IN WT down-regulated up-regulated

Supplemental Figure 2.8. Direct binding interactions supported by deletion-buffering effects. Circles or squares represent deleted TFs or target genes, respectively. Protein-DNA interactions between TFs and the gene promoters they bind are represented as arrows colored blue (bound in −MMS only), orange (+MMS only), or green (both +/−MMS). Genes are colored orange or blue, representing induction or repression in response to MMS. Each binding interaction shown is supported by a deletion-buffering effect. For example, deletion of the transcription factor SWI6 results in lack of expression of DIN7 and DUN1 in response to MMS. In total, 42 binding interactions covering 37 distinct target genes are supported. 44

Acknowledgements

We acknowledge funding from the National Institute of Environmental Health Sciences (NIEHS), the National Cancer Institute (NCI), and the David and Lucille Packard Foundation. LDS is the Ellison American Cancer Society Research Professor. We thank R. Young, D. Odom, T.I. Lee, and B. Ren for assistance with chIP-chip analysis, and J. Kadonaga, W. McGinnis, and G. Kassavetis for consultations on the binding motif predictions. Chapter 2, in full, is a reprint of material as it appeared on May 19 2006 in Science, 2006, vol. 312, p. 1054-1059. Workman, Christopher T.; Mak, H. Craig; McCuine, Scott; Tagne, Jean-Bosco; Agarwal, Maya; Ozier, Owen; Begley, Thomas J.; Samson, Leona D.; Ideker, Trey. The dissertation author was one of three primary investigators (with C.T. Workman and S. McCuine) and two primary authors (with C.T. Workman) of this paper. Chapter 3: Analyses of position effects within the yeast transcriptional network

Abstract

Transcription factors are most commonly thought of as proteins that regulate expression of specifi c genes, independently of the order of those genes along the chromosome. By screening genome-wide chromatin immunoprecipitation (chIP) profi les in yeast, we fi nd that more than 10% of DNA-binding transcription factors concentrate at the subtelomeric regions near to chromosome ends. None of the proteins identifi ed were previously implicated in regulation at telomeres, yet genomic and proteomic studies reveal that almost all have interactions with established telomere binding complexes. For many factors, the subtelomeric binding pattern is dynamic and undergoes fl ux toward or away from the telomere as physiological conditions shift. We fi nd that subtelomeric binding is dependent on environmental conditions and correlates with the induction of gene expression in response to stress. Taken together, these results suggest that genome position effects involve not only co-expression of neighboring genes along a genome but have implications for the architecture and evolution of entire transcriptional regulatory networks.

45 46

Introduction

It is becoming increasingly evident that gene order is not random88,89. Expression profi ling of diverse eukaryotic species has revealed that co-expressed genes are often clustered along the chromosome90-93, and that such clusters of genes function to varying degrees in the same metabolic pathways94. In yeast, for example, adjacent genes are co-regulated during the cell-cycle95 or in response to changing growth conditions96. In multicellular eukaryotes such as fl ies and humans, extended tracts of co-expression have been observed encompassing up to thirty genes97,98. Broadly speaking, the effects of gene order on expression are known as “position effects.” Yet despite intensive study, the mechanisms controlling position effects are incompletely understood88. The chromosome ends of the yeast Saccharomyces, though, have been used as a model to link position effects to epigenetic factors such as the state of surrounding chromatin and the spatial compartmentalization of the nucleus99. These epigenetic factors are, in part, assembled by DNA-binding transcription factors and chromatin modifying proteins100,101. At chromosome ends, such transcription factors (including Rap1p) and chromatin modifi ers (including Hda1p and the Sir silencing complex) appear to bind in a position-specifi c manner across multiple genes in a broad genomic region102-104, but do not completely explain the observed expression patterns of genes located near these ends103,105,106. Transcription factors have traditionally been assumed to regulate specifi c gene promoters regardless of their genomic locations24,70,107. The vast majority of transcription factors therefore have not been investigated for position-specifi c binding, which, if found, would suggest a broader role for those factors in position effects than has been previously appreciated. Thus, we developed methods for analyzing the wealth of available transcription binding profi les and applied them to yeast to 47 discern whether position-specifi c binding could be observed as a general phenomenon, focusing in particular on position effects at chromosome ends. We fi nd that a surprising number (more than 10%) of all profi led yeast DNA- binding transcription factors display a marked preference for binding genomic locations within 25 kilobases of the telomere, and much of this position-specifi c binding is responsive to changes in physiological conditions. We also assay the phenotypes of single and double deletions of these transcription factors in response to physiological challenges. None of these factors have been previously known to have any correlation with genome position, but we fi nd that most interact with proteins with known telomere functions. Taken together, our fi ndings suggest that genome position effects involve not only co-expression of neighboring genes along a genome but are also evident in the architecture, and perhaps evolution, of entire transcriptional regulatory networks.

Results

Discovery of a large family of Subtelomere Binding Transcription Factors

We screened the existing compendium of transcription factor binding locations gathered using the technique of chromatin immunoprecipitation followed by microarray analysis for each of 203 S. cerevisiae transcription factors70. These data were collected from yeast grown in rich medium and under stress conditions. Each transcription factor (TF) was scored using a quantitative measure of telomere-proximal binding that we defi ned to be its Telomere Distance Profi le (TDP). We computed the TDP for a TF by measuring the distance to the closest telomere for every target sequence reported to be bound by that TF, resulting in a distribution of distances. Then, we compared each TDP to a background Telomere Distance Profi le consisting of all yeast genes (Figure 3.1A; see Methods). 48

A 1614 All genes B 10-25 • Gat3

• Yap5 1000 Hap4 Nrg1 Phd1 Gal4 -10 Cup9 P-value 10 Yjl206c Dat1 500 Ace2 • Msn4 • Rgm1 Yap6 • Pdr1 Ypr196w No. of genes • • Mig1 • -3 ••• • •• • 10 • • • • • • • • •• ••••• ••• • • • • • • 1 ••••••••••• •••••••••••• •••• •• • 0.3 1 10 100 1000 0 20 40 60 80 100 Distance from closest tel. (kbp) Subtelomeric promoters bound (%) C Gat3p D Msn4p E Ace2p 8 7 18 6 6

4 10 4 No. of genes No. of genes 2 No. of genes 2 5

0 0 0 0.1 1 25 1000 0.1 2 25 1000 0.1 3 25 1000 Distance from closest telomere (kbp) Distance from closest telomere (kbp) Distance from closest telomere (kbp) F G H – 80

– – 60 – – – • • – • – – – • 40 • • • • • • • • – • – – •

– Promoters bound (% subtelomeric) • • – • • • – – – • – – – – 20 • –• • • • • • • • • –• • – – – – – I IV VIII XII XVI I IV VIII XII XVI 0 Chromosome Chromosome SBTF Cell cycle Other

Figure 3.1. Transcription Factors that Preferentially Bind Sequences at Subtelomeres. (A) Background Telomere Distance Profi le (TDP) for all yeast promoters. (B) Overview of TDPs for rich-medium promoter-binding profi les70 compared against the background distribution. Each dot represents data for one TF: its signifi cance score on the y-axis (the P-value of a one-sided Kolmogorov-Smirnov test) versus the percentage of bound promoters that are subtelomeric on the x- axis. Subtelomere binding transcription factors (SBTFs) having bi-modal TDPs more signifi cant than the P-value threshold of 0.001 are labeled and indicated in black. (C, D, E) Telomere Distance Profi les for (C) Gat3p – the statistically most signifi cant SBTF, (D) Msn4p – a SBTF that is a master regulator of stress response genes, and (E) Ace2p – a cell cycle regulator that is representative of TFs having a TDP similar to the background. Blue dashed lines correspond to the blue dashed line in (A). The red dashed line indicates the 25 kb cutoff distance used to categorize subtelomeric genes. (F, G) Plots showing the location of promoters bound on the 16 yeast chromosomes by (F) Gat3p and (G) Msn4p. Red ticks indicate binding events located within 25 kb of a telomere. Black ticks indicate non-telomeric binding events. Small black dots mark the centromere on each chromosome. (H) The average percent of subtelomeric promoters bound by three groups of TFs: SBTFs, cell cycle TFs (those annotated with the “cell cycle” GeneOntology term GO:0007049), and the remaining TFs. Error bars indicate standard deviation. 49

Among the TF binding profi les from yeast grown in rich medium, seventeen TFs had a TDP that was signifi cantly different from the background distribution according to a Kolmogorov-Smirnov test at P<0.001 (Figure 3.1B). This P-value threshold corresponds to a false discovery rate108 of approximately 1%, meaning that none of the TDPs identifi ed are expected to be false positives (Supplemental Figure 3.1). Fifteen of these signifi cant TDPs were distinctly bi-modal, with an unusually high number of target sequences located within 25 kb of the closest telomere (Figures 1C-D, F-G). This bi-modality was not simply due to low gene density 25 kb from the telomere, since the background distribution was not depleted for genes in this region (Figure 3.1A). Of note is that in S. cerevisiae the chromosomal region known as the subtelomere is postulated to extend roughly 25 kb inwards from each telomere109. The subtelomere is a patchwork of sequence blocks that are repeated within and between the ends of chromosomes, forming the transition between telomeric repeats and chromosome-specifi c sequences110. Thus, we named these 15 factors SBTFs, for Subtelomere Binding Transcription Factors. We next analyzed the 84 TF binding profi les that had been reported under various stress conditions such as rapamycin, butanol, or hydrogen peroxide70. Eleven SBTFs were identifi ed that showed a subtelomeric binding preference under stress. Conversely, a number of SBTFs that had been identifi ed in rich medium were found to lose their subtelomeric binding preferences in alternative conditions (Figure 3.2). In total, this raised the number to 22 SBTFs identifi ed (Figure 3.3A): seven TFs for which the subtelomeric binding preference was specifi c to a stress condition, six TFs for which it was specifi c to rich media, four TFs for which subtelomeric binding was observed under both stress and rich media, and fi ve TFs for which no stress binding data were available. It is perhaps intriguing that 11% of S. cerevisiae 50

A Examples of stress SBTFs

2 Uga3p 2 Gzf3p 2 Dal80p rich media rich media rich media No. of genes

0 0 0

4 5 5 +RAPA +RAPA +RAPA No. of genes

0 0 0 25 25 25 Distance from closest telomere (kbp) Distance from closest telomere (kbp) Distance from closest telomere (kbp)

B Examples of rich media SBTFs

12 Nrg1p 7 Msn4p 6 Gal4p rich media rich media rich media No. of genes

0 0 0 20 16 8 +H2O2 Hi +H2O2 Hi +galactose No. of genes

0 0 0 25 25 25 Distance from closest telomere (kbp) Distance from closest telomere (kbp) Distance from closest telomere (kbp)

Figure 3.2. Example SBTFs that Show Dynamic Binding Preferences as a Function of Growth or Stress Condition. (A) SBTFs with Telomere Distance Profi les (TDP) that are signifi cant only in a stress condition. TDPs are shown for each SBTF in two conditions: rich medium (top) and a stress condition (bottom). Genes considered subtelomeric are colored light red in rich-medium binding profi les or dark red in stress profi les. (B) SBTFs that show subtelomeric preference in rich medium. Stress conditions are abbreviated: RAPA (100 nM rapamycin), H2O2 Hi (4mM hydrogen peroxide), and galactose (2% in YEP medium). 51

A Stress Targets bound Targets bound P < 0.001 TF condition in stress condition in rich media (YPD) S R Aft2p H2O2Lo 117 ✓ Gzf3p RAPA ✓ Xbp1p H2O2Lo 78 ✓ Rox1p H2O2Hi ✓ Uga3p RAPA ✓

Dal80p RAPA ✓ Stress only Mal33p H2O2Hi ✓

Mig1p GAL ✓ Gal4p GAL ✓ 9 RAFF 219 ✓ Hap4p H2O2Lo 77 ✓ 9 SM 77 ✓ Nrg1p H2O2Hi 74 ✓ Msn4p Acid ✓ 9 H2O2Hi ✓ H2O2Lo 9 ✓ Rich media only 9 RAPA ✓ Pdr1p H2O2Lo 88 ✓ Yap5p H2O2Hi 86 ✓

YJL206C H2O2Hi ✓ ✓ 9 H2O2Lo ✓ ✓ Phd1p BUT90 115 ✓ ✓ Yap6p H2O2Hi 92 77 ✓ ✓

9 H2O2Lo 94 ✓ ✓ Stress & Nrg1p H2O2Lo ✓ ✓ Rich media

YPR196W no data ✓ Cup9p ✓ no data Subtelomeric targets All targets Rgm1p no data ✓ Dat1p no data ✓ data Gat3p no data ✓ No stress

60 40 20 0 20 40 60

B HXT9,12,13,15,16 FLO1, FLO10 HXT17 YRF1-1,2,3,5,6 SOR1, SOR2 FDH1 PAU1, 4,6 YRF1-4,7 HMLa1 ADH7 DSF1, FSP2 FDH2 YRF1-8 COS1,3,5,4,8 HMLa2` AAD3 FET4

Aft2 H2O2Lo Mal33 H2O2Hi Xbp1 H2O2Lo Yap6 H2O2Hi Rox1 H2O2Hi Dal80 RAPA S Uga3 RAPA Gzf3 RAPA Gal4 Hap4 Pdr1 Yap5 Gat3 R Dat1 Msn4 Rgm1 YPR196W Cup9 Mig1 Yap6 Nrg1 H2O2Lo Nrg1 SR Yap6 H2O2Lo Phd1 Phd1 BUT90 YJL206C YJL206C H2O2Lo YJL206C H2O2Hi

Figure 3.3. Subtelomeric Binding Preference is Dynamic and Distributed into Distinct Clusters. (A) Comparison of stress versus rich-medium SBTF promoter binding profi les. Dark and light horizontal red bars indicate the number of subtelomeric targets bound. Open bars indicate the total number of targets bound. Check marks on right indicate whether the SBTF shows a signifi cant (P<0.001) subtelomeric preference in the (S) stress condition or (R) rich medium. (B) Hierarchical clustering of the 125 subtelomeric genes (columns) bound by one or more SBTFs (rows). Blue indicates binding. Colored bars at right correspond to the dynamic binding behaviors in panel A: SBTF displays a subtelomeric preference in stress only (S; dark red), rich media only (R; light red), or both stress and rich media (SR; bright red). Stress conditions are abbreviated: Acid (succinic acid, pH 4), BUT90 (1% butanol), GAL (2% galactose), RAFF (2% raffi nose), H2O2Lo (0.4 mM hydrogen peroxide), H2O2Hi (4mM hydrogen peroxide), RAPA (100 nM rapamycin), SM (0.2 mg/ml sulfometuron methyl, an inhibitor of amino acid biosynthesis). 52 bound a completely different set of stress responsive promoters upstream of hexose transporters (HXT), fl occulation genes (FLO, FSP2), and a sorbitol dehydrogenase (SOR1). Notable exceptions to these trends included Mig1p, a SBTF only in rich media but which bound targets in the ‘stress and rich media’ cluster; Yap6p, which bound different sets of subtelomeric genes in low versus high levels of hydrogen peroxide; and three SBTFs (Aft2p, Mal33p, and Yjl206c) that did not strongly cluster with other factors.

SBTFs do not behave like known telomere-binding complexes or transcription factors but do interact with these proteins

The telomere, as opposed to the subtelomere, is the target of several extensively studied protein complexes. The function of these complexes includes telomere replication, chromosome end protection, and transcriptional silencing. To assess whether SBTFs have been linked to any of these functions, we examined protein complexes114 (Supplemental Figure 3.3), recent literature reviews99,115, and results from a genetic screen for telomere length mutants116. None indicated that SBTFs play known roles near the telomere (Supplemental Table 2.1). However, we also analyzed all of the protein interactions from the BioGRID database117 and did ﬁ nd that more than 80% of SBTFs (18 of 22) were linked via either physical or genetic interactions to genes with established functions at the telomere

(Figure 3.4; see Methods). We next compared the SBTFs to the most extensively studied telomere- binding transcription factor, Rap1p. Rap1p binds to repeated sequences (C1-3A) that are found at all chromosome ends118 and plays roles in telomere silencing and length control115. We note that in our computational screen Rap1p was identiﬁ ed among the top subtelomere binding factors (P=0.011; Table 2.1) but did not meet our stringent P-value cutoff. This is consistent with previous reports that found >90% of the target 53

Chromatin proteins HTA1

HTA2 HHT1 PKC1 Ubiquitination and Glucose repression protein degradation HMO1 NRG1

CYC8 RPT6 HAP4 RAD6 MIG1 GAL4 SNF2 RPT4

SWI/SNF chromatin remodeling complex

SIR4 SSN8 AFT2 GAL11 SSN3 MEC1 RAP1 MAL33 SIR silencing PGD1 XBP1 DUN1 complex SSN2 ROX1 SRB2 CUP9 SRB8 MOT3 SRB5 HSP82 STE12 ATG17

PHD1 GZF3 PNC1 YAP6 DAT1 Pseudohyphal PDR1 growth DAL80 URE2 MSN4 GAT3

YAP5 YJL206C UGA3 RNA Pol II holoenzyme RGM1 YPR196W Nitrogen catabolism and Mediator complex

Telomere function Interaction type Physical SBTF Silencing Genetic Length maintenance Co-localization Silencing & length Biochemical activity (e.g phosphorylation)

Figure 3.4. Interactions Between SBTFs and Telomere-related Genes. BioGRID117 interactions that connect SBTFs to any telomere-related gene listed in Supplemental Table 3.1. Also shown are all interactions between these telomere-related genes. Genes that function at the telomere, but do not interact with a SBTF based on data from BioGRID, are not shown (e.g. SIR2 or YKU70). Ovals indicate that a gene has been associated with the telomere in recent literature reviews99,115. Rectangles indicate annotation from the GO database or high-throughput experiments. Interaction types are described in detail in the Methods. 54

TFs (22/203) display a binding preference for subtelomeric genes, which themselves only comprise about 6% of the genome.

SBTFs regulate stress and carbon metabolism in three broad clusters

Roughly one third of the SBTFs we identifi ed had been previously associated with cellular stress responses (e.g., Msn4p, Pdr1p, Phd1p, Rox1p, Xbp1p, Yap5p, Yap6p — see Table 2.1). SBTFs also included factors mediating glucose and nitrogen repression (Mig1p, Dal80p, Gzf3p, Nrg1p, Uga3p) as well as growth on alternative carbon sources (Gal4p, Mal33p, Hap4p) and metal uptake (Aft2p, Cup9p). In contrast, TFs associated with most other cellular programs, such as the cell- cycle regulator Ace2p (Figure 3.1E), had TDPs that closely matched the background distribution (Figure 3.1H). Five SBTFs (Gat3p, Dat1p, Rgm1p, Yjl206c, Ypr196w) were poorly characterized or of unknown function. Several SBTFs such as Gal4p, Msn4p, and Pdr1p are among the most well studied yeast TFs. Their identifi cation as subtelomeric binding factors is, to our knowledge, novel, but it is in concordance with their known roles in regulating stress or metabolic genes110. Hierarchical clustering of SBTF binding profi les indicated that SBTFs bound at least three distinct classes of subtelomeric genes (Figure 3.3B and Supplemental Figure 3.2). Stress-only SBTFs largely targeted alcohol dehydrogenases (AAD3 and ADH7) along with YRF genes, a family of putative helicases located within the subtelomeric Y’ repeated element which may function in telomere maintenance when telomerase is absent111. Rich-media SBTFs bound YRF genes as well as members of the COS gene family,family, which are widely conserved and may function in salt resistance112 and the unfolded protein response113, but are otherwise generally uncharacterized. Intriguingly, SBTFs identifi ed in both stress and rich media conditions 55 sequences bound by Rap1p lie outside the subtelomere in the promoters of many essential genes. Nonetheless, we reasoned that if SBTFs are binding to repeated DNA elements in a manner analogous to Rap1p, then SBTF binding should be correlated with the presence of these elements. To check for coincidence with known repeated elements, we correlated SBTF binding with not only Rap1 binding sites but also the presence of all ten non-ORF sequence features annotated by the Saccharomyces Genome Database119, which include the X, Y’, long terminal repeat, and autonomic replicating sequence elements (Supplemental Table 3.2). The binding profi les for two SBTFs, Yap5p and Msn4p, were correlated with the Y’ element (Pearson P<0.05 with r = 0.75 and r = 0.73, respectively) but no other signifi cant correlations were found. Because subtelomeric DNA is repetitive, we also considered the possibility that a single subtelomeric sequence bound by a SBTF might generate false-positive binding hits due to microarray cross-hybridization to other subtelomeric sequences. We expect that the impact of this effect is likely to be small, since binding was not correlated with the presence of known repeats. We tested this possibility by applying TDP analyses to a fi ltered dataset in which binding targets with similar promoter sequences had been removed 70. Most SBTFs remained signifi cant in this data set (15 of 22). Furthermore, the fact that most SBTFs have subtelomeric binding patterns that change dynamically depending on physiological state (Figure 3.3A) does not support an explanation whereby subtelomeric binding is predominantly an artifact of microarray cross-hybridization.

Six SBTFs target heterochromatin domains that are derepressed during growth on alternative carbon sources

Next, we sought to explore whether SBTFs might be associated with Hda1p, the only known regulator that specifi cally targets the subtelomere. Hda1p is a histone 56 deacetylase that establishes HDA1 associated subtelomeric (HAST) domains that repress about 40% of all subtelomeric genes103. It is plausible that SBTFs could help establish, maintain, or relieve repression of genes in HAST domains. We examined the clusters of subtelomeric genes bound by SBTFs (Figure 3.3B) and found a signifi cant number of HAST genes in cluster ‘SR’, the stress responsive genes bound by Mig1p, Nrg1p, Phd1p, and Yap6p in hydrogen peroxide, butanol, and rich conditions (Figure 3.5A; binomial test, P<0.01). These same HAST genes were also bound by Xbp1p in hydrogen peroxide. Interestingly, the other SBTF in the SR cluster, Yjl206cp, also bound HAST genes, albeit a different set. We found that the subtelomeric genes bound by TFs in cluster SR were specifi cally upregulated in an hda1Δ deletion strain120 (Figures 5B-D; KS test, P<0.001), consistent with Hda1p’s function as a repressor. However, SBTFs in this cluster did not appear to be required for establishing or maintaining repression because individually deleting each TF121 does not affect the expression of many subtelomeric targets in rich medium (Supplemental Figure 3.4). Taken together, these fi ndings suggest that SBTFs in cluster SR perhaps do not function at the subtelomere by differential binding but instead are modulated by other regulatory mechanisms such as post-translational modifi cation. Alternatively, it is possible that they do function by differential binding but only in conditions other than those profi led to-date. Since genes in HAST domains function during growth on alternative carbon sources103, we tested the SBTF deletion strains yap6Δ, phd1Δ, nrg1Δ, and yjl206cΔ for growth defects on solid media supplemented with glucose, fructose, galactose, lactose, ethanol, maltose, or sucrose using a series of dilution assays. Although growth defects were not observed for any single mutant, a yap6Δ phd1Δ double mutant exhibited a clear growth defect in galactose, ethanol, and glycerol (Figure 3.5E). 57

A Number of HAST genes bound 6 60 10 40 4 Total 17 12 18 14 13 6 HAST 20 17 10 9 genes:

% targets in HAST 189 3.3

Yap6p Nrg1p Phd1p Mig1p Yjl206cp genome

Phd1p BUT90 Yap6p H2O2LoNrg1p H2O2Lo Yap6p H2O2HiXbp1p H2O2Lo Yjl206cp H2O2LoYjl206cp H2O2Hi B Yap6p ChIP C Phd1p ChIP D Yap1p ChIP

1.5

1.0 � vs. wild type )

0.5

0.0

-0.5 Expression ratio: log2 ( hda1

Unbound Bound Unbound Bound Unbound Bound Unbound Bound Unbound Bound Unbound Bound central subtelomeric central subtelomeric central subtelomeric

E SC glucose SC galactose

BY4741

gal4∆

yap6∆

phd1∆

yap6∆ phd1∆ SC ethanol SC glycerol BY4741

gal4∆

yap6∆

phd1∆

yap6∆ phd1∆

Figure 3.5. Binding Profi les, Expression, and Growth Phenotypes Link SBTFs to Hda1p. (A) SBTFs that have a signifi cant percentage of targets in HAST domains (P<0.01). Exact numbers of targets are above each bar. (B,C,D) Box-and-whisker plots of gene expression changes in an hda1Δ strain compared to wild type [data from 114] for targets bound by: (B) Yap6p (P=10-8). (C) Phd1p (P=0.00006). (D) Yap1p (statistically unaffected). (E) YAP6 and PHD1 display a condition-specifi c genetic interaction during growth on non-glucose carbon sources. Five-fold serial dilutions were grown on synthetic complete (SC) medium supplemented with 2% of one of the following carbon sources: glucose, galactose, ethanol, or glycerol. Controls are BY4741, the parent strain used to create gene deletions, and gal4Δ, that does not grow on galactose. 58

This interaction likely depends on subtelomeric genes, as Yap6p and Phd1p bind seven common subtelomeric targets including the hexose transporter-like genes HXT9, HXT15, and HXT16. The gene HXT15 was previously found to have higher expression levels during growth on ethanol and glycerol122. In summary, we have linked six SBTFs to the chromatin-modifying enzyme Hda1p, which mediates Sir-independent silencing at the subtelomere. Two of these SBTFs, Yap6p and Phd1p, jointly contribute to growth on alternative carbon sources, a role consistent with existing models of Hda1p function.

Dynamic subtelomeric binding is correlated with gene expression

Given that subtelomeric genes are typically repressed105, a SBTF that preferentially binds the subtelomere in a stress condition (cluster S in Figure 3.3B) might function as an activator if its targets are upregulated in the same stress. Alternatively, a SBTF that moves away from the subtelomere in response to stress (cluster R) might function as a repressor if its targets are subsequently upregulated in the same condition. To distinguish between these hypotheses, we analyzed published gene expression proﬁ les76 to identify SBTFs whose subtelomeric target genes had condition- speciﬁ c regulatory behaviors123,124.

We discovered a strong correlation between the binding and upregulation of targets of Aft2p in response to oxidative stress. This correlation was particularly striking for the 11 members of the subtelomeric PAU gene family that are bound by Aft2p under mild hydrogen peroxide treatment (Figure 3.6A), although the proﬁ les of some family members might be affected by microarray cross-hybridization with other PAU-family genes since these sequences are over 80% similar to PAU1 at the nucleotide level. PAU genes and Aft2p have each been implicated in oxidative stress 59

A PAU gene family B Nitrogen depletion

ChIP Percent (inferred) nucleotide ChIP similarity 2.5 mM DTT * Nitrogen depletion * to PAU1 0 5min 30min 1hr 3hr 0 30min 2hr 8hr 2day 5day Aft2p Aft2p H2O2Lo Gzf3p RAPA Uga3p RAPA Dal80p RAPA

PAU1 100 HMRA1 YGL261C 96 HMLALPHA1 YHL046C 97 HMRA2 YLL064C 97 HMLALPHA2 ADH7 YOL161C 96 AAD3 PAU6 97 YNR068C PAU4 96 YKL224C 86 -4 -2 2 4 83 PAU15 Fold change YOR394W 83 YPL282C 83 Binding Expression

C Oxidative stress

ChIP ChIP (inferred) (inferred) ChIP 1.5 mM * 0.32 mM H2O2 * 1 mM menadione * diamide 0 10min 30min 1hr 2hr 0 10min 30min 1hr 3hr 0 5min 30m 50m 90m Nrg1p Yap6p Rox1 Nrg1p H2O2Lo Yap6p H2O2Lo Yap6p H2O2Hi Rox1p H2O2Hi YML131W YMR315W YCR102C AAD3 ADH7 YLR460C RDS1 GIT1 YER184C YER185W YJL218W PHO89

D ‘Stress-dependent’ E ‘Poised for action’ SBTF only in stress condition SBTF in rich media SBTF X Subt gene

SBTF X stress #1 stress #2 Subt gene

e SBTF

r t

s SBTF X Subt gene SBTF ? Subt gene Subt gene

Figure 3.6. Dynamic Binding and Expression Profi les Suggest Models of SBTF Function and Link SBTFs to Poorly Characterized Genes at Subtelomeres. TF binding profi les were matched to expression profi les gathered under similar environmental perturbations. (A) Aft2p binds upstream of 12 subtelomeric PAU genes under oxidative stress conditions (blue boxes). Heatmap shows induction of the PAU genes in oxidative stress conditions. (B) In the presence of rapamycin, which simulates nitrogen depletion by antagonizing the TOR kinases 202, Gzf3p, Uga3p, and Dal80p bind upstream of genes that are induced under conditions of nitrogen limitation. (C) Genes targeted by Nrg1p, Yap6p, and/or Rox1p under hydrogen peroxide but not in untreated conditions are upregulated in response to three oxidative stress agents (hydrogen peroxide, menadione, and diamide). For A-C, the asterisk (*) indicates the approximate time that the binding data were collected. (D) SBTFs that display a preference for the subtelomere only in stress conditions may positively regulate gene expression. (E) SBTFs that bind the subtelomere in rich media conditions perhaps either contribute to repression or are ‘poised for action,’ subsequently functioning in different stress conditions (see text). 60 resistance125,126, but our results provide the fi rst evidence that Aft2p upregulates PAU genes. The other stress-only (S) and rich-media-only (R) SBTFs bound two overlapping sets of genes (Figure 3.3B). These included 30 genes similar to the YRF family that we found to be largely unresponsive in stress conditions (Supplemental Figure 3.5). However, excluding the YRF-like genes, three stress-dependent SBTFs (Gzf3p, Uga3p, Dal80p) displayed behaviors consistent with functions as activators in conditions of nitrogen depletion (Figure 3.6B). The expression analysis also identifi ed a set of subtelomeric genes that were targeted by Yap6p, Nrg1p, or Rox1p under hydrogen peroxide but not in untreated conditions and which were upregulated in response to a broad array of oxidative stress agents (Figure 3.6C). For these targets, these three SBTFs appear to behave like ‘stress only’ SBTFs, although Yap6p and Nrg1p had been classifi ed as ‘stress-and- rich-media’ factors based on their global binding patterns in Figure 3.3a. Intriguingly, many of the putative stress-regulated targets of these SBTFs are poorly characterized and unnamed, although YML131W is similar in sequence to oxidoreductases119. In contrast to the above evidence for SBTFs as transcriptional activators, we did not fi nd any strong evidence for SBTFs as repressors of gene expression.

Discussion

We have identifi ed 22 yeast transcription factors that display a clear binding preference for subtelomeric regions (Figure 3.1 and Table 3.1). Almost all of these factors physically or genetically interact with genes encoding proteins with known telomeric functions (Figure 3.4). The majority of binding patterns are dynamic, such that the factor is concentrated at the subtelomere only under certain conditions (Figures 3.2 and 3.3). This observation, combined with our analysis of stress-induced 61 expression profi les, suggests several models for how SBTFs may function. Most stress-only SBTFs (cluster S) appear to be activators of subtelomeric genes in response to stress (modeled in Figure 3.6D) and may relocalize to different target genes under other conditions. The role of rich-medium SBTFs (clusters R and SR) is less clear. We initially hypothesized that these might function as repressors of genes that are derepressed under specifi c stress conditions (modeled in Figure 3.6E). We did not fi nd such evidence of repressor function, although it remains possible that derepression occurs in conditions different than those for which data are currently available. In particular, many targets in the SR cluster function in alternative carbon metabolism, and we did fi nd a combination of SBTFs that bind this cluster and promote growth in non-glucose conditions (Figure 3.5). We also considered that a SBTF might be neither an activator nor a repressor. Instead, it may be sequestered to subtelomeres without effecting changes in subtelomeric gene expression as a means of holding it in reserve from other genomic locations. Additionally, there may be roles for SBTFs at the telomere that are dependent on sequestration as suggested for Rap1p127,128 and shown for the telomeric Ku proteins that become mobilized to sites of DNA damage to facilitate repair129. A growing body of evidence suggests that genome organization, nuclear architecture, and gene expression are interrelated101,130,131. Thus, an intriguing alternative possibility is that rich-media SBTFs may be residing in repressive regions within the nucleus. Foci at the nuclear periphery contain telomeres132 along with other regions of silent heterochromatin130,133. Transcription factors at the nuclear periphery are also located at sites of active transcription tethered to nuclear pore complexes101,130. In a process termed ‘reverse recruitment’ genes to be transcribed appear to be recruited to these sites of 62 transcription. Since Rap1p has been implicated in this process134, it is tempting to ascribe a similar role to SBTFs. Under this model, the condition-dependent binding seen in Figure 3.3 would be interpreted not as SBTFs themselves ‘moving towards’ or ‘away’ from the subtelomere, but rather as different subtelomeric genes moving in or out of sites of gene activation or repression at the nuclear periphery. As more binding profi les are generated in yeast and other species, it may be fruitful to screen those data for unexpected binding preferences at the subtelomere as well as more generally at other genomic regions. For instance, such a strategy has been used to show that Sgo1p and Rec8p, two cohesin proteins that function in chromosome segregation, localize to a 50 kb region around the centromere135. Our work demonstrates that many other such analyses may be productive, and it provides an example of transcription factor binding patterns that refl ect genome organization. Further work will be required to determine the extent to which condition- specifi c targeting of the subtelomere is a conserved regulatory strategy. Orthologs for all SBTFs have been identifi ed in one or more other yeast species, and seven SBTFs appear to be widely conserved across fungi, invertebrates, fi sh, and mammals (Supplemental Table 3.3). Moreover, SBTFs that cluster together by binding profi le (Figure 3.3B) also appear to have roughly similar patterns of conservation across species. Intriguingly, paralogs of most SBTFs were themselves not found to be SBTFs, with the exception of Dal80p / Gzf3p. This suggests there may be selective pressure that maintains the subtelomeric binding pattern of one paralog, but that redundancy is typically not necessary. It is tempting to speculate that SBTFs and their dynamic localization may contribute to the evolutionary plasticity that has previously only been attributed to the subtelomeric genes themselves110,136. 63

Methods

Computational screen for SBTFs

SBTFs were screened from published genome-wide transcription factor (TF) binding data70. A P-value threshold of 0.001 was used to identify the set of gene promoters putatively bound by each TF in a particular environmental condition (either rich medium or one of 12 additional stress or nutritional conditions). To compute the distance from each gene to the closest telomere, we ﬁ rst computed midg, the midpoint of the starting and ending chromosomal coordinates for gene g, which were downloaded from the Saccharomyces Genome Database (http:// www.yeastgenome.org). Then, the distance to the closest telomere is dg = min(midg,

LengthChrg – midg), where LengthChrg is the length of the chromosome on which g is located. The distribution of distances for all genes targeted by a TF in a particular binding experiment was defi ned to be the Telomere Distance Profi le (TDP). Using R (http://www.r-project.org), a one-sided Kolmogorov-Smirnov (KS) test was used to compare the TDP for each binding experiment to a background distribution of the TDP for all yeast genes. To estimate the false discovery rate (FDR), the set of all KS P-values was used as input to the Q-value software108. Statistically signifi cant TDPs were identifi ed at P ≤0.001, corresponding to a

FDR of approximately 1%. Subtelomeric genes were deﬁ ned to be those with dg ≤ 25000. Hierarchical clustering of SBTF binding proﬁ les was performed using Cluster 3.0137 and visualized using Java TreeView138. To simplify data visualization, clustering was limited to subtelomeric genes bound by at least one SBTF. 64

Analysis of protein complexes and interactions involving SBTFs

A comprehensive set of yeast protein-protein interactions was obtained from the BioGRID database117 version 2.0.40 (May 2008). Interactions that connected a SBTF to any known telomere-related gene were identifi ed, and then manually inspected using Cytoscape11. Physical interactions were classifi ed as those in BioGRID labeled: Affi nity Capture, Co-crystal Structure, Co-fractionation, Co- purifi cation, FRET, Far Western, Protein-peptide, Protein-RNA, Reconstituted Complex, or Two-hybrid. Genetic interactions were classifi ed as those in BioGRID labeled: Dosage Growth Defect, Dosage Lethality, Dosage Rescue, Phenotypic Enhancement, Phenotypic Suppression, Synthetic Growth Defect, Synthetic Lethality, or Synthetic Rescue.

A compendium of 547 non-overlapping yeast protein complexes114 was checked to see if any complex contained one or more SBTFs and any of the known telomere-related genes listed in Supplemental Table 3.1. Telomere-related genes were obtained from two sources: 1) any gene mentioned in recent reviews of yeast telomeres115 or the telomere position effect99, or discovered in genetic screens for telomere length mutants116, and 2) any gene annotated in the Gene Ontology (GO) database under the categories “telomere organization and biogenesis” (GO:0032200) or “chromatin-silencing-at-telomere” (GO:0006348). In total, 426 telomere-related genes were identiﬁ ed, 321 from GO analyses and an additional 105 from literature reviews.

Analysis of repetitive element binding by SBTFs

Yeast genome sequence features were downloaded from the Saccharomyces

Genome Database (http://www.yeastgenome.org). All features labeled Open Reading 65

Frame (ORF) were discarded, leaving ten different types of non-ORF features. Next, we determined the presence or absence of one or more occurrences of each feature on the 32 S. cerevisiae chromosome arms. We then determined whether each SBTF bound at least one promoter on the 32 chromosome arms. These analyses generated two sets of vectors consisting of 32 elements each: one set corresponding to SBTF binding and another set corresponding to sequence features. Using the Pearson correlation test in R, the similarity of each of the binding vectors was compared to each of the ten non-ORF feature vectors. P-values were adjusted for multiple hypotheses testing using the Bonferroni correction.

Screening SBTF binding proﬁ les for correlation with hda1Δ sensitivity

We used a method by Steinfeld and colleagues 139 to identify transcription factor (TF) and chromatin modifying enzyme (CM) pairs that function in concert. Briefl y, gene expression profi les of CM deletions are analyzed to determine whether the targets of a TF, identifi ed by ChIP-chip, are preferentially affected. The rationale being that if a TF and CM function together, then deletion of the CM should preferentially affect the genes targeted by that TF. We extended this approach by reasoning that if a TF and CM function together to specifi cally regulate the subtelomeric targets of a TF, then those subtelomeric targets should be preferentially expressed compared to the non-subtelomeric targets in a CM deletion. We applied this approach to hda1Δ expression data120 and the sets of subtelomeric and non-subtelomeric genes targeted by each SBTF. As in Steinfeld, we used the Kolmogorov-Smirnov test to assess the statistical signifi cance of the difference in expression between the bound vs. unbound subtelomeric targets for each SBTF. The signifi cance threshold was a Bonferroni-corrected P-value less than 0.001. 66

Dilution assays for growth of SBTF deletions on alternative carbon sources

Single deletion yeast strains were obtained from the Yeast Deletion Collection (Open Biosystems). Double deletions were a gift from S. Bandypadhay in the Ideker lab, and were constructed as described previously 140. Individual colonies growing on YPD agar plates were picked and cultured overnight to saturation in 2 mL YPD. Overnight cultures were diluted with sterile water in a 96-well microtiter plate in six 5-fold serial dilutions with a starting OD600 between 2 and 3. A 48-pin replica pinner was used to spot approximately 5 uL of each dilution onto solid agar plates of synthetic complete medium containing 2% of one of the following carbon sources: glucose, rafﬁ nose, glycerol, ethanol, galactose, fructose, maltose, or lactose. Plates were incubated at 30 C for at least three days. Initial screening results were conﬁ rmed with repeated experiments.

Inferring SBTF modes of action by integrated analysis of matched binding and expression proﬁ les

Gene expression experiments from Gasch and colleagues 76 were matched to TF binding experiments 70 performed under similar environmental perturbations. The set of gene targets bound by SBTFs in the presence of hydrogen peroxide (H2O2) was matched to expression perturbations caused by the oxidative agents menadione, DTT, diamide, or hydrogen peroxide. Binding profi les from cells treated with rapamycin were matched to expression profi les labeled ‘Nitrogen depletion’. Rich-media binding profi les were matched to expression profi les labeled ‘YPD’, ‘diauxic shift’ and ‘steady state expression’. Matched sets of binding and expression that passed both of the following criteria were identifi ed for in-depth manual inspection: 1) the expression of the subtelomeric targets bound by each SBTF was different than the unbound 67 subtelomeric genes, as assessed by the Kolmogorov-Smirnov test at P<0.05. 2) the SBTF bound the differentially expressed targets only in stress conditions, or the SBTF bound these genes only in rich medium. To identify YRF-like genes, the protein sequence encoded by YRF1-1 was used as an input to BLASTP, and matches were selected at an E-value cutoff of 10-10.

Identiﬁ cation of SBTF orthologs

Fungal orthologs of each S. cerevisiae gene were extracted from the ‘Pillars. tab’ fi le downloaded from the Yeast Gene Order Browser version 2.0141 (http://wolfe. gen.tcd.ie/ygob/data/Version1.0_Nature-2006/Pillars.tab). Orthologs in other species were extracted from raw datafi les downloaded from version 6.0 of InParanoid142 (http://inparanoid.sbc.su.se/download/old_versions/6.0/sqltables/). The reported number of orthologs for an ORF in another species is the total number of distinct orthologs for that ORF and for the S. cerevisiae paralogs of that ORF defi ned in the YGOB. 68

Table 3.1. Summary of Subtelomere Binding Transcription Factors # Number of genes bound that are in HAST domains or are members of the COS, PAU, or YRF subtelomeric multi-gene families. * Met18p and Rap1p did not meet the stringent SBTF P-value cutoff, but are included here for reference because they play known roles in telomere biology.

KS Total Total Genes bound # TF Cond P-value bound subt. HAST COS PAU YRF Saccharomyces Genome Database annotation Msn4p YPD 3.83E-08 58 25 (43%) 4 0 1 8 Transcriptional activator related to Msn2p; activated in stress conditions, which results in translocation from the cytoplasm to the nucleus; binds DNA at stress response elements of responsive genes, inducing gene expression Phd1p YPD 0.000332 67 15 (22%) 14 0 0 0 Transcriptional activator that enhances pseudohyphal growth; regulates BUT90 0.000185 115 26 (23%) 17 1 1 0 expression of FLO11, an adhesin required for pseudohyphal filament formation; similar to StuA, an A. nidulans developmental regulator; potential Cdc28p substrate Pdr1p YPD 2.84E-07 88 28 (32%) 3 1 0 8 Zinc cluster protein that is a master regulator involved in recruiting other zinc cluster proteins to pleiotropic drug response elements (PDREs) to fine tune the regulation of multidrug resistance genes

Rox1p YPD 0.014476 68 10 (15%) 9 0 0 0 Heme-dependent repressor of hypoxic genes; contains an HMG domain H2O2Hi 0.000004 58 22 (38%) 1 0 0 4 that is responsible for DNA bending activity Yap5p YPD 4.15E-19 86 46 (53%) 3 4 0 9 Basic leucine zipper (bZIP) transcription factor Yap6p YPD 0.000016 94 16 (17%) 18 0 0 0 Putative basic leucine zipper (bZIP) transcription factor; overexpression H2O2Lo 0.000596 61 13 (21%) 13 0 0 0 increases sodium and lithium tolerance H2O2Hi 9.33E-08 92 31 (34%) 10 0 0 4 Xbp1p YPD 1 0 Transcriptional repressor that binds to promoter sequences of the cyclin H2O2Lo 4.06E-08 78 23 (29%) 9 0 0 6 genes, CYS3, and SMF2; expression is induced by stress or starvation during mitosis, and late in meiosis; member of the Swi4p/Mbp1p family; potential Cdc28p substrate Mig1p YPD 0.000035 23 10 (43%) 10 0 0 0 Transcription factor involved in glucose repression; sequence specific DNA binding protein containing two Cys2His2 zinc finger motifs; regulated by the SNF1 kinase and the GLC7 phosphatase

Dal80p YPD 0.998252 13 0 (0%) 0 0 0 0 Negative regulator of genes in multiple nitrogen degradation pathways; RAPA 0.000728 41 13 (32%) 0 0 0 3 expression is regulated by nitrogen levels and by Gln3p; member of the GATA-binding family, forms homodimers and heterodimers with Deh1p

Gzf3p YPD 0.052803 7 2 (29%) 0 0 0 0 GATA zinc finger protein and Dal80p homolog that negatively regulates RAPA 5.85E-11 40 24 (60%) 0 0 0 6 nitrogen catabolic gene expression by competing with Gat1p for GATA site binding; function requires a repressive carbon source; dimerizes with Dal80p and binds to Tor1p Nrg1p YPD 0.000016 74 21 (28%) 17 0 0 0 Transcriptional repressor that recruits the Cyc8p-Tup1p complex to H2O2Lo 0.000017 48 15 (31%) 12 0 0 0 promoters; mediates glucose repression and negatively regulates a variety of processes including filamentous growth and alkaline pH response Uga3p YPD 0.678337 5 1 (20%) 1 0 0 0 Transcriptional activator necessary for gamma-aminobutyrate (GABA)- RAPA 2.25E-08 33 17 (52%) 1 0 0 5 dependent induction of GABA genes (such as UGA1, UGA2, UGA4); zinc- finger transcription factor of the Zn(2)-Cys(6) binuclear cluster domain type; localized to the nucleus Gal4p YPD 0.000342 26 11 (42%) 0 0 0 3 DNA-binding transcription factor required for the activation of the GAL genes in response to galactose; repressed by Gal80p and activated by Gal3p Mal33p YPD 1 0 MAL-activator protein, part of complex locus MAL3; nonfunctional in H2O2Hi 0.000158 39 7 (18%) 1 0 0 1 genomic reference strain S288C Hap4p YPD 0.00013 77 18 (23%) 4 0 0 6 Subunit of the heme-activated, glucose-repressed Hap2p/3p/4p/5p CCAAT-binding complex, a transcriptional activator and global regulator of respiratory gene expression; provides the principal activation function of the complex Aft2p YPD 0.566925 23 1 (4%) 0 0 0 0 Iron-regulated transcriptional activator; activates genes involved in H2O2Lo 0.000131 117 28 (24%) 9 0 12 3 intracellular iron use and required for iron homeostasis and resistance to oxidative stress; similar to Aft1p Cup9p YPD 0.000288 22 5 (23%) 3 0 0 0 Homeodomain-containing transcriptional repressor of PTR2, which encodes a major peptide transporter; imported peptides activate ubiquitin-dependent proteolysis, resulting in degradation of Cup9p and de-repression of PTR2 transcription Dat1p YPD 1.12E-08 24 15 (62%) 0 0 0 7 DNA binding protein that recognizes oligo(dA).oligo(dT) tracts; Arg side chain in its N-terminal pentad Gly-Arg-Lys-Pro-Gly repeat is required for DNA-binding; not essential for viability Gat3p YPD 4.36E-26 57 43 (75%) 4 5 0 8 Protein containing GATA family zinc finger motifs Rgm1p YPD 0.00001 8 7 (88%) 2 0 0 2 Putative transcriptional repressor with proline-rich zinc fingers; overproduction impairs cell growth YJL206C YPD 0.0001 29 9 (31%) 6 0 0 0 Putative protein of unknown function; similar to transcriptional H2O2Lo 0.000001 10 6 (60%) 6 0 0 0 regulators from the Zn[2]-Cys[6] binuclear cluster protein family; mRNA H2O2Hi 0.000971 13 4 (31%) 4 0 0 0 is weakly cell cycle regulated, peaking in S phase; induced rapidly upon MMS treatment YPR196W YPD 0.000458 4 4 (100%) 0 0 0 1 nuclear protein (putative) Met18p * YPD 0.001462 21 9 (43%) DNA repair and TFIIH regulator, required for both nucleotide excision repair (NER) and RNA polymerase II (RNAP II) transcription; involved in telomere maintenance Rap1p * YPD 0.010879 174 12 (7%) DNA-binding protein involved in either activation or repression of transcription, depending on binding site context; also binds telomere sequences and plays a role in telomeric position effect (silencing) and telomere structure 69

Supplemental Figures

A B

0.015 0.015

0.010 0.010 q−value q−value 0.005 0.005

0.000 0.000

0.0000 0.0005 0.0010 0.0015 0.0020 0 10 20 30 40 Significant tests P−value C

0.6

0.5

0.4 Expected number of false positives 0.3

0.2

0.1

0.0

0 10 20 30 40 Significant tests

Supplemental Figure 3.1. Estimating the error in identifying SBTFs. The Q-value software package108 was used to account for multiple-hypothesis testing by estimating the false discovery rate for the telomere distance profi le tests (see Figure 3.1 and text). (A) P-values plotted against their corresponding q-values. As indicated by the red dashed lines, the P-value threshold of 0.001 used in this study corresponds to a q-value (false discovery rate) of approximately 0.01, meaning that about 1% of the TF binding profi les with a signifi cant subtelomeric preference are expected to be false positives. (B) At P≤0.001, 31 tests are considered signifi cant. (C) Based on the q-value method, less than one (~0.3) of the 31 TF binding profi les with subtelomeric bias is expected to be a false positive. 70

Legend Promoter-binding SBTF = number of TFs that share targets interactions

Target rich-media only genes = number of TFs that bind stress only 21 18 16 13 9 5 2 1 0 both

Supplemental Figure 3.2. A map of the subtelomeric regulatory circuitry. All TF-promoter binding interactions from signiﬁ cant SBTF binding proﬁ les are visualized with Cytoscape11. See Legend for details. 71

A RPL9B YPR115W NOP13 XBP1 SNT2 YRM1 GZF3 PRP16 MIG1 FAB1 PIL1 YTA6 HMO1 CDC20 CUP9 NDC1 YPT11 MCM10 LSP1 HMO1,CST6: PSK2 Chromatin FOL2 associated proteins CST6 PKH1 YLR179C MCM10: YMR086W PDR1 replication and APP1 telomeric silencing AEP3

PRC1 ESC1 NRG1 MSC3

YLR407W SIA1 ROX1 ELF1 CRP1 CHD1 TAF2 PGA2 CKA1 ESC1: CKB2 SIR4-mediated CKA2 PEP4 silencing & telomere localization to REH1 nuclear periphery VHS3 HOT1 IDS2 RTN1 TAF8 CKB1 RIM13 YKL088W

B SIR silencing SAS silencing complex Ku complex GPG1 YGR052W heterodimer SAS2 SAS4

YLR278C SIR2 SIR4 YKU70 YKU80 BUD4 SIR3 MUC1 YIL152W

YBL044W GIR2 DMC1 SLA1 BZZ1 VRP1 MRE11 LAS17 RBG2 PRS5 XRS2 MRX complex MEC1 kinase complex CDC15 MEC1 CDC13 RAD50 complex CDC13

Gene annotation legend no telomere SBTF Telomere silencing (oval if confirmed in literature) annotation Telomere length (oval if confirmed in literature)

Silencing & length (oval if confirmed in literature)

Supplemental Figure 3.3. Analysis of protein complexes. (A) Protein complexes36 that contain SBTFs. (B) Examples of protein complexes identiﬁ ed in the same screen that contain telomere-related proteins identiﬁ ed in recent reviews of yeast telomeres99,115. 72

ssn6� haploid n=399 tup1� n=386 cdc73� n=103 sin3� n=224 yor080w� *3 n=226 top3��haploid n=51 sas4� n=20

P < 0.001 sas5� n=19 isw2� n=44 sir2� n=15 sir3� n=9 mrpl33� n=19 gal4� n=36 rgm1� n=42 mig1� n=29 pdr1� n=41 dat1� n=8 cup9� n=15 nrg1� n=19 hap4� ypr196w SBTFs � msn4� gat3� phd1� yjl206c� yap5� yap6�

0 10 20 30 40 50 60 70 Number of subtelomeric genes affected by deletion

Supplemental Figure 3.4. Screening expression profi les of singe gene deletions for enrichment of subtelomeric genes. Gene expression profi les of single gene deletions29,121 were analyzed with the same Telomere Distance Profi le analysis method used to identify SBTFs. The set of genes differentially expressed in each deletion mutant was identifi ed using a P-value cutoff of 0.001, as described in the original publications. Bars show the number of subtelomeric genes affected by each TF deletion. Grey indicates deletions that affect an unexpectedly large number of subtelomeric genes (KS-test, P<0.001). Black bars show data for the 15 rich-media SBTFs. No deletions of the stress condition SBTFs were available in these data sets. Numbers next to each bar indicate the total number of genes differentially expressed in a deletion. 73

HXT9,12,13,15,16 FLO1, FLO10 HXT17 YRF1-1,2,3,5,6 SOR1, SOR2 FDH1 PAU1, 4,6 YRF1-4,7 HMLa1 ADH7 DSF1, FSP2 FDH2 YRF1-8 COS1,3,5,4,8 HMLa2` AAD3 FET4

HAST domains COS YRF PAU COS4 COS5 YCL074W HMLALPHA RDS1 YMR315W ENB1 YIR035C YPL282C PAU1 YIR041W YIR040C PHO89 YNR068C YLL066C COS3 HMLALPHA YCL065W YRF1-6 YNL338W YDR543C BSC3 YLR463C YNL337W YRF1-1 YDR544C YLL065W YBL113C YBL112C YBL111C YBL109W ADH7 GIT1 YPL278C YPL277C YPL276W FDH2 YOR389W FDH1 HXT17 YNR071C YGL262W YPR204W YFL066C YHL050C YRF1-8 YOR394W YOL161C PAU6 YMR325W YMR324C PAU4 YGL260W YGL261C YLL064C YKL224C YKL223W YHL045W YHL046C YRF1-7 YIL175W YIL174W YPR203W YPR202W YML133C YRF1-4 YLR464W YLR462W YJL225C YIL177C COS1 YFL063W YFL064C YFL065C COS8 YEL076C- YEL076C YEL075C YEL074W YHL049C YOL166C YRF1-5 YRF1-3 AAD3 YRF1-2 YER189W HMRA1 HMRA2 YJL218W YLR460C YMR320W YCR102W- YCR102C YER185W YER184C YOL159C FLO1 FET4 YJL217W YOL157C FSP2 HXT9 YIL172C YIL171W HXT12 HXT16 SOR1 DSF1 HXT15 SOR2 FLO10 HXT13 YAL064W YNR070W YHL041W SNO3 SNZ3 STL1 YDR535C YML131W YIL176C VTH1 PGA3 YEL077C YBL108W

A 2 1 A

Heat Shock 005 minutes hs-2 29C to 33C - 5 minutes Heat Shock 030inutes hs-2 constant 0.32 mM H2O2 (40 min) rescan Heat Shock 015 minutes hs-2 Heat Shock 40 minutes hs-1 Heat Shock 05 minutes hs-1 Heat Shock 10 minutes hs-1 Heat Shock 15 minutes hs-1 Heat Shock 20 minutes hs-1 Heat Shock 30 minutes hs-1 heat shock 17 to 37, 20 minutes heat shock 21 to 37, 20 minutes heat shock 25 to 37, 20 minutes heat shock 29 to 37, 20 minutes Heat shock DBYyap1- 37degree heat - 20 min (redo) heat shock 33 to 37, 20 minutes constant 0.32 mM H2O2 (10 min) redo constant 0.32 mM H2O2 (20 min) redo constant 0.32 mM H2O2 (30 min) redo constant 0.32 mM H2O2 (50 min) redo constant 0.32 mM H2O2 (60 min) redo 1 mM Menadione (20 min) redo 1 mM Menadione (30 min) redo 1mM Menadione (40 min) redo 1 mM Menadione (50 min)redo 1 mM Menadione (105 min) redo 1 mM Menadione (80 min) redo 1 mM Menadione (160 min) redo 1.5 mM diamide (5 min) 1.5 mM diamide (10 min) 1.5 mM diamide (20 min) 1.5 mM diamide (30 min) 1.5 mM diamide (40 min) 1.5 mM diamide (50 min) 1.5 mM diamide (60 min) 1.5 mM diamide (90 min) DBYmsn2msn4 (good strain) + 0.32 mM H2O2 DBYmsn2/4 (real strain) + 0.32 mM H2O2 (20 min) DBY7286 + 0.3 mM H2O2 (20 min) H2O2 1 mM Menadione (120 min)redo 1M sorbitol - 120 min 1M sorbitol - 45 min DBYmsn2-4- 37degree heat - 20 min dtt 060 min dtt-2 2.5mM DTT 030 min dtt-1 DTT 2.5mM DTT 060 min dtt-1 2.5mM DTT 045 min dtt-1 2.5mM DTT 090 min dtt-1 2.5mM DTT 120 min dtt-1 2.5mM DTT 180 min dtt-1 dtt 120 min dtt-2 YPD stationary phase 2 h ypd-1 diamide YPD stationary phase 4 h ypd-1 YPD stationary phase 8 h ypd-1 DBY7286 37degree heat - 20 min DBYyap1- + 0.3 mM H2O2 (20 min) 1 mM Menadione (10 min)redo steady state 17 dec C ct-2 Menadione DBYyap1 + 37degree heat (repeat) DBYmsn2/4 (real strain) + 37degrees (20 min) 37 deg growth ct-1 aa starv 4 h aa starv 6 h 1M sorbitol - 5 min 1M sorbitol - 15 min 1M sorbitol - 30 min diauxic shift timecourse 18.5 h diauxic shift timecourse 20.5 h DBYyap1 + 0.32 mM H2O2 (20 min) steady state 36 dec C ct-2 diauxic shift timecourse11.5 aa starv 1 h 1M sorbitol - 60 min 1M sorbitol - 90 min dtt 480 min dtt-2 constant 0.32 mM H2O2 (100 min) redo constant 0.32 mM H2O2 (80 min) redo constant 0.32 mM H2O2 (120 min) redo constant 0.32 mM H2O2 (160 min) redo aa starv 0.5 h Nitrogen Depletion 1 h Nitrogen Depletion 30 min. Nitrogen Depletion 2 h dtt 240 min dtt-2 YPD 6 h ypd-2 Nitrogen Depletion 8 h Nitrogen Nitrogen Depletion 12 h YPD 1 d ypd-2 YPD 8 h ypd-2 YPD 10 h ypd-2 YPD 12 h ypd-2 YPD stationary phase 1 d ypd-1 YPD 2 d ypd-2 YPD 3 d ypd-2 Nitrogen Depletion 1 d Nitrogen Depletion 2 d Nitrogen Depletion 3 d YPD Nitrogen Depletion 5 d YPD stationary phase 12 h ypd-1 YPD stationary phase 2 d ypd-1 YPD stationary phase 3 d ypd-1 YPD stationary phase 5 d ypd-1 YPD stationary phase 7 d ypd-1 stationary phase YPD stationary phase 13 d ypd-1 YPD stationary phase 22 d ypd-1 YPD stationary phase 28 d ypd-1 YPD 5 d ypd-2 YP ethanol vs reference pool car-2 ethanol vs. reference pool car-1 17 deg growth ct-1 diauxic shift 21 deg growth ct-1 25 deg growth ct-1 steady state 15 dec C ct-2 Msn4 overexpression Heat Shock 060 minutes hs-2 Heat Shock 000 minutes hs-2 Heat Shock 000 minutes hs-2 29C +1M sorbitol to 33C + 1M sorbitol - 5 minutes 29C to 33C - 30 minutes 33C vs. 30C - 90 minutes 29C to 33C - 15 minutes 29C +1M sorbitol to 33C + 1M sorbitol - 15 minutes Heat Shock 60 minutes hs-1 Heat Shock 80 minutes hs-1 steady state 36 dec C ct-2 (repeat hyb) 37C to 25C shock - 60 min 37C to 25C shock - 90 min Heat Shock 000 minutes hs-2 37C to 25C shock - 15 min 37C to 25C shock - 30 min 37C to 25C shock - 45 min 29C +1M sorbitol to 33C + 1M sorbitol - 30 minutes 29C +1M sorbitol to 33C + *NO sorbitol - 5 minutes 29C +1M sorbitol to 33C + *NO sorbitol - 15 minutes 29C +1M sorbitol to 33C + *NO sorbitol - 30 minutes Hypo-osmotic shock - 15 min dtt 000 min dtt-2 dtt 015 min dtt-2 dtt 030 min dtt-2 2.5mM DTT 015 min dtt-1 2.5mM DTT 005 min dtt-1 diauxic shift timecourse 13.5 h diauxic shift timecourse 15.5 h Hypo-osmotic shock - 30 min Hypo-osmotic shock - 5 min Hypo-osmotic shock - 45 min Hypo-osmotic shock - 60 min YP raffinose vs reference pool car-2 Diauxic Shift Timecourse - 0 h diauxic shift timecourse 9.5 h steady-state 1M sorbitol YP sucrose vs reference pool car-2 YPD 2 h ypd-2 YPD 4 h ypd-2 YAP1 overexpression galactose vs. reference pool car-1 glucose vs. reference pool car-1 mannose vs. reference pool car-1 sucrose vs. reference pool car-1 raffinose vs. reference pool car-1 steady state 21 dec C ct-2 steady state 25 dec C ct-2 steady state 29 dec C ct-2 steady state 33 dec C ct-2 aa starv 2 h Nitrogen Depletion 4 h YP galactose vs reference pool car-2 YP fructose vs reference pool car-2 YP glucose vs reference pool car-2 YP mannose vs reference pool car-2 Msn2 overexpression (repeat) 29 deg growth ct-1

�ubtelomeric genes bound by SBTFs

Supplemental Figure 3.5. Expression responses of subtelomeric genes to environmental perturbations. Expression data are from Gasch and colleagues76. Subtelomeric genes (columns) are aligned with clustered binding patterns from Figure 3.3b. Unnamed stress responsive ORFs from Figure 3.6 (panels b and c) are indicated by maroon ticks and include YNR068C, YML131W, YMR315W, YCR102C, and YLR460C. 74

Supplemental Tables Supplemental Table 3.1. List of telomere-related genes curated from the literature.

�� 75

Supplemental Table 3.2. Correlation of binding proﬁ les with the presence of genomic features. 76

Supplemental Table 3.3. Orthologs of SBTFs in other species.

Yeast Plant Worm Fly Fish Mammal otal orthologs

SBTF Ohnolog # Binding cluster (see Fig 3b) T

YGOB ¶ InParanoid §

# Ohnologs are deﬁ ned to be paralogous genes (in this case in S. cerevisiae) that arose from a whole genome duplication event141 ¶ Ortholog data from the Yeast Gene Order Brower141 § Ortholog data from InParanoid142 version 6.0 77

Acknowledgements

We are grateful to S. Bandyopadhyay for providing the double deletion strains used in this project; S. Jacobson for assisting with dilution assays; S. Bandyopadhyay, D. Gottschling, S. Jacobson, J. Karlseder, T. Ravasi, K. Tan, B. Trask, C. Workman, and C.H. Yeang for critical comments on the manuscript. This work was funded by grant ES14811 to T. Ideker from the NIEHS. T. Ideker is a David and Lucille Packard Fellow. Chapter 3, in full, has been submitted for publication as it may appear in Genome Research, 2008, Mak, H. Craig; Pillus, Lorraine; Ideker, Trey. The dissertation author was the primary investigator and author of this paper. Chapter 4: Reﬁ nement of a transcriptional regulatory network using graphical probabilistic models

Abstract

As genome-scale measurements lead to increasingly complex models of gene regulation, systematic approaches are needed to validate and reﬁ ne these models. Towards this goal, we describe an automated procedure for prioritizing genetic perturbations to optimally discriminate among alternative models of a gene regulatory network. Using this procedure, we evaluate 38 candidate regulatory networks in yeast and perform four high priority gene knockout experiments. The reﬁ ned networks support previously-unknown regulatory mechanisms downstream of SOK2 and SWI4.

78 79

Introduction

Recent advances in genomics and computational biology are enabling construction of large-scale models of gene regulatory networks. High-throughput technologies such as automated sequencing143, gene expression arrays144, chromatin immunoprecipitation24, and yeast two-hybrid assays21 each probe different aspects of the gene regulatory system through genome-wide data sets. These data have spawned a variety of methods to infer the structure of gene regulatory networks or to study their high-level properties, as recently reviewed145. Regulatory network models generated thus far in E. coli and yeast have been most often validated against functional databases or prior literature146,147. In contrast, only a few studies have attempted to validate or refi ne models systematically. However, if we are to accurately model large gene networks in complex organisms including fl y, worm, mouse, and human, automated procedures will be essential for analyzing the network, choosing the best new experiments to test the model, conducting the experiments, and integrating the resulting data. The problem of choosing the best experiments to estimate a model, termed experimental design or active learning, has been a signifi cant area of research in statistics and machine learning148-150. Automating the experimental design process can greatly accelerate data collection and model building, leading to substantial savings in time, materials, and human effort. For these reasons, many industries such as electronic circuit fabrication and airplane manufacturing incorporate experimental design as an integral step in the design process151,152. A promising application of experimental design for biological systems was presented by King et al.153, who integrated computational modeling and experimental design to reconstruct a small, well-studied metabolic pathway. Whether automated experimental design can be useful in a large and poorly-characterized biological system with noisy data remains 80 an open question. Recently, we reported a procedure to infer gene regulatory network models by integrating gene expression profi les with high-throughput measurements of protein interactions35. Here, we extend this procedure to incorporate automated design of new experiments. First, we use the previously-described modeling procedure to generate a library of models corresponding to different gene regulatory systems in yeast. Many of these models contain transcriptional interactions for which the regulatory effects (inducer vs. repressor) are ambiguous and cannot be determined from publicly- available expression profi les. Next, to address these ambiguities we implement a score function that ranks possible genetic perturbation experiments based on their projected information content over the models. We perform four of the highest-ranking perturbations experimentally and integrate the data back into the model. The new data support two out of three novel regulatory pathways predicted to mediate expression changes downstream of the yeast transcriptional regulator SWI4.

Results

Summary of physical regulatory models

We applied a previously-described network modeling procedure35 to integrate three complementary sources of gene regulatory information in yeast: 5558 promoter- binding interactions for 106 transcription factors measured using the ChIP-chip approach24; the set of all 15,116 pair-wise protein-protein interactions recorded in the Database of Interacting Proteins as of April 2004154; and a panel of mRNA expression proﬁ les for 273 individual gene deletion experiments29. Software for performing the network modeling procedure is available as a plug-in to the Cytoscape package11 on our Supplemental Website (http://www.cellcircuits.org/Yeang2005/). For each gene deletion experiment, the modeling procedure identiﬁ ed the most 81

[a] ��

�� Perturbation/Expression �� original knockout �� affected �� new knockout ��

��

�� Protein-DNA Protein-protein

�� inducer activator �� repressor inhibitor �� [b1] [b2] ��

��

Figure 4.1. Wiring diagrams for example network models. Two examples are shown: Model 0, showing regulatory pathways that have unique functional annotations [a]; and Model 1, showing regulatory pathways downstream of SWI4 and SOK2 with ambiguous functional annotations (several would be consistent with the observed expression responses: two possibilities are shown in [b1] and [b2]). In the models, a connection from gene a to b represents the experimental observation that [1] the proteins encoded by a and b physically interact in a protein- protein interaction (dotted links); or [2] the protein encoded by a binds the promoter of b (solid links). Each gene is either a knockout cause (red nodes), a differentially-expressed effect (tan nodes), or a signal transducer that was chosen for follow-up perturbation (gray nodes). Functional annotations (edge colors) are uniquely determined [a] or else multiple annotations are possible [b1, b2] based on the available data. Diagram layout is performed automatically using the Cytoscape package11. 82 probable paths of protein-protein and promoter-binding interactions that connect the deleted gene (the perturbation) to genes that were differentially expressed in response to the deletion (the effects of perturbation). Thus, a path represented one possible physical explanation by which a deleted gene regulates a second gene downstream. Based on the expression data, each interaction on a path was annotated with (1) its probable direction of information fl ow and (2) its probable regulatory effect as an inducer or repressor. For example, the model in Figure 4.1[a] (top center) includes a path from GLN3 through GCN4 to a block of downstream affected genes. This model integrates evidence that: (1) Gln3p binds the promoter of GCN4 with high signifi cance in a ChIP-chip assay [3] (p ≤ 8·10-4); (2) Gcn4 binds the promoters of many genes in the ChIP-chip assay (RIB5, YJL200C, and others in the downstream block); and (3) a signifi cant number of genes in the block are up-regulated in a gln3Δ knockout but down-regulated in a gcn4Δ knockout29. Together, these integrated data confi rm Gcn4p as an activator of downstream genes155 and lead to a (novel) annotation that Gln3p likely regulates GCN4 via transcriptional repression.

In total, the modeling process generated 4836 paths, each explaining expression changes for a particular gene in one or more knockout experiments. Of the 965 interactions covered by paths, 194 had regulatory effects that were uniquely determined by the data, while regulatory effects of the remaining 771 interactions were ambiguous. For example, Figure 4.1b includes ambiguous interaction paths through SWI4, SOK2, and MSN4 explaining the observation that many genes for which the promoters are bound by MSN4 are up-regulated in a swi4Δ knockout. This observation can be explained by several alternative annotations: one scenario is that SWI4 activates SOK2 and SOK2 represses MSN4 (Figure 4.1b1), while another is that SWI4 represses SOK2 and SOK2 activates MSN4 (Figure 4.1b2). These regulatory 83 annotations could be uniquely determined by measuring the expression changes of genes downstream of MSN4 in the model in response to a sok2Δ deletion and an msn4Δ deletion (see below).

Paths with ambiguous interactions were partitioned into 37 independent network models (numbered 1-37), where each model contained a distinct region of the physical network (see Methods). The remaining non-ambiguous paths were grouped into a single model (referred to as Model 0). As shown in Table 4.1, 21 of all models (55%) contained pathways that had been well documented in the literature or were signifi cantly enriched for genes belonging to specifi c MIPS14 functional categories. Of 132 protein-DNA interactions incorporated into Model 0, we found that 50 had been confi rmed in classical (low-throughput) assays as reported in the Proteome BioKnowledge Library156. Moreover, the inferred regulatory roles (induction or repression) for 48 out of 50 of these interactions agreed with their experimentally- determined roles (96%, binomial p-value < 1.22•10-7). Wiring diagrams for Models 0 and 1 are given in Figure 4.1; diagrams for all other regulatory network models are provided in Supplemental Materials and on the Supplemental Website (http://www. cellcircuits.org/Yeang2005).

Experiment selection

As shown in Figure 4.2, we implemented an information-theoretic approach to discriminate among ambiguous model annotations using the fewest additional gene expression experiments. All non-lethal single gene knockout experiments were ranked by their projected information content based on the inferred models (see Methods). Table 4.2 reports the list of top-ranking experiments. This list coincides roughly with biological intuition, in the sense that informative target genes typically encode proteins 84

Alternate network models Candidate single 1 2 3 N gene knock-outs

B ... C D

E F G

Knock-out priority scoring: (B, D, C, E, G, F)

Run top priority microarray experiments: (B, D)

Re-assembly . . . . . Remaining . consistent models Validation

Figure 4.2. Schematic of the experimental design approach. The input to the approach is a set of alternative representations of a gene regulatory model, each of which is equally likely given current expression data. In the present work, the alternatives arise due to ambiguities in the regulatory roles of interactions in the model as inducers or repressors of downstream genes. Next, a scoring procedure is used to rank candidate perturbations according to their expected information gain over the model alternatives. High-ranking perturbations are applied to the system and characterized using gene expression microarrays. The resulting expression proﬁ les validate or invalidate particular connections in the model and reduce the set of model alternatives to those that are consistent with both old and new expression measurements. 85 that are network “hubs”, each having a large number of regulatory interactions with downstream genes in the models. However, as discussed later, knocking out “hubs” only is not as effective as using the information-theoretic criteria.

Among the highest priority experiments, Model 1 (Figure 4.1b) was the most often targeted, containing 3 of the top 10 highest scoring genetic perturbations: sok2Δ, yap6Δ, and msn4Δ. A fourth perturbation to Model 1, hap4Δ, was also highly ranked (rank 34). Therefore, Model 1 was chosen for further experimentation.

Model validation

Knockout strains corresponding to the high-ranking perturbations sok2Δ, yap6Δ, hap4Δ, and msn4Δ were grown in quadruplicate under conditions identical to those for the initial 273 knockouts by Hughes et al29. Gene expression proﬁ les were obtained for each knockout culture versus wild-type using yeast-genome microarrays. We sought to test the three regulatory cascades leading from SWI4 to SOK2 to either MSN4, HAP4, or YAP6 (Figure 4.1b). ToTo verify these cascades independently of the model, we analyzed the expression patterns of gene sets known to be directly regulated by MSN4, HAP4, or YAP6 (obtained from the Proteome BioKnowledge Library156; see Supplemental Table 4.4).

To normalize between our microarray procedures and those of Hughes et al., we also repeated the original swi4Δ expression profi le, and fi ltered the above sets to select only those genes with expression changes that were reproducible (i.e., same direction of change) between the Hughes et al. swi4Δ profi le and our new profi le. Expression changes were reproducible for 28 of 42 Msn4-regulated genes, 11 of 29 Hap4-regulated genes, and 64 of 119 Yap6-regulated genes. Expression similarity among the genes in each fi ltered set was captured formally in a measure called 86

“coherence”; details about the computation of expression coherence and the selection of the gene sets are described further in Methods and the Supplemental Website (http:// www.cellcircuits.org/Yeang2005).

As shown in Figure 4.3a, the gene set downstream of MSN4 showed coherent

Msn4-regulated genes [a] 2.0

1.0

0.0 swi4� sok2� hap4� msn4� yap6�

Coherence -1.0

-2.0

Hap4-regulated genes [b] 2.0

1.0

0.0 swi4� sok2� hap4� msn4� yap6�

Coherence -1.0

-2.6 -2.0

Yap6-regulated genes [c] 2.0

1.0

0.0 swi4� sok2� hap4� msn4� yap6�

Coherence -1.0

-2.0

Unrelated control (Msn1-regulated genes) [d] 2.0

1.0

0.0 swi4� sok2� hap4� msn4� yap6�

Coherence -1.0

-2.0

[e] Refined model ��

��

Figure 4.3. Validation and reﬁ nement of Swi4 transcriptional cascades. Yeast genome microarrays were used to explore three transcriptional cascades from Model 1 involving the transcriptional regulators SWI4, SOK2, and either [a] MSN4, [b] HAP4, or [c] YAP6. Bar charts show the expression coherence of genes regulated by MSN4, HAP4, or YAP6 in knockout strains swi4Δ, sok2Δ, msn4Δ, hap4Δ, and yap6Δ. Coherence scores more extreme than ±0.7 are signiﬁ cant (p < 0.01; dotted lines). Results are also shown for genes bound by MSN1 [d] as representative of an unrelated model not targeted by these perturbations. This analysis provides validation for the MSN4 and HAP4 pathways and disambiguates the role of each pathway interaction as activating (SWI4 interactions) or repressing (SOK2 interactions) downstream genes [e]. The YAP6 pathway hypothesis is not supported by this analysis. 87 up-regulation in the swi4Δ (p ≤ 10−4) and sok2Δ (p ≤ 10−4) knockouts, but down- regulation in the msn4Δ (p ≤ 8•10−4) knockout. This result supports the existence of a regulatory cascade leading from SWI4 to SOK2 to MSN4. Furthermore, in the context of the present regulatory cascade, MSN4 appears to be an inducer since its downstream gene set was down-regulated in the msn4Δ experiment. In contrast, SOK2 appears to be a repressor of MSN4 since a sok2Δ deletion experiment up-regulates the same set of genes. Finally, SWI4 appears to be an inducer of SOK2 since the swi4Δ knockout has the same effect as sok2Δ (i.e., up-regulation).

Results were qualitatively similar for the HAP4 pathway (Figure 4.3b). The gene set downstream of HAP4 was up-regulated in the swi4Δ (p ≤ 10−2) and sok2Δ (p ≤ 9•10−4) knockouts but down-regulated in hap4Δ (p ≤ 10−4). These results suggest that swi4Δ, sok2Δ, and hap4Δ deletions affect the set of genes immediately downstream of HAP4, supporting the SWI4-SOK2-HAP4 regulatory pathway hypothesis.

In contrast to the MSN4 and HAP4 pathways, the gene set downstream of YAP6 had insigniﬁ cant responses to all follow-up knockout experiments (Figure 4.3c). Thus, the existence of the SWI4-SOK2-YAP6 regulatory pathway was not supported by our validation experiments.

Automated model reﬁ nement

We used our modeling procedure to construct a new physical network model using the original 273 knockout gene expression experiments of Hughes et al. combined with the new sok2Δ, hap4Δ, msn4Δ, and yap6Δ proﬁ les. Overall, 60 protein-DNA interactions were disambiguated by our data: 50 interactions were resolved as deﬁ nite inducers or repressors, whereas ten interactions were removed 88 from the model because the expression of downstream genes did not change as a result of the knockout. In the updated Model 1, MSN4 and HAP4 were unambiguously annotated as inducers of downstream genes, SOK2 was annotated as a repressor of MSN4 and HAP4, and SWI4 was annotated as an inducer of SOK2 (Figure 4.3e). These results agree with our previous manually-derived annotations (see Model Validation above).

Learning curve analysis

We quantifi ed the effi ciency of our information-based approach by comparing it to two other methods of prioritizing gene knockout experiments: prioritizing hubs and prioritizing genes randomly. First, we generated a “reference” model by fi xing each ambiguous interaction in Models 1-37 to be an inducer or repressor. Assignments were chosen arbitrarily from the set of annotations that were consistent with the original knockout data. Next, we used each method (information, hub, or random) to iteratively “learn” these assignments. In each iteration, we selected the highest- priority knockout experiment, simulated the resulting expression changes (up/down) using the “reference” model, updated the inferred model, and recorded the number of ambiguous interactions that were resolved. This iterative learning procedure was repeated 100 times.

As shown in Figure 4.4, the mutual information criterion signiﬁ cantly outperformed hub-based and random selection. The learning curves also provide an estimate of the number of additional experiments needed to reduce model ambiguity below a given level. For example, using the information-based score, ten knockout experiments are needed to reduce the number of ambiguous interactions by 50%. In contrast, over 25 experiments are needed according to the hub-based method. Figure 89

4.4 suggests that performing 40 additional experiments selected using the information- based score will clarify the regulatory roles of about 70% of the ambiguous interactions. The learning rate of the ﬁ nal 30% becomes very slow because these interactions are isolated in the physical network, unconnected to others, and thus require separate knockouts to decipher each of them.

1000

900

800

700

600

500

400 information

# ambiguous interactions remaining hub 300 random

200 0 5 10 15 20 25 30 35 40 # simulated knockout experiments used to refine model

Figure 4.4. Simulated learning curves of three experimental design methods. Three different methods of selecting experiments are compared: mutual information scores (triangles), hub selection (circles), and random selection (squares). We performed 100 simulated trials and show the average number of ambiguous interactions remaining in the inferred model after each simulated knockout experiment. Vertical bars indicate the standard deviations for the random selection method. The standard deviations for the information and hub selection curves are less than ﬁ ve and are not shown for clarity.

Discussion

We have used global expression profi les to validate models of transcriptional regulation inferred from protein-protein interactions, genome-wide location analysis, and expression data. A previously described network inference algorithm35 identifi es probable paths of physical interactions connecting a gene knockout to genes that are differentially-expressed as a result of that knockout. The proposed validation strategy uses information gain as a criterion for choosing optimal knockouts to profi le using microarray experiments. This strategy agrees with intuition, in that optimal knockouts typically target intermediate genes along the pathways under consideration. 90

If an intermediate gene knockout fails to affect downstream genes in a pathway, that pathway is removed from the model.

The validated pathways point to a combination of previously documented and novel fi ndings. First, in agreement with previous literature, we confi rm that MSN4 and HAP4 are inducers157,158 and that SOK2 is a repressor159. For instance, SOK2 is known to act downstream of PKA to repress genes involved in stress response, glycogen storage, and pseudohyphal growth159. However, although SOK2 is thought to control these pathways via a transcriptional cascade, the components of this cascade have remained unclear. Here, we provide evidence for a model in which SOK2 acts as a negative regulator upstream of MSN4 and HAP4. Interestingly, MSN4 has been shown to activate stress response genes158, and HAP4 has been shown to activate genes involved in energy conservation and oxidative carbohydrate metabolism157. Thus, we have identifi ed a candidate model for the transcriptional cascade downstream of PKA signaling that mediates stress response. This model includes two novel regulatory pathways from SWI4 to SOK2 to MSN4 and from SWI4 to SOK2 to HAP4. The validation experiments do not support the third predicted pathway from SWI4 to SOK2 toYAP6.

In model simulations, choosing new gene knock out experiments with an information-theoretic approach signiﬁ cantly outperformed both random and hub-based selection. It also outperformed the observed experimental results: approximately 280 interactions were disambiguated after four simulated knockouts (Figure 4.4), whereas only 60 interactions were resolved due to the four actual knockouts sok2Δ, hap4Δ, msn4Δ, and yap6Δ. This difference in performance stems from key differences between the simulated and actual scenarios. In simulation, the four experiments are performed independently and iteratively, selecting the absolute highest ranking knockout each time. In the actual study, four high-ranking experiments (but not the 91 highest) are chosen to interrogate and maximally resolve a single pathway model, resulting in experiments that are highly co-dependent and performed simultaneously without intervening rounds of inference and experimental design. In addition, the simulation assumes that all interactions in the model are correct, along with one of the initial sets of inducer/repressor annotations. It therefore isolates the process of learning regulatory role annotations, whereas the actual procedure also serves to distinguish interactions as true- versus false-positives. Nevertheless, the simulation provides a useful comparison of experimental design methods relative to each other.

An important limitation of the single-gene knockout approach is that single perturbations do not identify pathway intermediates for which loss of function can be compensated by another gene. Furthermore, our approach may not identify regulatory pathways in which several transcription factors independently activate gene expression. Applying knockouts in combination may prove fruitful in these cases. For instance, approximately 4000 double knockouts have been reported in yeast that lead to synthetic lethality, i.e. a lethal phenotype that is not observed in either of the single knockouts individually160. These interactions suggest regulatory relationships which could be incorporated in future work.

Conclusions

Scientiﬁ c discovery is an iterative process of building models to explain experimental observations and validating models with new experiments161. Experimental design is the essential link between these two aspects. Here, we have explored a framework for modeling transcriptional networks, in which experimental design and validation are central features. This framework is based on computational analysis and expression microarrays, both of which are amenable to automation, suggesting a high-throughput strategy for mapping gene regulatory pathways. 92

Methods

Model building and inference

Experiment scoring

We calculated the expected information gain for each of the 4756 possible non-lethal single-gene deletion experiments that were not included in the set of 273 deletions used to generate our network models. Intuitively, information gain measures

(the logarithm of) the number of ambiguous annotations in the model that are likely to be determined after generating a yeast-genome expression profi le in response to a particular gene deletion under consideration. Each gene deletion experiment predicts a distinct expression profi le given a particular confi guration of model annotations. Experiments with high information gain are those for which the predicted expression profi les are highly variable over the set of possible annotations. In these cases, only one (or at most a few) of the predicted profi les will match the true observed profi le, 93 effi ciently constraining the space of possible model annotations.

The information gain discussed above arises from the expected value of information calculations in statistical decision theory149. Here we describe the score more directly in terms of reduction of model entropy. The entropy of a set of ambiguous model annotations is given by:

H (M )= −∑ P(M = m) log2 P(M = m) m The expected information gain is the difference between the entropies before and after a hypothetical experiment:

I (M ;Y e )= H (M ) − H (M |Y e ) = H (M )+ P M = m,Y e = y log P M = m |Y e = y ∑ ( ) 2 ( ) where Ye denotes the vector of predictedm,y expression changes for each gene in the model under experiment e. The conditional entropy H(M | Ye) requires us to consider all possible models and corresponding outcomes resulting from experiment e. Direct enumeration of all values of M and Ye is impractical; instead, we make several simplifying approximations as described on the Supplemental Website (http://www. cellcircuits.org/Yeang2005).

Expression proﬁ ling

Expression profi ling experiments were based on the wild-type diploid BY4743 and homozygous gene knock-out strains derived from this parent5 (Invitrogen), with cultures grown identically to Hughes et al.29. Labeled cDNA from each gene knockout strain was co-hybridized versus wild type cDNA in quadruplicate two-color microarray hybridizations. Total RNA was isolated by hot acid phenol extraction, purifi ed to mRNA (Ambion PolyAPure kits), and labeled with Cy3 or Cy5 by 94 direct incorporation (Amersham CyScribe First-Strand cDNA Labeling Kit). DNA microarrays were spotted from the Yeast Genome Oligo Set v1.1 (Qiagen) on Corning UltraGAPS slides using an OmniGrid® 100 robot (Genomic Solutions). Lyophilized Cy3- and Cy5-labeled samples were resuspended in 50μL buffer (5X SSC, 0.1% SDS, 1X Denhardt’s solution, 25% formamide) and co-hybridized at 42°C beneath a coverslip for 15 hours. Arrays were imaged at 10μm resolution using a ScanArray Lite instrument (PerkinElmer). Raw quantitated background intensities were smoothed using a 7x7 median fi lter, separately for the Cy3 and Cy5 channels, and data were corrected for cyanine-dye dependent bias using a Qspline normalization82. The VERA/ SAM package83 was utilized to assign a log-likelihood statistic λ with each gene, indicating its signifi cance of differential expression in each experiment. Microarray expression data are deposited in the ArrayExpress database162 under accession numbers A-MEXP-217 (Arrays) and E-MEXP-351 (Experiments).

Expression coherence

The expression coherence of a set of genes measures whether the expression levels of these genes behave similarly in a particular experiment. Each gene i in gene- deletion experiment e has an expression ratio rie (versus wild-type) and associated p- value pie of differential expression. First, we ﬁ lter out insigniﬁ cant expression changes with a p-value > 0.5. Then, we use the inverse Gaussian cumulative distribution function, Φ−1, to convert each remaining p-value into a z-score28,83:

−1 zie = Φ (1−pie)

Next, we compute a “signed z-score” by multiplying z by +1 if the expression level is increasing and by −1 if it is decreasing. The average signed z-score for a gene subset of size N is computed as: 95

1 N Ze = ∑ ∂(zie > 0)sgn(rie )zie N i=1 Gene sets with expression changes that are signiﬁ cant and in the same direction result in large Z-values. A distribution of Z values obtained from random gene sets of size N was used to determine a p-value for each expression coherence score. 96

Tables Table 4.1. Internal validation for 21 of the 38 inferred models. The number of genes and variants are shown for each model along with the results of our preliminary validations. Each variant corresponds to a distinct set of functional annotations on the interactions in the model (directions and regulatory effects: see text). For Model 0, the expression data implied a unique set of annotations; for all other models multiple sets of annotations were possible. Each model was validated if its pathways were (wholly or partially) cited in previous studies or its downstream genes were signiﬁ cantly enriched for MIPS functional categories (p≤0.05; hypergeometric test with Bonferroni correction).

Model #Genes #Variants Validated literature pathway Enriched MIPS functions 0 130 1 Kss1/Fus3-Ste12 cell fate (1.48×10-7); (mating response and metabolism (0.0067) ﬁ lamentous growth) 1 69 8 Sok2-Msn4 (PKA pathway) 2 63 16 Tup1-Hhf1 (histone regulation) protein synthesis (7.13×10-8) 3 44 2 Tup1/Ssn6-Nrg1 transport (1.05×10-5); (glucose metabolism) metabolism (5.41×10-4) 4 58 8 Tup1/Ssn6-α2/Mcm1 cell fate (1.12×10-5); (mating response) 5 58 4 Rpd3-Abf1 (histone modiﬁ cation) 6 44 2 Swi4-Ndd1-Ace2 (cell cycle) 7 26 4 cell cycle (0.035) 8 36 8 Slt2-Rlm1/Swi4 (PKC pathway) 10 45 16 Med2-Gal4/Gcn4 (general transcription) 15 13 4 Cmd1-Cna1-Skn7 (calcium signaling) 19 9 4 cell defense (6.33×10-6) 23 17 2 metabolism (1.49×10-6); energy (0.04) 26 8 8 cell defense (9.62×10-5) 29 9 4 Yap1-Cad1 (metal response) 30 12 4 Med2-Srb6-Gal4 (general transcription) 32 12 4 Med2-Gal11-Gal4 (general transcription) 33 12 4 Med2-Srb5-Gal4 (general transcription) 34 9 4 Ste12-Mcm1 cell fate (4.55×10-8); (mating response) homeostasis (0.0012); cell communication (0.0345) 36 7 4 metabolism (0.0258) 40 5 2 metabolism (0.0017) 97

Table 4.2. Top-ranking knock-out experiments proposed for model discrimination. Each proposed target gene is reported, along with its function, mutual information score, rank, and the model(s) it informs. All target genes are non-lethal in rich media. A ‘*’ indicates gene knockouts selected in this study.

Gene Function Score Downstream Rank genes HHF1 Histone 52.1429 74 1 2 regulator for meiosis and SOK2* 45.0279 64 2 1 PKA pathway CKA1 protein kinase of cell cycle 45.0075 64 3 5 A2 mating response 40.9023 58 4 4 YAP6* stress response regulator 35.1652 50 5 1, 3 regulator of glucose NRG1 31.6501 45 6 3 dependent genes FKH1 regulator of cell cycle 29.1194 41 7 2 FKH2 regulator of cell cycle 26.7131 38 8 7 protein kinase of cell wall SLT2 23.4727 31 9 8 integrity pathway MSN4* regulator of stress response 21.8224 31 10 1 … … … … … … regulator of cellular HAP4* 6.3310 9 34 1 respiration 98

Supplemental Tables

Supplemental Table 4.1. Internal validation for 17 out of 24 restricted network models.

Model # Genes Validated literature pathway Refs. Enriched MIPS Function

Kss1/Fus3-Ste12 cell fate (1.48x10-7); 0 109 (mating response and ﬁ lamentous 163 metabolism (0.0067) growth)

Sok2-Msn4 1 69 158 (PKA pathway)

Tup1/Ssn6-Nrg1 transport (1.05x10-5); 2 44 164 (glucose metabolism) metabolism (5.41x10-4)

Tup1/Ssn6-α2/Mcm1 3 58 165 cell fate (1.12x10-5) (mating response)

Swi4-Ndd1-Ace2 4 44 166 (cell cycle)

Slt2-Rlm1/Swi4 6 36 167 (PKC pathway)

Cmd1-Cna1-Skn7 7 13 168 (calcium signaling)

8 9 cell defense (6.33x10-6)

Med2-Gal11-Gal4 10 12 169 (general transcription)

12 7 metabolism (0.026)

13 17 metabolism (1.49x10-5)

Gcn4-Met4 14 6 metabolism (2.88x10-4) (methionine synthesis)

cell fate (4.55x10-8); Ste12-Mcm1 16 9 homeostasis (0.0012); (mating response) cell communication (0.0345)

Cka2-Abf1 17 13 (casein kinase pathway)

20 5 metabolism (0.0017)

Cka2-Abf1 22 8 (casein kinase pathway)

Cka2-Abf1 25 7 (casein kinase pathway) 99

Supplemental Table 4.2. Correlations between swi4Δ and gcn4Δ data from Rosetta and the new experiments.

swi4Δ total gcn4Δ total correlation -0.013 0.34 correlation p-value 0.84 < 10-4 rank correlation -0.058 0.145 rank correlation p-value 1.0 < 10-4 hyper-geometric p-value 0.188 4.88×10-21

swi4Δ subset gcn4Δ subset correlation 0.344 0.831 correlation p-value 8×10-5 < 10-5 rank correlation 0.353 0.705 rank correlation p-value 9×10-5 < 10-5 hyper-geometric p-value 1.05×10-5 < 1.8×10-13 100

Supplemental Table 4.3. Restricted subsets used to evaluate the reproducibility.

Swi4 subset

AAP1 LYS20 YAP6 YJL225C ATR1 MNN1 YBL029W YJR078W BAT2 MSN4 YBL112C YKL051W CCP1 NCE102 YBL113C YLR084C CLB1 PCL7 YDR451C YLR463C CLB2 PDR12 YDR545W YLR465C CLB6 PRY2 YEL045C YLR466W COX9 QCR2 YEL047C YLR467W CPA2 RNR1 YEL077C YML133C CSI2 RPI1 YER045C YNL339C CUP9 RPL19B YER189W YNR067C ECM13 SAT2 YER190W YOL011W ECM33 SCW10 YFR006W YOR248W ERG6 SGA1 YGR086C YOR315W EXG1 SNQ2 YGR153W YOX1 HAP1 SOK2 YGR296W YPL267W HAP4 SRL1 YHL049C YPL283C HSC82 SVS1 YHR048W YPR203W HTA1 SWI4 YIL056W YPS4 HTB1 TRP4 YIL177C ISU2 UTR2 YJL217W

Gcn4 subset

ADH5 HAD1 UGA3 APG1 HAL2 YER069W ARG1 HIS1 YER073W ARG11 HIS4 YGL184C ARG3 HOM3 YHM1 ARG7 LEU4 YHR122W ARG8 MET16 YHR162W ARO1 PCL5 YJL200C ARO3 RIB5 YLR152C CPA2 TRP2 YMC1 GCN4 TRP3 YOL119C 101

Supplemental Table 4.4. Gene sets for external validation.

Msn4 subset

ALD3 LAP4 ARA1 PGM2 CTT1 RAS2 YMR173W SOD2 DOT5 SOL4 GLK1 SPS100 GLO1 SSA4 GRE3 TKL2 GSY1 TPS1 HOR2 TTR1 HSP12 YDL124W HSP26 YKR011C HSP104 YMR315W HXK1 YNL200C

Hap4 subset

ATP3 COX13 ATP7 QCR7 ATP16 TUF1 COX4 QCR10 COX7 YLR294C COX12 102

Acknowlegements

We gratefully acknowledge Owen Ozier, Ryan Kelley, and Rowan Christmas for their valuable assistance with model visualization, and Julia Zeitlinger for commenting on our manuscript. CM, CW and SM were supported by NIGMS grant GM070743-01and NSF grant CCF-0425926. TI was supported by a David and Lucille Packard Fellowship award. CY and TJ were supported in part by NIH grant(s) GM68762 and GM69676.

Chapter 4, in full, is a reprint of material as it appeared on July 1 in Genome Biology, 2005, 6:R62. Yeang, Chen-Hsiang; Mak, H. Craig; McCuine, Scott; Workman, Christopher; Jaakkola, Tommi; Ideker, Trey. The dissertation author was one of two primary investigators and authors (with C.H. Yeang) of this paper. Chapter 5: Cataloging network models in a searchable online database

Abstract

CellCircuits (http://www.cellcircuits.org) is an open-access database of molecular network models, designed to bridge the gap between databases of individual pair-wise molecular interactions and databases of validated pathways. CellCircuits captures the output from an increasing number of approaches that screen molecular interaction networks to identify functional subnetworks, based on their correspondence with expression or phenotypic data, their internal structure, or their conservation across species. This initial release catalogs 2019 computationally- derived models drawn from eleven journal articles and spanning fi ve organisms (yeast, worm, fl y, Plasmodium falciparum, and human). Models are available either as images or in machine-readable formats and can be queried by the names of proteins they contain or by their enriched biological functions. We envision CellCircuits as a clearinghouse in which theorists may distribute or revise models in need of validation and experimentalists may search for models or specifi c hypotheses relevant to their interests. We demonstrate how such a repository of network models is a novel systems biology resource by performing several meta-analyses not currently possible with existing databases.

103 104

Introduction

Presently, a great deal of biological information is represented as interactions between molecules. This information includes physical interactions that occur among proteins, DNA, RNA, and small molecules (e.g.38,70,170); genetic interactions such as synthetic lethality, enhancement, or suppression171; and interactions due to co-expression172 or co-citation173. Modern analyses of interaction data typically accomplish two goals. The ﬁ rst goal is to clean the data, by ﬁ ltering erroneous interactions that can be associated with high-throughput screens (false positives, e.g.34,174) or by predicting new interactions that may have been previously missed (false negatives, e.g.175,176). The second goal is to organize the interactions into biological network models — that is, collections of interactions hypothesized to work together towards a particular cellular function or within a common pathway177-179. Interaction analysis is currently supported by two types of available databases (Figure 5.1). First, the raw material for analysis is provided by databases of molecular interactions including the Database of Interacting Proteins16, the Munich Center for Information on Protein Sequences14, the Biomolecular Interaction Network

Databases of pathways

D.I.P. ell Circuits CellMap

Databases of molecular interactions

Increasing certainty and biological relevance

Figure 5.1. The need for a new type of database. The CellCircuits database is positioned between raw molecular interaction databases (left) and databases of rigorously validated cellular pathways (right). Interaction database icons represent (clockwise from top left) the Database of Interacting Proteins (DIP16); the General Repository of Interaction Datasets (GRID117); Molecular INTeractions Database (MINT203); the IntAct molecular interactions database13; the interaction database at the Munich Information Center for Protein Sequences (MIPS14); and Biomolecular Interaction Network Database (BIND12). Pathway database icons represent Reactome50; Signal Transduction Knowledge Environment (STKE54); Gene Networks database (GeNet53); Biocarta (http://www.biocarta.com/genes); Kyoto Encyclopedia of Genes and Genomes (KEGG204); and CellMap (http://cellmap.org). 105

Database12, the BioGRID15, and IntAct13. Many of these databases provide confi dence scores with each measured and predicted interaction. Second, there are a growing number of so-called pathway databases, in which canonical diagrams of metabolic, signaling, or regulatory pathways have been hand-curated from review articles and textbooks. Metabolic pathways are the focus of Reactome50, MetaCyc51 and the Kyoto Encyclopedia of Genes and Genomes52, while databases such as BioCarta (http:// www.biocarta.com/genes) , CellMap (http://cellmap.org), the Signal Transduction Knowledge Environment54, GeNet53, and TransPATH180 are primarily concerned with signaling and transcription. All of these pathway databases are relevant to the second and perhaps ultimate goal of interaction analysis — models of well-defi ned and well- validated functional relationships among genes, proteins, and/or metabolites. Automatic inference of accurate and detailed molecular pathways, however, is well beyond the capability of current interaction analyses and integrative modeling approaches. While current approaches attempt to place interactions into subnetworks according to their putative function177-179, such subnetworks are hypothetical in nature and thus inappropriate for entry into any of the existing databases of canonical pathways. Rather, the subnetwork models produced by automated approaches are typically embedded in fi gures, tables, or supplementary information in the primary published literature. Although it is certainly possible to read about the models, there are several problems with this traditional method of dissemination. First, the size and number of models from even a single publication can be overwhelming, making models relevant to a particular gene or function diffi cult to locate. Second, in many cases, network modeling papers target bioinformatic, rather than biological or medical, audiences. As a result, the models remain largely inaccessible to those who have the most knowledge to interpret them and the most to gain from their successful interpretation. 106

Recent opinion articles181,182 have recognized a related problem for the case of protein functional predictions, calling for a clearinghouse of hypotheses generated by bioinformatics analyses and searchable by experimental biologists. In the same vein, the BioModels Database49 has recently been adopted as a working repository for simulations of kinetic quantitative systems based on ordinary differential equations. Subnetworks inferred from genome-scale data, however, do not fall into this category. Motivated by these considerations, we have designed CellCircuits as an open-access general repository of models distilled from protein networks. By aggregating models derived from many separate studies into a single resource, CellCircuits bridges the gap between databases of individual pairwise interactions and fully curated, biologically-validated pathway models. The CellCircuits database enables experimentalists to readily access and cross-reference models across multiple publications. It also enables the meta-analysis of the entire set of models to reveal inter-model relationships and to answer global questions: for instance, which models overlap in terms of the genes and/or cellular processes represented? How novel is a new result given the models that are already present in the database?

Results

A spectrum of network models

To date, interactions have been organized by searching for essentially three types of subnetworks: linear paths of interactions, interaction clusters, or parallel clusters. Representative models of each type are shown in Figure 5.2. Linear (or branching) paths of interactions have been used to represent biological pathways such as metabolic processes or regulatory cascades183-185 (Figure 5.2a). Clusters in an interaction network are regions of dense interconnections and are suggestive of functional protein complexes23,28,62,186,187 (Figure 5.2b). Parallel clusters are two (or 107 more) similar network clusters in which the proteins in one cluster are, in some way, associated with the proteins in the other cluster. Parallel clusters have been used to represent protein complexes conserved across species188-190 (Figure 5.2c), in which pairs of proteins spanning the two clusters are orthologs associated by sequence- similarity relationships. They have also been used to identify the physical basis for genetic interactions191 (Figure 5.2d), in which two protein interaction clusters are linked by many genetic interactions if the clusters perform redundant or synergistic cellular functions.

a d MATING_TYPE CTF18

MFA1

STE2 AGA2 MFA2 CTF8 DCC1 STE6 MFALPHA1

FAR1 TUP1 BAR1 MFALPHA2

GPA1 AGA1 FUS1 TEC1 STE12 STE4 STE50 MCM1 KSS1 SIN3 SWI1 STE3 MRE11 RAD50 FUS3 STE18 SST2 STE5 STE7 SNF2 SAG1

STE11 STE20

KAR3 XRS2

b YPL014W c

SIC1 CLN3 SWE1 CLB6

CLB2

CDC28 CKS1 CLB5

CLB1

CLN2

CLB4 CLB3 CLN1

Figure 5.2. Representative network models stored in CellCircuits. (a) A collection of linear regulatory pathways downstream of mating-type locus in yeast184 (b) An interaction cluster of co-expressed proteins suggestive of a functional complex23 (c) Parallel clusters conserved between Plasmodium falciparum and yeast190 (d) Parallel clusters that are highly connected by genetic interactions191.

Finally, integrating the interaction network with external data, such as gene expression profi les and other molecular states, has also been a key methodology used to identify signifi cant subnetworks. For instance, these approaches have been used to fi nd protein interaction clusters that exhibit coherent expression changes in response 108 to panels of perturbations28,62 or as a function of the cell cycle23. Other works192 have reported network “motifs”, defi ned as patterns of interactions that occur more often in the network than expected by chance. However, these approaches (by design) focus on general patterns rather than subnetworks of particular proteins. Therefore, they are not considered here.

Database coverage and assembly

This CellCircuits initial release (version 1.0) was designed as proof-of-principle of the value of a searchable database of network models. We focused on providing a clear database interface and representative, albeit incomplete, coverage of the types of network models possible. For version 1.0, the database includes models from eleven publications, spanning linear, clustered, or parallel subnetworks, with priority given to publications with models available in both graphical representations and machine-readable formats (Table 5.1). Graphical representations of network models are a particularly valuable method of disseminating interactions and/or pathways, in much the same way that DNA sequence logos193 are used to visualize position-specifi c score matrices of DNA binding motifs. Conversely, machine-readable formats, such as SBML48, BioPax47, or the Cytoscape SIF format11, greatly facilitate database entry, model curation, and subsequent computational analysis. Four publications provided models in both graphical and machine-readable formats185,189-191. For the remaining seven, models were manually curated from published fi gures23,28,62,183,184,186,188. Manual curation involved downloading fi gures containing each network model, and then transcribing the genes and interactions in the models into a machine- readable format. For most publications, one fi gure, or each subpanel in a fi gure, contained a single network model. However, in three publications23,184,188 the fi gures contained multiple, unconnected networks that were not divided by the authors into 109 separate subpanels. In these cases, each unconnected component was entered as one model in CellCircuits, and in one case, networks were further subdivided into smaller models if they contained several sparsely connected, but functionally annotated, clusters of proteins (see Supplemental Table 5.1). These curation activities resulted in a total of 2019 protein network models in the database. Models in the database include protein interactions from fi ve organisms: yeast (Saccharomyces cerevisiae; 91% of all models), fl y (Drosophila melanogaster; 58%), nematode worm (Caenorhabditis elegans; 27%), a malarial parasite (Plasmodium falciparum; 2%), and human (2%; these percentages total >100% due to cross-species comparisons covering multiple species in a single model). The models include up to four types of interactions (protein-protein, protein-DNA, genetic, and metabolic) as well as two types of external data (gene expression and gene deletion phenotypes).

Network model query

Models in the CellCircuits database are queried through a web-based interface. In the simplest use case, entering a standard gene name (e.g. RAD9) into the search fi eld will return all models containing that gene. Wild-card searches are permitted (e.g., RAD* will search for models containing any gene with a name that begins with the letters RAD. See Figure 5.3). All gene queries are also checked against a list of gene name synonyms, which are drawn from the latest release of the Gene Ontology (GO) database194. In addition, searches can be limited to models from specifi c publications or to models containing genes from specifi c organisms. Searches based on gene function are also supported. The CellCircuits database automatically scores all models for GO functional enrichment using the hypergeometric test (see Methods). Such tests had been originally applied in only 110 three of the 11 curated publications. The enrichment results are stored with each model in the database as meta-data, allowing users to search for models that are enriched for genes having a particular annotation. For example, some of the same models retrieved by searching for RAD9 can also be retrieved by searching for GO annotations associated with this gene. Queries may include exact GO ID numbers (e.g., GO:0006974) or partial or complete GO term names (e.g. “DNA damage” or “integrity checkpoint”; these must be enclosed in double quotes). More than one gene, GO annotation, or wild-card may be included in a query. If a model matches multiple search terms, it will be ranked higher in the results. All search results include graphical representations of the models, links to the original publication, the organism(s) modeled, the genes or GO annotations from the search query that were found in each model, and the hypergeometric p-value of enrichment for any GO annotations (Figure 5.3).

2 9

8 4

5 6

Figure 5.3. Web interface (http://www.cellcircuits.org). Results using RAD* and “DNA binding” as the search query (circle 1). A total of 274 subnetwork models are returned. The search output includes a graphical representation of the model (circle 7), the genes and GO terms from the model that match the query (circle 6), alternative gene names or synonyms matching the query (circle 9), the total number of matches (circle 8), enriched GO terms (circle 5), and a link to view similar models (circle 4). 111

Meta-analysis of models

Collecting published network models within a single database allowed us to survey the state of computational analysis of large interaction datasets. Scoring all models for GO functional enrichment (described in the previous section) is an example of such analyses. Another example, the observed sizes of models from all eleven publications, is shown in Figure 5.4a. On average, the 2019 models in the database

a b Kelley [41] Number 3000 30 of models Sharan [39] All others No. 20 400 of genes 10 0 300 300 50 150 250 Models spanned

200 200

100 100

0 0 0 10 20 30 40 50 0 10 20 30 40 50 Model size (No. of genes) Models spanned c de Lichtenberg (2005) [34] Begley (2002) [33]

Gandhi (2006) [38]

Haugen (2004) [35] Ideker (2002) [36] Kelley (2005) [41]

0.05 - 0.25

Yeang (2005) [32] 0.25 - 0.50 Sharan (2005) [39] S > 0.50

Bernard (2005) [30] Hartemink (2002) [31] S = Suthram (2006) [40] |A| + |B|

Figure 5.4. Meta-analysis of models. [a] Histogram of the number of genes or proteins per model. [b] Histogram of the number of genes (y-axis) that are contained in a given number of models (x-axis). The inset is an expanded view of the genes that span over 50 models. [c] Overlap between network modeling publications. Thicker lines represent greater similarity between the sets of models published in two publications (see legend). Similarity is measured by the number of distinct models that share one or more interactions (yeast interactions only) divided by the total number of models in both publications. Interactions are shared between almost every pair of publications, but for clarity, similarity scores less than 0.05 are not shown. 112 contained ~18 proteins and 36 interactions with 95% of models containing between fi ve and 30 proteins. However, this distribution was heavily infl uenced by two publications189,191 which together contributed over 90% of the models in the database. To assess the overlap between models, we examined the extent to which the same proteins appeared in multiple models (Figure 5.4b). Although a protein was shared by approximately nine models on average, the majority were found in only one or two models. Thirty-fi ve proteins appeared in over 100 models (<5% of all models in the database). Interestingly, among these were all six of the yeast ATPases in the 26S proteasome (RPT1-6), components of the yeast and worm 20S proteasome, and several yeast, worm, and fl y protein kinases. The pervasiveness of these proteins in models may refl ect their broad evolutionary conservation across species, a high degree of connectivity in the protein network, their popularity in the biological literature, or their functional roles in many distinct biological processes (i.e., pleiotropy). The results of our model overlap analyses are accessible through the web interface. Each model is annotated in the CellCircuits database with a list of similar models, defi ned as those that contain at least three of the same genes. Clicking the “View similar models” link in the search results will display these models (Figure 5.3, circle 4). Currently, only the number of shared genes is used to assess similarity between models. However, more complex measures could be envisioned, potentially making CellCircuits, itself, a resource for comparing several similar models (perhaps corresponding to the same biological process) and showing the differences between them. On a broader scale, we also assessed the extent to which publications covered overlapping regions of the protein interactome using a pairwise similarity score (see Methods). Results are shown in Figure 5.4c. Although our similarity score was permissive such that some overlap was expected between every pair of publications, 113 only fi ve of the 55 possible pairs showed over 25% similarity. Thus, it appears that the different modeling publications are, to some degree, capturing different regions of the protein interaction network (excluding189,191, see Figure 5.4c). Furthermore, in the future, this kind of meta-analysis could be used to determine how the results from new publications differ from existing models.

Discussion

In summary, CellCircuits v1.0 provides a clearinghouse in which hypothetical pathway models derived from large-scale protein networks may be easily accessed, queried, and exported for further study. The eleven publications included in this initial release were chosen to cover a broad range of network model types with a bias towards publications that provided models in both graphical and machine-readable format. Beyond this proof-of-principle, a signifi cant question is whether, or to what extent, all past and future network models might be incorporated. On one hand, the fi eld of network biology is still young such that the number of relevant previous publications is probably less than fi fty. On the other hand, the rapid adoption of systems and network approaches will make capturing information from all future works a daunting prospect if the models are not readily accessible. CellCircuits complements existing efforts that have begun to address this challenge, such as markup languages for describing models (BioPAX47 and SBML48) and the BioModels Database of quantitative, kinetic models49. Like biological sequence and microarray databases, we envision CellCircuits as a valuable resource for storing, accessing, and updating network models across the wider biological research community. 114

Methods

Data processing

A data processing pipeline was used to extract information from the textual representation of a model and store that information in a MySQL (http://www.mysql. org) relational database. The data processing pipeline required a digital image of each model and a text fi le containing the genes, proteins, metabolites, other small molecules, and interconnections represented in the model. In cases when a network model was published in graphical form only, the text fi le was manually transcribed (see Supplemental Table 5.1). To ensure that the CellCircuits database used a consistent set of gene identifi ers, we mapped each gene name found in the text fi le for a model to a GO gene id using tables from the GO database. Gene names found in a model but not in the GO database were automatically inserted into the appropriate database tables and fl agged as being externally added. Future curation efforts could be directed towards handling these genes missing from the GO database. After models were entered into the database, they were scored using the hypergeometric test for GO annotation enrichment.

Web interface

We used Perl CGI scripts (http://www.perl.org) in conjunction with the Apache web server (http://httpd.apache.org), mod_perl (http://perl.apache.org), and Perl DBI (http://dbi.perl.org) to serve HTML content, handle user input, and query the MySQL database. Script.aculo.us version 1.61 (http://script.aculo.us), an open source

JavaScript library, was used to generate visual effects on the web pages that display search results. 115

Scoring models for Gene Ontology (GO) annotation

Using the latest release of the GO database, models were scored for a statistically signifi cant number of genes in the model that were annotated with a particular GO term. We fi rst identifi ed the complete set of genes associated with each GO term. This set included the genes directly annotated with that term as well as those annotated with any of the term’s descendents in the GO hierarchy. Next, we used the hypergeometric distribution195,196 to test the genes in each model against the genes annotated with each of the GO terms. The resulting P-values were stored in the database.

Scoring similarity between publications

For each pair of publications we compared all models in one publication to all of the models in the other. To capture model similarity as sensitively as possible, we deﬁ ned two models to be similar if they shared at least one interaction. The similarity score of a pair of publications was deﬁ ned to be the number of distinct models that participated in any overlap divided by the total number of models in the pair. For example, consider publication A containing 2 models and publication B containing 6. If model 1 in A overlaps with models 1-5 in B, and model 2 in A only overlaps with model 1 in B, then the total number of distinct overlapping models is 7, and the similarity score between publications is 7/8. 116

Tables

Table 5.1. Sources of data.

Interactions† States‡ Network patterns §

Model source Organism(s)* Models** Genes Protein-protein Protein-DNA Synthetic lethal Protein-reaction-protein Gene expression Phenotypes Linear Parallel (genetic interactions) Parallel (sequence conservation) Clusters (interaction/state correlation) Bernard183 Y 1 17 39 62 x Hartemink184 Y 2 33 66 320 x Yeang185 Y 38 602 110 708 273 x Kelley191 Y 473 787 1246 95 1843 554 x Sharan189 YWF 1370 822 2805 x Suthram190 YP 32 40 68 x Gandhi188 YWFH 48 85 74 x de Lichtenberg23 Y 31 719 712 66 x Ideker28 Y 10 107 83 26 20 x Begley62 Y 6 130 129 12 26 x x Haugen186 Y 8 280 85 410 13 5 x x TOTALS: 2019 3622 5312 1356 1843 567 772

* Y = Yeast ; W = Worm ; F = Fly ; H = Human ; P = Plasmodium falciparum

** counts refer to total number of models across all organisms modeled

§ counts refer to number of distinct genes in yeast only across all models.

† counts refer to number of distinct interactions in yeast only across all models

‡ For gene expression, counts refer to number of proﬁ les used 117

Supplemental Tables

Supplemental Table 5.1. Figures curated.

Model Source Figures curated Curation method Begley62 5b (subpanels #1-#6) Each panel is one model (1:1) Ideker28 5 (subpanels #1a-e, #2-#7) 1:1 Haugen186 1a-f, 5a, 5b 1:1 Bernard183 3 1:1 Hartemink184 2 Each unconnected component is one model Gandhi188 Suppl. Figs 1 – 3 Each unconnected component is one model De Lichtenberg23 1 Each unconnected component is one model. And, the following annotated (yet sparsely connected) components were subdivided into separate models (APC, Cdc28-cyclin, MCM/ORC, Sister chromatid cohesion, one model with Tubulin related and SPB, one model with Nucleosome/bud formation, Histones, and DNA replication) 3a-c 1:1 118

Acknowledgements

We acknowledge funding from the National Science Foundation (NSF 0425926) and thank the members of the Ideker lab for testing and suggesting improvements to the web interface. Chapter 5, in full, is a reprint of material as it appeared on November 29 2006 in Nucleic Acids Research, 2006, vol. 35, D538–D545, Database issue 2007. Mak, H. Craig; Daly, Mike; Gruebel, Bianca; Ideker, Trey. The dissertation author was one of two primary investigators and authors (with M. Daly) of this paper. Chapter 6: Conclusions

Physiologically relevant interpretation of high-throughput data

In this dissertation, we sought to elucidate the structure and function of molecular interaction networks that determine gene expression. Large-scale studies of these networks have been made possible by advances in technology for gathering cellular data, such as protein interactions and gene expression profi les, on an unprecedented high-throughput level. Generating network models, however, is challenging because it is often unclear how to meaningfully interpret the wealth of data in ways that are physiologically relevant. We adopted a four-stage strategy for extracting meaning from high-throughput data: network assembly, network analysis, network refi nement, and network cataloging (Figure 6.1). Chapters 2 through 5 of this dissertation are each focused on one stage of this strategy, and our main fi ndings are summarized at the end of Chapter 1. Here, we discuss our work in the context of three themes: harnessing complementary relationships between data, controlling for missing and spurious data, and remaining grounded in biology.

a b Articulate General strategy relationships Regulatory modules (Ch. 2) � TF-motif associations (Ch. 2) Codify � Condition-specific remodeling (Ch. 2,3) Search A ble n m a Global properties (position effects; Ch. 3) e ly DNA damage s z s e response (Ch. 2) A Functional annotation (Ch. 2,4,5) SBTF network (Ch. 3) Regulatory effect (repressor / activator; Ch. 4) Deletion response (Ch. 4)

C a ta ne lo fi Experimental validation (Ch. 2,3,4) g Re CellCircuits database (Ch. 5) Automated strategy (Ch. 4)

d c

Figure 6.1. Contributions of the dissertation to each stage of the network lifecycle.

119 120

Harnessing complementary relationships

Since different experimental techniques measure complementary information about the inner-workings of a cell, we explored methods for data integration. We sought to harness the complementary relationships between data types to produce network models that more accurately represented a biological system. The raw data used for network assembly, analyses, and refi nement typically fell into two categories: physical and inferred. Physical interactions represent chemical affi nities between molecules, and include interactions between two proteins or interactions between a protein and DNA. Inferred (or functional) interactions represent functional dependencies that one gene or protein has on another, and are observed when a biological system is perturbed, perhaps by an environmental stimulus or by deleting a gene. In Chapters 2 and 4, we employed the following strategy (Figure 6.1a) to integrate data from gene deletion experiments with a noisy network of physical interactions to identify high confi dence regulatory pathways. First, we articulated the precise biological relationships between data types. Specifi cally, it has been observed that the majority of physical TF-promoter binding events do not coincide with changes in gene expression197. Instead, we proposed that only those physical interactions that link a transcription factor (TF) to a gene whose expression changes when that factor is deleted should be interpreted as a functional regulatory pathway. Thus, a pathway represents one possible physical explanation of an effect observed when a TF is deleted. Such a regulatory pathway might include protein kinases in a signaling cascade and/or transcription factors in a regulatory hierarchy. Second, we codifi ed these relationships by augmenting and implementing a previously described modeling procedure35 as a plugin for Cytoscape11, a software 121 package for network analyses and visualization. Third, we used our software to search for all instances of these relationships. In Chapter 4, we searched for regulatory pathways that function in cells growing in rich medium. In Chapter 2, we extended this methodology to search for pathways that are responsible for expression changes induced by MMS, a DNA damaging agent. We found 37 direct physical pathways in which an MMS-dependent functional interaction was explained by a TF binding to the promoter of a target. By searching for longer regulatory pathways that contained intermediate genes, we were able to explain 68 additional functional interactions, an almost 3-fold increase. In total, however, pathways through the physical network could only explain about 30% of the functional interactions (105 of 341). Why were the remaining 70% of functional interactions not explained? Clearly, the biological relationships that we articulated between physical and functional interactions were insuffi cient for explaining all of the observed gene expression changes. The next generation of transcriptional network models will need to incorporate data representing other regulatory mechanisms such as mRNA degradation, nuclear export, and protein translation130. Lastly, in Chapter 3, we integrated physical TF-promoter interactions with the genomic locations on the chromosome of the bound promoters. A few yeast TFs are known to bind near the telomeres at chromosome ends99,115. We screened a compendium of yeast TF binding profi les for additional transcription factors that also bound near chromosome ends. We discovered 22 factors, which we named SBTFs (Subtelomere Binding Transcription Factors). For many SBTFs, the subtelomeric binding pattern was dynamic, meaning that the TF appeared to relocate to or away from the subtelomere as physiological conditions shifted. Our study presented the fi rst systematic evidence that genome position 122

(i.e. the location of a target gene along the chromosome) may inﬂ uence patterns of transcription factor binding. This may be representative of a general principle governing the wiring of transcriptional networks. Our results support observations that genome position affects gene expression88-93, and provide possible insights into the regulatory network underlying these expression patterns.

Controlling for missing data and spurious data

Missing data includes false negatives (phenomena that could have been measured, but were not, perhaps due to errors or technical limitations) and incomplete coverage (phenomena that were excluded from being measured, perhaps due to the study design). Almost all high-throughput studies performed to-date have been limited by incomplete coverage, primarily due to fi nite time, dollar, and reagent resources. Noise and experimental biases in high-throughput methods also generate spurious data (false positives), measurements that appear accurate but are not. Recent studies of protein interaction screens38 and microarray expression profi les27,30,139 suggest that aggregating data from several studies may be a useful, and perhaps necessary, strategy to control for missing or spurious data during network assembly and analysis (Figure 6.1 a and b). In Chapter 2, we developed a statistical method (the truncated product method) that used TF-promoter binding interactions measured in more than one physiological condition to more accurately identify high- confi dence interactions. We also used a Bayesian probabilistic method (deletion- buffering analysis) for combining many expression profi les from deletion mutants to infer functional dependencies. In Chapter 3, we supported our discovery of SBTFs, which was based solely on TF-promoter binding data, with high-throughput protein interactions, gene expression profi les, and phenotyping studies. In Chapter 4, we described a principled method for identifying highly informative follow up 123 experiments and incorporating the results to refi ne a network (Figure 6.1 c).

Remaining grounded in biology

Bioinformatics and computational biology can be seductively powerful in their ability to ﬁ nd patterns in huge amounts of data. It is important to consider whether the patterns found will be of beneﬁ t to the practicing biologist. Here, we discuss our work in the context of potential pitfalls that may undermine the biological utility of computational analyses: excessive precision, challenges of validation, and limited dissemination.

Excessive precision refers to the danger inherent in the ability of computational methods to quantify a biological system with almost infi nitely fi ne-grained measurements that do not refl ect biology reality. For example, it is possible to use statistical methods to analyze a microarray expression profi le and rank every single gene from most to least expressed. Because of the inherent noisiness of the measurements, though, it is unlikely that the 3543th ranked gene, for instance, is actually expressed at a different level than the 2987th gene. Excessive precision is related to over-fi tting, which refers to statistical models that represent (or ‘fi t’) a set of observations too closely. As a result of both over-fi tting and excessive precision, the actual behaviors and variations in the underlying biological system are not accurately captured, thereby potentially leading to misleading conclusions. Perhaps the best counter to these pitfalls is experimental validation, which we attempted in Chapters 2, 3, and 4 and discuss in more detail below.

Challenges of validation. The accuracy and value of computational studies are sometimes regarded with skepticism by biologists, especially if the conclusions 124 generated have not been extensively experimentally confi rmed. Experimental validation is therefore an important component of computational modeling, but it is particularly challenging because of the overwhelming quantity of biological hypotheses that may be generated. A single computational study may identify hundreds of proteins warranting further research, each of which has a rich history of published literature, each of which may require specialized (and expensive) experimental assays and reagents, and each of which could lead to focused studies that might consume an individual investigator for weeks, months, or even years. Our validation strategy in Chapter 4 identifi es the theoretically optimal experiments that will provide the most useful data for validating and refi ning the network (Figure 6.1 c). However, follow-up experiments may still be time-consuming and expensive, and ideally, we would want to pursue all hypotheses generated by a computational analysis. Thus, going forward, it might be worthwhile to explore technology for automating, miniaturizing, parallelizing, or otherwise reducing the cost of follow-up experiments. It is an open question as to whether computational studies can be explicitly designed to generate results that are amenable to low-cost, rapid verifi cation.

Limited dissemination. The biological utility of network models is currently limited because they are typically embedded in fi gures, tables, or supplementary information in the primary published literature. As a result, the models remain largely inaccessible to the biologists who have the best knowledge to interpret them. To address these obstacles, we built the CellCircuits database of network models in Chapter 5. In its initial release, CellCircuits provided an online interface for searching and viewing over 1000 network models from eleven previously published studies. An online database makes models more accessible to the research community 125 and perhaps will encourage validation and as well as the assembly of the next generation of better networks (Figure 6.1d). CellCircuits also enabled us to analyze, for the fi rst time, a diverse collection of network models as a whole. Based on these meta-analyses, we identifi ed genes and pathways that rarely appear in network models and that therefore might be good targets for additional experimental or computational analyses. Just as existing databases of DNA sequences, protein structures, and microarray profi les have proven valuable, CellCircuits demonstrates the potential value of a database of network models.

Beyond wiring diagrams

Much of this dissertation has focused on building transcriptional networks to answer “What regulates gene X?” and “What does gene X regulate?” Knowing what is connected to what – the ‘wiring diagram’ of the cell – is important, but it provides little insight into how the system works and why is it wired that way. Indeed, network models promise to illuminate how complex behaviors emerge from collections of individual interactions198. To move beyond a gene- and protein-centric view, we propose that a future challenge of computational biology will be to formulate abstractions that represent instances of known biological patterns or mechanisms. For example, it is known than transcription factors activate or repress target genes via many regulatory mechanisms that include nuclear translocation, mRNA degradation, post-translational modifi cation, binding with co-factors, altered affi nity for as sequence motif, localization within the nucleus, and differential expression of the factor itself. Focused experimental studies have revealed specifi c cases where each regulatory mechanism is used, but few system-wide studies have explored the extent that each mechanism is used in an organism, how multiple mechanisms 126 concurrently affect the activity of a single regulatory protein, or whether different organisms have evolved to prefer some regulatory mechanisms over others. To state the question another way: which TFs regulate gene expression by which mechanisms, and why? Future research might fi rst devise methods for identifying TFs that use each mechanism. For instance, nuclear localization signals embedded within the amino acid sequences of TFs could be to classify factors that are translocated to the nucleus199. Gene promoters that contain overlapping, repeated, inverted, or co-occurring binding sites could be used to identify TFs that function together70. Protein sequence and structural similarity could be used to identify TFs that function as homo- or hetero-dimers19. Expression profi les could be used to identify TFs that are present in the cell at different times200. Protein modifi cation assays could be used to identify TFs that are regulated by phosphorylation, ubiquitination, or other post-translational modifi cations201. Algorithms could then be developed to map regulatory mechanisms to TFs, target genes, and therefore specifi c biological pathways. Such knowledge might allow us to globally characterize transcriptional networks so that, in addition to knowing that gene X regulates gene Y, we also know how, and thus perhaps why, that regulation occurs. 127

Final remarks

We anticipate that our ability to collect biological measurements will continue to grow in both the scale and breadth. Concurrently, the need for computational methods to interpret these data will grow as well. The individual studies performed in this dissertation have addressed specifi c stages in the network lifecycle. But we also recognize that the stages are interconnected. New networks are assembled as a result of analyses. A catalog of network models facilitates their refi nement. Refi nement and analysis are steps during which new data can be incorporated to assemble new networks. What underlies the interrelationships between these stages? Perhaps it is that biological discovery is an iterative, incremental process: advances towards solving disparate problems proceed in lockstep. Key pieces to one puzzle are provided by another. As such, we see value in future research that targets the individual challenges of network assembly, analysis, refi nement, and cataloging, as well as the boundaries between them, which are blurred and yet rich in potential for crafting ever more comprehensive models of life. References

1. Southern, E.M. Detection of speciﬁ c sequences among DNA fragments separated by gel electrophoresis. J Mol Biol 98, 503-17 (1975).

2. Schena, M., Shalon, D., Davis, R.W. & Brown, P.O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467-70 (1995).

3. Ren, B. et al. Genome-wide location and function of DNA binding proteins. Science 290, 2306-9 (2000).

4. Ptacek, J. et al. Global analysis of protein phosphorylation in yeast. Nature 438, 679-84 (2005).

5. Winzeler, E.A. et al. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285, 901-6 (1999).

6. Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376-80 (2005).

7. Mitra, R.D. & Church, G.M. In situ localized ampliﬁ cation and contact replication of many individual DNA molecules. Nucleic Acids Res 27, e34 (1999).

8. Shendure, J.A., Porreca, G.J. & Church, G.M. Overview of DNA sequencing strategies. Curr Protoc Mol Biol Chapter 7, Unit 7 1 (2008).

9. Tong, A.H. et al. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 294, 2364-8 (2001).

10. Breitkreutz, B.J., Stark, C. & Tyers, M. Osprey: a network visualization system. Genome Biol 4, R22 (2003).

11. Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13, 2498-504 (2003).

12. Bader, G.D., Betel, D. & Hogue, C.W. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res 31, 248-50 (2003).

13. Hermjakob, H. et al. IntAct: an open source molecular interaction database. Nucl. Acids Res. 32, D452-455 (2004).

14. Mewes, H.W. et al. MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res 34, D169-72 (2006).

128 129

15. Stark, C. et al. BioGRID: a general repository for interaction datasets. Nucl. Acids Res. 34, D535-539 (2006).

16. Xenarios, I. et al. DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 30, 303-5 (2002).

17. Boone, C., Bussey, H. & Andrews, B.J. Exploring genetic interactions and networks with yeast. Nat Rev Genet 8, 437-49 (2007).

18. Barabasi, A.L. & Oltvai, Z.N. Network biology: understanding the cell’s functional organization. Nat Rev Genet 5, 101-13 (2004).

19. Han, J.D. et al. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 430, 88-93 (2004).

20. Kim, P.M., Lu, L.J., Xia, Y. & Gerstein, M.B. Relating three-dimensional structures to protein networks provides evolutionary insights. Science 314, 1938-41 (2006).

21. Uetz, P. et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403, 623-7 (2000).

22. Levine, M. & Davidson, E.H. Gene regulatory networks for development. Proc Natl Acad Sci U S A 102, 4936-42 (2005).

23. de Lichtenberg, U., Jensen, L.J., Brunak, S. & Bork, P. Dynamic complex formation during the yeast cell cycle. Science 307, 724-7 (2005).

24. Lee, T.I. et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298, 799-804 (2002).

25. Simon, I. et al. Serial regulation of transcriptional regulators in the yeast cell cycle. Cell 106, 697-708 (2001).

26. Shen-Orr, S.S., Milo, R., Mangan, S. & Alon, U. Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31, 64-8 (2002).

27. Segal, E., Friedman, N., Koller, D. & Regev, A. A module map showing conditional activity of expression modules in cancer. Nat Genet 36, 1090-8 (2004).

28. Ideker, T., Ozier, O., Schwikowski, B. & Siegel, A.F. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18 Suppl 1, S233-40 (2002). 130

29. Hughes, T.R. et al. Functional discovery via a compendium of expression proﬁ les. Cell 102, 109-26 (2000).

30. Lamb, J. et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929-35 (2006).

31. Kellis, M., Patterson, N., Endrizzi, M., Birren, B. & Lander, E.S. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423, 241-54 (2003).

32. Maniatis, T. & Reed, R. An extensive network of coupling among gene expression machines. Nature 416, 499-506 (2002).

33. Kelley, B.P. et al. PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res 32, W83-8 (2004).

34. von Mering, C. et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417, 399-403 (2002).

35. Yeang, C.H., Ideker, T. & Jaakkola, T. Physical network models. J Comput Biol 11, 243-62 (2004).

36. Krogan, N.J. et al. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440, 637-643 (2006).

37. Collins, S.R. et al. Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics 6, 439-50 (2007).

38. Reguly, T. et al. Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae. Journal of Biology 5, 11 (2006).

39. Duarte, N.C., Herrgard, M.J. & Palsson, B.O. Reconstruction and validation of Saccharomyces cerevisiae iND750, a fully compartmentalized genome-scale metabolic model. Genome Res 14, 1298-309 (2004).

40. Gat-Viks, I., Tanay, A. & Shamir, R. Modeling and analysis of heterogeneous regulation in biological networks. J Comput Biol 11, 1034-49 (2004).

41. Herrgard, M.J., Covert, M.W. & Palsson, B.O. Reconstruction of microbial transcriptional regulatory networks. Curr Opin Biotechnol 15, 70-7 (2004).

42. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J. & Wheeler, D.L. GenBank. Nucleic Acids Res 36, D25-30 (2008).

43. Cochrane, G. et al. Priorities for nucleotide trace, sequence and annotation data capture at the Ensembl Trace Archive and the EMBL Nucleotide Sequence Database. Nucleic Acids Res 36, D5-12 (2008). 131

44. Berman, H., Henrick, K. & Nakamura, H. Announcing the worldwide Protein Data Bank. Nat Struct Biol 10, 980 (2003).

45. Edgar, R. & Barrett, T. NCBI GEO standards and services for microarray data. Nat Biotechnol 24, 1471-2 (2006).

46. Parkinson, H. et al. ArrayExpress--a public database of microarray experiments and gene expression proﬁ les. Nucleic Acids Res 35, D747-50 (2007).

47. BioPAX working group. BioPAX—biological pathways exchange language. Level 1, Version 1.0 Documentation. (2004).

48. Hucka, M. et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19, 524-531 (2003).

49. Le Novere, N. et al. BioModels Database: a free, centralized database of curated, published, quantitative kinetic models of biochemical and cellular systems. Nucl. Acids Res. 34, D689-691 (2006).

50. Joshi-Tope, G. et al. Reactome: a knowledgebase of biological pathways. Nucl. Acids Res. 33, D428-432 (2005).

51. Caspi, R. et al. MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucl. Acids Res. 34, D511-516 (2006).

52. Kanehisa, M. et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 34, D354-7 (2006).

53. Ananko, E.A. et al. GeneNet in 2005. Nucleic Acids Res 33, D425-7 (2005).

54. Gough, N.R. & Ray, L.B. Mapping cellular signaling. Sci STKE 2002, EG8 (2002).

55. Drablos, F. et al. Alkylation damage in DNA and RNA--repair mechanisms and medical signiﬁ cance. DNA Repair (Amst) 3, 1389-407 (2004).

56. Friedberg, E.C. et al. DNA Repair and Mutagenesis, (American Society for Microbiology Press, Washington, D.C., 2005).

57. Rouse, J. & Jackson, S.P. Lcd1p recruits Mec1p to DNA lesions in vitro and in vivo. Mol Cell 9, 857-69 (2002).

58. Gasch, A.P. et al. Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Mol Biol Cell 12, 2987- 3003 (2001). 132

59. Jelinsky, S.A. & Samson, L.D. Global response of Saccharomyces cerevisiae to an alkylating agent. Proc Natl Acad Sci U S A 96, 1486-91 (1999).

60. Jelinsky, S.A., Estep, P., Church, G.M. & Samson, L.D. Regulatory networks revealed by transcriptional proﬁ ling of damaged Saccharomyces cerevisiae cells: Rpn4 links base excision repair with proteasomes. Mol Cell Biol 20, 8157-67 (2000).

61. Begley, T.J., Rosenbach, A.S., Ideker, T. & Samson, L.D. Hot spots for modulating toxicity identiﬁ ed by genomic phenotyping and localization mapping. Mol Cell 16, 117-25 (2004).

62. Begley, T.J., Rosenbach, A.S., Ideker, T. & Samson, L.D. Damage Recovery Pathways in Saccharomyces cerevisiae Revealed by Genomic Phenotyping and Interactome Mapping. Mol Cancer Res 1, 103-12 (2002).

63. Chang, M., Bellaoui, M., Boone, C. & Brown, G.W. A genome-wide screen for methyl methanesulfonate-sensitive mutants reveals genes required for S phase progression in the presence of DNA damage. Proc Natl Acad Sci U S A 99, 16934-9 (2002).

64. Birrell, G.W. et al. Transcriptional response of Saccharomyces cerevisiae to DNA-damaging agents does not identify the genes that protect against these agents. Proc Natl Acad Sci U S A 99, 8778-83 (2002).

65. Ideker, T. et al. Integrated genomic and proteomic analysis of a systematically perturbed metabolic network. Science 292, 929-934 (2001).

66. Lee, T.I. et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298, 799-804. (2002).

67. Habeler, G. et al. YPL.db: the Yeast Protein Localization database. Nucleic Acids Res 30, 80-3. (2002).

68. Mewes, H.W. et al. MIPS: a database for genomes and protein sequences. Nucleic Acids Res 30, 31-4 (2002).

69. Friedberg, E.C., Siede, W. & Cooper, A.J. Cellular Responses to DNA Damage in Yeast. in The Molecular and Cellular Biology of the Yeast Saccharomyces, Vol. 1 (eds. Broach, J.R., Pringle, J. & Elizabeth, J.) 147-192 (Cold Spring Harbor Press, Cold Spring Harbor, 1991).

70. Harbison, C.T. et al. Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99-104 (2004).

71. Zhu, J. & Zhang, M.Q. SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics 15, 607-11 (1999). 133

72. Wingender, E. et al. The TRANSFAC system on gene expression regulation. Nucleic Acids Res 29, 281-3 (2001).

73. Workman, C.T. & Stormo, G.D. ANN-Spec: a method for discovering transcription factor binding sites with improved speciﬁ city. Pac Symp Biocomput, 467-78 (2000).

74. Santiago, T.C. & Mamoun, C.B. Genome expression analysis in yeast reveals novel transcriptional regulation by inositol and choline and new regulatory functions for Opi1p, Ino2p, and Ino4p. J Biol Chem 278, 38723-30 (2003).

75. Winzeler, E.A. et al. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285, 901-6 (1999).

76. Gasch, A.P. et al. Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11, 4241-57 (2000).

77. Huang, M., Zhou, Z. & Elledge, S.J. The DNA replication and damage checkpoint pathways induce transcription by inhibition of the Crt1 repressor. Cell 94, 595-605 (1998).

78. Chabes, A. et al. Survival of DNA damage in yeast directly depends on increased dNTP levels allowed by relaxed feedback inhibition of ribonucleotide reductase. Cell 112, 391-401 (2003).

79. Iyer, V.R. et al. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 409, 533-8 (2001).

80. Koch, C., Moll, T., Neuberg, M., Ahorn, H. & Nasmyth, K. A role for the transcription factors Mbp1 and Swi4 in progression from G1 to S phase. Science 261, 1551-7 (1993).

81. Jia, Y., Rothermel, B., Thornton, J. & Butow, R.A. A basic helix-loop-helix- leucine zipper transcription complex in yeast functions in a signaling pathway from mitochondria to the nucleus. Mol Cell Biol 17, 1110-7 (1997).

82. Workman, C. et al. A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome Biol 3, research0048 (2002).

83. Ideker, T., Thorsson, V., Siegel, A.F. & Hood, L.E. Testing for differentially- expressed genes by maximum-likelihood analysis of microarray data. J Comput Biol 7, 805-17 (2000).

84. Zaykin, D.V., Zhivotovsky, L.A., Westfall, P.H. & Weir, B.S. Truncated product method for combining P-values. Genet Epidemiol 22, 170-85 (2002). 134

85. Stormo, G.D. DNA binding sites: representation and discovery. Bioinformatics 16, 16-23 (2000).

86. Bailey, T.L. & Gribskov, M. Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14, 48-54 (1998).

87. Kschischang, F., Frey, B. & Loeliger, H. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory 47, 498-519 (2001).

88. Hurst, L.D., Pal, C. & Lercher, M.J. The evolutionary dynamics of eukaryotic gene order. Nat Rev Genet 5, 299-310 (2004).

89. Kosak, S.T. & Groudine, M. Gene order and dynamic domains. Science 306, 644-7 (2004).

90. Cohen, B.A., Mitra, R.D., Hughes, J.D. & Church, G.M. A computational analysis of whole-genome expression data reveals chromosomal domains of gene expression. Nat Genet 26, 183-6 (2000).

91. Pauli, F., Liu, Y., Kim, Y.A., Chen, P.J. & Kim, S.K. Chromosomal clustering and GATA transcriptional regulation of intestine-expressed genes in C. elegans. Development 133, 287-95 (2006).

92. Su, A.I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101, 6062-7 (2004).

93. Versteeg, R. et al. The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes. Genome Res 13, 1998-2004 (2003).

94. Lee, J.M. & Sonnhammer, E.L. Genomic gene clustering analysis of pathways in eukaryotes. Genome Res 13, 875-82 (2003).

95. Cho, R.J. et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 2, 65-73 (1998).

96. Kruglyak, S. & Tang, H. Regulation of adjacent yeast genes. Trends Genet 16, 109-11 (2000).

97. Caron, H. et al. The human transcriptome map: clustering of highly expressed genes in chromosomal domains. Science 291, 1289-92 (2001).

98. Spellman, P.T. & Rubin, G.M. Evidence for large domains of similarly expressed genes in the Drosophila genome. J Biol 1, 5 (2002). 135

99. Mondoux, M.A. & Zakian, V.A. Teleomere Position Effect: Silencing Near the End. in Telomeres (eds. de Lange, T., Lundblad, V. & Blackburn, E.H.) 261-316 (Cold Spring Harbor Laboratory Press, 2006).

100. Grewal, S.I. & Jia, S. Heterochromatin revisited. Nat Rev Genet 8, 35-46 (2007).

101. Sexton, T., Schober, H., Fraser, P. & Gasser, S.M. Gene regulation through nuclear organization. Nat Struct Mol Biol 14, 1049-1055 (2007).

102. Gottschling, D.E., Aparicio, O.M., Billington, B.L. & Zakian, V.A. Position effect at S. cerevisiae telomeres: reversible repression of Pol II transcription. Cell 63, 751-62 (1990).

103. Robyr, D. et al. Microarray deacetylation maps determine genome-wide functions for yeast histone deacetylases. Cell 109, 437-46 (2002).

104. Rusche, L.N., Kirchmaier, A.L. & Rine, J. The establishment, inheritance, and function of silenced chromatin in Saccharomyces cerevisiae. Annu Rev Biochem 72, 481-516 (2003).

105. Wyrick, J.J. et al. Chromosomal landscape of nucleosome-dependent gene expression and silencing in yeast. Nature 402, 418-21 (1999).

106. Ai, W., Bertram, P.G., Tsang, C.K., Chan, T.F. & Zheng, X.F. Regulation of subtelomeric silencing during stress response. Mol Cell 10, 1295-305 (2002).

107. Workman, C.T. et al. A systems approach to mapping DNA damage response pathways. Science 312, 1054-9 (2006).

108. Storey, J.D. & Tibshirani, R. Statistical signiﬁ cance for genomewide studies. Proc Natl Acad Sci U S A 100, 9440-5 (2003).

109. Louis, E.J. The chromosome ends of Saccharomyces cerevisiae. Yeast 11, 1553- 73 (1995).

110. Mefford, H.C. & Trask, B.J. The complex structure and dynamic evolution of human subtelomeres. Nat Rev Genet 3, 91-102 (2002).

111. Yamada, M., Hayatsu, N., Matsuura, A. & Ishikawa, F. Y’-Help1, a DNA helicase encoded by the yeast subtelomeric Y’ element, is induced in survivors defective for telomerase. J Biol Chem 273, 33360-6 (1998).

112. Mitsui, K. et al. A novel membrane protein capable of binding the Na+/H+ antiporter (Nha1p) enhances the salinity-resistant cell growth of Saccharomyces cerevisiae. J Biol Chem 279, 12438-47 (2004). 136

113. Spode, I., Maiwald, D., Hollenberg, C.P. & Suckow, M. ATF/CREB sites present in sub-telomeric regions of Saccharomyces cerevisiae chromosomes are part of promoters and act as UAS/URS of highly conserved COS genes. J Mol Biol 319, 407-20 (2002).

114. Krogan, N.J. et al. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440, 637-43 (2006).

115. Lundblad, V. Budding Yeast Telomeres. in Telomeres (eds. de Lange, T., Lundblad, V. & Blackburn, E.H.) 345-386 (Cold Spring Harbor Laboratory Press, 2006).

116. Askree, S.H. et al. A genome-wide screen for Saccharomyces cerevisiae deletion mutants that affect telomere length. Proc Natl Acad Sci USA 101, 8658-63 (2004).

117. Stark, C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res 34, D535-9 (2006).

118. Konig, P., Giraldo, R., Chapman, L. & Rhodes, D. The crystal structure of the DNA-binding domain of yeast RAP1 in complex with telomeric DNA. Cell 85, 125-36 (1996).

119. Hong, E.L. et al. Gene Ontology annotations at SGD: new data sources and annotation methods. Nucleic Acids Res 36, D577-81 (2008).

120. Bernstein, B.E. et al. Methylation of histone H3 Lys 4 in coding regions of active genes. Proc Natl Acad Sci USA 99, 8695-700 (2002).

121. Hu, Z., Killion, P.J. & Iyer, V.R. Genetic reconstruction of a functional transcriptional regulatory network. Nat Genet 39, 683-687 (2007).

122. Greatrix, B. & van Vuuren, H. Expression of the HXT13, HXT15 and HXT17 genes in Saccharomyces cerevisiae and stabilization of the HXT1 gene transcript by sugar-induced osmotic stress. Curr Genet 49, 205-17 (2006).

123. Lee, H.G. et al. High-resolution analysis of condition-speciﬁ c regulatory modules in Saccharomyces cerevisiae. Genome Biol 9, R2 (2008).

124. McCord, R.P., Berger, M.F., Philippakis, A.A. & Bulyk, M.L. Inferring condition-speciﬁ c transcription factor function from DNA binding and gene expression data. Mol Syst Biol 3, 100 (2007).

125. Blaiseau, P.L., Lesuisse, E. & Camadro, J.M. Aft2p, a novel iron-regulated transcription activator that modulates, with Aft1p, intracellular iron use and resistance to oxidative stress in yeast. J Biol Chem 276, 34221-6 (2001). 137

126. Rachidi, N., Martinez, M.J., Barre, P. & Blondin, B. Saccharomyces cerevisiae PAU genes are induced by anaerobiosis. Mol Microbiol 35, 1421-30 (2000).

127. Lieb, J.D., Liu, X., Botstein, D. & Brown, P.O. Promoter-speciﬁ c binding of Rap1 revealed by genome-wide maps of protein-DNA association. Nat Genet 28, 327-34 (2001).

128. Marcand, S., Buck, S.W., Moretti, P., Gilson, E. & Shore, D. Silencing of genes at nontelomeric sites in yeast is controlled by sequestration of silencing factors at telomeres by Rap 1 protein. Genes Dev 10, 1297-309 (1996).

129. Bertuch, A.A. & Lundblad, V. Which end: dissecting Ku’s function at telomeres and double-strand breaks. Genes Dev 17, 2347-50 (2003).

130. Komili, S. & Silver, P.A. Coupling and coordination in gene expression processes: a systems biology view. Nat Rev Genet 9, 38-48 (2008).

131. Schneider, R. & Grosschedl, R. Dynamics and interplay of nuclear architecture, genome organization, and gene expression. Genes Dev 21, 3027-43 (2007).

132. Klein, F. et al. Localization of RAP1 and topoisomerase II in nuclei and meiotic chromosomes of yeast. J Cell Biol 117, 935-48 (1992).

133. Feuerbach, F. et al. Nuclear architecture and spatial positioning help establish transcriptional states of telomeres in yeast. Nat Cell Biol 4, 214-21 (2002).

134. Casolari, J.M. et al. Genome-wide localization of the nuclear transport machinery couples transcriptional status and nuclear organization. Cell 117, 427-39 (2004).

135. Kiburz, B.M. et al. The core centromere and Sgo1 establish a 50-kb cohesin- protected domain around centromeres during meiosis I. Genes Dev 19, 3017-30 (2005).

136. Louis, E.J. & Vershinin, A.V. Chromosome ends: different sequences may provide conserved functions. Bioessays 27, 685-97 (2005).

137. de Hoon, M.J., Imoto, S., Nolan, J. & Miyano, S. Open source clustering software. Bioinformatics 20, 1453-4 (2004).

138. Saldanha, A.J. Java Treeview--extensible visualization of microarray data. Bioinformatics 20, 3246-8 (2004).

139. Steinfeld, I., Shamir, R. & Kupiec, M. A genome-wide analysis in Saccharomyces cerevisiae demonstrates the inﬂ uence of chromatin modiﬁ ers on transcription. Nat Genet 39, 303-9 (2007). 138

140. Schuldiner, M., Collins, S.R., Weissman, J.S. & Krogan, N.J. Quantitative genetic analysis in Saccharomyces cerevisiae using epistatic miniarray proﬁ les (E-MAPs) and its application to chromatin functions. Methods 40, 344-52 (2006).

141. Byrne, K.P. & Wolfe, K.H. The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Res 15, 1456-61 (2005).

142. Remm, M., Storm, C.E. & Sonnhammer, E.L. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 314, 1041-52 (2001).

143. Hood, L.E., Hunkapiller, M.W. & Smith, L.M. Automated DNA sequencing and analysis of the human genome. Genomics 1, 201-12 (1987).

144. Lockhart, D.J. & Winzeler, E.A. Genomics, gene expression and DNA arrays. Nature 405, 827-36 (2000).

145. de Jong, H. Modeling and simulation of genetic regulatory systems: a literature review. J Comput Biol 9, 67-103 (2002).

146. Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95, 14863-8. (1998).

147. Segal, E. et al. Module networks: identifying regulatory modules and their condition-speciﬁ c regulators from gene expression data. Nat Genet 34, 166-76 (2003).

148. Fedorov, F. Theory of Optimal Experimental Design, (Academic Press, New York, 1972).

149. Raiffa, H. & Shlaiffer, R. Applied Statistical Decision Theory, (MIT Press, Cambridge, MA, 1962).

150. Tong, S. & Koller, D. Active learning for parameter estimation in Bayesian networks. in Proceedings from the 13th Annual Conference on Neural Information Processing Systems 647-563 (Neural Information Processing Systems, Tübingen, 2000).

151. Abadir, M.S., Ferguson, J. & Kirkland, T.E. Logic design veriﬁ cation via test generation. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 7, 138-148 (1988).

152. Rea, C. & Settle, M.A. An automated test approach for US Air Force ﬁ ghter engines. Aerospace and Electronic Systems Magazine, IEEE 11, 24-28 (1996). 139

153. King, R.D. et al. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427, 247-52 (2004).

154. Deane, C.M., Salwinski, L., Xenarios, I. & Eisenberg, D. Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 1, 349-56 (2002).

155. Natarajan, K. et al. Transcriptional proﬁ ling shows that Gcn4p is a master regulator of gene expression during amino acid starvation in yeast. Mol Cell Biol 21, 4347-68 (2001).

156. Johnson, R.J. et al. Analysis of gene ontology features in microarray data using the Proteome BioKnowledge Library. In Silico Biol 5, 389-99 (2005).

157. Blom, J., De Mattos, M.J. & Grivell, L.A. Redirection of the respiro- fermentative ﬂ ux distribution in Saccharomyces cerevisiae by overexpression of the transcription factor Hap4p. Appl Environ Microbiol 66, 1970-3 (2000).

158. Smith, A., Ward, M.P. & Garrett, S. Yeast PKA represses Msn2p/Msn4p- dependent gene expression to regulate growth, stress response and glycogen accumulation. Embo J 17, 3556-64 (1998).

159. Ward, M.P., Gimeno, C.J., Fink, G.R. & Garrett, S. SOK2 may regulate cyclic AMP-dependent protein kinase-stimulated growth and pseudohyphal development by repressing transcription. Mol Cell Biol 15, 6854-63 (1995).

160. Tong, A.H. et al. Global mapping of the yeast genetic interaction network. Science 303, 808-13 (2004).

161. Popper, K. The Logic of Scientiﬁ c Discovery, (Basic Books, New York, 1959).

162. Brazma, A. et al. ArrayExpress--a public repository for microarray gene expression data at the EBI. Nucl. Acids Res. 31, 68-71 (2003).

163. Elion, E.A., Satterberg, B. & Kranz, J.E. FUS3 phosphorylates multiple components of the mating signal transduction cascade: evidence for STE12 and FAR1. Mol Biol Cell 4, 495-510 (1993).

164. Park, S.H., Koh, S.S., Chun, J.H., Hwang, H.J. & Kang, H.S. Nrg1 is a transcriptional repressor for glucose repression of STA1 gene expression in Saccharomyces cerevisiae. Mol Cell Biol 19, 2044-50 (1999).

165. Wahi, M. & Johnson, A.D. Identiﬁ cation of genes required for alpha 2 repression in Saccharomyces cerevisiae. Genetics 140, 79-90 (1995). 140

166. Lew, D.J., Weinert, T. & Pringle, J.R. Cell cycle control in Saccharomyces cerevisiae. in The Molecular and Cellular Biology of the Yeast Saccharomyces: Cell Cycle and Cell Biology (eds. Pringle, J.R., Broach, J.R. & Jones, E.W.) 607-695 (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, 1997).

167. Jung, U.S. & Levin, D.E. Genome-wide analysis of gene expression regulated by the yeast cell wall integrity signalling pathway. Mol Microbiol 34, 1049-57 (1999).

168. Williams, K.E. & Cyert, M.S. The eukaryotic response regulator Skn7p regulates calcineurin signaling through stabilization of Crz1p. Embo J 20, 3473- 83 (2001).

169. Park, J.M. et al. In vivo requirement of activator-speciﬁ c binding targets of mediator. Mol Cell Biol 20, 8709-19 (2000).

170. Cusick, M.E., Klitgord, N., Vidal, M. & Hill, D.E. Interactome: gateway into systems biology. Hum Mol Genet 14 Spec No. 2, R171-81 (2005).

171. Ooi, S.L. et al. Global synthetic-lethality analysis and yeast functional proﬁ ling. Trends in Genetics 22, 56-63 (2006).

172. Stuart, J.M., Segal, E., Koller, D. & Kim, S.K. A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules. Science 302, 249-255 (2003).

173. Krallinger, M. & Valencia, A. Text-mining and information-retrieval services for molecular biology. Genome Biol 6, 224 (2005).

174. Bader, J.S., Chaudhuri, A., Rothberg, J.M. & Chant, J. Gaining conﬁ dence in high-throughput protein interaction networks. Nat Biotechnol 22, 78-85 (2004).

175. Jansen, R. et al. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 302, 449-53 (2003).

176. Lee, I., Date, S.V., Adai, A.T. & Marcotte, E.M. A probabilistic functional network of yeast genes. Science 306, 1555-8 (2004).

177. Joyce, A.R. & Palsson, B.O. The model organism as a system: integrating ‘omics’ data sets. Nat Rev Mol Cell Biol 7, 198-210 (2006).

178. Sharan, R. & Ideker, T. Modeling cellular machinery through biological network comparison. Nat Biotechnol 24, 427-33 (2006).

179. Vidal, M. Interactome modeling. FEBS Lett 579, 1834-8 (2005). 141

180. Krull, M. et al. TRANSPATH(R): an information resource for storing and visualizing signaling pathways and their pathological aberrations. Nucl. Acids Res. 34, D546-551 (2006).

181. Karp, P.D. Call for an enzyme genomics initiative. Genome Biol 5, 401 (2004).

182. Roberts, R.J. Identifying protein function--a call for community action. PLoS Biol 2, E42 (2004).

183. Bernard, A. & Hartemink, A.J. Informative structure priors: joint learning of dynamic regulatory networks from multiple types of data. Pac Symp Biocomput, 459-70 (2005).

184. Hartemink, A.J., Gifford, D.K., Jaakkola, T.S. & Young, R.A. Combining location and expression data for principled discovery of genetic regulatory network models. Pac Symp Biocomput, 437-49 (2002).

185. Yeang, C.H. et al. Validation and reﬁ nement of gene-regulatory pathways on a network of physical interactions. Genome Biol 6, R62 (2005).

186. Haugen, A.C. et al. Integrating phenotypic and expression proﬁ les to map arsenic-response networks. Genome Biol 5, R95 (2004).

187. Maciag, K. et al. Systems-level analyses identify extensive coupling among gene expression machines. Mol Syst Biol 2, E1-E14 (2006).

188. Gandhi, T.K. et al. Analysis of the human protein interactome and comparison with yeast, worm and ﬂ y interaction datasets. Nat Genet 38, 285-93 (2006).

189. Sharan, R. et al. Conserved patterns of protein interaction in multiple species. Proc Natl Acad Sci U S A 102, 1974-9 (2005).

190. Suthram, S., Sittler, T. & Ideker, T. The Plasmodium protein network diverges from those of other eukaryotes. Nature 438, 108-12 (2005).

191. Kelley, R. & Ideker, T. Systematic interpretation of genetic interactions using protein networks. Nat Biotechnol 23, 561-6 (2005).

192. Milo, R. et al. Network motifs: simple building blocks of complex networks. Science 298, 824-7 (2002).

193. Schneider, T.D. & Stephens, R.M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18, 6097–6100 (1990).

194. Harris, M.A. et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 32, D258-61 (2004). 142

195. Feller, W. An Introduction to Probability Theory and Its Application, 528 (John Wiley & Sons, Inc., 1968).

196. Boyle, E.I. et al. GO::TermFinder--open source software for accessing Gene Ontology information and ﬁ nding signiﬁ cantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 20, 3710-3715 (2004).

197. Bar-Joseph, Z. et al. Computational discovery of gene modules and regulatory networks. Nat Biotechnol 21, 1337-42 (2003).

198. Henney, A. & Superti-Furga, G. A network solution. Nature 455, 730-1 (2008).

199. Nair, R., Carter, P. & Rost, B. NLSdb: database of nuclear localization signals. Nucleic Acids Res 31, 397-9 (2003).

200. Luscombe, N.M. et al. Genomic analysis of regulatory network dynamics reveals large topological changes. Nature 431, 308-12 (2004).

201. Kung, L.A. & Snyder, M. Proteome chips for whole-organism assays. Nat Rev Mol Cell Biol 7, 617-22 (2006).

202. Magasanik, B. & Kaiser, C.A. Nitrogen regulation in Saccharomyces cerevisiae. Gene 290, 1-18 (2002).

203. Zanzoni, A. et al. MINT: a Molecular INTeraction database. FEBS Letters 513, 135-140 (2002).

204. Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucl. Acids Res. 28, 27-30 (2000).

205. Spellman, P.T. et al. Comprehensive identiﬁ cation of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9, 3273-97 (1998).