BIOINFORMATICS APPROACHES TOWARDS FACILITATING DRUG DEVELOPMENT

Anna Ying-Wah Lee

Doctor of Philosophy

School of Computer Science

McGill University Montr´eal,Qu´ebec August 2010

A thesis submitted to McGill University in partial fulfillment of the requirements of the degree of Doctor of Philosophy. c Anna Ying-Wah Lee, 2010 ACKNOWLEDGEMENTS

I would like to thank my supervisors Mike Hallett and Sarah Jenna, and my col- laborator David Thomas for their valuable guidance and support. Their efforts have taught me a lot about research itself, in addition to science. Despite my generally quiet nature, I would like to thank them for sending me off to present at various meet- ings. Those meetings exposed me to a lot of great science and I greatly appreciated the opportunities to attend. I would also like to thank my labmates for making my graduate experience enjoyable both in- and outside of the lab. Members of the David Thomas lab have also been helpful and kind to me over the years. I also thank my famiy and friends. My parents have been quite supportive considering the fact they wanted to me become a different kind of doctor. My high school friends have always supported me by encouraging me to stay in school. Finally, I thank Andr´eCourtemanche, Dr. and Mrs. Milton Leong and NSERC for providing me with fellowships.

ii ABSTRACT

Drug development is currently a time-consuming, costly and challenging process. The process typically starts with the identification of a therapeutic target for a given disease. A therapeutic target is some biological molecule and the binding of com- pounds to target molecules is expected to cause a desired therapeutic effect. That is, target-binding compounds have the potential to become drug candidates. How- ever, there is a tendency for many drug candidates to fail during clinical trials, and consequently, very few candidates become approved new drugs. This trend suggests that the early stages of drug development should be improved to provide better drug candidates. The reasons for which a drug candidate may fail during clinical trials include unac- ceptable toxicity and insufficient efficacy observed in humans. These reasons suggest that the assessments of a compound during the early stages of drug development often inaccurately predict the effect of the compound in humans. One of the main goals of systems biology is to accurately predict how a given biological system responds to perturbations, e.g. treatment with a compound. This suggests that systems biology can help address challenges in drug development. However, there are currently gaps in our knowledge of systems. Here we use machine learning techniques to exploit exist- ing systems data towards filling in these gaps. In particular, we developed a method that uses the occurrences of motifs in protein sequences to predict kinase-substrate interactions. We also developed a method that uses expression, protein-protein interaction and phenotype data to predict genetic interactions. These predicted inter- actions can facilitate the identification of potential therapeutic targets. Ultimately, a better selection of therapeutic targets should lead to better drug candidates. We also address the challenge of developing combinatorial therapies. Despite the fact that combinatorial therapies are advantageous, the scale of the experiments required

iii to search for desirable chemical combinations is currently prohibitive. We therefore developed a method that uses system response data to predict chemical synergies towards facilitating the development of combinatorial therapies. Overall, this thesis shows how computational prediction in a systems biology frame- work can be used to facilitate and expedite the early stages of drug development.

iv ABREG´ E´

Le d´eveloppement des m´edicaments est actuellement un processus coˆuteux,dif- ficile, et qui prend beaucoup de temps. Le processus commence g´en´eralement par l’identification d’une cible th´erapeutique pour une maladie sp´ecifique. Une cible th´erapeutique est une mol´eculebiologique et l’attachement des compos´esaux mol´ecules cibles est suppos´ecauser un effet th´erapeutique. Donc, les compos´esqui attachent aux cibles ont le potentiel de devenir des candidats m´edicaments. Toutefois, beaucoup de candidats m´edicaments ont tendance `a´echouer pendant les essais cliniques, et par cons´equence,tr`espeu de candidats deviennent nouveaux m´edicaments approuv´es. Cette tendance sugg`ereque les premi`eres´etapes du d´eveloppement de m´edicaments doit ˆetream´elior´eafin de fournir des candidats m´edicaments de meilleure qualit´e. Les raisons pour lesquelles un candidat m´edicament peut ´echouer pendant les essais cliniques incluent une toxicit´einacceptable et une ´efficacit´einsuffisante observ´eschez les humains. Ces raisons sugg`erent que les ´evaluations d’un compos´ependant les premi`eres´etapes du d´eveloppement de m´edicaments mal pr´edirent l’effet du com- pos´echez les humains. Un des principaux objectifs de la biologie des syst`emesest de pr´edireavec pr´ecisioncomment un syst`emebiologique r´epond `ades perturbations, par exemple, un traitement avec un compos´e.Ceci sugg`ereque la biologie des syst`emes peut aider `aaborder les d´efisdu d´eveloppement de m´edicaments. Toutefois, il existe actuellement des lacunes dans notre connaissance des syst`emes. Ici, nous utilisons des techniques d’apprentissage automatique pour exploiter l’information existantes des syst`emespour combler ces lacunes. En particulier, nous avons d´evelopp´eune m´ethode qui utilise des occurrences des motifs dans les s´equencesde prot´einepour pr´ediredes interactions kinase-substrat. Nous avons aussi d´evelopp´eune m´ethode qui utilise d’expression des g`enes,des interactions entre les prot´eineset d’information des ph´enotypes pour pr´ediredes interactions g´en´etiques. Ces interactions pr´edites

v peuvent faciliter l’identification des cibles th´erapeutiques potentielles. En fin de compte, une meilleure s´electiondes cibles th´erapeutiques devrait entraˆınerdes can- didats m´edicaments de meilleure qualit´e. Nous avons aussi abord´ele d´efide d´evelopper des th´erapiescombinatoires. Malgr´e le fait que les th´erapiescombinatoires sont avantageuses, l’ampleur des exp´eriences n´ecessaires`ala recherche de combinaisons chimiques souhaitables est actuellement prohibitif. Donc, nous avons d´evelopp´eune m´ethode qui utilise d’information de r´eponse des syst`emes pour pr´ediredes synergies chimiques en vue de faciliter le d´eveloppement de th´erapiescombinatoires. Dans l’ensemble, cette th`esemontre comment de calcul de pr´edictiondans une struc- ture de biologie des syst`emespeut ˆetre utilis´espour faciliter et acc´el´ererles premi`eres ´etapes du d´eveloppement de m´edicaments.

vi KEY TO ABBREVIATIONS

ADME: Absorption, Distribution, Metabolism and Excretion AIC: Akaike’s information criterion AlvC: alverine citrate AMCA: actin-myosin contractile apparatus AMD: amiodarone ASPM: Abnormal Spindle-like, Microcephaly-associated ATP: adenosine triphosphate AU: approximately unbiased AUC: area under the (ROC) curve BBD: Bud6p-binding domain BEN: benomyl Blebb.: blebbistatin CASP: caspofungin CBP: carboplatin CDK: cyclin-dependent kinase CFU: colony forming units CGC: Caenorhabditis Genetic Center CLSI: Clinical and Laboratory Standards Institute cmpd: compound COOH: C-terminal tail region (of Bni1p) CPT: camptothecin CPZ: chlorpromazine CsA: cyclosporin A DAD: Dia-autoregulatory domain DAPI: 4’,6-diamidino-2-phenylindole

vii DGC: dystrophin glycoprotein complex DMD: Duchenne Muscular Dystrophy DMI: desipramine DNA: deoxyribonucleic acid DOXY: doxycycline Emo: accumulation of endomitotic oocytes ER: endoplasmic reticulum FA: formic acid FCZ: fluconazole FDA: Food and Drug Administration FEN: fenpropimorph FH: Formin-Homology domain FICI: fractional inhibitory concentration index FKBP: FK506 binding protein GAP: GTPase activating protein GBD: GTPase binding domain GDI: guanine-nucleotide dissociation inhibitor GDP: guanosine diphosphate GEF: guanine-nucleotide exchange factor GFP: green fluorescent protein GIN: genetic interaction neighbourhood GIST: gastrointestinal stromal tumour GO: Gon: severe shortening of the gonads GST: glutathione S-transferase GTP: guanosine triphosphate GW: genome-wide

viii HIV: human immunodeficiency virus HOG: high-osmolarity glycerol HMM: hidden Markov model HPLC: high performance liquid chromatography HygB: hygromycin B IBU: ibuprofen IPTG: Isopropyl β-D-1-thiogalactopyranoside KAN: kanamycin LatA: latrunculin A LC-MS: liquid chromatography-mass spectrometry MAPK: mitogen-activated protein kinase MCC: minimum cytotoxic concentration MFC: minimum fungicidal concentration MIC: minimum inhibitory concentration MLC: myosin light chain MMS: methyl methanesulfonate MPA: mycophenolic acid mRNA: messenger ribonucleic acid MRSP: mental retardation and synaptic plasticity MYR: myriocin NA: not available NCRR: National Center for Research Resources ND: not done NGM: nematode growth media NIH: National Institutes of Health NRC: National Research Council of Canada NYS: nystatin

ix PAK: p21 activated kinase PCR: polymerase chain reaction PhEP: phenotype, and PP interaction network PIN: physical interaction neighbourhood PLZF: promyelocytic leukaemia zinc finger PP: protein-protein, e.g. PP interaction PSSM: position-specific scoring matrix Q-TOP: quadrupole time-of-flight RNA: ribonucleic acid RNAi: ribonucleic acid interference ROC: receiver-operating characteristic curve RSC: Remodel the Structure of Chromatin SAGA: Spt-Ada-Gcn5-acetyltransferase complex SBD: Spa2p-binding domain SC: synthetic complete (media) SD: synthetic dropout (media) SDS-PAGE: sodium dodecyl sulfate polacrylamide gel SGD: Saccharomyces genome database SGP: somatic gonadal precursor SSU: U3-containing Small Subunit Ste: sterility SynMuvB: synthetic multivulval B SVM: Support Vector Machine TC: tetracyline TUN: tunicamycin WM: wortmannin YPD: yeast peptone dextrose

x ZS: Zhong and Sternberg, 2006

xi TABLE OF CONTENTS

ACKNOWLEDGEMENTS ...... ii ABSTRACT ...... iii ABREG´ E...... ´ v KEY TO ABBREVIATIONS ...... vii LIST OF TABLES ...... xvi LIST OF FIGURES ...... xviii 1 Introduction ...... 1 1.1 Systems Biology and Drug Development ...... 2 1.1.1 The Parts: ...... 3 1.1.2 Interactions within a System ...... 5 1.1.3 Interactions between Compounds and a System ...... 7 1.1.4 System Dynamics ...... 8 1.1.5 Conservation of System Behaviour ...... 10 1.1.6 Productivity in the Pharmaceutical Industry ...... 12 1.2 Machine Learning in Systems Biology ...... 13 1.2.1 Class Discovery ...... 14 1.2.2 Class Prediction ...... 15 1.3 Addressing Challenges in Drug Development ...... 28 1.3.1 Therapeutic Targets and Toxicity ...... 28 1.3.2 Therapeutic Targets for Monogenic Diseases ...... 32 1.3.3 Combinatorial Therapies ...... 34 2 A Biochemical Genomics Screen for Substrates of Ste20p Kinase Enables the in silico Prediction of Novel Substrates ...... 37 2.1 Preface ...... 37 2.1.1 Contributions of Authors ...... 38 2.2 Abstract ...... 38 2.3 Introduction ...... 39 2.4 Results ...... 42 2.4.1 A biochemical genomics screen identifies in vitro substrates of the Ste20p kinase ...... 42 2.4.2 An in silico approach to the identification of Ste20p substrates 43

xii 2.4.3 Application of the predictor to the yeast proteome . . . . . 48 2.4.4 Genetic and physical networks suggest in vivo relevance of predicted substrates ...... 48 2.4.5 Polarisome components Bud6p and Bni1p are in vitro sub- strates of Ste20p ...... 51 2.5 Discussion ...... 54 2.6 Materials and Methods ...... 59 2.6.1 Materials ...... 59 2.6.2 Construction of plasmids ...... 59 2.6.3 Yeast strains and protein purifications ...... 60 2.6.4 Protein kinase assays ...... 60 2.6.5 Mass spectrometry ...... 61 2.6.6 Identification of predictive motifs ...... 62 2.6.7 The predictor of Ste20p substrates and its cross-validation . 63 2.6.8 Gene Ontology (GO) analysis ...... 65 2.6.9 Genetic and physical interaction network analysis . . . . . 65 2.7 Acknowledgements ...... 67 3 Supplementary Information for Chapter 2 ...... 75 3.1 Supplementary Text ...... 75 4 Searching for Signaling Balance through the Identification of Genetic In- teractors of the Rab Guanine-nucleotide Dissociation Inhibitor gdi-1 . 85 4.1 Preface ...... 85 4.1.1 Contributions of Authors ...... 86 4.2 Abstract ...... 86 4.3 Introduction ...... 87 4.4 Results ...... 91 4.4.1 The predictor of genetic interactions in C. elegans . . . . . 91 4.4.2 The set of predicted genetic interactions exhibits improved coverage of genes conserved between human and C. elegans 95 4.4.3 Validation of predicted genetic interactors of gdi-1 . . . . . 96 4.5 Discussion ...... 99 4.6 Methods ...... 105 4.6.1 Construction of the learning set ...... 105 4.6.2 Datasets used to derive attributes ...... 106 4.6.3 Derivation of attributes for use in the logistic regression . . 106 4.6.4 Model specification, training and cross-validation ...... 108 4.6.5 Predictions from other genetic interaction predictors . . . . 110 4.6.6 Quantifying the information available for a gene ...... 110 4.6.7 Analysis of the predicted genetic interaction network with pathway annotations ...... 110 4.6.8 Nematode strains ...... 111 4.6.9 RNAi and drug treatment ...... 112

xiii 4.6.10 Measurement of sheath cell contraction ...... 113 4.6.11 Epistasis statistics ...... 113 4.6.12 Statistic for the suppression of muscle degeneration . . . . 115 4.7 Acknowledgements ...... 115 5 Supplementary Information for Chapter 4 ...... 126 5.1 Supplementary Text: Supplementary Methods ...... 126 5.1.1 Human orthologues of genes with predicted genetic interactions126 5.1.2 Controls for the validation of predicted genetic interactions using RNAi and balanced heterozygote strains ...... 127 5.1.3 Epistasis statistics ...... 128 6 Chemogenomic Profiling Predicts Antifungal Synergy ...... 140 6.1 Preface ...... 140 6.1.1 Contribution of Authors ...... 141 6.2 Abstract ...... 141 6.3 Introduction ...... 142 6.4 Results ...... 146 6.4.1 The collection and generation of chemogenomic profiles . . 146 6.4.2 Components of the FCZ-Fungicidal Set ...... 147 6.4.3 Prediction of synergistic compounds ...... 148 6.4.4 Validation of predicted synergies in S. cerevisiae ...... 153 6.4.5 Validation of predicted synergies in C. albicans ...... 155 6.4.6 A comparison of predictors dependent on haploid- and/or diploid-based profiles ...... 156 6.5 Discussion ...... 157 6.6 Materials and Methods ...... 162 6.6.1 Strains & media ...... 162 6.6.2 Library screen to generate the lethality-based chemogenomic profile for fluconazole ...... 162 6.6.3 MIC assays, recovery assays and compound synergy tests in S. cerevisiae and C. albicans ...... 163 6.6.4 Complementation assay ...... 164 6.6.5 The chemogenomic profile collection ...... 165 6.6.6 Annotations of the FCZ-Fungicidal genes ...... 165 6.6.7 The gold standard set ...... 166 6.6.8 The measures of chemogenomic profile similarity ...... 166 6.6.9 Synergy prediction ...... 167 6.6.10 Permutation analysis ...... 168 6.6.11 Fitting to other models of synergy ...... 170 6.7 Acknowledgements ...... 171 7 Supplementary Information for Chapter 6 ...... 178 7.1 Supplementary Text ...... 178

xiv 8 Discussion ...... 191 8.1 The Prediction of Kinase-Substrate Interactions ...... 191 8.2 The Prediction of Genetic Interactions in C. elegans ...... 194 8.3 The Prediction of Chemical Synergies ...... 197 8.4 Conclusions ...... 200 REFERENCES ...... 202

xv LIST OF TABLES Table page 2–1 Hits from the in vitro screen for Ste20p substrates ...... 44 2–2 Motifs used in the construction of the Ste20p substrate predictor . . . 47 3–1 GO slim Cellular Component analysis of predicted Ste20p substrates (score ≥ 0.9) ...... 77 3–2 GO slim Biological Process analysis of predicted Ste20p substrates (score ≥ 0.9) ...... 77 3–3 GO slim Molecular Function analysis of predicted Ste20p substrates (score ≥ 0.9) ...... 77 3–4 Predicted substrates (score ≥ 0.9) in the neighborhoods of STE20 ge- netic interactors ...... 78 3–5 Predicted substrates (score ≥ 0.9) in the neighborhoods of Ste20p phys- ical interactors ...... 79 3–6 The number of STE20 -linked Genetic Interaction Neighborhoods (GINs) and Ste20p-linked Physical Interaction Neighborhoods (PINs) in which each known Ste20p substrate (from the positive learning set) appears 80 5–1 Performance of genetic interaction predictors ...... 133 5–2 Epistasis coefficients of experimentally tested genetic interactions . . . 133 5–3 Epistasis P values of experimentally tested genetic interactions . . . . 135 5–4 AIC values of 63 logistic regression models that use different combina- tions of the gene pair attributes ...... 138 5–5 Genotypes of C. elegans strains used in this study ...... 139 6–1 Compound pairs that exhibit antifungal synergy ...... 154 7–1 The 22 FCZ-Fungicidal genes ...... 180 7–2 Phenotypes of the FCZ-Fungicidal strains ...... 181 7–3 The gold standard set of positive and negative examples of compound pairs that exhibit antifungal synergy ...... 182

xvi 7–4 Evaluation of compound pairs for antifungal synergy in S. cerevisiae with the Loewe additivity model ...... 183 7–5 Dose-matrix responses to compound pairs in S. cerevisiae fitted to Bliss boosting and potentiation models ...... 184 7–6 Evaluation of compound pairs for antifungal synergy in C. albicans with the Loewe additivity model ...... 185 7–7 Dose-matrix responses to compound pairs in C. albicans fitted to Bliss boosting and potentiation models ...... 186

xvii LIST OF FIGURES Figure page 1–1 The stages of drug development ...... 2 2–1 Examples for computing the total weight of a motif m for a given protein sequence t (wt,m)...... 45 2–2 Significantly over-represented (adjusted P ≤ 0.05) GO slim annota- tions among the predicted Ste20p substrates ...... 68 2–3 Inferring the biological relevance of Ste20p predicted substrates via neighborhood analysis ...... 70 2–4 Clustering profiles of overlap between the predicted substrates and the GINs of STE20 genetic interactors ...... 71 2–5 Clustering profiles of overlap between the predicted substrates and the PINs of Ste20p physical interactors ...... 72 2–6 Bni1p and Bud6p are phosphorylated by Ste20p in vitro ...... 74 3–1 Receiver-Operating Characteristic (ROC) curves of Ste20p substrate predictors ...... 81 3–2 The statistical significance of clusters based on the Genetic Interaction Neighborhood (GIN) analysis shown in Figure 2–4 ...... 82 3–3 The statistical significance of clusters based on the Physical Interaction Neighborhood (PIN) analysis shown in Figure 2–5 ...... 83 3–4 STE20 genetic neighborhood analysis suggests that several predicted substrates in the negative set may represent false negatives of the biochemical screen ...... 84 4–1 Comparison of large-scale genetic interaction studies in C. elegans . . 116 4–2 Gene pair attributes used to predict genetic interactions ...... 117 4–3 Assessment of the biological relevance of the predicted genetic interac- tion network with pathway annotations ...... 118 4–4 Phenotypical characterization of gdi-1(RNAi)-treated animals . . . . 120 4–5 Validation of a subset of genetic interactions predicted for gdi-1 . . . 121

xviii 4–6 Epistasis between gdi-1 and its predicted genetic interactors and chem- ical suppressors ...... 123 4–7 gdi-1 suppresses dys-1 - and dyb-1 -associated muscle degeneration . . 125 5–1 Receiver-operating characteristic curve of the genetic interaction pre- dictor ...... 131 5–2 Comparison of genome-wide genetic interactions predicted by different approaches ...... 132 5–3 The relationship between the quantity of information available for a gene and the number of predicted genetic interactions ...... 135 5–4 Different methods for estimating the P value associated with a Pearson correlation value measuring the coexpression of two genes in the Kim et al. dataset ...... 136 5–5 The dependencies between the predictive gene pair attributes as defined by a learned Bayesian network ...... 136 5–6 The interaction of gdi-1 with unbalanced heterozygotes of aspm-1(ok1208)137 5–7 Validity of the normality assumption for the application of Student’s t-tests to phenotype measurement data ...... 137 6–1 A method for identifying synergistic compounds with antifungal activity172 6–2 The measures of chemogenomic profile similarity and antifungal syn- ergy predictions based on these measures ...... 173 6–3 Statistical evaluation of the gene-based chemogenomic profile similarity measure as a predictor of antifungal synergy ...... 174 6–4 Dose-matrix responses to compound pairs that exhibit antifungal synergy176 6–5 The dose-matrix recovery of clinical isolates from treatment with flu- conazole and wortmannin ...... 177 7–1 Comparisons of variants of the antifungal synergy predictor ...... 187 7–2 Statistical evaluations of the gene-based chemogenomic profile similar- ity measure as a predictor of antifungal synergy with different sets of negative examples ...... 188 7–3 Dose-matrix responses to compound pairs that exhibit antifungal syn- ergy in S. cerevisiae ...... 189 7–4 Dose-matrix responses to compound pairs that exhibit antifungal syn- ergy in C. albicans ...... 190

xix CHAPTER 1 Introduction Drug development is currently a time-consuming, costly and challenging process. The process typically starts with the identification of a therapeutic target for a given disease (Figure 1–1, adapted from [266]). A therapeutic target is some biological molecule and the binding of compounds to target molecules is expected to cause a desired therapeutic effect. That is, target-binding compounds have the potential to become drug candidates, and thus the next stage of the process involves screening chemical libraries for such compounds. Identified compounds enter the optimization stage where the goal is to find their analogues that have drug-like properties (e.g. soluble), reduced toxicity and greater efficacy. The most promising analogues are drug candidates. The candidates enter clinical trials and successful candidates are subsequently reviewed for approval [e.g. in the United States by the Food and Drug Administration (FDA)] for market release. The whole development process for a sin- gle new drug takes 10 years and the estimated out-of-pocket cost is 310 million USD e on average [2]. Despite the large investments in drug development, there is a ten- dency for many drug candidates to fail during clinical trials, and consequently, very few candidates become approved new drugs [150]. This trend suggests that the early stages of drug development should be improved to provide better drug candidates. The reasons for which a drug candidate may fail during clinical trials include unac- ceptable toxicity and insufficient efficacy observed in humans. These reasons suggest that the assessments of a compound during the early stages of drug development often inaccurately predict the effect of the compound in humans. Indeed, the Innovative Medicines Initiative, a public-private partnership in the European Union, aims to improve predictions of toxicity and efficacy towards alleviating research bottlenecks

1 Figure 1–1: The stages of drug development. The process typically involves the identification of a therapeutic target, followed by the screening of chemical libraries to identify compounds that bind to the target, then optimization of identified com- pounds, and finally, the optimized compounds undergo clinical evaluation (i.e. they enter clinical trials). in the drug development process [130]. Before clinical trials, toxicity and efficacy may be measured from human tissue cultures treated with the compound. The ex- periments are conducted in vitro, i.e. not in a living organism, and are thus unable to detect effects of the compound that would result from the interplay between dif- ferent tissues within a whole organism. Studies of whole animals are conducted to compensate for this shortcoming, but animal models do not mimic humans exactly, and thus the observed effects in animals are not always consistent with the effects in humans. The limitations of these in vitro and in vivo experiments suggest that a better estimation of the effects of a compound in a human system would enable a bet- ter judgment for whether or not the compound should be pursued as a drug candidate.

1.1 Systems Biology and Drug Development

One of the main goals of systems biology, i.e. the study of whole biological systems, is to accurately predict how a given biological system responds to perturba- tions [11]. If this goal were to be achieved for human systems, better predictions of the effects of a given compound might be obtainable, possibly improving the early stages of drug development as a result. This possible means of improvement is motivating the integration of systems biology with drug development [28]. While the integration is a work in progress, many studies have contributed to the groundwork [47]. That is, in any system, there are parts and interactions between the parts. Studies that identify and improve our understanding of these components of biological systems are

2 establishing a foundation on which future studies can be built to address challenges in drug development.

1.1.1 The Parts: Genes

In recent decades, systems biology has reached new levels of comprehensive- ness and resolution. That is, current technologies can generate “snapshots” of en- tire systems at the molecular level. For example, all human are es- sentially deoxyribonucleic acid (DNA) molecules and these molecules were resolved into sequences of characters, where each character represents a DNA building block (e.g. [154, 259]). The results were used to predict the subsequences corresponding to genes. The set of human genes is essentially a catalogue of parts, yet under most conditions a cell only expresses a subset of all genes and other genes are expressed as needed [132]. When a cell requires the activity of a specific gene, an intermediate molecule called messenger ribonucleic acid (mRNA) is generated from the DNA se- quence of the gene [63]. Typically, the mRNA is then translated into a protein, i.e. a molecule that folds into a three-dimensional structure and subsequently carries out specific cellular functions [63]. The advent of microarray technology and advances in the field of proteomics enabled the generation of snapshots of the expression levels of all genes in given cells by quantifying specific mRNAs and proteins (respectively) in cellular extracts [229, 201]. Importantly, one can compare snapshots of the system in different conditions, e.g. diseased versus healthy, or chemically perturbed versus unperturbed. Therefore, system-wide snapshots permit unbiased searches for genes that are relevant to particular conditions, thus circumventing the need to rely on the possibly little existing knowledge about the conditions to suggest which genes to study. A common strategy for investigating the role of a gene in a condition of interest in- volves (i) perturbing the organism such that the expression of the gene is reduced and

3 (ii) comparing the phenotypes (i.e. qualities of a biological system) observed from the perturbed and wild-type organisms subjected to the condition of interest (e.g. [268]). However, the first step is not actively done in humans for ethical reasons. Except for clinical trials, studies of whole human organisms are observational. For example, patients already harbouring a disease-causing, loss-of-function mutation in a specific gene may be compared to individuals without a mutation in the gene. Due to the limitations of observational studies, hypotheses about the roles of human genes are also generated from human cell lines and model organisms.

Model organisms and evolutionary conservation Model organisms are often studied in lieu of humans. Saccharomyces cerevisiae (baker’s yeast), Caenorhabditis elegans (a roundworm), Drosophila melanogaster (a fruit fly), mice and rats are examples of model organisms that are manipulated ge- netically to gain insight into humans [85]. Importantly, model organisms are studied not only due to the greater experimental flexibility that they afford, but also due to evolutionary conservation. A gene from one species is conserved in a second species if there is a gene in the second species with high sequence similarity to the initial gene, and both genes are thought to be derived from an ancestor common to both species. For example, the conservation of a human gene in mice implies that there is a version of the human gene in mice (called an orthologue). This suggests that hypotheses involving human genes can be generated using mice as models. In general, studies of model organisms aim to generate plausible hypotheses about organisms (with orthol- ogous genes) that are less characterized and/or more difficult to study.

Genetic resources Genetic manipulation of model organisms can also be challenging. A classic strategy

4 for reducing the expression of a specific gene is to delete the gene from the under- lying genome. Deletion strain libraries were constructed in S. cerevisiae such that each strain has a different gene deleted [268]. A more recently discovered strategy involves RNA interference (RNAi), that is, the delivery of specific double-stranded RNA molecules into cells can reduce the expression of the genes that they target [88]. RNAi libraries were constructed for C. elegans and a few other model organisms such that each library member reduces the expression of a different gene (e.g. [139, 222]). Both strategies tend to be more difficult to execute with mammalian model organ- isms (e.g. [277, 255]). For example, RNAi molecules are delivered to mammals via injection, infusion, lentiviral infection or inserting the RNA code into the underlying genome [255], whereas the molecules can simply be fed to C. elegans animals [248]. Taken together, the genetic resources of model organisms, especially those of S. cere- visiae and C. elegans, facilitate studies of the functions of genes.

1.1.2 Interactions within a System

The identification of all genes in various organisms was followed by efforts to identify all interactions between the genes or their encoded proteins (e.g. [254, 250]). Interactions are important because they show coordination and dependencies between genes that ultimately play a part in how the system responds to perturbations. How- ever, even in a relatively simple organism such as S. cerevisiae, there are 18 million e gene pairs to test for an interaction. The advent of new high-throughput technolo- gies permitted the comprehensive identification of interactions, though not without limitations or caveats.

Protein-protein interactions A protein-protein (PP) interaction, i.e. a physical interaction between two proteins,

5 can serve many different purposes including the formation of protein complexes repre- senting larger functional units, and the relaying of signals [114]. The yeast two-hybrid assay identifies PP interactions between proteins that are made to be overexpressed and/or that are not in their native environments [86]. While the assay has detected thousands of interactions, it is expected that some fraction of them are not physio- logically relevant, e.g. if the proteins are never expressed together in the same native envirionment, and thus never have to opportunity to interact [113]. The tandem affin- ity purification with mass spectrometry technique involves extracting complexes from their native environments followed by the identification of the proteins in each complex (e.g. [121]). Interactions identified by this technique are mostly stable, and between proteins that are abundant in the cell. The protein-fragment complementation assay detects PP interactions between proteins from within their native environments, and is therefore capable of identifying less stable yet still physiologically relevant interac- tions [244]. Taken together, there are different high-throughput assays for identifying PP interactions and they are complementary.

Genetic interactions A genetic interaction represents a relationship between the activities of two genes, such as antagonism or synergism. Specifically, a genetic interaction between two genes exists when the phenotypic effect of a perturbation (e.g. mutation, RNAi treatment, drug targeting) in one gene is dependent upon a perturbation in the other gene [78]. Thus, a genetic interaction between two genes does not necessarily imply that their encoded proteins exhibit a PP interaction, or that the encoded protein of one gene binds to the DNA of the other gene. Indeed, the genes may be expressed in different compartments of a cell, or even different tissues of an organism. In S. cerevisiae, the deletion strain libraries have been exploited by the synthetic genetic array method to identify genetic interactions with respect to a growth phenotype; 94% of all gene

6 pairs were tested [250, 61, 251]. In C. elegans, the RNAi library was used together with several mutant strains to identify genetic interactions with respect to six dif- ferent phenotypes; only gene pairs that include signaling genes were tested [163, 48]. Whole organisms of species S. cerevisiae and C. elegans are small enough to fit into a well in a 96-well plate, thus enabling large-scale, high-throughput genetic exper- iments. The size of whole mammals and the effort required to maintain them are considerably less conducive to high-throughput experiments; not surprisingly, there have been no large-scale efforts to identify genetic interactions in whole mammals. However, one study used cell lines to infer human genetic interactions with respect to a viability phenotype; 99% of all gene pairs were tested [165]. Specifically, Lin et al. analyzed publicly available radiation hybrid datasets that show the effects of adding extra copies of DNA fragments (generated by random breaks in the genome induced by irradiation) on the viability of cells. While the study avoids the challenge of specific genetic manipulation, each DNA fragment does not necessarily correspond to a complete single gene, and this influences the accuracy with which one can infer an interaction between two specific genes. Taken together, genetic interactions have been mapped for a few organisms with varying degrees of comprehensiveness and confidence, and to date, very few phenotypes have been considered.

1.1.3 Interactions between Compounds and a System

Although a compound may have a known target within a system, the broader consequences of perturbing the system with the compound are of interest. These con- sequences may follow from the compound binding to its known target, or from the compound binding to unknown targets (e.g. adverse effects of a compound developed for atherosclerosis were detected and evidence suggests that these effects are unrelated to the primary target [260]). As mentioned above, comparing expression snapshots (or more formally, expression profiles) generated from perturbed and unperturbed

7 systems permit unbiased searches for genes involved in the response to the perturba- tion. Such genes represent a starting point for studying in detail the consequences of a chemical perturbation. However, it is not necessarily the case that all genes involved in the response to the perturbation are differentially expressed (i.e. the activity of a gene might instead be altered by a chemical modification of the corresponding protein as shown in [186]), and thus some will not be detected by comparing expression pro- files. Fortunately, these other genes can be detected with other types of system-wide profiles. For example, each strain of the yeast deletion library could be treated with the compound [205]. The resulting growth of each strain could then be measured and the collection of measurements would define a chemogenomic profile of the compound. Strains showing reduced growth in comparison to untreated cells suggest that the re- duced expression of the corresponding genes (i.e. the genes that were deleted from the strains) confers sensitivity to the compound. The chemogenomic profile can highlight genes that have no previous association with the compound but that are nonetheless relevant to the response to the compound. In other organisms, chemogenomic profiles can be generated by using an RNAi library in place of a yeast deletion library. Taken together, genes that stand out from a chemically-induced system-wide profile link the system to the compound at the molecular level, and studying the functions and interactions of these genes can provide insight into how the system responds to the compound.

1.1.4 System Dynamics

The dynamic nature of biological systems is apparent through the varying ac- tivity levels of certain genes over time and the transience of certain interactions. In studying the molecular dynamics of a biological system, one of the goals is to obtain a better understanding of the roles of genes and interactions within the context of the system [149]. A biological system can be studied at the level of a molecular pathway,

8 regulatory network, cell, tissue, organ, or a whole organism, and to date, most studies of molecular dynamics have been focused on the lowest level of organization: path- ways [15]. Even at the lowest level, the dynamic aspect of interactions can result in switch-like behaviour, oscillations, and more [212, 131]. These dynamic relationships are encoded in mathematical models, and these models are used to drive simulations to predict the behaviour of the system in different conditions [149]. The dynamic relationships are typically inferred from the literature and/or mea- surements of the system parts (e.g. gene expression) taken from different starting conditions and over periods of time. In mathematical models, the relationships are often encoded as differential equations specifying the kinetics of the underlying bio- chemical reactions (e.g. [94, 81]). However, it is difficult to obtain accurate estimates of the model parameters since the generation of extensive experimental data is nec- essary for the task (e.g. [190, 8]). As an alternative, studies have used the known structure (i.e. the identified genes and interactions) of a system as the main source of data to gain insight into the dynamics of the system, since it has been observed that the structure dictates the dynamics to a great extent [147]. For example, one study used token accumulation and dissipation over time (in simulations) to charac- terize signal flow through a signalling network, where the nodes are signalling proteins and the edges are activating or inhibitory interactions [223]. Encouragingly, models of the heart were developed despite greater complexity at the level of organs (re- viewed in [195]). Previously developed models of cell types within the heart were incorporated into models of the whole organ, and these models have already been used in drug development. The study of the dynamics of biological systems will likely continue to gain momentum as the structures of systems become better characterized.

Overall, a biological system is greater than the sum of its parts. Accordingly, systems biology permits the generation of novel hypotheses for how a system would respond

9 to a perturbation, many of which would not be found with single-gene studies.

1.1.5 Conservation of System Behaviour

The conservation of genes between two species does not necessarily imply that their interactions are also conserved, and consequently, nor would it imply that the system behaviour is conserved. It was reported that a significant but low percent- age of genetic interactions is conserved between two yeast species, S. cerevisiae and Schizosaccharomyces pombe [75], suggesting that an even lower percentage would be conserved between more distant species. However, only genetic interactions defined by a growth phenotype were considered. Many different pathways can contribute to the growth of an organism and thus, the dependencies between these pathways can vary among distant species. Therefore, even a gene that is highly conserved between two distant species may not have many conserved interactions defined by growth. In contrast, a phenotype that is more specific to some conserved pathway or disease system may reveal a significant number of conserved interactions. The conserva- tion of disease-related mechanisms between model organisms and human has been reported in several instances [64, 6, 1, 238], supporting the conservation of system interactions and behaviour in certain cases. Furthermore, it has been shown that the architecture of certain signalling pathways is conserved between lower and mam- malian organisms [47]. Although different organisms may use a conserved module to regulate different processes, the conservation suggests that studying the module in lower organisms should help characterize the module in higher organisms [174]. Taken together, though there may not be broad conservation of interactions between model organisms and humans, the conservation of particular system modules may be sufficient for model organisms to provide useful insights into the behaviour of human systems. C. elegans is a multicellular animal with a lifespan of approximately 2-3 weeks [218].

10 Although humans have clearly diverged from C. elegans in general, a large number of human genes associated with disease are conserved in C. elegans [238]. Furthermore, C. elegans studies have facilitated the dissection of the physiopathological mecha- nisms of genetic diseases including Duchenne Muscular Dystrophy (DMD), obesity, diabetes and Huntingtons disease [144, 16]. These previous studies show that the conservation that exists between the two species is sufficient for providing insight into certain human diseases, based on observations in C. elegans. Thus, screens for therapeutic targets for human diseases have been conducted in C. elegans. For ex- ample, the biotechnology company Devgen conducted a genome-wide RNAi screen in a C. elegans model of type 2 diabetes, i.e. a strain with a loss-of-function mu- tation in daf-2, the C. elegans orthologue of the insulin-like growth factor receptor (reviewed in [16]). Briefly, the screen found that the reduced expression of a specific protein suppressed the daf-2 mutant phenotype. The orthologous protein was then deleted from a mouse model that can be induced to show early signs of diabetes. Mice lacking the protein, and thus its functions, were protected against diabetes. These results encourage further study of the human orthologue of the protein as a potential therapeutic target for type 2 diabetes since it may be possible to find a compound that inhibits the function of the protein by binding to it, alleviating symptoms of the disease as a result. In general, the Devgen study supports the search for therapeutic targets for human diseases by using C. elegans as a model. S. cerevisiae is a unicellular fungus and in the domain of eukaryotes (i.e. organisms with cells that have a nucleus) like humans [236]. While humans have diverged even farther from S. cerevisiae in general, genes involved in basic cellular processes such as the cell cycle are conserved between the two species [128]. For example, the immuno- suppressive drug rapamycin was shown to cause cell cycle arrest in yeast, where the FKBP-rapamycin complex targets the yeast proteins Tor1p and Tor2p [118]. The drug was then shown to cause cell cycle arrest in human cells, where the FKBP-rapamycin

11 complex targets the human orthologue of the yeast targets, mTOR [43, 183]. There- fore, the level of conservation between the two species is sufficient to provide insight into how human cells respond to certain compounds, based on observations in yeast. As a well-studied fungus, S. cerevisiae is also effective as a model for fungal pathogens. Most laboratory S. cerevisiae strains are non-pathogenic [236] and thus safer and easier to work with. For example, a low-cost and operationally simple assay was developed for the identification of compounds that cause the lysis of S. cerevisiae cells [151]. The assay was used to screen five chemical libraries, resulting in the iden- tification of several compounds including drugs already used against fungal pathogens in the clinic. Thus, the S. cerevisiae assay successfully identified compounds that have antifungal activity against fungal pathogens, highlighting the utility of S. cerevisiae as a model fungus for drug development. Therefore, previous studies suggest that C. elegans and S. cerevisiae are useful model organisms for the early stages of drug development.

1.1.6 Productivity in the Pharmaceutical Industry

Even though advances in systems biology are being exploited in drug devel- opment (as illustrated by the examples above), a sustained increase in the yearly number of new drugs has not yet followed. That is, the average number of new drugs for 1996-2001 was 35 but only 20 for 2002-2007 [127]. A simple explanation is that the benefits of systems biology have not yet come to fruition in drug development due to the length of the process. Moreover, it was proposed that the decline in the number of new drugs is in part due to patent challenges in the United States (in accordance with the Drug Price Competition and Patent Term Restoration Act of 1984) [119]. Generic-drug-manufacturing companies have been challenging patents with increasing frequency, and as a result, decreasing the revenue of pharmaceutical

12 companies. This has likely limited the development of new drugs, and instead, phar- maceutical companies further develop existing drugs. For example, in recent years, pharmaceutical companies have released with greater frequency reformulations (i.e. next generations) of existing drugs that represent low-risk investments. In addition, systems biology may have played a role in repurposing existing drugs, i.e. to treat diseases other than the ones for which they were originally approved. Re-using exist- ing drugs can cut development costs since, for example, the drugs have already been assessed for toxicity to some extent. However, if an existing drug is approved for a different indication, it is not counted as a new drug (i.e. a new molecular entity), and the benefits of systems biology in drug development may be less obvious as a result. Taken together, additional reductions in development costs driven by systems biology may sway pharmaceutical companies to focus more heavily on innovation.

1.2 Machine Learning in Systems Biology

Machine learning has been used to address various challenges in systems biology including the prediction of gene functions, interactions, and the responses of a system to perturbations (e.g. [109, 280, 157]). Although many advances in biotechnology have allowed these challenges to be addressed experimentally, the current techniques have costs that tend to prohibit system-wide experiments, or they have technical limitations that leave gaps in our knowledge of systems (such as those described in section 1.1.2). Machine learning addresses the cost issue because the computational experiments tend to be much less costly compared to bench experiments, and the predictions can also be used to prioritize bench experiments. Furthermore, machine learning is used to address the technical limitations of biotechnologies by learning patterns in existing data to make predictions for the gaps in our knowledge. There- fore, machine learning is well-suited for addressing challenges in systems biology. Here, the focus is on machine learning strategies that organize items into classes:

13 class discovery and class prediction.

1.2.1 Class Discovery

Class discovery organizes items into classes without requiring any known exam- ples from each class [83]. Specifically, a class discovery problem involves a set of items, where each item is associated with a data point in d-dimensional space. A dis- tance metric, that is, a metric quantifying the distance between any two given points, must also be supplied to the class discovery or clustering algorithm. The algorithm essentially groups the items into clusters such that each cluster contains items that are closer to one another than to the items in other clusters. Each cluster can then be interpreted as a separate class. Although the discovery of classes is guided by parameters that constrain the size, shape, and/or number of classes, the discovery is otherwise unbiased.

Hierarchical Clustering Hierarchical clustering algorithms are either agglomerative or divisive, growing or splitting clusters of items, respectively until exit conditions are met [115]. The growth or division decisions are guided by a linkage criterion, that is, a metric quantifying the distance between two sets of items/points, which relies on the distance metric for individual points. For example, each item starts in its own cluster with an ag- glomerative algorithm. The two clusters that are closest together according to the linkage criterion are agglomerated together to form a new cluster, and then the two constituent clusters are removed from consideration. The agglomeration step is re- peated until there is only one cluster left. All agglomeration events are recorded and represented in a dendrogram (i.e. a tree-like diagram) where each item is a leaf and two nodes have the same parent only if the corresponding clusters were agglomerated, effectively showing the cluster hierarchy.

14 Hierarchical clustering is often used in systems biology with possible benefits for drug development. For example, in one study each item represented a sample of yeast cells that either contained a mutation in a specific gene or were treated with a specific compound, and the associated data point was a gene expression profile consisting of an expression value for each yeast gene [129]. Hierarchical clustering revealed that mutants and compounds that perturb the same cellular process induce profiles that cluster together. This suggested that gene expression profile similarity can be used to gain insight into the mechanism of action of a given compound. Indeed, the profile induced by an anesthetic drug dyclonine was found to be most similar to the profile induced by a mutation in the ergosterol biosynthesis pathway. This suggested that dyclonine perturbs the ergosterol pathway, which was confirmed by analysis of the sterol content in dyclonine-treated cells. Taken together, the hierarchical clustering of gene expression profiles organized mutations and compounds into clusters based on the cellular processes that they perturb, and profile similarity provided insight into the mechanism of action of a compound.

1.2.2 Class Prediction

Class prediction involves predicting which class a given item belongs to based on features that distinguish the classes of interest. Consider the simplest case: only one class of interest and thus the goal is to predict whether or not a given item belongs to the class. A training set consisting of examples of members and non-members of the class is required to train a classifier. Each item is described in terms of selected features (i.e. the inputs of the classifier), which are informative for predicting the class membership of the item to varying degrees. The complexity of a classifier, i.e. the number of classifier parameters, increases with the number of features. However, even with a fixed set of features, there are several different families of classifiers of

15 varying complexity. Each family has specific parameters and constraints. Neverthe- less, every family-specific learning method selects parameter values that minimize the class prediction errors for the training examples, given the constraints. Therefore, the classifier family, selected features and training set mainly define a classifier. As alluded to above, feature selection influences the complexity of a classifier. In selecting the features that the classifier will use, one would ideally select those that are most useful for the classification problem at hand, excluding the features that would make the classifier unnecessarily complex. Feature selection may be guided by training variants of the classifier that use different features and subsequently mea- suring the performances (with respect to the training set) of the variants [95]. In pursuing the variant(s) with the best performance, the corresponding features would be selected. There can be many features to select from in systems biology problems. For example, the expression levels of genes have been used by classifiers to predict the prognosis of breast cancer patients (e.g. [87]). However, the expression levels of 20K human e genes were measured, and thus there were 20K different features from which to se- e lect. Therefore, feature selection is often a challenge in class prediction problems in systems biology. Another important aspect of class prediction is the bias-variance trade-off [98]. A simple classifier compared to a complex classifier is less likely to capture complex trends in the feature data that inform on class membership, and is thus more likely have a large bias. A large bias indicates that the classifier consistently makes errors for certain items regardless of the training set used. A simple classifier is also less likely to be overfitted, i.e. to capture trends that are specific to the given training set yet irrelevant to the classification problem, and is thus less likely to vary greatly if a different training set were used. High variance would suggest that the classifier does

16 not perform well on items that are not in the training set, and thus, it does not gen- eralize well. Taken together, the complexity of a classifier is tied to the bias-variance trade-off. Due to the complexity of biological systems, related prediction problems can also be complex. With such problems one might address the bias-variance trade-off by training a complex classifier with a relatively large training set. The more training items there are relative to the number of classifier parameters, the less likely it is for the learning method to overfit the classifier and yield highly variable classifiers as a result. However, large training sets are often unavailable in systems biology (as is the case for the prediction problems addressed in this thesis). Furthermore, there can be a lot of missing data and/or very noisy data [267], making it difficult to learn the true complexity in the data. Therefore, simple classifiers are often used in systems biology. The decision to choose a particular family of classifiers may be guided by the relative performance of classifiers from different families and/or properties of the family that make it better suited for the problem at hand. Here, different families of classifiers are reviewed and examples of how they have been applied in systems biology and/or drug development are provided.

Logistic Models A logistic model is a generalized linear model that can be applied to classification problems where there is only one class of interest [62, 29]. The model has the following form:

 p  ln = w + w x + w x + + w x (1.1) 1 − p 0 1 1 2 2 k k

where

x1, x2,..., xk represent the k feature values of a given item,

17 w0 represents the learned logit of the probability, or log odds, that the given item

is in the class of interest when xi = 0 for i = 1 . . . k,

w1, w2,..., wk represent the learned weights of the k features, and p represents the probability that the given item is in the class of interest given the feature values of the item. Thus, the desired output of the model, the estimate of p, is based on a weighted sum of the feature values. Each wi for i = 1 . . . k is the estimated contribution of one unit value of feature i to the log odds that the given item is in the class of interest. Given the learned weights and the feature values of the item, the estimate of p can be obtained. Furthermore, the item can be predicted as a member of the class of interest if the estimate of p surpasses a selected threshold. Logistic models are particularly attractive for applications in systems biology because of their simplicity and transparency. With k features, there are only k+1 model pa- rameters to fit. In addition, the contribution of each feature to a prediction is ex- plicitly specified by the weight of the feature. Furthermore, the established statistics for linear models can be used to evaluate the significance of the contribution of each feature and of the model as a whole, providing insight into the underlying biology. However, missing feature data must be filled in by some other method before logis- tic models can be applied. Moreover, a logistic model cannot capture dependencies between the features that would influence prediction, unless interaction terms (e.g.

xixj) are included in the model. Doing so would require another weight parameter for every term added, thereby reducing the simplicity of the model. Therefore, logistic models are advantageous for systems biology because of their transparency; however their simplicity can be both advantageous and disadvantageous.

18 Logistic models are frequently used in medical research [194]. For example, a logis- tic model was developed to predict whether or not a given protein cavity is drug- gable [230]. In this study, a protein cavity is called druggable if it is capable of bind- ing oral drug-like molecules. Regardless of the desired therapeutic effect, a drug-like molecule is expected to have good absorption, distribution, metabolism and excretion (commonly referred to as ADME) properties within an organism [166]. Such prop- erties are determined by specific chemical properties of the molecule. This suggests that, in order for a particular protein cavity to bind to a drug-like molecule, the cavity should have properties that are compatible with the properties of the molecule. Thus, features describing the structural and chemical properties of a given protein cavity were used by the logistic model to predict the druggability of the cavity [230]. In addition to confirming the importance of certain cavity features in predicting drug- gability, this study identified the positive role of polar residues in cavities. Therefore, the transparency of the logistic model provided insight into the properties of a protein cavity that make it druggable.

Support Vector Machines A support vector machine (SVM) classifies items by considering where the items, as points, are positioned in a geometrical space [258]. With each item described by a vector of k feature values, the goal of the learning procedure is to find the hyper- plane in the k-dimensional feature space that best separates the known members and non-members of the class of interest. However, there may be classification problems where no such hyperplane exists in the k-dimensional feature space. In these cases, a kernel function (i.e. K(x, y) = φ(x) · φ(y), where x and y are points and φ is a transformation function) is used to map the item points into a higher dimensional space where it is possible to find a hyperplane that separates the know members from the non-members. Thus, if a given item lies on the same side of the hyperplane as

19 the known members, the SVM predicts that the item is in the class of interest, and not in the class otherwise. SVMs are attractive for systems biology because of their ability to deal with complex classification problems. Complex problems are often associated with nonlinear trends in the data. Rather than directly identify a nonlinear boundary between known mem- bers and non-members of the class of interest, the SVM learning algorithm exploits the linear separation that is achievable when the data have been mapped to a higher dimensional space. However, the learning procedure requires fitting the same number of parameters as there are items in the training set. Although the learning algorithm aims to make many of the parameters have values close to zero and thus minimize the complexity of the classifier, having such a large number of parameters to fit relative to the size of the training set makes SVMs susceptible to overfitting. Furthermore, the fitted parameters highlight the training items that are the most influential to clas- sification predictions (i.e. the support vectors). However, depending on the choice of kernel function, it may not be possible to assess how each feature contributes to a prediction. Therefore, SVMs are advantageous in systems biology because they are well-suited for complex problems, although they are disadvantageous due to the risk of overfitting and their opaqueness. Towards facilitating drug development, SVMs have been assessed for their utility in screening compounds in silico (e.g. [226]). One specific goal was to predict whether or not a given compound actively inhibits a specific target protein, using features that describe the parts of the compound that are expected to interact directly with the target [226]. SVMs were found to outperform three other families of classifiers at this task. These results suggest that SVMs in particular are useful for in silico compound screening.

Decision Trees and Random Forests

20 A decision tree is a rooted tree where each internal node is a so-called decision node [40]. Each decision node is associated with a logical test based on a feature value (e.g. featureV alueA > 0). Descending from each decision node, there is branch for each possible outcome of evaluating the test: typically a true branch and a false branch, although there may be a third branch for the case where the required feature value is missing. Each leaf node is associated with a class distribution and when a traversal ends at a particular leaf, the given item is predicted to be in the most frequent class of the leaf. Therefore, an item is classified by traversing the de- cision tree from the root to a leaf, where the traversal is dictated by the outcomes of evaluating tests with the feature values of the given item. Decision trees are attractive for applications in systems biology for several reasons. First, it is easy to interpret how a decision tree arrives at a classification prediction (i.e. a series of tests), and this can contribute to our understanding of the underlying biology. Secondly, there are built-in mechanisms for coping with missing data. Lastly, decision trees can capture complex (e.g. nonlinear) patterns in the data, including dependencies between the features. However, decision tree learning algorithms are prone to overfitting, creating overly large trees that do not generalize well. Typically, a tree subsequently undergoes pruning, a process that aims to make the tree smaller without greatly reducing the predictive accuracy [213]. Thus, there are several ad- vantages to using decision trees in systems biology, however, given the small training sets in the domain, the overfitting may be too disadvantageous. Random forest classifiers aim to have the advantages of decision trees while address- ing the issue of overfitting [39]. In fact, a random forest consists of a multiple decision trees, and a given item is classified as the most frequent class amongst the predictions from the individual trees. Unlike standard decision trees, each tree in a random forest is trained on a random (i.e. bootstrap) sample of the training set. Furthermore, the test at each decision node involves a feature from a random subset of the features

21 obtained during the learning procedure. Although each tree may be overfit, allowing a large number of trees to contribute to the final classification avoids overfitting over- all, thus reducing problems with generalization. Therefore, random forests are also attractive for applications in systems biology. Since a fundamental goal in systems biology and drug development is to predict how a biological system responds to a perturbation, random forests have been applied to related prediction problems. In particular, a random forest was trained to pre- dict whether or not a given human gene is associated with a specific phenotype, and such an association suggests that the system would respond to a perturbation of the gene/protein by exhibiting the phenotype, or a related one [175]. In fact, a random forest was trained for each of 178 phenotypes, using features based on gene expression, PP interactions and gene function annotations. Taken together, the predictions sug- gest that 40.9% of all genes are associated with more than one phenotype. This result highlights the challenge of selecting therapeutic targets that would not be linked to many negative side effects, and also emphasizes that systems biology will likely help address this challenge.

Naive Bayes Classifiers and Bayesian Networks When applied to a classification problem, a Bayesian network estimates the probabil- ity that a given item is in a particular class, given the feature values of the item [117]. A Bayesian network is a graphical model (i.e. a directed acyclic graph) that dictates how the desired probability is computed. Each node represents a random variable,

and the network has a separate node for each feature variable Xi and a node for the class variable Y . A directed edge exists between two nodes if there is a dependency between the corresponding variables. Specifically, an edge from Xi to Xj implies that

the probability that Xj = xj depends on Xi. Also, the absence of an edge between two variables implies conditional independence. Furthermore, each node is associated with

22 a probability distribution, where the probability of each value of the corresponding variable is conditional on the values of the parent node variables. These distributions

enable the computation of the joint probability that Xi = xi for i = 1 . . . k where k is the number of features, conditional on the class value y. Thus, Bayesian networks classify an item by first using Bayes’ theorem in the following form:

P (X = x|Y = y)P (Y = y) P (Y = y|X = x) = (1.2) P (X = x)

where

X = (X1,X2,...,Xk), a vector of all feature variables, and

x = (x1, x2, . . . , xk), a vector of specific feature values. In binary classification problems specifically, the theorem can be used as follows:

P (X = x|Y = 1)P (Y = 1) P (Y = 1|X = x) = (1.3) P (X = x|Y = 1)P (Y = 1) + (X = x|Y = 0)P (Y = 0)

where Y = 1 indicates that the item is in the class of interest, and Y = 0 indicates otherwise. A given item can then be predicted as a member of the class of interest if the com- puted P (Y = 1|X = x) surpasses a selected threshold. Bayesian networks are attractive for applications in systems biology because of the clear probabilistic interpretation of how a prediction is made given the features of an item. Moreover, a Bayesian network can capture the dependencies between the features that may exist, and accounting for these dependencies should enable more accurate estimates of the probabilities that items are in the class of interest. When there is missing data, a probability distribution can still be estimated using the train- ing examples that have values defined for the features of interest. However, for some problems in systems biology, the whole training set is already small and subsets with

23 the required data may be too small in size to accurately estimate a probability dis- tribution. Similarly, if the structure of the Bayesian network is not already provided (e.g. based on prior knowledge), identifying the ideal structure will require a training set with sufficient data to accurately evaluate whether or not variables are condition- ally independent. Furthermore, a Bayesian network requires the prior probability that any given item is in the class of interest, i.e. P (Y = 1) in equation 1.3, in order to make predictions. The prior may represent the belief that an arbitrary item is in the class based on expert knowledge, however, this knowledge may not be available for the problem at hand. One could set the prior to the proportion of class members in the training set if the set is an unbiased sample from all items, however, this may not be the case. Taken together, Bayesian networks are advantageous in systems biology because they capture dependencies in the data while being interpretable; however, the accuracy with which a network captures the dependencies is limited if the training data is limited. A naive Bayes classifier is a Bayesian network with a specific structure [116]. All the features are assumed to be conditionally independent and thus, there are no edges between feature nodes. Since the probability distribution of each feature variable is only conditional on the class value, less data is required to estimate the probability distributions of naive Bayes classifiers than of Bayesian networks in general. How- ever, the independence assumption of naive Bayes classifiers limits the accuracy of the predictions when there are in fact dependencies between the features. Therefore, naive Bayes classifiers are simple Bayesian networks that are less sensitive to missing data; however, the simplicity of these classifiers is disadvantageous when there are complex trends in the data. Bayesian networks have been used to predict whether or not two given proteins are in the same protein complex, in S. cerevisiae [134]. One Bayesian network uses feature data generated by high-throughput experimental techniques for identifying complexes

24 and PP interactions. The other Bayesian network, specifically a naive Bayes classifier, uses features derived from genomic datasets that are not directly related to complexes (e.g. gene coexpression). Although each genomic feature is weakly associated with complexes, the classifier that integrates the genomic features was found to predict membership in the same complex more accurately than the Bayesian network that uses the high-throughput data. Novel predictions of the naive Bayes classifier were validated experimentally [134]. Therefore, this study shows that integrating different types of information with a naive Bayes classifier can provide additional insight into systems biology.

Hidden Markov Models When applied to a binary classification problem, a hidden Markov model (HMM) estimates the probability that the sequence of symbols associated with a given item is characteristic of items in the class of interest [21]. Unlike the classifiers described above, the features of an item are encapsulated in a sequence of symbols where each symbol is in a specific alphabet. An HMM consists of a set of states specified based on prior knowledge, and each state is associated with a set of transition probabilities specifying the probabilities of transitioning from the state to other states, or back to itself. A standard HMM is a first-order Markov model, and in such a model the probability of each transition only depends on the current state. Generally, in a kth- order Markov model, each transition depends on the k preceding states in a state sequence. Each state is also associated with a set of emission probabilities specifying the probabilities of emitting the symbols in the alphabet, when the state is reached. Finally, each state is associated with the probability that the sequence starts in the state. Together, the states and probabilities of an HMM capture state sequences that are characteristic of items in the class of interest. While every item is associated with a sequence of states, this sequence is not observed (i.e. it is hidden); instead

25 we observe a sequence of symbols. However, one can compute the probability of gen- erating a given sequence of symbols with an HMM of the class of interest [21]. The corresponding item can then be predicted as a member of the class of interest if the probability surpasses a selected threshold. HMMs are attractive for applications in systems biology where features of biolog- ical sequences (e.g. DNA sequences) would facilitate classification. In particular, the states can represent specific features of biological sequences and the transition probabilities might capture the (potentially complex) structure underlying sequences of items in the class of interest. Thus, a learned HMM can provide insight into the sequence structure of items in the class of interest. In addition, including a symbol for missing data in the alphabet allows an HMM to handle missing data explicitly. However, the complexity of an HMM increases with the order of the Markov model, the specified number of states and the size of the specified alphabet. Although some probabilities may be known a priori, others must be learned from the training set. The larger the number of probabilities to learn, the larger the size of the training set required for estimating the probabilities accurately. Furthermore, with HMMs it is assumed that the occurrence of a symbol (in a given sequence) is independent of the preceding symbols, given the current state. Thus, HMMs may not capture long-range dependencies in symbol sequences. Taken together, HMMs are advanta- geous in systems biology because they can model some of the complexity of biological sequences in a transparent manner; however, the accuracy with which the complexity is modelled is limited by the size of the training set. HMMs have been commonly used for gene prediction [76]. Specifically, an HMM can be used to estimate the probability that a given DNA sequence contains a protein- coding gene. For example, GENSCAN predicts such genes using an HMM with states representing functional units of a gene or genomic region, such as an exon, intron, in- tergenic region, etc. [46]. Thus, in addition to predicting genes, GENSCAN predicts

26 the exon/intron structures of genes. Although GENSCAN has been widely used, the predictor was extended to integrate data based on the similarity of the given sequence to sequences of known genes [276]. This integration was shown to result in improved predictive accuracy when such related sequences are available. Taken together, HMMs have played a major role in gene prediction, and therefore in the identification of the parts of biological systems.

Integration in Systems Biology Overall, class prediction methods have been successfully applied to problems in sys- tems biology. The systems biology knowledgebase contains various types of data that can be transformed into feature data for classifiers. Although there are established methods for coping with missing data (e.g. [228, 71]), some inherent to the classifier, the extent of missing data in systems biology motivates the integration of different types of data. Though each type of data on its own may only be weakly predictive, together the different types of data can lead to more accurate predictions. In the same vein, there is no perfect model for humans, but in some cases, one might be able to build a more complete picture of a human system by integrating data from various model organisms, rather than by relying exclusively on data from human cell cultures or a single model organism. Thus, several class prediction applications in systems biology integrate data from different organisms (e.g. [177, 280]). While the integration of different types of data can improve class predictions, understanding the way in which the data is integrated can provide insight into the underlying biology. Therefore, class prediction methods that integrate data in a transparent manner are particularly beneficial to systems biology.

27 1.3 Addressing Challenges in Drug Development

Our goal was to exploit the systems biology knowledgebase with machine learn- ing techniques towards addressing challenges in drug development.

1.3.1 Therapeutic Targets and Toxicity

A major challenge in drug development is the identification of therapeutic tar- gets. Although the binding of a compound to a therapeutic target is expected to cause therapeutic effects, ideally, the binding event would not also cause toxic side effects. Toxicity may be inherent in a target, rather than any compound that binds to it. For example, if a putative target is a regulator of many different genes in a specific system, inhibiting the target with a drug could deregulate the activity of all depen- dent genes, even those that are unrelated to the disease of interest. Perturbations of unrelated gene activities represent side effects that may contribute to toxicity. There- fore, knowledge of the interactions between putative targets and other genes/proteins may help remove from consideration the targets that are prone to toxicity, thereby reducing time and money spent on dead ends. Alternatively, knowledge of the in- teractions of a target may focus drug development efforts on the part of the target molecule that is likely to be relevant to the disease of interest, yet irrelevant to other interactions (if such a part exists). Therefore, more comprehensive knowledge of the interactions between proteins/genes would facilitate the identification of therapeutic targets. Proteins involved in signal transduction appear to be useful therapeutic targets since many approved drugs target these proteins [204]. For example, G-protein coupled re- ceptors are targeted by 27% of all drugs. These receptors are proteins that function e at the cell membrane by receiving extracellular stimuli and subsequently initiating signaling within the cell to induce a response [9]. Vesicular trafficking is thought to

28 regulate the level of receptor-mediated signal transduction [242]. That is, the for- mation of vesicles at the cell membrane causes receptors to be engulfed within the cell, rendering them unable to receive extracellular stimuli. Signaling proteins include small G proteins and their regulators [9]. A small G protein serves as a molecular switch that cycles between off and on states depending on whether the protein is bound to guanosine triphosphate (GTP) or guanosine diphosphate (GDP). Guanine- nucleotide dissociation inhibitors (GDIs) and GTPase activating proteins (GAPs) regulate these switches by promoting the off state, and guanine-nucleotide exchange factors (GEFs) regulate them by promoting the on state. Cellular signals are also relayed with protein phosphorylation, a reversible protein modification that involves the attachment of a phosphate group to the protein [9]. Kinases are proteins that carry out phosphorylation and their substrates are thus the specific proteins that they modify. The recent launch of several drugs inhibiting kinases suggests that these pro- teins still represent promising therapeutic targets [204]. The important cellular role of kinases can make them both advantageous and disad- vantageous as therapeutic targets. Aberrant kinase activity is associated with a wide range of diseases [59], suggesting that kinases would be reasonable therapeutic targets. Indeed, SUTENT (sunitinib malate) is a drug for metastatic renal cell carcinoma and gastrointestinal stromal tumour (GIST) patients, that targets multiple kinases [3]. In its last phase of clinical trials involving GIST patients, there was only 5% more patients with severe adverse effects (i.e. grade 3/4 toxicity) in the SUTENT group than in the placebo group. However, a broad range of adverse effects occurred with greater frequency in the SUTENT group. These results highlight the importance of identifying the substrates of a kinase so that one might be able to anticipate the down- stream effects of targeting the kinase, and refine drug development efforts accordingly.

29 OBJECTIVE: The prediction of kinase-substrate interactions. Kinase-substrate interactions are transient PP interactions that are influenced by var- ious cellular factors. For example, the occurrence of a kinase-substrate interaction requires that the kinase and substrate are co-localized within the cell, and it may require the presence of co-factor proteins [84, 31]. As described in section 1.1.2, there is a high-throughput experimental method that is capable of systematically identi- fying transient PP interactions in vivo [244]. This method requires the modification of proteins to enable the detection of interactions; however, these modifications may prevent kinase-substrate interactions from occurring. Several high-throughput ex- perimental methods have been developed to systematically identify kinase-substrate interactions specifically, based on the detection of phosphorylation events [241]. The in vivo methods of this type are advantageous because they provide additional insight by enabling the identification of the actual phosphorylation sites on the substrates (e.g. by mass spectrometric analysis of protein fragments). Although these methods can discern that a particular kinase plays a role in the phosphorylation of a particular substrate, it might be the case that it is a downstream kinase that actually phospho- rylates the substrate. Thus, these in vivo methods are disadvantageous because they may not be able to unambiguously identify kinase-substrate interactions. In contrast, in vitro methods test each kinase separately and can thus unambiguously identify kinase-substrate interactions. However, since these methods do not test proteins in their endogenous environments, they can identify interactions that are not physio- logically relevant. Although these methods do not identify the phosphorylation sites, the identified substrates of a kinase can be studied to search for common features that might characterize phosphorylation sites targeted by the kinase. Thus, several studies have applied machine learning techniques to this task, specifically by analyz- ing the protein sequences of the substrates of a particular kinase (e.g. [211, 180]). An or protein sequence, obtained by translating the DNA sequence of a

30 gene using the genetic code, is a simple representation of the corresponding protein (i.e. the primary structure). However, a protein sequence can reveal properties (e.g. chemical properties) that govern the physical interactions of the protein (e.g. [96]). Taken together, it is challenging to identify physiologically relevant kinase-substrate interactions with current experimental techniques, yet the results of these techniques may be exploited by machine learning techniques to gain additional insight. The characterization of phosphorylation sites can contribute to the identification of kinase-substrate interactions (e.g. [51, 188]). The sites at which substrates are phos- phorylated by a particular kinase must have common features that make the sub- strates compatible with the active site of the kinase. Furthermore, the common features may enable the kinase to recognize its substrates in the mixture of proteins within a cell. Thus, by characterizing the phosphorylation sites targeted by a given ki- nase, one may be able to predict whether or not a given protein has a phosphorylation site of the kinase, and thus predict kinase-substrate interactions (e.g. [51, 188]). How- ever, sequence motifs characterizing phosphorylation sites are highly degenerate [233]. As a result, many proteins have an occurrence of a motif simply by chance, and thus many kinase-substrate interactions would be false positive predictions. Thus, the occurrence of a phosphorylation motif seems necessary but insufficient to accurately predict the substrates of a particular kinase. Here we consider other common features of the substrates of a given kinase, in addi- tion to phosphorylation site characteristics, towards the accurate prediction of kinase- substrate interactions. Although each feature may only be weakly associated with the substrates, the integration of these features may permit greater accuracy than any feature alone. Chapter 2 describes our method for predicting kinase-substrate interactions with proof-of-principle results obtained from yeast. Supplementary in- formation on this research can be found in chapter 3. This work may contribute to our knowledge of the roles of specific kinases in a biological system, thus enabling

31 more informed evaluations of these proteins as therapeutic targets.

1.3.2 Therapeutic Targets for Monogenic Diseases

Trends of approved drugs highlight the challenge of identifying novel therapeutic targets. Currently, the total number of therapeutic targets linked to approved drugs is considerably smaller than the total number of drugs (i.e. 350 therapeutic targets e versus 1,500 approved drugs) [204, 281]. In fact, over 50% of all drugs (approved e as of December 2005) target class I G-protein coupled receptors, nuclear receptors, ligand-gated ion channels and voltage-gated ion channels [204]. The small number and lack of diversity of therapeutic targets suggests that the difficulty in identifying novel targets has driven drug development efforts towards revisiting validated tar- gets. It is unlikely that these classic targets are applicable to all diseases and there are many diseases (e.g. rare diseases) for which there are no therapeutic targets in the drug development pipeline. However, the identification of targets for currently untreated diseases typically requires a certain amount of groundwork, to obtain a better understanding of the molecular mechanisms of the diseases, before potential targets can be suggested. The time required to establish this groundwork may in part be responsible for the current lag in the release of new drugs. Monogenic diseases are caused by mutations in a single gene and although each disease may be rare, millions of people worldwide suffer from this class of diseases [203]. Here we consider a strategy for identifying potential therapeutic targets for monogenic dis- eases in particular. Many biological mechanisms depend on a state of signaling home- ostasis maintained by the appropriate integration of the synergistic and antagonistic activities of signaling genes [231]. Accordingly, the symptoms of numerous diseases result from genetic mutations that disrupt this homeostasis [192, 152, 245, 66]. How- ever, disruptions caused by loss-of-function mutations in a particular gene may be compensated by concomitant perturbations in genes with antagonistic activities. As

32 stated in section 1.1.2, antagonisms and synergisms between genes can be identified via genetic interactions. Thus, disease symptoms caused by mutations in a given gene may be compensated by perturbing genetic interactors of the gene. That is, the genetic interactors are potential therapeutic targets. Therefore, we consider the strategy of identifying genetic interactions for the identification of therapeutic tar- gets, towards facilitating the development of drugs for monogenic diseases. The strategy of identifying genetic interactions for the identification of therapeutic targets is potentially very powerful. As described in section 1.1.2, the genetic re- sources of various organisms enable genome-wide searches for the genetic interactors of a gene of interest (i.e. for this application, the gene mutated in the monogenic disease of interest). Such unbiased searches might identify good therapeutic targets that are unrelated to what is currently known about the disease of interest at the molecular level, and thus might otherwise take a long time to identify with a strategy of pursuing hypotheses stemming from the current knowledgebase. Therefore, the strategy of identifying genetic interactions for the identification of therapeutic tar- gets is potentially more efficient than other strategies.

OBJECTIVE: The prediction of genetic interactions in C. elegans. Despite several efforts to systematically identify genetic interactions in model organ- isms, the difficulty of the task with current technologies has delayed the comprehen- sive mapping of such interactions (see section 1.1.2). In particular, a comprehensive map has yet to be obtained for an animal, from experiments that involve organism- level phenotypes rather than potentially less relevant phenotypes observed from cell cultures. However, the difficulty of the required experiments has motivated the ap- plication of machine learning techniques to predict genetic interactions, since the predictions can help focus experimental efforts on the genes that are most likely to

33 interact [55, 157, 280]. For example, a few tools predict genetic interactions in C. el- egans, and the tools integrate features of gene pairs that are computed from existing systems-based datasets of this well-studied animal model. However, the systems- based datasets can be further integrated, and we thus explored whether additional integration improves the accuracy of prediction. In chapter 4 we describe our method for predicting genetic interactions in C. elegans towards the identification of ther- apeutic targets for human monogenic diseases. Chapter 5 contains supplementary information on this research.

1.3.3 Combinatorial Therapies

Recent decades have focused on a single-target-single-drug paradigm in drug de- velopment [28]. It was thought that the binding of a compound to a single therapeutic target with high specificity would have minimal toxic side effects. However, even a strong affinity between a compound and a single target does always translate into therapeutic efficacy. Some diseases such as cancer may develop resistance to a given compound by acquiring a mutation in the associated target [284]. For example, the mutation may interfere with the binding of the compound to the target, or cause the overexpression of the target to the extent where a sufficient quantity of target molecules remains unhindered by compound binding. In other cases, the disease may be complex and thus require a more complex therapeutic involving multiple tar- gets [284]. Therefore, in the development of drugs for certain diseases, the failures of compounds during clinical trials may be due to the fact that there is only one therapeutic target when multiple targets would be more appropriate. Multi-target therapies have several advantages [284]. It is less likely for cells to ac- quire mutations in multiple targets simultaneously, and thus, multi-target therapies are less prone to the development of drug resistance. Although a multi-target therapy

34 may consist of a single compound, alternatively, such a therapy may consist of multi- ple compounds and thus be referred to as a combinatorial therapy. One of the major benefits of combinatorial therapies is the potential for synergistic effects: that is, the overall therapeutic efficacy of a compound combination is greater than the sum of the effects of the compounds individually. In particular, synergy between the constituent compounds indicates that an equal or greater level of efficacy can be achieved with dosages that are lower than the dosage required of a single compound, and reduced toxicity may also be achieved as a result [90]. These advantages have driven drug discovery efforts towards the search for combinatorial therapies [202, 284, 90, 38], yet an exhaustive experiment-based search of combinatorial chemical space for synergies would be costly and inefficient.

OBJECTIVE: The prediction of chemical synergies. There are few methods to aid the discovery of chemical synergies for combinatorial therapies. The testing of hypotheses derived from expert knowledge of existing drugs is the main method by which current combinatorial therapies were discovered [284]. However, a molecular-level model of a human system may be used to predict the response of the system to multiple perturbations, and therefore predict the thera- peutic benefit of treatment with a chemical combination (e.g. [190]). Generating such a model of a particular system requires knowledge of the relevant genes and the interactions between these genes, thus limiting the general applicability of this method for predicting chemical synergies. Although interaction maps are currently incomplete, available chemogenomic profiles capture the responses of yeast systems to various compounds, and we aimed to exploit these profiles. In chapter 6 we de- scribe our method for predicting chemical synergies for combinatorial therapies with

35 proof-of-principle results in the antifungal domain. Chapter 7 contains supplemen- tary information on this research.

In summary, chapters 2 and 4 describe methods for guiding the identification of good therapeutic targets in general and for monogenic diseases, respectively, and chapter 6 describes our method for predicting chemical synergies for the development of combinatorial therapies. With these methods we aimed to further integrate systems biology into the drug development process with computational tools that address challenges of innovative drug development and also accelerate the process.

36 CHAPTER 2 A Biochemical Genomics Screen for Substrates of Ste20p Kinase Enables the in silico Prediction of Novel Substrates Robert B Annan*1 , Anna Y Lee*2 ,3 , Ian D Reid2, Azin Sayad2, Malcolm White- way4 , Michael Hallett2,3,5 and David Y Thomas1 *these authors contributed equally to this work Originally published in PLoS ONE 4: e8279 (2009) under the terms of the Creative Commons Attribution-Generic 2.5 License.

2.1 Preface

The important role of kinases in several diseases suggests that they represent a good starting point in the search for therapeutic targets. One way to investigate the consequences of targeting a kinase is to identify its substrates. Although there are various experimental methods that can identify kinase-substrate interactions, each has its own shortcoming. A standard method for predicting the substrates of a given kinase first involves characterizing the phosphorylation sites on known substrates with a sequence motif, and then predicting that a given protein is a substrate if its

1 Department of Biochemistry, McGill University, Montr´eal,Qu´ebec, Canada 2 McGill Centre for Bioinformatics, McGill University, Montr´eal,Qu´ebec, Canada 3 School of Computer Science, McGill University, Montr´eal,Qu´ebec, Canada 4 Biotechnology Research Institute, National Research Council, Montr´eal,Qu´ebec, Canada 5 Rosalind and Morris Goodman Cancer Centre, McGill University, Montr´eal, Qu´ebec, Canada

37 sequence contains occurrences of the motif. However, phosphorylation site sequences can be highly variable, resulting in highly degenerate motifs, or no motif at all that can sufficiently distinguish substrates from non-substrates. That is, a phoshorylation sequence motif is inadequate for predicting the substrates of certain kinases. We thus developed a method for predicting the substrates of a given kinase that takes into consideration other sequence features that may help distinguish substrates from non- substrates. Our proof-of-principle work predicts the substrates of the yeast Ste20p kinase, a kinase that did not have a characterized phosophorylation motif prior to the start of our project.

2.1.1 Contributions of Authors

Robert B Annan conducted the biochemical genomics screen to identify Ste20p substrates experimentally, made suggestions for the design of the predictor and the analysis of the resulting predictions, performed the experiments to validate predic- tions, and contributed to the writing of the manuscript. Anna Y Lee developed the predictor, analyzed the resulting predictions and contributed to the writing of the manuscript. Ian D Reid and Azin Sayad contributed to the design (of different as- pects) of the predictor. Malcolm Whiteway and David Y Thomas contributed to the design of the biochemical experiments and made suggestions for the development of the predictor. Michael Hallett contributed to the design of the predictor and made suggestions for the analysis of the resulting predictions. All authors contributed to the revising of the manuscript.

2.2 Abstract

The Ste20/PAK family is involved in many cellular processes, including the reg- ulation of actin-based cytoskeletal dynamics and the activation of MAPK signaling pathways. Despite its numerous roles, few of its substrates have been identified. To

38 better characterize the roles of the yeast Ste20p kinase, we developed an in vitro biochemical genomics screen to identify its substrates. When applied to 539 purified yeast proteins, the screen reported 14 targets of Ste20p phosphorylation. We used the data resulting from our screen to build an in silico predictor to identify Ste20p substrates on a proteome-wide basis. Since kinase-substrate specificity is often me- diated by additional binding events at sites distal to the phosphorylation site, the predictor uses the presence/absence of multiple sequence motifs to evaluate poten- tial substrates. Statistical validation estimates a three-fold improvement in substrate recovery over random predictions, despite the lack of a single dominant motif that can characterize Ste20p phosphorylation. The set of predicted substrates significantly overrepresents elements of the genetic and physical interaction networks surrounding Ste20p, suggesting that some of the predicted substrates are in vivo targets. We validated this combined experimental and computational approach for identifying ki- nase substrates by confirming the in vitro phosphorylation of polarisome components Bni1p and Bud6p, thus suggesting a mechanism by which Ste20p effects polarized growth.

2.3 Introduction

Protein phosphorylation is a central post-translational modification in signal transduction; underscoring its importance is the observation that roughly 2% of eu- karyotic genes encode kinases, and roughly one-third of all intracellular proteins may be phosphorylated on at least one residue [112, 58, 285]. However, given the large number of possible substrates for each of the many protein kinases, it is not surprising that the identification of kinase-substrate relationships remains a daunting challenge. Our knowledge of kinase-substrate relationships has expanded using approaches that detect in vitro phosphorylation or in vivo phosphoproteins. In vitro methods include the use of purified kinases and substrates, kinases engineered to use ATP analogues,

39 or phage display libraries [235, 253, 197, 240]; peptide microarrays have been used to perform in vitro screens for kinase substrates in a high-throughput manner [211, 283]. In vivo methods include the use of mass spectrometry to generate large-scale profiles of cellular phosphoproteins (reviewed in [169, 184]). Recent studies have combined these approaches with the examination of evolutionary conservation and interaction networks to better understand kinase-substrate relationships [23]. While such ap- proaches have expanded our knowledge of kinase-substrate relationships, it is clear that many remain unidentified or uncharacterized by current methods. Bioinformatics techniques are increasingly employed to help identify kinase-substrate relationships, usually by characterizing the phosphorylation motif (i.e. describing the sequence at the site of phosphorylation) associated with a given kinase. For exam- ple, the amino acid preferences at the phosphorylation sites of a given kinase can be determined with a peptide library screen (e.g. [217]) and encoded in a position- specific scoring matrix (PSSM). Alternatively, it may be possible to characterize the phosphorylation motif with a regular expression (e.g. [ST]P.[RK] for CDK). Thus, any protein can be assessed for the likelihood that it can be phosphorylated by a specific kinase, based on whether its sequence possesses a likely phosphorylation site of the kinase. Since phosphorylation motifs are often highly degenerate, approaches based on these motifs rely on sophisticated machine learning techniques to increase the accuracy of substrate prediction; nevertheless, these approaches have proven most effective for the few kinases with the least degenerate motifs [270, 126, 234, 275, 227]. Recent approaches take into consideration repeat occurrences of a phosphorylation motif within a protein sequence. Modeling the propensity for such clusters of phos- phorylation motifs results in improved accuracy for the prediction of CDK substrates, for instance [51, 188]. These studies raise the question of whether considering the co- occurrence of different motifs will also result in more accurate prediction of substrates for kinases with degenerate phosphorylation motifs.

40 Even in cases where phosphorylation motifs can be readily identified in putative sub- strates, the motifs do not generally provide sufficient information to unambiguously identify the physiologically relevant kinase. It has been recognized that sequence features that are independent of the actual phosphorylation site are often crucial for the phosphorylation of a substrate, including features that enable binding of the sub- strate to the regulatory domain of the kinase, binding of kinase and substrate to the same scaffold protein, or co-localization in the cell of kinase and substrate due to independent interactions (reviewed in [282, 84, 148, 208, 31]). Moreover, it has been recognized that kinases often bind substrates at a second site, distal to the active site, and that these docking interactions are largely responsible for kinase-substrate specificity [216]. These findings suggest that a predictor that takes into considera- tion such distal motifs, in addition to putative phosphorylation motifs, could produce more accurate predictions of kinase-substrate relationships. The Saccharomyces cerevisiae Ste20p kinase (SGDID:S000000999) is the founding member and prototype of the Ste20/PAK family, a large family of kinases ubiquitous in the genomes of all eukaryotes (for reviews see [36, 69]). Ste20p was first described as an activator of the yeast pheromone response MAPK cascade, and was subsequently also shown to activate the MAPK cascades responsible for pseudohyphal growth and the high-osmolarity glycerol (HOG) response [214, 187, 155, 215, 273]. Ste20p also regulates other physiological processes, such as actin cytoskeleton organization and polarized morphogenesis [80, 123, 160], mitotic exit [56], and hydrogen-peroxide in- duced apoptosis [5]. Furthermore, Ste20p shares an undefined essential role with its homolog Cla4p (SGDID:S000005242), as ste20 ∆cla4 ∆ mutants are not viable [65]. Despite the breadth of knowledge about the cellular roles of Ste20p, only a few of its substrates have been identified. In addition to its phosphorylation of Ste11p (SGDID:S000004354) in the activation of MAPK pathways [273], Ste20p phosphory- lates type I myosins Myo3p (SGDID:S000001612) and Myo5p (SGDID:S000004715)

41 to promote actin polarization [272, 156], Cdc10p (SGDID:S000000595), albeit less ef- ficiently than Cla4p [261], and the histone H2B (SGDID: S000000098) [5]. Given the still-unidentified essential function it shares with Cla4p and the number of cellular processes in which it participates, it is likely that physiologically relevant substrates of Ste20p remain to be identified. Our goal was to develop a method to facilitate the discovery of kinase-substrate rela- tionships. Taken together, existing studies suggest that an in silico approach based on sequence motifs, characterizing phosphorylation sites and distal sites, may be useful for predicting the substrates of kinases that have not been well-characterized by ex- isting methods. We used such an approach to identify substrates of the yeast Ste20p kinase.

2.4 Results

2.4.1 A biochemical genomics screen identifies in vitro substrates of the Ste20p kinase

To develop an in silico tool for identifying substrates of the yeast Ste20p kinase, our first step was to generate a learning set of positive and negative examples of sub- strates. We designed a biochemical genomics screen to identify the in vitro substrates of kinases. This approach was applied to 539 yeast proteins coded by essential genes to identify substrates of Ste20p (see Materials and Methods). Essential genes were chosen to identify potential substrates responsible for the shared essential function of STE20 and its homologue CLA4 [65]. Individual clones expressing GST-fusion constructs under the control of the inducible GAL1 promoter [283] of each of the designated proteins were grown under non-inducing conditions until mid-log phase. These were then combined in pools of eight, induced for three hours by the addition of galactose, and immobilized on glutathione-Sepharose beads. The combined pools of purified proteins bound to beads were incubated in each of two solutions: a solution

42 containing the kinase domain of Ste20p, expressed and purified from E. coli, with necessary cofactors and γ-[P32]-ATP, and a control solution lacking Ste20p kinase. After thirty minutes, the samples were boiled in sample loading buffer and separated by SDS-PAGE, with subsequent visualization of phosphorylation by autoradiography. Where phosphorylation was observed, stepwise deconvolution confirmed the phospho- rylation and identified the phosphorylated proteins. Among the 539 proteins screened, 14 (2.6%) were reproducible in vitro substrates of Ste20p (Table 2–1). These were subsequently screened in vitro with Cla4p; as shown in Table 2–1, 10 of the 14 Ste20p substrates were also phosphorylated by Cla4p. Since Ste20p and Cla4p are known to share an uncharacterized essential function in yeast, suggesting that they share common targets, it is not surprising that several substrates are phosphorylated by both kinases. Nonetheless, Ste20p also exhibits specificity, as four of the 14 Ste20p substrates were not phosphorylated by Cla4p.

2.4.2 An in silico approach to the identification of Ste20p substrates

Our goal was to build a Ste20p substrate predictor that considers sequence fea- tures of phosphorylation sites and distal sites. In a naive Bayes classifier, we thus inte- grated previously-defined PSSMs characterizing the phosphorylation sites of human Ste20p-related kinases [217] with motifs that characterize the substrates of Ste20p identified in our biochemical screen. For the latter component, we identified short sequence motifs, characterized with regular expressions, that are enriched in the set of Ste20p in vitro substrates (i.e. the positive learning set) relative to the screened set of proteins that were found not to be substrates (i.e. the negative learning set). Five substrates from the literature that had not been included in our initial screen (Htb2p, Myo5p, Myo3p, Ste11p, and Cdc10p [261, 5, 272, 273]) were added to the positive learning set. We enumerated the motifs that occur in multiple members of the positive learning set (via the pattern identification algorithm Teiresias [220]),

43 Table 2–1: Hits from the in vitro screen for Ste20p substrates. *Indicates whether the gene product is phosphorylated by the kinase of the column (Y = yes, N = no).

Gene Ste20p* Cla4p* Function/Process Predictor Score ALY2 Y Y interacts with CDK Pcl7p, unknown function 1.00 BMS1 Y N GTPase involved in ribosome biogenesis 1.00 CDC3 Y Y Septin 1.00 COG4 Y Y member of the Golgi complex involved in 0.99 transport PEM1 Y N phosphoacetyl-glucosamine mutase 0.99 RAD53 Y Y DNA damage checkpoint kinase 1.00 RPT5 Y Y 26S proteasome regulatory subunit 1.00 RSC6 Y N component of the RSC chromatin remodeling 0.28 complex RSC8 Y Y component of the RSC chromatin remodeling 0.03 complex SGV1 Y Y nuclear cyclin-dependent kinase 1.00 SPB1 Y Y methyltransferase involved in ribosome bio- 1.00 genesis SPT16 Y Y component of FACT complex involved in 0.03 transcription elongation UTP5 Y Y member of the SSU processome involved in 0.03 ribosome biogenesis UTP7 Y N member of the SSU processome involved in 1.00 ribosome biogenesis

44 specifically within regions of the protein sequences that were predicted to be exposed (according to ACCpro 4.0 [53]), and these motifs were evaluated for inclusion in a Ste20p substrate predictor. Functionally important motifs are expected to be evolutionarily conserved. To bias

Figure 2–1: Examples for computing the total weight of a motif m for a given protein sequence t (wt,m). The sequence for Htb2p has occurrences of two different motifs used by the predictor: A...P[AG] and A[KR]H. There are three occurrences of the first motif in the sequence, and these occurrences overlap. The weight incorporates the conservation of the motif occurrences in other Saccharomyces species. Htb2p has an identified orthologue in only one other species in this genus: S. mikatae. Abbreviations: S. cer = S. cerevisiae; S. mik = S. mikatae. our approach towards motifs that appear functionally important, we assigned each motif occurrence a weight ranging from 0 to 1 reflecting the degree of conservation across Saccharomyces species (wt,m,i in Figure 2–1; see Materials and Methods). The total weight of a motif for a given protein (wt,m = Σiwt,m,i in Figure 2–1) is thus an estimate of the number of functionally important motif occurrences in the protein sequence. For each motif, we used its corresponding weights across the positive and

45 negative learning sets to calculate a selectivity ratio that measures how frequently the motif occurs in the positive learning set as compared to how frequently it occurs the negative learning set (see Materials and Methods). As such, a motif with a selectivity ratio greater than one indicates that the motif is more prevalent in the positive set than in the negative set. The 23 motifs with a selectivity ratio of at least 10 (Table 2–2) were integrated into the predictor such that occurrences of any of the motifs within a given amino acid sequence contribute to the belief that the sequence encodes a Ste20p substrate. Es- sentially, the predictor takes any peptide/protein sequence and returns the posterior probability that it represents a Ste20p substrate. We experimented with different parameter values to balance the trade-off between the sensitivity and specificity of the predictor (see Materials and Methods). The accuracy of the predictor was tested in silico via a modified version of leave-one- out cross-validation (see Materials and Methods). The predictor is approximately as accurate as a variant of the predictor that only uses the 23 motifs identified in this study (see Figure 3–1 and Supplementary Text). We thus used the variant as our definitive Ste20p substrate predictor in subsequent analyses for simplicity. The pre- dictor has an estimated false positive rate of 11% and false negative rate of 74% if a protein is predicted as a substrate with a posterior probability of at least 0.9 (Fig- ure 3–1). Moreover, the frequency with which we expect to identify true substrates within a set of predictions is 8% (i.e. the positive predictive value). This is roughly a three-fold enrichment over the frequency of experimentally verified Ste20p substrates among the initial selection of proteins in this study, and a five-fold enrichment over the frequency observed for an in vitro screen from a previous study [283].

46 Table 2–2: Motifs used in the construction of the Ste20p substrate predictor. *The frequency of motif occurrences in the positive/negative learning set, where each occurrence is weighted by its conservation across Saccharomyces species.

Motif Frequency in Frequency in Selectivity Positive Set* Negative Set* Ratio K.H.V 0.2421 0.0141 17.18 KG..R 0.3596 0.0217 16.56 H[AG]..R 0.2158 0.0136 15.92 [ST]V.H 0.2982 0.0192 15.55 A...PG 0.3842 0.0282 13.64 AQR 0.4211 0.0310 13.58 [KR]...HR 0.2711 0.0208 13.04 K..HS 0.3298 0.0261 12.65 N.[KR]..H 0.3956 0.0322 12.29 P.G.Q 0.2149 0.0181 11.86 Q.DP 0.3289 0.0277 11.86 A..PP 0.2939 0.0248 11.85 E.C..[KR] 0.2237 0.0199 11.22 PG...S 0.3667 0.0332 11.05 G.NF 0.2404 0.0218 11.02 PT..Y 0.2982 0.0275 10.86 I..T.H 0.1711 0.0158 10.84 R.S..H 0.2807 0.0261 10.76 A[KR]H 0.2263 0.0212 10.67 G.K.P 0.2035 0.0197 10.32 A...P[AG] 0.5860 0.0569 10.29 K[AG]..R 0.5877 0.0572 10.27 RDA 0.2895 0.0283 10.23

47 2.4.3 Application of the predictor to the yeast proteome

We applied the predictor to the yeast proteome (6,696 proteins considered) and each yeast protein was ascribed a posterior probability that it is an in vitro substrate of Ste20p (Table 3-S1). In total, 753 proteins (11.3%) were assigned a probability greater than 0.9, and 5050 proteins (75.4%) below 0.05. Amongst the 14 substrates identified in the initial biochemical genomics screen, ten scored with probabilities above 0.9 and three scored below 0.05 (Table 3-S1). Previously described Ste20p substrates Ste11p, Myo3p, and Myo5p were all assigned a probability of 1.0, Cdc10 was assigned a probability of 0.86, and the histone Htb2p was assigned a probability of 0.035. We performed pathway analysis on the predicted Ste20p substrates using the Gene Ontology (GO) [17]. The significantly overrepresented categories (adjusted P ≤ 0.05) amongst the predicted substrates (score ≥ 0.9) are shown in Figure 2–2, Table 3–1, Table 3–2 and Table 3–3. Encouragingly, the cellular components and biological pro- cesses that are overrepresented overlap with the established role of Ste20p in budding and morphogenesis at sites of polarized growth, including the bud tip. Furthermore, the role of Ste20p as a component of several signaling cascades is reflected in the overrepresentation of proteins related to protein kinase activity amongst the pre- dicted substrates. Thus, the biological relevance of our predictor is supported by the presence of a significantly high number of predicted substrates in process/pathways related to established roles of Ste20p.

2.4.4 Genetic and physical networks suggest in vivo relevance of pre- dicted substrates

We reasoned that, since a kinase and a given substrate act in the same path- way, genes that genetically interact with STE20 may also interact with the genes encoding in vivo substrates of Ste20p. Analogously, we also reasoned that proteins

48 which form physical interactions with Ste20p binding partners (i.e. proteins in the neighborhoods of Ste20p physical interactors) are more likely to be accessible as sub- strates for Ste20p given that kinases and their cognate substrates often assemble in macromolecular complexes. To this end, we investigated whether any overlap exists between the network neighborhoods of STE20 genetic and physical interactors and the set of predicted substrates (Figure 2–3A). First, the predicted substrates were examined in the context of genetic interactions. A genetic interaction reflects a functional relationship between two genes determined by the level of some phenotype observed in the double mutant compared to the levels observed in the single mutants. We defined a network wherein each yeast gene is represented by a node and an edge is created between two genes if they are known to genetically interact (see Materials and Methods). For any node, the Genetic In- teraction Neighborhood (GIN) is the set of nodes connected to it by an edge. The GIN of every gene in the network was tested for significant overlap with the set of predicted Ste20p substrates (Figure 2–3A). Indeed, the GINs of genes that genetically interact with STE20 tend to overlap more significantly with the predicted substrates compared to the GINs of all other genes in the network (Figure 2–3B; P = 5.45x10−4, Mann-Whitney test, see Materials and Methods). Of the 42 published STE20 genetic interactors, five of the corresponding GINs significantly overlap the set of predicted substrates (adjusted P ≤ 0.05, hypergeometric test; Table 3–4). Given the expec- tation that in vivo substrates of Ste20p should share Ste20p’s genetic interactions, the observed overlap between the predicted substrate set and the GINs of Ste20p’s genetic interactors suggests that the in silico predictor exhibits in vivo relevance. We next performed an analogous analysis based on physical interactions. Here we say a physical interaction exists between two proteins if they have been shown to physically interact directly or indirectly as part of the same complex. We thus de- fined a network where an edge was placed between two proteins if they are known

49 to physically interact (see Materials and Methods). For any node, the Physical In- teraction Neighborhood (PIN) is the set of nodes connected to it by an edge. The Saccharomyces Genome Database [209] contains 96 published physical interactors for Ste20p, although substrates identified in previous high-throughput studies (such as [211]) were excluded to avoid bias or redundancy. The physical interactors exam- ined include, for example, the scaffold proteins for Ste20p-related signaling complexes such as Bem1p (SGDID:S000000404) and Cdc24p (SGDID:S000000039). Although only one of the Ste20p PINs (i.e. the neighborhood of Cdc28p (SG- DID:S000000364)) significantly overlaps the set of predicted substrates (adjusted P ≤ 0.05, Fisher’s exact test; Table 3–5), the 29 PINs tend to have lower overlap P values than the PINs of the other proteins in the network (Figure 2–3C; P = 8.29x10−4, Mann-Whitney test, see Materials and Methods). In fact, as is the case with genetic interactions, the neighborhoods of the majority of proteins have no overlap with the set of predicted substrates, whereas 27 out of the 29 Ste20p PINs have some overlap (i.e. a significant number with P = 1.03x10−4, hypergeometric test). By combining the in silico predictor with GIN and PIN analyses, it becomes possible to focus on physiologically relevant potential substrates based on their additional biological con- nections to Ste20p. In order to better characterize the physiological relevance of the neighborhood ap- proach, we clustered the genetic and physical interactors of STE20 with respect to the predicted substrates in their neighborhoods (Figures 2–4 and 2–5, see Figures 3–2 and 3–3 for the statistical significances of the clusters). In other words, STE20 interactors with many common predicted substrates in their respective neighbor- hoods are likely to co-cluster. The clustering of the genetic interactors of STE20 depicted in Figure 2–4 identifies one large cluster that includes genes coding for proteins involved in cell-cycle progression and polarized growth such as CDC28,

50 SWE1 (SGDID:S000003723), CLA4, CDC42 (SGDID:S000004219), and RAS2 (SG- DID:S000005042) [Approximately Unbiased (AU) score = 94 as shown in Figure 3– 2A, see Materials and Methods]. Moreover, most of the genes in this cluster in- teract with a set of predicted substrates that also include polarity- and cell-cycle- associated genes such as CDC5 (SGDID:S000004603), LTE1 (SGDID:S000000022), AXL2 (SGDID:S000001402), and MSB1 (SGDID:S000005714) (highlighted in Figure 2–4). Also included in this list are three of the four known components of the po- larisome (BNI1 (SGDID:S000005215), SPA2 (SGDID:S000003944), and BUD6 (SG- DID:S000004311)), whose activation has been linked genetically to STE20 [104]. This analysis of the Ste20p physical interactors also highlights a tight clustering of proteins related to polarized growth (Figure 2–5, Figure 3–3). Cdc42p, its guanine nucleotide exchange factor (GEF) Cdc24p, and the scaffold Bem1p are Ste20p in- teractors that co-cluster. These overlap with a significant cluster of predicted sub- strates (AU score = 94 as shown in Figure 3–3B) including the polarity proteins Boi1p (SGDID:S00000018) and Boi2p (SGDID:S000000916), suggesting that these may serve as physiologically relevant substrates of Ste20p. Polarisome components Bud6p and Spa2p cluster together with the kinase Ptk2p (SGDID:S000003820), al- though Bni1p is clustered with another set of actin-associated proteins including Bbc1p (SGDID:S000003557) and Las17p (SGDID:S000005707). Thus, by combining the predicted biochemical relationships between Ste20p kinase and potential sub- strates with the known relationships of genetic and physical interactors of STE20, it becomes possible to identify novel roles for Ste20p phosphorylation in vivo.

2.4.5 Polarisome components Bud6p and Bni1p are in vitro substrates of Ste20p

We sought to validate our approach to Ste20p substrate prediction by assaying several high-scoring proteins that are also present in STE20 interactor neighborhoods.

51 The neighborhood cluster analysis described above identified a number of putative substrates involved in polarized growth. Ste20p participates with Cdc42p in the es- tablishment of polarized growth at directed sites in response to intrinsic budding cues and extrinsic signals such as mating pheromones or altered nutrient conditions [137]. In these processes, Ste20p has been genetically linked to the polarisome, a 12S macro- molecular complex that controls polarized growth and morphogenesis [104, 210]. Examination of the set of predicted Ste20p substrates revealed that three of the four polarisome components (Bni1p, Bud6p, and Spa2p) were predicted with high prob- ability (>0.98) to be Ste20p substrates (Table 3-S1). Furthermore, these three were also identified numerous times when cross-referenced against STE20 interactor GINs and PINs; BNI1 is a member of 16 neighborhoods, SPA2 is a member of 15, and BUD6 is a member of 10. We verified the predictions by performing in vitro kinase as- says with the polarisome proteins to determine whether they could serve as substrates of Ste20p. The candidate substrates were exogenously expressed as GST-fusions, and assays were performed essentially as in the biochemical screen (see Materials and Methods). As shown in Figure 2–6, Bni1p and Bud6p are both phosphorylated by Ste20p in vitro. Spa2p was not phosphorylated by Ste20p (data not shown). Thus, both Bni1p and Bud6p were predicted and experimentally confirmed to be Ste20p substrates. To gain greater insight into the consequences of Ste20p phosphorylation, we iden- tified phosphorylation sites in Bni1p and Bud6p. Bni1p is a large protein (1,953 residues) which was previously shown to be a phosphoprotein in vivo whose phos- phorylation is reduced in a ste20 ∆ mutant [104]. It comprises a N-terminal Cdc42p binding domain and three C-terminal regions characteristic of the formin-family of proteins, which together constitute the actin-assembly machinery. These C-terminal regions include the Formin-homology 1 and 2 (FH1 and FH2) domains, and a C- terminal tail region (COOH) that includes two domains: the Bud6p-binding domain

52 (BBD) and a cis-inhibitory Dia-autoregulatory domain (DAD) (Figure 2–6A). Given the functional importance of the C-terminal domains, we expressed subclones com- posed of the FH1, FH2, and COOH domains as GST-fusion proteins and repeated the in vitro Ste20p kinase assays. As seen in Figure 2–6B, phosphorylation was observed in the constructs containing the COOH region, but was not observed in constructs in which it is absent. Thus, Ste20p phosphorylation in vitro occurs within the region of Bni1p responsible for binding Bud6p. Next, we sought to validate that Bud6p is phosphorylated in vitro by Ste20p (Figure 2–6C). While the domain organization of Bud6p is not as well-defined as for Bni1p, it has been determined that the C-terminal region (519-788aa) is involved in dimer- ization as well as binding Bni1p and actin, whereas the N-terminal region (1-166) is required for proper Bud6p localization [136]. We thus subcloned Bud6p to deter- mine which regions are phosphorylated by Ste20p in vitro. As seen in Figure 2–6C, weak phosphorylation is observed in the N-terminal fragment, a stronger signal is observed in the uncharacterized middle region, and no phosphorylation is observed in the region involved in dimerization or binding to actin and Bni1p. Thus, while Ste20p phosphorylates Bni1p in the region responsible for Bud6p binding, it does not phosphorylate Bud6p in the region responsible for binding Bni1p. We used mass spectrometry to identify in vivo phosphorylation sites for Bud6p. Us- ing standard procedures (see [219] and Materials and Methods), we identified in vivo phosphorylation of a TAP-tagged Bud6p fusion protein on two residues: serine 327 and serine 342. These two residues are found in the middle fragment, which was phosphorylated by Ste20p in vitro (Figure 2–6C). Thus, Bud6p is a phosphoprotein in vivo and the phosphorylation on residues Ser327 and Ser342 correlates with the phosphorylation of the same region by Ste20p in vitro. Given that the region of Bni1p which is phosphorylated by Ste20p is required for viability in the absence of

53 CLA4 [104], we asked whether the same is true for Bud6p. Expression of a bud6 con- struct with the region containing both phosphorylation sites deleted (bud6 ∆272−411) retains the ability to rescue the lethality of a bud6 ∆cla4 ∆ strain and results in a morphology similar to a cla4 ∆ mutant (data not shown). While the in vivo relevance of Bud6p phosphorylation remains to be determined, the substrate predictor cor- rectly suggested that direct phosphorylation of polarisome proteins by Ste20p occurs in vitro, and thus presents opportunities for directed investigation of the activation mechanism of this complex.

2.5 Discussion

We developed a strategy to aid the discovery of substrates for any given kinase and demonstrated its utility with the yeast Ste20p kinase. In particular, we tested 550 proteins in a biochemical screen for in vitro substrates of Ste20p. The results e were used to generate an in silico predictor of Ste20p substrates. Cross-referencing the predicted substrates against known genetic and physical interactions highlighted polarisome components as likely targets of in vivo phosphorylation by Ste20p. Of these components, Bni1p and Bud6p were shown to be phosphorylated by Ste20p in vitro. The phosphorylation was mapped to a region of Bni1p which is genetically as- sociated with STE20, and which binds to Bud6p. In vivo phosphorylation sites were identified on Bud6p, but the physiological relevance of these sites remains unclear. The decision to screen proteins coded by essential genes was made to address the shared essential function of STE20 and CLA4 [65]. We reasoned that if the phos- phorylation of a single substrate is responsible for the synthetic lethality of the ste20 ∆cla4 ∆ mutant, then the gene coding for that substrate may also be essen- tial. We thus screened roughly half the complement of yeast essential genes, selected on the basis of GO annotations that we assumed might overlap with known functions or localizations of Ste20p and Cla4p. As shown in Table 2–1, 10 of 14 substrates

54 phosphorylated by Ste20p were also phosphorylated by Cla4p. It remains to be de- termined whether any of these substrates may be responsible for the shared essential function of Ste20p and Cla4p. While the identification and mutational analysis of in vivo phosphorylation sites on these substrates should yield insight into this question, our results also suggest that the phosphorylation of Bni1p is related to this essential function. Our approach to kinase-substrate identification can complement and support various other methods currently used to identify kinase substrates. For instance, an in vitro screening approach using protein microarrays [211] has identified 70 substrates for Ste20p, 11 of which were also tested in our screen. We confirmed a 36% recovery of chip-identified substrates in our screen, which is generally consistent with observed differences between related high-throughput assays and with the observation that different approaches to identify physical, genetic, and biochemical interactions are complementary to each other [262]. This earlier study also employed a pattern-searching algorithm to identify predictive phosphorylation motifs for each of the tested kinases. It succeeded in identifying phos- phorylation motifs for 11 of the 87 kinases tested, suggesting that these are kinases with strict phosphorylation site requirements [211]. Though they did not identify a Ste20p phosphorylation site motif, our multi-motif predictor of Ste20p substrates exhibits an estimated three-fold improvement, over screening randomly selected pro- teins, in the rate of substrate identification. This substrate enrichment is especially significant since the predictor is based on a screen of only 539 proteins. Thus, our approach improves upon previous efforts to capture sequence motifs that predict sub- strate status. This improvement is likely due to the fact that our predictive approach does not rely on strict phosphorylation site requirements, but rather a set of sequence motifs that are not restricted to describing the phosphorylation site and can therefore predict

55 substrates by other means. We reasoned that, while no individual sequence feature may be sufficient to identify Ste20p substrates, the combination of relevant motifs may allow for substrate prediction with a higher degree of accuracy. Our approach identified 23 motifs that were then used in a naive Bayes classifier to assign a posterior probability of being a Ste20p substrate to each member of the pro- teome. Amino acid preferences at the phosphorylation sites of human Ste20p-related kinases have been specified via position-specific scoring matrices (PSSMs) [217], and some of the motifs identified in this study capture the predominant preference for basic amino acids at positions N-terminal to the sites (i.e. K..HS and R.S..H). More- over, we showed that integrating the PSSMs with our motifs does not improve the accuracy of our Ste20p substrate predictor. This suggests that our motifs sufficiently capture the preferences specified by the PSSMs that are useful for predicting Ste20p substrates. It remains to be determined how these motifs participate in specifying Ste20p phosphorylation, whether via cis effects through a docking interaction with Ste20p, or trans effects through binding with a Ste20p-associated scaffold such as Bem1p, Ste5p (SGDID:S000002510), or Far1p (SGDID:S000003693). Though the molecular functions of the motifs are not yet clear, our analyses showed significant associations between the set of predicted substrates and genes/proteins that are al- ready (directly or indirectly) associated with STE20. Thus, although the predictor is based on the in vitro biochemistry and primary structure of proteins, employing multiple evolutionarily-conserved selective motifs seems to result in biologically rele- vant substrate predictions. While the predictor remains a tool for identifying a biochemical relationship between a kinase and its potential substrates, providing biological context, for example by path- way analysis, suggests hypotheses that physiologically relate predicted substrates to kinases. While pathway analysis reveals that the predicted Ste20p substrate set ap- pears consistent with known roles of Ste20p, the analysis also suggests a potential role

56 for Ste20p in vesicle-mediated transport, which is supported by the observation that the human Ste20p-related kinase Pak1 plays a role in regulating vesicular-based trans- port in human fibroblasts [73]. Likewise, a role for Ste20p in carbohydrate metabolic processes is supported by the observation that Pak1 phosphorylates and activates phosphoglucomutase-1 (PGM; Ensembl:ENSG00000079739) [110]. The biological or in vivo relevance of the predicted substrates was also evaluated using the genetic and physical interaction networks surrounding Ste20p. Combining genetic and physical interaction data with biochemical data has been shown to be an effective means of evaluating and assigning biological relevance to observed phosphorylation. For instance, the NetworKin methodology employs several types of data in order to assign thousands of identified in vivo phosphorylation sites to the roughly 500 human kinases [167]. In that framework, genetic and physical interactions are used to eval- uate possible kinase-substrate relationships. Here, we employ a similar approach to evaluate substrates that have been predicted computationally. The analysis focused on genes/proteins that are closely linked to STE20 in the genetic and physical interac- tion networks, reasoning that these are the most likely to represent strong candidates for biologically relevant associations with the kinase. Indeed, STE20 interactor GINs tend to overlap more significantly with the predicted substrate set compared to the GINs of other genes in the network. Analogously, Ste20p interactor PINs tend to overlap more significantly than other PINs with the predicted substrate set. The in- teraction neighborhood analyses therefore support the hypothesis that the predicted substrates represent candidates for in vivo phosphorylation by Ste20p. Large-scale screens and bioinformatics analyses are a great source of novel biologi- cal hypotheses, and our analyses led to the hypothesis that Ste20p phosphorylates components of the polarisome complex. Here, we confirmed that Bni1p and Bud6p are phosphorylated by Ste20p in vitro, but that the other two components are not. Spa2p was assigned a high posterior probability of being a Ste20p substrate and is

57 also present in many Ste20p interactor neighborhoods. It may therefore represent a false prediction. However, it is also possible that there are additional requirements for Spa2p phosphorylation by Ste20p that are not present in the in vitro assay. Nonethe- less, our predictive method resulted in the identification of two novel substrates for Ste20p phosphorylation. The Ste20p phosphorylation of Bni1p occurs in the C-terminal Bud6p-binding domain (BBD). It has been previously shown that expression of a bni1 construct lacking this BBD region is unable to rescue the synthetic lethality of a bni1 ∆cla4 ∆ mutant, and therefore the transformed mutants exhibit a terminal phenotype similar to that of ste20 ∆cla4 ∆ mutants [104]. The phenotypic similarity between strains lacking the region of Bni1p phosphorylated by Ste20p and those lacking Ste20p altogether sug- gests that phosphorylation of this region occurs in vivo and is required in the absence of cla4 ∆. As its name implies, this region also binds the other polarisome substrate of Ste20p, Bud6p, indicating the potential for sophisticated regulatory coordination. Though the in vivo relevance of Ste20p phosphorylation and the coordination of the phosphorylation of Bni1p and Bud6p in the regulation of polarized growth remain to be determined, our predictive approach to substrate identification has provided a framework for further investigation. Phosphorylation is a key modification involved in signal transduction, and thus par- ticipates in many of the dynamic processes of the cell. Despite this importance, identifying the detailed architecture of phosphorylation networks remains a challenge. Here, we have described a combination of biochemical genomics and bioinformatics to identify potential new substrates for the Ste20p kinase in yeast. We have confirmed experimentally that our approach predicts valid in vitro Ste20p substrates, and leads to greater insight into the functions of this kinase.

58 2.6 Materials and Methods

2.6.1 Materials

Restriction endonucleases and DNA-modifying enzymes were obtained from New England Biolabs and GE Healthcare. Protease inhibitor tablets and reduced glu- tathione were obtained from Roche. Glutathione-Sepharose 4B beads were purchased from GE Healthcare. Radioisotopes were purchased from GE Healthcare and Perkin Elmer, and film for autoradiography was BioMax MS from Kodak. Acid-washed glass beads (450-600 µm), protease inhibitors, sorbitol, and trypsin were purchased from Sigma. The yeast GST-6xHIS Open Reading Frame collection was purchased from Open Biosystems.

2.6.2 Construction of plasmids

Yeast expression GST-fusion proteins were obtained from Open Biosystems [283]. The GST-Bni1 constructs were kindly provided by C. Boone (University of Toronto). The Bud6p fragments were expressed in E. coli. Relevant fragments were amplified from genomic DNA [12]. pRA210 expresses full-length BUD6 and was constructed using the oligonucleotides 5’-GAGACCCGGGGAATGAAGATGGCCGTGGATGACC-3’ and 5’-GAGACTCGAGTTAAGTAAACCCCGGCCCAAAATATGC-3’. pRA211 expresses BUD6 1−272 and was constructed using the oligonucleotides 5’-GAGACCCGGGGAATGAAGATGGCCGTGGATGACC-3’and 5’-GAGACTCGAGTTAAGCTTCTGTTGTAGACTGATTTGTC-3’. pRA212 expresses BUD6 272−519 and was constructed using the oligonucleotides 5’-GAGACCCGGGAGCTGCTGCGGCTGCCGGCCTCATGAC-3’ and 5’-GAGACTCGAGTTACCTATTAATATTATGCACTTGTTT-3’. pRA213 expresses BUD6 519−788 and was constructed using the oligonucleotides 5’-GAGACCCGGGAAACAAGTGCATAATATTAATAGG-3’ and

59 5’-GAGACTCGAGTTAAGTAAACCCCGGCCCAAAATATGC-3’. The underlined nucleotides are SmaI or XhoI sites. The PCR products were inserted into the vector pGEX-5T by cutting both with SmaI and XhoI followed by ligation. The resultant plasmids were confirmed by sequencing.

2.6.3 Yeast strains and protein purifications

Yeast media, culture conditions, and manipulations were as described [221]. Transformation of yeast with plasmid DNA was achieved with lithium acetate and standard protocols [221]. Growth and induction of yeast strains for the biochemical screen were essentially as described [283]. Cell patches were inoculated in SD (2%) -ura medium, grown overnight, washed, reinoculated in raffinose (4%) -ura, and grown to an absorbance at 600 nm of 0.8. Cultures were pooled by combining 5 ml of each and were then induced with 4% (final concentration) galactose for three hours. GST-fusion proteins were isolated on glutathione-Sepharose beads as previously described [271]. Isolated proteins conjugated to beads were dried and kept at 4◦C no longer than overnight. The biochemical screen followed an iterative process with the first-round pools com- prised of eight fusion proteins followed by fractionation of positive pools by halves until single positives were identified. Expression of full-length BUD6 and associated fragments was in E. coli strain BL21, which was induced with 0.4 mM IPTG for three hours. Fusion proteins were obtained as described [239].

2.6.4 Protein kinase assays

The biochemical screen was designed to screen roughly 10% of the yeast pro- teome. We reasoned that a substrate responsible for the shared essential function of Ste20p and Cla4p might itself be essential. We thus created a library of 539 essential

60 proteins based on their GO functional and localization annotations [207], reasoning that these would likely still exhibit biochemical diversity characteristic of the protein population as a whole. GO terms used for the selection included broad categories in- cluding signal transduction, protein translation or degradation, cell cycle progression, among others. Kinase assays were as described [273]. Dried beads with GST-fusion proteins bound were resuspended in kinase buffer supplemented with 2 µM ATP and 1 µl γ-[P32]- ATP (4,500 Ci/mmol, 10 Ci/µl) and were split in two aliquots, one of which received 1 µl of recombinant GST-Ste20p and the other received an equal volume of protein storage buffer. Reaction mixtures were incubated for 30 minutes and then boiled for 5 minutes after the addition of Laemmli buffer. Samples were separated by SDS-PAGE, dried, and visualized by autoradiography.

2.6.5 Mass spectrometry

HPLC grade water and acetonitrile were purchased from Fisher Scientific (Whitby, ON, Canada). Formic acid (FA) and ammonium bicarbonate were obtained from EM Science (Mississauga, ON, Canada). Fused silica capillaries were purchased from

Polymicro Technologies (Phoenix, AZ). Jupiter C18, 5 µm particle material was from Phenomenex (Torrance, CA). All LC-MS analyses were performed using a Nano-Acquity Q-TOF Premier (Waters,

Millford, MA) with a home-made C18 pre-column (5 mm x 300 µm i.d. Jupiter 3

µm, C18) and an analytical column (10 cm x 150 µm i.d., Jupiter 3µm C18). Sample injection was 10 µl, and tryptic digests were first loaded on the pre-column at a flow rate of 4 ml/min and subsequently eluted onto the analytical column using a gradient from 10% to 60% aqueous acetonitrile (0.2% formic acid) over 56 minutes.

61 2.6.6 Identification of predictive motifs

S. cerevisiae protein sequences were obtained from the Saccharomyces Genome Database (SGD) in August 2008 [209]. The sequence of each of the 19 known sub- strates (14 from the in vitro screen and five from the literature forming the positive learning set) was scanned with a six-amino acid sliding window to identify sequence fragments where at least three of the residues are predicted to be exposed according to ACCpro 4.0 [53]. The identified fragments (with overlapping fragments merged into single fragments) were then used as input to the Teiresias algorithm [220] in order to identify motifs characterized with regular expressions. The algorithm parameters were set so that identified motifs must contain at least three literal (i.e. non-wildcard) residues and that any three consecutive literals span at most six amino acids. Each motif m was evaluated with respect to the positive and negative (non-substrates

Σj wj,m/Npos of the in vitro screen) learning sets by computing its selectivity ratio sm = Σkwk,m/Nneg

where Npos = 19 is the size of the positive set, Nneg = 525 is the size of the negative

set, wj,m is the total weight of motif m for substrate j for j = 1 ...Npos and wk,m is the total weight of motif m for non-substrate k for k = 1 ...Nneg. The total weights were computed as shown in Figure 2–1. The multiple sequence alignments were con- structed with ClustalW2 [54] together with Saccharomyces sequences and orthology mappings obtained elsewhere [209, 142]. An alignment may include two sequences for proteins in S. bayanus and/or S. mikatae since for each of these species there were two different research groups that generated sequences. In such an alignment, the conservation score considers whether the motif occurrence was conserved in any of the sequences for a given organism and therefore does not double-count. The weights of overlapping motif occurrences were adjusted so that the overlapping region con- tributes to the weight of only one of the motif occurrence (Figure 2–1).

62 2.6.7 The predictor of Ste20p substrates and its cross-validation

The predictor is implemented as a naive Bayes classifier, and thus computes the posterior probability (i.e. prediction score) that a given sequence t encodes a Ste20p substrate as follows:

 −1 −w −w P (!S) score = 1 + s t,1 s t,2 ··· s−wt,n · (2.1) t 1 2 n P (S)

where P (S) = 14/539 is the substrate identification rate of the in vitro screen (i.e. the prior probability), P (!S) = 1 - P (S) is the non-substrate identification rate of the in vitro screen,

sm represents the selectivity ratio for motif m,

wt,m represents the total weight of motif m in sequence t, and

n = 23 is the number of motifs used by the predictor (each with sm ≥ 10).

The wt,m values used in Equation 2.1 are computed as in Figure 2–1 except that

th conservation is not considered and consequently, wt,m,i = ut,m,i for the i occurrence of motif m in sequence t. Not considering conservation during prediction allows for added flexibility. For example, it may not be possible to map the sequence of a syn- thetic peptide to a region of the genome that can be assessed for conservation across species. However, the peptide can still be assessed for the likelihood that it is a Ste20p substrate based on the presence/absence of predictive motifs in its sequence. Rather than using standard leave-one-out cross-validation where in each iteration, a different (positive or negative) learning example is not used to generate the predictor, we only left out positive examples. A random selection of 100 proteins from the neg- ative set ( 20%) was set aside for testing the predictors resulting from the different e iterations. We cross-validated in this way due to the computationally intensive pro- cess of deriving selectivity ratios for the thousands of motifs discovered by Teiresias during each iteration.

63 We experimented with different parameter values and used the overall performance, estimated with the area under the receiver-operator-characteristic curve (AUC), of the resulting predictors to guide the selection of optimal parameter values for the final predictor. For example, the selectivity ratio threshold controls the number of motifs that are used by the predictor, since the smaller the threshold the more motifs that will have selectivity ratios above the threshold, and all motifs that pass the threshold are used. While using more motifs increases the possibility of false positives (i.e. a reduction in specificity), since it becomes more likely for any sequence to contain an occurrence of any predictive motif by chance, the sensitivity of the predictor im- proves (data not shown). For different parameter values, we estimated the AUC of the resulting predictors using the variation of leave-one-out cross validation described above. The parameter values that result in larger AUCs are better. The optimal parameter values used to build the final predictor are described above. To integrate the amino acid preferences at the phosphorylation sites of Ste20p-related kinases, we obtained the PSSMs for Pak1, Pak2 and Pak4 [217]. The PSSMs were modified to reflect the background frequencies of amino acids in the S. cerevisiae pro- teome (computed from our collection of protein sequences). Specifically, each PSSM is a 20x10 matrix with entries defined as xij = log(pij/bi) where pij is the probability of observing amino acid i at position j (in a 10-residue subsequence with the putative phosphorylated residue in the centre), and bi is the background frequency of amino acid i. For any given serine or threonine (S/T) in a sequence t, the PSSM score is defined as the sum of the xij values corresponding to the observed amino acids flanking the S/T. We define an S/T with a PSSM score greater than or equal to a selected threshold as a likely phosphorylation site of the corresponding kinase.

The selectivity ratio for PSSM m with a given score threshold is defined as sm = Σj wj,m/Npos , analogous to the selectivity ratio for a regular-expression-based motif. Σkwk,m/Nneg

However, here the total weight of PSSM m in a sequence t is defined as wt,m = Σict,m,i

64 th where ct,m,i represents the conservation score of the i likely phosphorylation site in t according to m, and is defined as the fraction of Saccharomyces orthologues that also have a likely phosphorylation site according to m at the position aligned to the likely site in S. cerevisiae. For each PSSM, we computed selectivity ratios using a range of score thresholds and selected the threshold that produced the largest selectivity ratio.

The selected threshold for a PSSM m and the corresponding sm are used to integrate m into the naive Bayes classifier (Equation 2.1). As with a regular-expression-based motif, the wt,m values used in Equation 2.1 are simplified; here, wt,m is equal to the number of likely phosphorylation sites in sequence t according to PSSM m (with the selected threshold).

2.6.8 Gene Ontology (GO) analysis

GO slim annotations from all three ontologies were obtained from SGD in Au- gust 2008 [209]. The significance of the over-representation of GO category gene sets amongst the predicted substrates was computed using the hypergeometric test in the context of all annotated protein-coding genes. For each ontology, Benjamini and Hochberg multiple-test correction [27] was performed across all categories exhibiting a non-zero overlap with the predicted substrates to obtain adjusted P values [27].

2.6.9 Genetic and physical interaction network analysis

All S. cerevisiae genetic and physical interactions were obtained from BIOGRID v2.0.40 [41]. The networks were reduced to protein-coding genes and self-interactions were removed. To avoid bias and redundancy, Ste20p-substrate interactions from [211] were omitted from the physical network. The neighborhood of a gene/protein is de- fined as the set of genes/proteins that interact with it in the network. The significance of the overlap between a neighborhood and the set of predicted substrates was com- puted using the hypergeometric test in the context of all protein-coding genes. For

65 each network, Benjamini and Hochberg multiple-test correction [27] was performed across all neighborhoods exhibiting a non-zero overlap with the predicted substrates to obtain adjusted P values. If a gene/protein has been investigated in multiple interaction studies, it is likely to have more identified interactions compared to a less-studied gene/protein. Conse- quently, the interaction neighborhood of a frequently studied gene/protein is more likely to significantly overlap with the set of predicted substrates. To correct for this artifact of frequent study, we counted the number of times a gene/protein was used as a bait in interaction screens (b). We then considered a linear model that uses b to predict the multiple-test corrected P value for overlap. The model was trained using data for genes/proteins whose neighborhoods exhibit a non-zero overlap with the set of predicted substrates. The residuals of the model were taken as P values adjusted for frequent study (with negative residuals forced to zero). The one-sided Mann-Whitney test was used to determine whether STE20 interactor neighborhoods tend to have lower P values compared to the neighborhoods of other genes in the network. The same statistical test was performed to determine whether the negative set proteins that are predicted as Ste20p substrates tend to be in more STE20 GINs compared to all negative set proteins (see Figure 3–4 and Supplemen- tary Text). We also used GIN and PIN analyses to investigate known false negatives of the predictor (see Table 3–6 and Supplementary Text). For Figures 2–4 and 2–5, we focused on predicted substrates that are present in at least 1 STE20 interactor GIN or PIN, respectively. The overlap profiles were clus- tered using the Ward agglomerative method and the binary distance metric (see hclust documentation in the R statistical computing framework [246]). The significance of branch points in the resulting dendrograms was measured using multiscale bootstrap resampling (see the documentation for the pvclust R package [246]). The Approx- imately Unbiased (AU) score for a branch point is the percentage of resamples in

66 which the branch point occurs so that larger percentages represent more significant branching.

2.7 Acknowledgements

We gratefully thank Dr. C. Boone for plasmids. This work is NRC publication 50677.

67 Figure 2–2: Significantly over-represented (adjusted P ≤ 0.05) GO slim an- notations among the predicted Ste20p substrates. (A) Cellular components. (B) Biological processes. (C) Molecular functions.

68 69 Figure 2–3 (preceding page): Inferring the biological relevance of Ste20p pre- dicted substrates via neighborhood analysis. (A) Depiction of the test for the statistical significance of the overlap between the set of predicted substrates and the interaction neighborhood (blue portion) of a given gene/protein. Here the given gene is known to interact with STE20. (B) Neighborhood analysis in the context of the genetic network. Comparing the distributions of adjusted P values shows that the predicted substrates tend to overlap more significantly with the neighborhoods of STE20 interactors versus those of all genes in the network. (C) Neighborhood analysis in the context of the physical network. A similar trend is apparent here but the P values are less extreme for Ste20p physical interactors. Insets for (B) and (C) depict the distributions at higher resolution where P ≤ 0.1.

70 Figure 2–4: Clustering profiles of overlap between the predicted substrates and the GINs of STE20 genetic interactors. A solid black cell indicates the presence of a predicted substrate (row) in a GIN (column). A cluster of predicted substrate profiles is shown at higher resolution. SPA2 and BNI1 form a significant subcluster (AU score = 97, Figure 3–2B) and they are involved with polarized growth.

71 Figure 2–5: Clustering profiles of overlap between the predicted substrates and the PINs of Ste20p genetic interactors. A solid black cell indicates the presence of a predicted substrate (row) in a PIN (column). Several predicted substrates implicated in polarized growth are clustered together (AU score = 94, Figure 3–3B). Highlighted in red is a subcluster of predicted substrates that are present in the PINs of Ste20p physical interactors that are also involved with polarity (Cdc42p, Cdc24p, Bem1p).

72 73 Figure 2–6 (preceding page): Bni1p and Bud6p are phosphorylated by Ste20p in vitro. (A) A schematic representation of the functional regions of Bni1p. These are the Formin-Homology domains (FH1, FH2, and FH3), GTPase binding do- main (GBD), Spa2p-binding domain (SBD), Dia-autoregulatory domain (DAD), and Bud6p-binding domain (BBD). The region C-terminal to the FH2 domain, which contains part of the BBD, is referred to as the COOH region in the text (figure adapted from [82]). (B) Ste20p only phosphorylates Bni1p constructs containing the COOH region. Constructs composed of different combinations of the FH1 and FH2 domains and the COOH region were purified and equal concentrations of each were assayed in in vitro kinase assays with Ste20p and γ-[P32]-ATP, then visualized by SDS-PAGE and autoradiography. The three constructs containing the COOH re- gions are phosphorylated (with the position of the labeled peptides in their respective lanes indicated by arrows at left), whereas the constructs without the COOH region are not. (C) Ste20p phosphorylates the central region of Bud6p. In the left panel, full length Bud6p is phosphorylated by Ste20p. In the right panel, the middle fragment of Bud6p exhibits strong evidence of phosphorylation and the N-terminal fragment exhibits weak evidence of phosphorylation. No signal is detected for the C-terminal fragment.

74 CHAPTER 3 Supplementary Information for Chapter 2 This chapter contains all the supplementary information for the manuscript in chapter 2, except for Table 3-S1. Table 3-S1 contains the substrate prediction scores for all 6000 yeast proteins and is not intended for the printed page. However, the e table is available for download (i.e. no subscription required) at www.plosone.org, PLoS ONE 4: e8279 (2009).

3.1 Supplementary Text

Rennefahrt and colleagues determined amino acid preferences at the phosphory- lation sites of the human Ste20p-related kinases Pak1 (Ensembl:ENSG00000149269), Pak2 (Ensembl:ENSG00000180370) and Pak4 (Ensembl:ENSG00000130669) [217]. For each kinase, the preferences are encoded in a position-specific scoring matrix (PSSM) that can be used to score each serine and threonine in any given sequence. All serines and threonines with a PSSM score that exceeds a selected threshold are deemed likely phosphorylation sites of the kinase. We computed a selectivity ratio for each PSSM (similar to the selectivity ratios of the motifs identified in this study, see Materials and Methods in chapter 2) such that a PSSM with a selectivity ratio greater than one indicates that likely phosphorylation sites are more prevalent in the positive set than in the negative set. The selectivity ratios for all the PAK PSSMs are less than the ratio threshold (i.e. 10) we used to select regular-expression-based motifs for the predictor (the selectivity ratios for the Pak1, Pak2, and Pak4 PSSMs are 1.66, 2.30 and 7.88, respectively). A classifier that integrates the motifs identified in this study with the PSSM with the largest selectivity ratio (i.e. the Pak4 PSSM)

75 is approximately as accurate as the classifier that only uses our motifs (Figure 3–1, see Materials and Methods in chapter 2). A naive Bayes classifier that integrates our motifs with all the PSSMs also has comparable accuracy (Figure 3–1). Taken together, our results show that integrating the PAK PSSMs does not improve the accuracy of our Ste20p substrate predictor. We also employed GIN and PIN analyses to investigate the known false negatives and false positives generated by the predictor. Five of the 19 Ste20p substrates iden- tified in our initial screen or from the literature were poorly scored (score < 0.3) by the predictor (Table 3–6). Four of these are members of the GINs or PINs of STE20 interactors, suggesting that they retain in vivo relevance but do not possess the sequence features used by the predictor to characterize Ste20p substrates. These represent likely false negatives of the predictor. Furthermore, among the set of pro- teins that were not phosphorylated in our screen, 34 proteins were predicted as Ste20p substrates (Figure 3–4). These proteins tend to be found surprisingly often in the GINs of STE20 interactors compared to all the proteins from the negative learning set (P = 3.17x10−5, Mann-Whitney test). These proteins therefore represent likely false negatives of the in vitro screen, yet are correctly predicted substrates.

76 Table 3–1: GO slim Cellular Component analysis of predicted Ste20p sub- strates (score ≥ 0.9).

GO Slim Term GO Slim Overlap P value Adjusted Term Size Size P value cellular bud 160 47 3.7100e-11 9.2800e-10 site of polarized growth 162 45 7.4700e-10 9.3400e-09 cell cortex 105 33 5.0400e-09 4.2000e-08 plasma membrane 265 53 3.6800e-06 2.3000e-05 mitochondrion 1051 155 5.1800e-06 2.5900e-05 cytoskeleton 203 43 6.9200e-06 2.8800e-05 cytoplasm 2834 345 4.7600e-04 1.6990e-03 vacuole 202 34 4.6840e-03 1.4638e-02 nucleus 1794 221 5.7730e-03 1.6035e-02 Golgi apparatus 215 34 1.2190e-02 3.0476e-02 other 384 54 2.0105e-02 4.5693e-02 cytoplasmic membrane- 103 18 2.4088e-02 4.6322e-02 bounded vesicle 249 37 2.3283e-02 4.6322e-02 membrane 911 115 2.6463e-02 4.7256e-02

Table 3–2: GO slim Biological Process analysis of predicted Ste20p sub- strates (score ≥ 0.9).

GO Slim Term GO Slim Overlap P value Adjusted Term Size Size P value transcription 585 105 0.00019 0.00721 anatomical structure morpho- 156 35 0.00073 0.01379 genesis cell budding 84 21 0.00207 0.02622 carbohydrate metabolic pro- 189 38 0.00368 0.03498 cess vesicle-mediated transport 340 61 0.00477 0.03626 cytokinesis 116 25 0.00698 0.04424

Table 3–3: GO slim Molecular Function analysis of predicted Ste20p sub- strates (score ≥ 0.9).

GO Slim Term GO Slim Overlap P value Adjusted Term Size Size P value protein kinase activity 129 36 0.00001 0.00027 lipid binding 53 19 0.00004 0.00042 hydrolase activity 725 131 0.00013 0.00099

77 Table 3–4: Predicted substrates (score ≥ 0.9) in the neighborhoods of STE20 genetic interactors. See Figure 2–3A for an illustration of an inter- action neighborhood. *Adjusted P values were not computed for genetic interactors with neighborhoods that do not overlap with the predicted substrates.

STE20 Genetic Interactor Neighborhood Size Overlap Size P value Adjusted P value CDC42 84 26 0.000000839 0.001982 SWE1 45 16 0.000015800 0.006404 GIC2 24 10 0.000138000 0.031707 BUB2 86 21 0.000423000 0.042764 HSP82 1729 233 0.000507000 0.042764 RAS2 136 28 0.001025000 0.060549 CLA4 197 36 0.002091000 0.080261 GIC1 28 9 0.002646000 0.090666 SLG1 36 10 0.005113000 0.118427 TAF9 141 25 0.013734000 0.173212 STE4 30 8 0.015272000 0.178672 KES1 25 7 0.017428000 0.195614 BEM1 50 11 0.020991000 0.206127 CDC28 61 12 0.037011000 0.243204 STE12 44 9 0.053209000 0.294332 WHI2 72 13 0.056568000 0.305749 PMI40 4 2 0.065104000 0.311286 BEM3 14 4 0.063669000 0.311286 SEC14 61 11 0.076094000 0.343404 HSL7 11 3 0.118118000 0.356772 CDC34 45 8 0.127147000 0.379673 SIC1 150 21 0.171830000 0.442463 RGA1 13 3 0.173463000 0.442463 SHO1 7 2 0.181741000 0.444381 AKR1 8 2 0.225106000 0.460346 KIN4 8 2 0.225106000 0.460346 STE5 16 3 0.266171000 0.511751 PUP3 3 1 0.301282000 0.514553 CLB2 79 11 0.272907000 0.514553 LTE1 269 33 0.325320000 0.543426 RGA2 4 1 0.380007000 0.554402 SSK1 11 2 0.356320000 0.554402 STE11 45 6 0.396462000 0.568879 SPO12 21 3 0.426868000 0.603028 TEM1 21 3 0.426868000 0.603028 NCP1 5 1 0.449874000 0.603408 UBA4 13 2 0.439535000 0.603408 OCH1 8 1 0.615726000 0.706234 URM1 12 1 0.761898000 0.805036 MSB2 7 0 1.000000000 N/A* STE3 5 0 1.000000000 N/A* SKM1 2 0 1.000000000 N/A*

78 Table 3–5: Predicted substrates (score ≥ 0.9) in the neighborhoods of Ste20p physical interactors. See Figure 2–3A for an illustration of an inter- action neighborhood. *Adjusted P values were not computed for physical interactors with neighborhoods that do not overlap with the predicted substrates. Ste20p Physical Neighborhood Size Overlap Size P value Adjusted Interactor P value Cdc28p 241 67 4.81000e-13 1.44E-09 Bem1p 26 10 3.04000e-04 0.053593 Htb2p 81 18 3.30800e-03 0.153509 Myo3p 30 9 4.46000e-03 0.189745 Bmh1p 51 12 9.55000e-03 0.258371 Cdc42p 29 8 1.23880e-02 0.258371 Slt2p 48 11 1.56180e-02 0.290825 Cbk1p 25 7 1.74280e-02 0.311015 Asc1p 50 11 2.09910e-02 0.328345 Cdc24p 23 6 3.77300e-02 0.405232 Cbr1p 18 5 4.40050e-02 0.448729 Hsl7p 8 3 5.16460e-02 0.463581 Ste11p 25 6 5.46100e-02 0.465113 Ubc6p 14 3 2.03407e-01 0.630751 Bmh2p 67 10 2.17466e-01 0.641066 Prp21p 42 6 3.33462e-01 0.702544 Nup53p 26 4 3.35328e-01 0.702638 Ste4p 26 4 3.35328e-01 0.702638 Nbp2p 4 1 3.80007e-01 0.705863 Htb1p 67 9 3.39866e-01 0.705863 Ncp1p 13 2 4.39535e-01 0.731411 Cln2p 46 6 4.17478e-01 0.731411 Bem4p 16 2 5.52262e-01 0.78538 Boi2p 10 1 6.97505e-01 0.846607 Rad1p 31 3 6.94673e-01 0.846607 Erb1p 84 8 7.44227e-01 0.874634 Boi1p 13 1 7.88761e-01 0.889489 Erg4p 2 0 1.00000e+00 N/A* Bud8p 4 0 1.00000e+00 N/A*

79 Table 3–6: The number of STE20 -linked Genetic Interaction Neighbor- hoods (GINs) and Ste20p-linked Physical Interaction Neighborhoods (PINs) in which each known Ste20p substrate (from the positive learning set) appears. See Figure 2–3A for an illustration of an interaction neighborhood.

Gene Predictor Score # Overlapping STE20 - # of Overlapping Ste20p- linked GINs (out of 42) linked PINs (out of 29) ALY2 1.00 1 1 BMS1 1.00 0 0 CDC3 1.00 0 1 MYO3 1.00 1 1 MYO5 1.00 2 1 RAD53 1.00 3 1 RPT5 1.00 0 0 SGV1 1.00 1 1 SPB1 1.00 0 0 STE11 1.00 12 1 UTP7 1.00 0 1 COG4 0.99 0 0 PCM1 0.99 1 0 CDC10 0.86 2 0 RSC6 0.28 3 0 HTB2 0.03 0 1 RSC8 0.03 1 0 SPT16 0.03 0 3 UTP5 0.03 0 0

80 Figure 3–1: Receiver-Operating Characteristic (ROC) curves of Ste20p sub- strate predictors. The ROC curves were estimated with a modified version of leave-one-out cross-validation (see Materials and Methods in chapter 2). All predic- tors are naive Bayes classifiers that integrate the motifs identified in this study and/or position-specific scoring matrices (PSSMs) that specify the amino acid preferences at the phosphorylation sites of specific Ste20p-related kinases. In the key, “all PAK PSSMs” refers to the Pak1, Pak2 and Pak4 PSSMs. The estimated true and false positive rates of the predictor that only integrates our motifs, used with a threshold of 0.9, are indicated by the dotted red line.

81 Figure 3–2: The statistical significance of clusters based on the Genetic Interaction Neighborhood (GIN) analysis shown in Figure 2–4. Each branch point is labeled with an Approximately Unbiased (AU) score (see Materials and Methods in chapter 2) such that a score ≥ 95 corresponds to a P value ≤ 0.05 indicating the significance of the cluster. (A) Dendrogram of STE20 genetic interactors clustered by the overlap of their respective GINs with the set of predicted substrates. The box highlights a cluster containing genes associated with cell-cycle progression and polarized growth. (B) Dendrogram of predicted substrates clustered by their overlap with STE20 -linked GINs.

82 Figure 3–3: The statistical significance of clusters based on the Physical Interaction Neighborhood (PIN) analysis shown in Figure 2–5. Approxi- mately Unbiased (AU) score (see Materials and Methods in chapter 2) such that a score ≥ 95 corresponds to a P value ≤ 0.05 indicating the significance of the clus- ter. (A) Dendrogram of Ste20p physical interactors clustered by the overlap of their respective PINs with the set of predicted substrates. (B) Dendrogram of predicted substrates clustered by their overlap with Ste20p-linked PINs. The box highlights a cluster of proteins involved with polarity.

83 Figure 3–4: STE20 genetic neighborhood analysis suggests that several pre- dicted substrates in the negative set may represent false negatives of the biochemical screen. There are 34 negative set proteins that are predicted as substrates and some are present in the neighborhoods of STE20 genetic interactors (i.e. a table cell is red if the predicted substrate of the column is present in the ge- netic neighborhood of the gene of the row, white otherwise). In general, the negative proteins predicted as substrates are present in more neighborhoods compared to all proteins in the negative set (P = 3.17x10−5, Mann-Whitney test). See Figure 2–3A for an illustration of an interaction neighborhood.

84 CHAPTER 4 Searching for Signaling Balance through the Identification of Genetic Interactors of the Rab Guanine-nucleotide Dissociation Inhibitor gdi-1 Anna Y Lee*1 ,2 , Richard Perreault*3 , Sharon Harel3, Elodie L Boulier3, Matthew Suderman1, Michael Hallett1,2,4 and Sarah Jenna3,5 ,6 *these authors contributed equally to this work Originally published in PLoS ONE 5: e10624 (2010) under the terms of the Creative Commons Attribution-Generic 2.5 License.

4.1 Preface

Kinase-substrate interactions and other interactions capture how one protein/gene can influence another protein/gene. For example, a kinase might enhance or repress the activity of one of its substrates through the act of phosphorylation. Furthermore, the kinase might enhance or repress the activity of a protein that is downstream of the substrate, without ever coming into physical contact with it. These potentially non-physical relationships between genes that ultimately regulate some phenotype

1 McGill Centre for Bioinformatics, McGill University, Montr´eal,Qu´ebec, Canada 2 School of Computer Science, McGill University, Montr´eal,Qu´ebec, Canada 3 Department of Chemistry, Universit´edu Qu´ebec `aMontr´eal,Montr´eal,Qu´ebec, Canada 4 Rosalind and Morris Goodman Cancer Centre, McGill University, Montr´eal, Qu´ebec, Canada 5 Pharmaqam, Universit´edu Qu´ebec `aMontr´eal,Montr´eal,Qu´ebec, Canada 6 Biomed, Universit´edu Qu´ebec `aMontr´eal,Montr´eal,Qu´ebec, Canada

85 are captured by genetic interactions. These interactions are important for drug de- velopment because they may reveal relationships between genes that regulate disease phenotypes, and may therefore suggest approaches to alleviating these phenotypes. To date, genetic interactions have been most comprehensively mapped in yeast. Al- though there were experimental and computational efforts to map genetic interactions in the animal model C. elegans prior to the start of this project, these efforts were not comprehensive. Therefore, we developed a method to predict genetic interactions in C. elegans comprehensively, and demonstrated the validity of our method through the interactions of the orthologue of a gene associated with mental retardation, gdi-1. 4.1.1 Contributions of Authors

Anna Y Lee developed the predictor, analysed the resulting predictor and predic- tions, and contributed to the writing of the manuscript. Richard Perreault performed genetics experiments to validate the predictions and contributed to the writing of the manuscript. Sharon Harel and Elodie L Boulier performed experiments to validate the predictions. Matthew Suderman contributed to the development of the predictor. Michael Hallett contributed to the design of the predictor and made suggestions for the analysis of the resulting predictor and predictions. Sarah Jenna contributed to the development of the predictor, made suggestions for the analysis of the resulting predictor and predictions, designed and performed genetics experiments, and con- tributed to the writing of the manuscript. All authors contributed to the revising of the manuscript.

4.2 Abstract

Background The symptoms of numerous diseases result from genetic mutations that disrupt the homeostasis maintained by the appropriate integration of signaling gene activities.

86 The relationships between signaling genes suggest avenues through which homeosta- sis can be restored and disease symptoms subsequently reduced. Specifically, disease symptoms caused by loss-of-function mutations in a particular gene may be reduced by concomitant perturbations in genes with antagonistic activities.

Methodology/Principle Findings Here we use network-neighborhood analyses to predict genetic interactions in Caenorhab- ditis elegans towards mapping antagonisms and synergisms between genes in an an- imal model. Most of the predicted interactions are novel, and the experimental val- idation establishes that our approach provides a gain in accuracy compared to pre- vious efforts. In particular, we identified genetic interactors of gdi-1, the orthologue of GDI1, a gene associated with mental retardation in human. Interestingly, some gdi-1 interactors have human orthologues with known neurological functions, and upon validation of the interactions in mammalian systems, these orthologues would be potential therapeutic targets for GDI1 -associated neurological disorders. We also observed the conservation of a gdi-1 interaction between different cellular systems in C. elegans, suggesting the involvement of GDI1 in human muscle degeneration.

Conclusions/Significance We developed a novel predictor of genetic interactions that may have the ability to significantly streamline the identification of therapeutic targets for monogenic disor- ders involving genes conserved between human and C. elegans.

4.3 Introduction

Many biological mechanisms depend on a state of signaling homeostasis main- tained by the appropriate integration of the synergistic and antagonistic activities of signaling genes [231]. Accordingly, the symptoms of numerous diseases result from

87 genetic mutations that disrupt this homeostasis [192, 152, 245, 66]. The relationships between signaling genes suggest avenues through which homeostasis can be restored and disease symptoms subsequently reduced. Specifically, disruptions caused by loss- of-function mutations in a particular gene may be compensated by concomitant per- turbations in genes with antagonistic activities. Antagonisms and synergisms between genes can be identified via genetic interactions. A genetic interaction between two genes exists when the phenotypic effect of a perturbation (e.g. mutation, RNAi treat- ment, drug targeting) in one gene is dependent upon a perturbation in the other gene. Thus, disease symptoms caused by mutations in a given gene may be compensated by perturbing genetic interactors of the gene. That is, the genetic interactors are potential therapeutic targets. Therefore, the identification of genetic interactions is an important step towards the development of treatments for monogenic disorders. The nematode Caenorhabditis elegans is an ideal animal model for identifying genetic interactions due its genetic tractability. Furthermore, the high degree of conserva- tion of molecular pathways related to human diseases has facilitated the dissection of physiopathological mechanisms of genetic disorders including Duchenne Muscular Dystrophy (DMD; OMIM: 310200), lysosomal storage disorders, obesity, diabetes and Huntingtons disease [144, 70, 16]. Although the extent to which genetic interactions are conserved between C. elegans and human is unknown, previous studies encourage the use of C. elegans towards the identification of therapeutic targets for human dis- eases. For example, a genome-wide RNAi suppressor screen in a C. elegans model of type 2 diabetes, i.e. a strain with a loss-of-function mutation in the C. elegans insulin- like growth factor receptor daf-2, led to the identification of a kinase that exhibits antagonistic activity towards daf-2. Interestingly, mice with the kinase knocked-out appeared to be protected against diabetes, suggesting that the antagonistic interac- tion identified in C. elegans led to the identification of a potential therapeutic target for a human disease [16]. The application of systematic screens for other diseases

88 hinges on the development of high-throughput techniques enabling the quantification of relevant phenotypes. However, the development of such quantitative techniques is in general time-consuming and may be extremely challenging. An alternative ap- proach involves the in silico prediction of genetic interactions [157, 280, 55]. Inter- estingly, the rate at which genetic interactions are identified with prediction-driven screens appears to be significantly greater than the rate for systematic experimental screens (Figure 4–1). This suggests that in silico prediction represents an efficient approach to identifying genetic interactions. All existing in silico approaches for predicting genetic interactions use several types of data including gene expression measurements and protein-protein (PP) interac- tions. Lee and colleagues developed a method for predicting whether two given genes have a shared function [157]. The method is based on the weighted integration of gene pair data and was trained with pairs of genes that share functional annotations as positive learning examples. The predictions can be used in turn to infer genetic interactions, since pairs of genes that share function tend to exhibit synergistic interactions. More- over, known antagonists of a given gene can be used as so-called seeds to search for other antagonists of the gene; specifically, genes predicted to share function with the seeds are inferred to be antagonists as well. However, many genes that share a func- tion do not synergistically interact with each other nor do they antagonize the same gene(s), and therefore the accuracy of this approach for predicting genetic interac- tions may be limited (see the validation success rate in Figure 4–1). Zhong and Sternberg developed a method to directly predict genetic interactions [280]. This method is also based on the weighted integration of gene pair data, but was trained with known genetic and PP interactions as positive learning examples. How- ever, the method predicts a set of genetic interactions that involves only a small portion of all C. elegans genes ( 8% of the genome, see Figure 4–1). This may be e

89 due to the amount of data specific to a given gene pair that is required to make a prediction, since such data is scarce for many gene pairs. Chipman and Singh developed an approach for predicting synergistic interactions only [55]. This approach uses information gained from the contexts of genes in a biological network that integrates several types of data (e.g. an edge exists between two genes if they encode proteins that exhibit a PP interaction), specifically by using the proximity between genes in the network. While this approach appears extremely powerful based on the in silico validation results, it remains to be determined how well this approach performs according to experimental validation. Since all experimentally validated approaches for predicting genetic interactions cur- rently suffer from limited accuracy or predict genetic interaction sets with limited genome coverage, we developed a novel in silico approach that uses statistical analyses of gene/protein neighborhoods in biological networks (Figure 4–2). Unlike previous approaches, the prediction of a genetic interaction between two given genes is aided by analyses that detect common features of the neighborhoods of the genes, or their encoded proteins (e.g. common PP interactors of the proteins). Furthermore, our ap- proach does not require ‘seeds’ for every gene of interest to predict novel antagonistic interactions, unlike the Lee et al. approach [157], and while our approach appears comparable to the Zhong and Sternberg approach [280] in terms of specificity, our set of predicted genetic interactions has greater genome coverage (Figure 4–1). The overall aim of this study was to identify genetic interactions in C. elegans that warrant further study in mammals towards the identification of promising ther- apeutic targets for genetic diseases. We thus used our approach to identify genetic interactors of the Rab-specific guanine-nucleotide dissociation inhibitor gdi-1 (Worm- Base: WBGene00001558) which shares 80% protein sequence identity with GDI1 (Ensembl: ENSG00000203879; Blast E-value: 2.10 x 10−158), a gene associated with non-syndromic forms of mental retardation in human (OMIM: 300104) [66]. GDI1

90 encodes GDIα, a major regulator of Rab GTPase activity during endocytosis and exocytosis [66, 68]. This protein is thus a critical regulator of cell signaling events. The validation of predicted genetic interactions of gdi-1 identified several antagonists. If these genetic interactions are conserved in the relevant human disease system, they would suggest therapeutic targets for GDI1 -associated cognitive disorders. In addi- tion, our results suggest the conservation of a subset of genetic interactions across different cellular systems in C. elegans, and the involvement of GDI1 in human my- opathies resulting from mutations in components of the Dystrophin Glycoprotein Complex (DGC).

4.4 Results

4.4.1 The predictor of genetic interactions in C. elegans

We developed a predictor of genetic interactions using a learning set that con- tains positive and negative examples of interactions from the literature (see Table 5-S1 for the manually-curated interactions) and gene pairs randomly selected from the C. elegans genome, respectively (see Methods). Since it is estimated that the vast majority of gene pairs do not genetically interact [251], a set of randomly selected gene pairs is expected to be enriched with true negative examples. Our predictor uses gene expression measurements, RNAi knockdown phenotype ob- servations and PP interactions from multiple species to measure the likelihood of a genetic interaction. The gene expression measurements were obtained from DNA microarray results [145], and the phenotype observations were obtained from genome- wide RNAi experiment results (see Methods). A multi-species PP interaction network was constructed with C. elegans, Drosophila melanogaster, Homo sapiens and Sac- charomyces cerevisiae PP interactions identified by yeast two-hybrid (obtained from BioGRID and [92, 243]). PP interactions from species other than C. elegans were incorporated using InParanoid orthology maps [198]. For any two given genes, we

91 considered a measure of their coexpression (Exp), a measure of their phenotype sim- ilarity (P h) and an indicator of a PP interaction between their encoded proteins or orthologues (I) as gene pair attributes that might help determine whether the given genes genetically interact. Variants of the Exp, P h and I attributes have been used by existing predictors of genetic interactions [280, 157]. Importantly, our approach is the first to use particular features of biological networks in order to improve the accuracy of the prediction of genetic interactions. For exam- ple, considering the multi-species PP interaction network, we define the neighborhood of a protein as the set of proteins that exhibit a PP interaction with it (possibly via orthology). Although two given proteins may not be known to exhibit a PP interac- tion, their neighborhoods may contain a surprising number of common PP interactors (Figure 4–2B). We defined a gene pair attribute based on the encoded proteins of the two genes of interest, measuring the significance of their number of common PP in- teractors (CI). The set of gene pairs that encode proteins with significantly many common PP interactors (CI ≤ 0.05) is enriched with gene pairs that are known to genetically interact (P = 6.67 x 10−39, hypergeometric test). We also investigated whether a biological network that integrates observations of phenotype similarity, coexpression and PP interaction can improve the prediction of genetic interactions. We defined a novel biological network called the PhEP network, where nodes represent genes and the genes are labeled with their RNAi knockdown phenotypes. Two genes are connected by an edge if they are significantly coexpressed and/or if they encode or have orthologous proteins that exhibit a PP interaction (see Methods). Although a phenotype observation may be absent for the gene itself, such observations may be available for several of its neighbors (i.e. the genes connected to the gene of interest by one edge) in the PhEP network. We therefore defined a gene pair attribute indicating the enrichment of genes associated with some pheno- type in the neighborhoods of both genes of interest, in the PhEP network (N, see

92 Figure 4–2C). We demonstrated that the set of gene pairs with such neighborhood characteristics is enriched with gene pairs that are known to genetically interact (P = 1.20 x 10−245, hypergeometric test). By the same line of reasoning, we defined a variant of this gene pair attribute, NP h, indicating that the two genes of interest are annotated with the phenotype that is enriched in both of their neighborhoods in the PhEP network (see Figure 4–2D). Again, the set of gene pairs with such neighbor- hood characteristics is enriched with gene pairs that genetically interact (P = 6.26 x 10−113, hypergeometric test). Taken together, we showed that our network-based attributes (CI, N and NP h) of gene pairs are significantly associated with genetic interactions, suggesting that these attributes may facilitate the accurate prediction of genetic interactions. Ultimately, the Exp, P h, I, CI, N, and NP h attribute values of a given gene pair are integrated by a logistic regression model that outputs a prediction score between 0 and 1 representing the likelihood of a genetic interaction between the two genes (see Methods). We performed leave-one-out cross-validation to evaluate the predictor at different score thresholds (Figure 5–1). We determined that a conservative threshold of 0.975 induces error rates comparable to those achieved by the Zhong and Sternberg (ZS) genetic interaction predictor (Table 5–1). However, this threshold also induces a set of predicted genetic interactions that is 98% novel when compared to the prediction sets of previous studies [157, 280], and roughly three-fold more genes are present in our set compared to the ZS set. Thus, under conditions where our predictor and the ZS predictor have comparable accuracy estimates, our set of predicted interactions exhibits greater genome coverage. We chose 0.85 as our definitive threshold since it yields an estimated false positive rate (Table 5–1) close to the expected rate of finding a genetic interaction at random (0.5%) [251], coinciding with our negative learning set

93 of random gene pairs. At this threshold, the estimated true positive rate is 10.8% (Ta- ble 5–1). Although our predictor misses many true positive interactions, over 800K genetic interactions are predicted and again, 98% of them are novel (Figure 5–2A). In particular, our predictor proposes more interactions per gene on average compared to the ZS predictor (Figure 5–2B). In addition, roughly four-fold more genes are present in our set of predicted interactions compared to the ZS set (Figure 5–2B). Thus, when the predictor has an estimated false positive rate that is appropriately low, the corre- sponding set of predicted interactions also exhibits a large increase in genome coverage compared to the ZS set. Genome-wide genetic interactions predicted by our method are available online (http://www.mcb.mcgill.ca/ anna/gInterWorm/search.php). e The biological relevance of the predicted genetic interaction network was assessed in silico using pathway annotations (Table 5-S3 and [200]). Previous studies show that genetic interactions occur within and between pathways, although between-pathway interactions are more prevalent amongst interactions identified in large-scale stud- ies [163, 48, 61]. Therefore, we investigated the connectivity of pairs of genes anno- tated to the same pathway, in the predicted network (see Methods). We found that a significant fraction of these pathway gene pairs are directly connected (P = 10−5, Figure 4–3A), indicating predicted interactions within pathways. We also found that a significant fraction of pathway gene pairs are connected through shared neighbors (P = 10−5, Figure 4–3B), and in most cases, at least one of the shared neighbors is not in the same pathway as the pair (98% and 99% of the cases for all and just signaling pathways, respectively). These cases indicate predicted interactions that likely occur between pathways, or within a pathway if the shared neighbor is an un- known member of the pathway of the pair. Interesting, we predict significantly many genetic interactions within and between pathways mapped from human to C. elegans, as we do for pathways derived directly from C. elegans (compare “all pathways” to “signaling pathways” in Figure 4–3). Taken together, the connectivity of pathway

94 genes in the predicted network is consistent with connectivity observations based on genetic interactions identified experimentally [163, 48, 61], even for pathway genes mapped from human, and thus supports the validity of our predictor.

4.4.2 The set of predicted genetic interactions exhibits improved coverage of genes conserved between human and C. elegans

We investigated whether more genes conserved between human and C. elegans are present in our set of predicted genetic interactions when compared to other pre- diction sets. If our prediction set is restricted to genes with human orthologues (see Supplementary Text), it is still true that a large fraction of the set is novel (Fig- ure 5–2A). We thus examined the level of characterization of human genes with C. elegans orthologues present in prediction sets. Our analysis shows that in silico meth- ods tend to predict genetic interactions involving well-characterized genes more often than poorly-characterized genes (Figure 5–2C). All human genes with C. elegans or- thologues only present in the ZS prediction set have a high level of characterization (gene characterization index > 5 [143]). Interestingly, 25% of human genes with C. elegans orthologues only present in our prediction set do not have a high level of char- acterization. Taken together, our approach predicts a large number of novel genetic interactions for genes conserved between C. elegans and human, and also predicts interactions for genes orthologous to poorly-characterized human genes that have no predicted interactions by other approaches. In order to better understand why our method predicts genetic interactions that are mostly novel, we investigated the genes with human orthologues associated with men- tal retardation and synaptic plasticity (MRSP) that we curated from the literature (Table 5-S4). Over two-fold more MRSP genes are present in our set of predicted genetic interactions compared to the ZS set (89% and 40% of the genes, respectively). In examining the MRSP genes that are present in our set only, we found that these

95 genes are generally associated with more information with our approach than with the ZS approach (see Figure 5–3A and the Methods). In particular, the additional information comes from our novel network-based attributes (e.g. the CI and N at- tributes). The values of these attributes are computable for nearly all MRSP genes, but the values of most ZS attributes are computable only for a smaller subset of the genes (Figure 5–3B). These results suggest that the network-based attributes facili- tate the prediction of novel interactions. A large number of human genes associated with disease are conserved in C. ele- gans [238]. For example, GDI1, a human gene associated with mental retardation [66], has high sequence similarity (Blast E-value: 2.10 x 10−158) to gdi-1, a C. elegans gene that has yet to be functionally characterized. Since GDI1 is involved in neurotrans- mission and has been associated with cognitive deficiency in human [66, 68], it is functionally related to our set of human MRSP genes (Table 5-S4). We thus investi- gated whether the relationship between GDI1 and MRSP genes is conserved between human and C. elegans. Interestingly, our method predicts that gdi-1 genetically in- teracts more frequently with MRSP genes than with other genes (P = 1.1 x 10−5, two proportion test), and it also shares genetic interaction partners more frequently with MRSP genes than with other genes (P = 1.1 x 10−50, two proportion test). These results provide statistically significant evidence that the interactions between GDI1 and its potential neurological partners are conserved in C. elegans.

4.4.3 Validation of predicted genetic interactors of gdi-1

We identified phenotypes that result from treating C. elegans animals with gdi- 1(RNAi). These phenotypes include sterility (Ste, Figure 4–4C), a gonad morpho- genesis defect characterized by a shortening of gonads (Gon, Figure 4–4A,C), an ovu- lation defect characterized by an accumulation of endomitotic oocytes (Emo, Figure 4–4B,C), and a severe reduction of sheath cell contraction (Figure 4–4D). We showed

96 that gdi-1 controls ovulation and gonad morphogenesis processes by modulating so- matic gonad cell functions. That is, rrf-1(pk1417) (WormBase: WBGene00004508) animals, which are resistant to RNAi in somatic cells, expressed significantly re- duced levels of the phenotypes when subjected to gdi-1(RNAi) compared to wild- type and mutant animals resistant to RNAi in germinal cells (see gdi-1(RNAi) and ppw-1(pk2505) (WormBase: WBGene00004508); gdi-1(RNAi) respectively in Figure 4–4C). These results suggest that gdi-1 is a critical regulator of signaling pathways controlling reproductive functions in C. elegans. To experimentally validate our predictions, we examined 18 strains containing mu- tations in 12 genes predicted to genetically interact with gdi-1. Ste, Emo and Gon phenotypes were measured for mutant and wild-type animals submitted to RNAi against gdi-1 or the negative control, egfp (Figure 4–5A). Epistasis analyses of these measurements were performed using three commonly used statistical models [171, 60] to identify significant genetic interactions (see Tables 5–2 and 5–3 for the estimated epistasis coefficients and P values, respectively). We also applied a statistical test that measures the suppression of gdi-1(RNAi)-induced phenotypes (see Methods). Our results show only partial agreement between the different models of epistasis (Figure 4–6A). The most stringent requirement (i.e. significant interaction by all applicable tests, P ≤ 0.05) resulted in a validation success rate of 42%, while more permissive analysis (i.e. significant interaction by at least one test) increased the suc- cess rate to 67%. This represents an 84- or 134-fold improvement over the expected success rate from random genetic screening. Although validation success rates de- pend on the selected bait gene(s) (e.g. gdi-1 in our study), our success rates surpass those reported for existing methods [157, 280] (Figure 4–1A), thus suggesting that our method represents an important improvement in predictive accuracy.

97 Of the 12 putative genetic interactions tested, five were successfully validated ac- cording to all statistical tests used to analyse our results. All five interactions are an- tagonistic, and the genes that antagonize gdi-1 are the following: unc-96 (WormBase: WBGene00006825), encoding a paramyosin-binding protein [176]; unc-89 (Worm- Base: WBGene00006820), encoding a titin-like myosin light chain (MLC)-specific kinase [26]; tra-4 (WormBase: WBGene00018740), a close orthologue of the human proto-oncoprotein and transcriptional repressor PLZF (Ensembl: ENSG00000109906) [108]; aspm-1 (WormBase: WBGene00008107), the closest orthologue of the mammalian ASPM (Ensembl: ENSG00000066279), a gene associated with mitotic spindle as- sembly and microcephaly [256, 37]; and dyb-1 (WormBase: WBGene00001115), the closest orthologue of dystrobrevin (Ensembl: ENSG00000134769), a component of the DGC in human [100]. Notably, we showed genetic interactions between gdi-1 and regulators of the actin- myosin contractile apparatus. Indeed, gdi-1 -associated phenotypes were reduced by a mutation in the MLC-specific kinase (MLCK) unc-89 while gdi-1(RNAi) phe- nocopies a mutation in the MLC-specific phosphatase mel-11 (WormBase: WB- Gene00003196, [269]; Figure 4–5A). This suggests that gdi-1 antagonizes MLC phos- phorylation and consequently, contraction through the actin-myosin apparatus dur- ing gonad morphogenesis. Consistent with these results, gdi-1 -associated phenotypes were reduced by a chemical inhibitor of MLCK (ML-7) and a chemical inhibitor of myosin II ATPase activity (blebbistatin) (Figures 4–5B and 4–6). We also identified a genetic interaction between gdi-1 and dyb-1 that affects gonad morphogenesis (Figures 4–5A and 4–6). The latter gene is a close orthologue of dys- trobrevin, a component of the DGC that when altered leads to myopathies [10]. More- over, dystrobrevin is a functional partner of dystrophin (Ensembl ENSG00000198947), a protein that is associated with DMD and mild cognitive deficiencies in human [10]. C. elegans is a model organism used to dissect the molecular mechanism of myopathy

98 associated with mutations in DGC components, dyb-1 and dys-1 (WormBase: WB- Gene00001131, the orthologue of dystrophin) [106]. As shown previously, mutations in dyb-1 and dys-1 produce a progressive myopathy when combined with a weak allele of hlh-1 (WormBase: WBGene00001948; compare panels A and B of Figure 4–7) [34, 101]. We showed that gdi-1(RNAi) treatment significantly reduces muscle degeneration in dyb-1(cx36);hlh-1(cc561) and dys-1(cx18);hlh-1(cc561) mutants (Fig- ure 4–7C). Therefore, we demonstrated that the antagonism between gdi-1 and dyb-1 is conserved in different cellular systems in C. elegans. Taken together, our experimental results identified genes that antagonize gdi-1 activ- ity during gonad morphogenesis, ovulation (Figure 4–6B), and muscle degeneration.

4.5 Discussion

We present a prediction-based approach to identifying genetic interactions in C. elegans. The approach predicts many novel interactions, including interactions for poorly-characterized genes. Our validation results for gdi-1 suggest that our pre- dictions identify true interactions with a success rate far beyond random genetic screening (i.e. at least 84-fold greater than the rate of identifying true interactions by chance), and that our approach has improved accuracy compared to previous approaches. Moreover, we identified five genes with antagonistic activities towards gdi-1 activity during gonad morphogenesis and/or ovulation, including genes associ- ated with phosphorylated MLCs and dyb-1. Interestingly, we also showed that the antagonism between gdi-1 and dyb-1 influences muscle cell morphology. Our predictor integrates novel attributes based on network analysis. We showed that each network-based attribute identifies gene pairs that are enriched for true genetic interactions. The common interactors (CI) attribute is based on the common PP interactors of the proteins encoded by the two genes of interest, in a multi-species PP

99 interaction network. Two given proteins that have surprisingly many common PP in- teractors may be members of the same complex, thereby increasing the likelihood that their encoding genes genetically interact, since members of the same complex tend to genetically interact [251, 141]. Moreover, the N and NP h attributes are based on shared phenotypes in a so-called PhEP network constructed with RNAi knockdown phenotype, gene expression and PP interaction data. When a specific phenotype is associated with surprisingly many neighbors of a given gene in the PhEP network, it may follow that the gene modulates this phenotype. Thus, if the neighborhoods of two given genes are characterized by the same phenotype(s), the genes may modulate the same phenotype(s), thereby increasing the likelihood that they genetically inter- act. Furthermore, the network-based attributes provide additional information for less-studied genes, such as genes that may not have been assayed individually (e.g. for phenotype observations) or with other genes systematically (e.g. for PP interac- tions). For example, no phenotypes have been observed for unc-89 and it has not been tested for a PP interaction with gdi-1. However, the CI and N attributes support a genetic interaction between unc-89 and gdi-1, which we confirmed experimentally. This suggests that the network-based attributes facilitate the accurate prediction of genetic interactions. Our analyses suggest that in silico approaches tend to predict genetic interactions involving well-characterized genes more often than poorly-characterized genes. The Zhong and Sternberg (ZS) approach explicitly restricts the predictions to genes that satisfy a minimum information requirement (i.e. a gene must be associated with information from at least one attribute that is not the C. elegans gene expression at- tribute) [280]. Only 50% of all genes satisfy the requirement. As a result, only 25% e e of all genes pairs are tested in silico. In our approach, we do not impose a minimum information requirement. Moreover, we gained information for 80% of all gene pairs e by integrating our network-based attributes. These features of our approach may be

100 responsible for the large number of novel predicted genetic interactions. All of the experimentally validated interactions are antagonistic. This suggests that our learning set contains a strong signal for antagonistic interactions and that our ap- proach captures this signal. If this is the case, our approach may be advantageous for predicting antagonistic interactions. Consequently, our approach may also be advan- tageous for proposing antagonisms that warrant further study in mammals towards the identification of therapeutic targets for monogenic disorders. Because of its involvement in vesicular trafficking in mammals, GDI1 may be a crit- ical regulator of several signaling pathways controlling functions such as synaptic plasticity, learning and memory acquisition [66, 125, 32]. Interestingly, the signaling pathways involving ephrins, integrins and inositol-triphosphate that control gonad morphogenesis and ovulation in C. elegans are highly similar to the pathways con- trolling synaptic plasticity in human [57, 19, 179, 274, 237, 13]. Supporting this observation, the anti-epileptic drug valproate, which targets components of these sig- naling pathways in human, has been shown to cause severe alteration of sheath cell contraction and ovulation processes in C. elegans [249]. Moreover, our data suggest that gdi-1, like valproate [249], controls ovulation processes by modulating somatic go- nad cell functions. As documented by the Gilbert and Bolker study [103], a conserved signaling pathway can control different cellular processes in different organisms; for example, ovulation in C. elegans versus synaptic plasticity in human. However, a signaling pathway that is conserved across different cellular systems and species may have also acquired some context-specific signaling components. We therefore do not expect all signaling pathway observations in one context to apply to another context. However, a number of genes identified as genetic interactors of gdi-1, using Ste, Gon and Emo as phenotypical readouts in nematodes, have high sequence similarity to genes with neurological functions in human. These observations support the search for genetic interactors of gdi-1 with a role in controlling gonad morphogenesis and

101 ovulation in C. elegans to suggest likely genetic interactors of GDI1 controlling cogni- tive abilities in human. Nevertheless, this strategy for identifying genetic interactions relevant to cognition requires extensive validation in higher organisms such as mouse. One of the genetic interactions that we uncovered is between gdi-1 and aspm-1. In both C. elegans and mammals, aspm-1 controls mitotic spindle positioning and conse- quently, the ratio of symmetric and asymmetric cell divisions [89, 256]. While control of asymmetric division of somatic gonadal precursor cells (SGPs) is required for the proper morphogenesis of gonads in C. elegans [181], it is still unknown whether the modulation of asymmetric division in these cells is at the origin of the interaction between gdi-1 and aspm-1. aspm-1 is an orthologue of ASPM, a gene involved with brain development in human. In particular, ASPM is involved in the control of neu- ronal progenitor proliferation and is associated with microcephaly [89]. Since GDI1 is expressed in both proliferative and differentiated neurons during brain develop- ment [66], it would be interesting to test whether GDI1 genetically interacts with ASPM in mammalian brains and consequently, test if the simultaneous perturbation of both genes would result in a reduction of cognitive disabilities associated with mu- tations in either ASPM or GDI1 alone. The molecular origin of the genetic interaction observed between gdi-1 and tra-4 is also unknown. The transcriptional repressor tra-4 was shown to promote female development by repressing male-specific genes in C. elegans [108]. This gene was also characterized as a SynMuvB gene because it was shown to negatively regulate let-60 (WormBase: WBGene00002335)/Ras-mediated vulval development in nema- todes [108]. Interestingly, several SynMuvB genes have been shown to control somatic gonad development [24, 25]. Further studies will be required to assess the function of tra-4 during somatic gonad development and its potential interaction with SynMuvB genes in the cellular context of that process. We also showed genetic interactions between gdi-1 and regulators of the actin-myosin

102 contractile apparatus. Indeed, gdi-1 -associated phenotypes were reduced by muta- tions in the MLCK unc-89 and its functional partner unc-96 [176]. Interestingly, unc-96 is required for the proper distribution of unc-89 at the M-line in body-wall muscle sarcomeres [176]. Our data suggest a partnership between unc-96 and unc-89 that promotes the contraction of the actin-myosin contractile apparatus in gonad so- matic cells, in a pathway antagonistic to gdi-1. This hypothesis is also supported by the significant reduction of gdi-1(RNAi)-induced phenotypes in animals treated with the MLCK and myosin II inhibitors ML-7 and blebbistatin, respectively. Interestingly, MLC phosphorylation and myosin II function have been shown to control synaptogenesis, dendritic spine morphology and synaptic plasticity in mam- mals [224, 278]. Moreover, the inhibition of MLCK function in the lateral amygdala of the mouse brain has been shown to enhance auditory fear conditioning (i.e. learn- ing and memory) and to facilitate synaptic plasticity [153]. Because mutating GDI1 in mice has the opposite effect [67], it is of interest to assess whether phosphorylated MLC and GDI1 have antagonistic functions in neurological mechanisms that enable learning and memory acquisition in mammals. If antagonism is present, inhibiting phosphorylated MLC is a potential therapeutic strategy to reduce the symptoms as- sociated with GDI1 mutations in human. We also demonstrated that the activity of gdi-1 is antagonistic with the activity of the dystrobrevin orthologue, dyb-1, during gonad morphogenesis in C. elegans. Dys- troglycan is another component of the DGC and its orthologue dgn-1 has been pre- viously shown to control gonad morphogenesis in C. elegans [138]. Our data suggest that dyb-1 also contributes to this developmental process. Interestingly, the antago- nism between dyb-1 and gdi-1 is consistent with the likely antagonism in mammals where dystrobrevin acts as a regulator of cell signaling through the inhibition of re- ceptors and membrane recycling [7], and GDI1 potentially promotes these cycling events by regulating RAB4 and RAB5 [32]. We also showed that this antagonism

103 is conserved in different cellular systems in C. elegans since gdi-1(RNAi) treatment significantly reduced muscle degeneration in both dyb-1;hlh-1 and dys-1;hlh-1 ani- mals. Mechanisms of muscle degeneration resulting from functional alterations of dystrobrevin or dystrophin are still poorly understood in mammals [33]. While C. el- egans is an animal model of choice to dissect the pathological mechanisms associated with myopathies [102], the antagonisms observed between the GDI1, dystrobrevin and dystrophin orthologues in C. elegans should be confirmed in DMD mammalian models (e.g. the mdx mouse) before considering GDI1 as a promising therapeutic target for DMD. Furthermore, since DGC components and GDI1 are expressed at the synapses of hippocampus neurons [34, 32], it would be extremely interesting to test whether perturbations of dystrobrevin function may reduce cognitive disabilities associated with mutations in GDI1 in mammals. In summary, we developed a bioinformatics tool to predict genetic interactions in C. elegans towards the identification of therapeutic targets to address monogenic dis- orders associated with disruptions in signaling homeostasis. Our tool uses network- based attributes and our validation suggests that it predicts interactions more com- prehensively and with improved accuracy compared to other tools. In addition, we experimentally confirmed the interactions that were predicted between gdi-1 and sev- eral genes involved in neurological functions in human. Notably, we established that perturbation of aspm-1, tra-4, unc-89, unc-96, or dyb-1 reduces the signaling unbal- ance resulting from a reduction of gdi-1 expression. We also showed that a reduction of gdi-1 expression significantly reduces muscular dystrophy in nematode DMD mod- els. Further studies using relevant mammalian models are required to assess whether ASPM, MLC phosphorylation machinery and dystrobrevin would be potent thera- peutic targets for cognitive disabilities associated with mutations in GDI1. Similarly, further studies in mammalian models would be required to assess whether GDI1 would be a potent therapeutic target for DMD. In conclusion, we have developed a

104 valuable tool that facilitates the mapping of genetic interactions in C. elegans. Since the conservation of pathogenic mechanisms and genetic interactions between distant species is still under intense debate, experimental validation in mammals of genetic interactions identified in C. elegans is required to evaluate the potential of our method to significantly streamline the therapy development process for monogenic disorders that involve genes and signaling pathways conserved between human and C. elegans.

4.6 Methods

The development and subsequent analysis of the genetic interaction predictor were completed in the R v2.6 statistical computing environment (http://www.r- project.org, [246]).

4.6.1 Construction of the learning set

A learning set, comprised of a positive and a negative subset, was constructed for the training of the predictor of genetic interactions. The positive learning set con- sists of 1,522 genetic interactions identified by automated [189] or manual curation of the literature (see Table 5-S1). The negative learning set should consist of pairs of non-interacting genes. Since the vast majority of gene pairs are believed not to genetically interact [251], we built our negative learning set from 14,000 randomly e selected gene pairs from the set of all genes mapped to a genomic location (WormBase release WS180, http://www.wormbase.org/). The approximate 1:10 ratio of positive to negative interactions was established to guarantee a learning set with a thorough sampling of all gene pair combinations ( 386 million in total). e

105 4.6.2 Datasets used to derive attributes

The gene expression data was obtained from [145]. We obtained all RNAi knock- down phenotype data in WormBase release WS141 and removed seven uninforma- tive or redundant types, such as “wildtype”, “unclassified”, “not embryonic” and “complex phenotype.” Protein-protein (PP) interactions were obtained from all C. elegans, Saccharomyces cervisiae, Drosophila melanogaster, and Homo sapiens yeast two-hybrid datasets stored in BioGRID v2.0.37 (http://www.thebiogrid.org/) and from two additional yeast two-hybrid datasets [92, 243] that are absent from this database. We focused on yeast two-hybrid datasets because the technique de- tects an interaction with minimal influence from endogenous environments, e.g. a fly cell. We assume that two proteins do not exhibit a PP interaction if both proteins were assayed and no interaction was found. To create a multi-species PP interac- tion network, we used the orthology mappings generated by InParanoid v1.35 [198] (non-default parameters: score cutoff 10, in-paralog confidence cutoff 0.025, sequence overlap cutoff 0.2) when run with protein sequences obtained from the InParanoid dataset of June, 2006 (http://inparanoid.sbc.su.se/cgi-bin/index.cgi). Comparisons with hand-curated orthologies for a subset of genes indicated that our parameter settings produced orthology mappings with minimal false positive results (data not shown). The names of the genes/proteins described in the datasets were updated to the names used in WormBase release WS180.

4.6.3 Derivation of attributes for use in the logistic regression

The co-expression attribute Exp(g, g0) is the P value derived for the Pearson cor- relation of genes g and g0 across all microarray hybridizations (conditions) relative to the empirically estimated probability distribution of correlation for all gene pairs (i.e. a fitted normal). Figure 5–4 establishes the need for this estimation due to the lack of fit to standard models of a correlation distribution. Correlations greater than 0.35

106 are statistically significant (P ≤ 0.05) according to the estimated distribution. The co-phenotype attribute P h(g, g0) measures the statistical significance of the number of shared phenotypes between the two genes via a standard Fisher’s exact test (N = the number of phenotypes observed for at least two genes). We defined the multi- species PP interaction network such that nodes represent C. elegans proteins and an edge exists between two proteins if they, or their orthologous proteins in a species considered here, exhibit a PP interaction according to the PP interaction dataset. The binary interaction attribute I(g, g0) indicates whether the proteins encoded by g and g0 exhibit a PP interaction in our multi-species PP interaction network (Figure 4–2A). Similarly, the common interactors attribute, CI(g, g0), considers the statistical significance of the observed number of common PP interactors of the proteins encoded by g and g0, in the multi-species PP interaction network (Figure 4–2B). Specifically, CI(g, g0) is assigned a P value derived from a one-tailed Fisher’s exact test (N = the number of genes encoding proteins that are in the multi-species PP interaction network). We defined a biological network called the PhEP network, where two genes g and g0 are connected by an edge if and only if the Pearson correlation of their gene expression exceeds 0.35, their gene products exhibit a PP interaction, or their orthologues (in any species considered here) exhibit a PP interaction (Figure 4–2C,D). For a given gene, we measured how surprising it is to witness the observed number of its neighbors (i.e. genes connected to it by one edge) in the PhEP network labeled with a specific phenotype identified by RNAi in C. elegans. This was measured using a one-tailed Fisher’s exact test (N = the number of genes with some assigned phenotype). If the derived P value is less than or equal to 0.05 for g and g0 (Figure 4–2C), we assign a value of 1 to a categorical variable N(g, g0), and 0 otherwise. Similarly, if g and g0 exhibit a phenotype that is also enriched in both their neighborhoods in the PhEP network (Figure 4–2D), we assigned a value of 1 to a categorical variable NP h(g, g0),

107 and 0 otherwise. Missing values for any of the derived attributes (due to missing values in the under- lying datasets) were replaced with the expected value (i.e. the sample mean) of the attribute before training.

4.6.4 Model specification, training and cross-validation

The logistic regression model is of the form:

 p(g, g0)  ln =c + c Exp(g, g0) + c P h(g, g) + c I(g, g0) 1 − p(g, g0) 0 Exp P h I

0 0 0 + cCI CI(g, g ) + cN N(g, g ) + cNP hNP h(g, g ) (4.1) where

0 0 p(g, g ) is the probability of a genetic interaction between genes g and g , c0 is the learned intercept term of the model, cExp, cP h, cI , cCI , cN , cNP h are the learned co- efficients for the attributes, and Exp(g, g0), P h(g, g0), I(g, g0), CI(g, g0), N(g, g0) and NP h(g, g0) are the attribute values for g and g0. To select the optimal logistic regression model in the context of our learning set and attributes, we assessed models defined by different attribute combinations and trained with different positive:negative weight ratios. Specifically, we trained models using each of the following weight ratios: 1:1, 1:2, 1:5, 1:10 and 1:100. If negative exam- ples are weighted more heavily, prediction errors on these examples result in greater penalties, and model coefficients are fitted accordingly. Using each weight ratio, we trained the models defined by all non-empty subsets of the attributes (in total, 26 - 1 = 63 models), with each of five different folds of the learning set to avoid learning set bias. In training each model with the iterative weighted least squares algorithm [191], we assume that the initial fit estimated from the weighted data is reasonably close to the optimal fit, and thus assume that the algorithm converges to the optimal fit (with the default tolerance and at most 50 iterations). For each fold, we define the optimal

108 model as the model that yielded the lowest Akaike’s Information Criterion (AIC), a measure that considers both fit to the data and complexity of the model. Any weight ratio that did not yield the same optimal model for all five folds was eliminated from consideration. For each remaining weight ratio, we computed the mean AIC of the optimal model (across the folds). The 1:2 weight ratio yielded the lowest mean AIC and we thus selected this ratio and the corresponding optimal model to define our genetic interaction predictor. Therefore, within the scope of logistic regression models defined by our attributes and trained with our learning set and tested weight ratios, the full model that uses all six attributes was found to be optimal based on our con- vergence assumptions and the AIC (Table 5–4). Leave-one-out cross-validation of the full model was performed to obtain true and false positive rates for unseen data (Figure 5–1). The final predictor was trained on the full learning set using the tuned weighting and all six attributes. If a pair of genes has a prediction score ≥ 0.85, the two genes are predicted to genetically interact. Logistic regression is a technique that does not take into account the obvious de- pendencies between the attributes. To test the strength of dependencies between attributes we experimented with graphical models, specifically by using a software package for learning Bayesian networks (i.e. the deal package v1.2-30) [49]. The learning set used to train the logistic model was also used to train a Bayesian net- work. The resulting network exhibits several dependencies between the attributes (Figure 5–5), many of which are expected since some attributes are derived from the same underlying datasets. Although predictive accuracy might be improved if these attribute dependencies were accounted for, doing so would require a more sophisti- cated predictive model that relies on an abundance of data to accurately quantify the dependencies. Due to the paucity of attribute data for some genes (e.g. a gene may only have data for the Exp and N attributes), such a predictive model trained with the current datasets would not necessarily be advantageous over a simpler model

109 (such as a logistic regression model).

4.6.5 Predictions from other genetic interaction predictors

The functional interactions predicted by the Lee et al. method were obtained from the WormNet v1 core set [157]. The genetic interactions predicted by the Zhong and Sternberg method were downloaded in June, 2006 [280]. The names of the genes in these prediction datasets were updated to the names used in WormBase release WS180.

4.6.6 Quantifying the information available for a gene

In quantifying the information available for a gene, we took into account the usefulness of particular types of data for the prediction of genetic interactions. Specif- ically, if there is sufficient data to compute the value of a predictive attribute (e.g. Exp) for any pair involving a particular gene, the usefulness of the value is quantified by the magnitude of the weight of the attribute in the predictive model (e.g. |cExp|). The total quantity of information available for a gene is thus defined as the sum of the magnitudes of weights corresponding to attributes for which values can be com- puted. The quantities were scaled to be in [0,1] via division by the maximum quantity achievable. The subsequent relative quantities allow for comparisons between predic- tors that use different attributes (see Figure 5–3).

4.6.7 Analysis of the predicted genetic interaction network with pathway annotations

The biological validity of the predicted genetic interaction network was assessed in silico by computing the shortest path distance between genes annotated to the same pathway. We defined the predicted network such that a node exists for each C. elegans gene and an edge exists between two genes if they are predicted to genetically

110 interact. We also defined 100K randomized networks such that each randomized network is identical to the predicted network, except that the nodes are assigned a random permutation of the gene labels. C. elegans pathway annotations derived from human were obtained from KEGG release 44 (http://www.genome.jp/kegg/) and signaling pathway annotations derived directly from C. elegans were obtained from [218] (Table 5-S3). Using the predicted network and each randomized network, the shortest path distance (i.e. the minimum number of edges to traverse in a given network to get from one gene to the other) was computed for every pairing of genes annotated to the same pathway. For each network, we subsequently computed di, the number of pathway gene pairs with shortest path distance = i, for i = 1,2. d1 rep- resents the number of within-pathway interactions based on the given set of pathway annotations (Figure 4–3A). d2 represents the number of pathway gene pairs that are not connected directly, but share ≥1 neighbor in the network, suggesting within- or between-pathway interactions (Figure 4–3B). Let di,pred represent di of the predicted network. The significance of di,pred was estimated with a permutation P value =

(x+1)/(N +1) [182], where x is the number of randomized networks with di ≥ di,pred, and N is the total number of randomized networks. We further examined pathway gene pairs with shortest path distance = 2 in the predicted network. Specifically, we computed the percentage of these pairs that satisfy the following criterion: the given pair has ≥1 shared neighbor that is not annotated to any of the pathways associated with either member of the pair. The pairs that satisfy this criterion likely exhibit predicted within-pathway interactions with an unknown member of the pathway of the pair, or predicted between-pathway interactions.

4.6.8 Nematode strains

Nematodes were grown on nematode growth media (NGM; Brenner, 1974) at 20◦C. Bristol strain N2 animals were used as wild-type animals. Nematode strains

111 containing the following alleles were retrieved from the Caenorhabditis Genetic Center (CGC), which is funded by the NIH National Center for Research Resources (NCRR): rrf-1 (pk1417), ppw-1 (pk2505), dyb-1 (cx36), unc-89 (e1460), unc-89 (st85), unc- 89 (ok1116), unc-89 (ok1659), unc-96 (su151), tra-4 (ok1636), mel-11 (sb56), trp-2 (gk298) (WormBase: WBGene00006615), smo-1 (ok359) (WormBase: WBGene00004888), aspm-1 (ok1208), lin-36 (n766) (WormBase: WBGene00003021), F42G8.10 (ok1199) (WormBase: WBGene00018361), tag-163 (ok644) (WormBase: WBGene00006508), F54D5.4 (ok2046) (WormBase: WBGene00010050), dys-1 (cx18), hlh-1 (cc561) (see Table 5–5).

4.6.9 RNAi and drug treatment

Blebbistatin (100 µM) and ML-7 (50 µM) were incorporated in NGM agar before plate pouring. The drug-containing plates were used throughout RNAi treatment. The pL4440-dest-gdi-1 construct, used to submit animals to RNAi against gdi-1, was kindly provided by Dr Marc Vidal, Dana-Farber Cancer Institute. The pL4440- dest-egfp construct was generated as described previously [50]. These constructs were transformed into HT115 (DE3) strains [247] and the animals were submitted to RNAi treatment as previously described [135]. To score the sterility phenotype (Ste), syn- chronized L1 larvae were fed RNAi-expressing bacteria for 72h at 18◦C. Three young adults were then transferred to fresh plates seeded with RNAi-expressing bacteria and they were allowed to lay eggs for 48h at 18◦C. The progeny were counted and sterility was measured as detailed in the Epistasis statistics section. The penetrances of the endomitotic oocyte (Emo) and gonad morphogenesis defect (Gon) phenotypes were scored after DAPI staining the RNAi-treated animals fixed with methanol. Emo and Gon phenotypes were scored by fluorescence microscopy using a Leica DM5500 mi- croscope equipped with a 63X oil-immersion objective and using regular sets of filters for excitation at an ultra-violet wavelength. An animal was considered as expressing

112 the Emo phenotype if at least one endomitotic oocyte was present in the gonad. An animal was considered as expressing the Gon phenotype if its gonad was significantly shorter than gonads observed in N2 animals. The position of the gonad turn with respect to the anterior and posterior intestine nuclei was used to measure the relative length of a gonad. Muscle degeneration was observed in methanol-fixed nematodes upon polarized-light illumination, using a Leica DM5500 microscope equipped with a 100X oil-immersion objective. Only the centermost 20 cells of the two muscle quad- rants facing the objective were observed to quantify the abnormal cells. Fluorescent microscopy pictures were captured using the Leica DFC350FX R2 camera and the Leica AF6000 software series. Polarized light microscopy pictures were captured from a Zeiss Axioimager Z1 equipped with a 63X oil-immersion objective and an Axiocam HRM camera controlled by the Axiovision software v4.5. The potential modulation of RNAi efficiency in the different backgrounds tested, and the relative contribution of balancers to identified genetic interactions, were examined to confirm the validity of our results (see Supplementary Text and Figure 5–6).

4.6.10 Measurement of sheath cell contraction

Sheath cell contraction rates were scored in anesthetized animals (0.1% tricaine and 0.01% tetramisole in M9 buffer) as previously described [173]. Basal contractions were estimated by monitoring lateral sheath displacement [178] upon DIC illumina- tion at room temperature, using a Leica DM5500 microscope equipped with a 63X oil-immersion objective.

4.6.11 Epistasis statistics

Let φx ∈ [0,1] represent the level of a particular phenotype expressed by genetic

population x. Conversely, let Fx = 1 − φx represent the “fitness” of x with respect to the phenotype, e.g. a maximal value of 1 indicates that the phenotypic defects are

113 absent in all animals of type x. Let φm, φgdi−1, φm/gdi−1 and φwt represent the level of the phenotype expressed by animals with mutation(s) in (predicted interactor) gene m, wild-type animals submitted to gdi-1(RNAi), m-mutant animals submitted to gdi-1(RNAi) and wild-type animals, respectively. Three different models were used to quantify epistatic effects through an epistasis coefficient . The models use values

0 0 that have been normalized to wild-type levels, i.e. φx = φx − φwt and Fx = Fx/Fwt. Under the minimum model [250]:

 = (Fm/gdi−1 − Fwt) − min(Fm − Fwt,Fgdi−1 − Fwt)

0 0 0 = max(φm, φgdi−1) − φm/gdi−1 (4.2)

Under the additive model [60]:

0 0 0  = (φm + φgdi−1) − φm/gdi−1 (4.3)

Under the multiplicative model [60]:

0 0 0  = Fm/gdi−1 − Fm · Fgdi−1 (4.4)

Within each model, a function of the phenotypic level expressed by the doubly-altered

0 population (e.g. φm/gdi−1) is compared to some expectation of the level, given what is known about the populations with the single-gene perturbations. This expectation is computed by a model-specific function, f(φm, φgdi−1, φwt) ∈ [0,1]. For example, under

0 0 0 0 the additive model, f(φm, φgdi−1, φwt) = φm + φgdi−1 if φm + φgdi−1 ≤ 1, otherwise

f(φm, φgdi−1, φwt) = 1. If  < 0, there is a synergistic interaction between m and gdi-1. If  > 0, there is an antagonistic interaction. We also identified genes that, when mutated, specifically suppress the phenotypic effects of gdi-1(RNAi) (observed in wild-type animals). This

0 0 was achieved by statistically testing if φgdi−1 − φm/gdi−1 > 0. See Supplementary Text for details regarding all statistical tests performed, including details about our

114 normality assumption (Figure 5–7).

The Ste level expressed by a genetic population x was defined as φx = 1 − Bx/Bwt,

where Bx and Bwt are the brood size measurements for x and wild-type animals, re-

spectively. The Gon and Emo levels were defined as φx = nx/nx,total, where nx is the

number of x animals observed to have the phenotype and nx,total is the total number of x animals examined.

4.6.12 Statistic for the suppression of muscle degeneration

Let φx = yx/nx represent the level of muscle degeneration expressed by an animal

in genetic population x, where yx is the number of abnormal muscle cells and nx is the total number of muscle cells observed in the animal. In each independent experi- ment, at least 20 animals were observed for each genetic population. We statistically

tested the hypothesis that φm/gdi−1 < φm, i.e. gdi-1(RNAi) treatment suppresses the muscle degeneration observed in m-mutant animals. Specifically, the hypothesis was tested using the Mann-Whitney test and a P value was obtained for each indepen- dent experiment. The P values were combined to compute an overall P value using the weighted-Z method [265] (N = 3). The weight of each independent experiment was the total number of animals observed (i.e. the number of gdi-1(RNAi)-treated m-mutant animals observed plus the number of control-treated m-mutant animals observed).

4.7 Acknowledgements

We thank Marc Vidal (Dana-Farber Institute) for the pL4440-gdi-1 construct; Nathalie Dourdin, Alexandre Arnold and Borhane Annabi for critical reading of the manuscript; and Sara J. Calafell Gosline for technical assistance.

115 Figure 4–1: Comparison of large-scale genetic interaction studies in C. el- egans. The studies are compared in terms of the percentage of genes with identi- fied/predicted interactions and the success rate of experimental validation (i.e. the fraction of tested gene pairs that exhibit a genetic interaction). Systematic experi- mental screens test a limited number of gene pairs due to the labor-intensive exper- imental procedures. Moreover, these screens identify a small number of interactions relative to the number of tested gene pairs since genetic interactions appear to be rare. Prediction-based methods can assess all pairs of genes in silico, and consequently, the percentage of genes with predicted interactions tends to be larger than the percentage of genes with interactions identified by a systematic experimental screen. Moreover, predictions focus experimental efforts on gene pairs that are likely to exhibit a ge- netic interaction. Accordingly, the success rates of prediction-driven screens tend to be greater than the rates for systematic experimental screens. The success rate of our study shown here is conservative since it was computed based on the following definition: a gene pair exhibits an interaction only if the interaction is statistically significant according to all considered epistasis models (see Methods).

116 Figure 4–2: Gene pair attributes used to predict genetic interactions. The two genes/proteins of interest are highlighted with thick grey rings. (A) I, the pres- ence or absence of a protein-protein (PP) interaction between the proteins encoded by the genes of interest, or their orthologues. (B) CI, a measure of the significance of the overlap between the PP interaction neighborhoods of the proteins encoded by the genes of interest (i.e. overlap of the red and blue regions). The PP interaction neigh- borhood of a given protein is the set of all of proteins that exhibit a PP interaction with the given protein (according to the multi-species PP interaction network). (C) N, an indicator for whether the neighborhoods of the genes of interest are enriched with the same phenotype. Here we define the neighborhood of a given gene as the set of genes that show significant coexpression (P ≤ 0.05, see Methods) with the given gene and/or encode proteins that exhibit a PP interaction with the product of the given gene (according to the multi-species PP interaction network). Both neighbor- hoods shown here are enriched with a particular phenotype. (D) NPh, an indicator like N with the additional requirement that the genes of interest themselves must also exhibit the phenotype enriched in their neighborhoods.

117 Figure 4–3: Assessment of the biological relevance of the predicted genetic interaction network with pathway annotations. Here we show scenarios where a pair of genes annotated to the same pathway (A) is directly connected or (B) shares ≥1 neighbor in a genetic interaction network, where the genes in the pair of interest are highlighted with thick grey rings. In (A), the genes exhibit a within-pathway genetic interaction based on the given set of pathway annotations. In (B), the genes belonging to the same pathway (e.g. pathway A) both interact with a gene that may be an unknown member of the same pathway, or may belong to a different pathway (e.g. pathway B), and thus exhibit potential within- and between-pathway interactions, respectively. Below, the frequencies at which each scenario occurs in the predicted network and in randomized networks are shown with respect to all pathways and to signaling pathways only (see Methods). The “all pathways” and signaling pathway annotations were derived from human and C. elegans, respectively. For each set of pathway annotations, the median, first and third quartile frequencies of each scenario were computed across N = 100K randomized networks; the bar length depicts the median and the error bars depict the first and third quartiles. Both scenarios occur more frequently than what is expected by chance, for both sets of pathway annotations.

118 119 Figure 4–4 (preceding page): Phenotypical characterization of gdi-1(RNAi)- treated animals. (A,B) DAPI staining of egfp(RNAi)- (labeled wt) and gdi- 1(RNAi)-treated wild-type animals. (A) Gonad morphogenesis defects (Gon) char- acterized by short gonads (*) are observed in gdi-1(RNAi)-treated animals. Scale bar, 200 µm. (B) Accumulation of Endomitotic oocytes (Emo, arrowheads) in the proximal gonad of gdi-1(RNAi)-treated animals. Arrows indicate the spermathecae. Scale bar, 25 µm. (C) Sterility (Ste), Gon and Emo phenotypes were measured in wild-type, rrf-1(pk1417) and ppw-1(pk2505) animals submitted to gdi-1(RNAi) (N = 3). The mean expressivity/penetrance of each phenotype is shown with error bars representing ± one standard error. A (*) indicates a statistically significant reduc- tion of the phenotypes (P ≤ 0.05, Student’s t-test) compared to wild-type animals treated with gdi-1(RNAi). (D) Distributions of the sheath cell contraction frequency for egfp(RNAi)- (labeled wt) and gdi-1(RNAi)-treated animals.

120 Figure 4–5: Validation of a subset of genetic interactions predicted for gdi-1. Ste, Gon and Emo phenotypes were measured in animals submitted to RNAi against egfp (grey) or gdi-1 (green). The mean difference in the expressivity/penetrance of each phenotype in perturbed (mutant or chemically-treated) versus wild-type animals (denoted φ[x]−φ[y], for animals of type x and y) is shown with error bars representing ± one standard error, n ≥ 3. For Ste, the Z-score of the difference in expressivity is plotted (see Supplementary Text). (∆) and (*) indicate statistically significant differences for animals treated with egfp(RNAi) and gdi-1(RNAi), respectively (P ≤ 0.05, see Methods). NA: not available. (A) Differences in phenotype expressivity induced by mutations in genes predicted to interact with gdi-1. (B) Differences in phenotype expressivity induced by chemical treatment. Blebbistatin (Blebb.) is a myosin ATPase inhibitor and ML-7 is a specific inhibitor of myosin light chain kinase.

121 122 Figure 4–6 (preceding page): Epistasis between gdi-1 and its predicted ge- netic interactors and chemical suppressors. (A) The minimum (M), additive (+) and multiplicative (*) statistical models of epistasis were used in the analysis. A statistical test for the specific suppression (S) of gdi-1(RNAi)-induced defects was also used (see Methods for details). Significant synergistic and antagonistic interac- tions are illustrated with shades of red and blue, respectively (P ≤ 0.05). Darker shades indicate significant interactions with P ≤ 0.01. The absence of a statistically significant interaction is indicated by a white entry. NA: not available. (B) Schematic representation of gdi-1 interactors. Blue lines represent antagonistic interactions with gdi-1. The dashed red line indicates phenocopy between mel-11 and gdi-1. unc-96 (paramyosin-binding protein), unc-89 [myosin light chain (MLC)-kinase], and mel-11 (MLC-phosphatase) are regulators of the actin-myosin contractile apparatus (AMCA, represented in grey) [176, 269]. unc-54 and myo-1 are type II myosin heavy chains. tra-4 encodes a PLZF -like transcription factor [108]. aspm-1 (orthologue of mam- malian ASPM ) and dyb-1 (orthologue of a component of the dystrophin glycoprotein complex, DGC) have been associated with mitotic spindle assembly and DGC function in human, respectively [10]. ML-7 and blebbistatin (Blebb.) are specific inhibitors of MLC-kinase and myosin II ATPase activities, respectively.

123 124 Figure 4–7 (preceding page): gdi-1 suppresses dys-1 - and dyb-1 -associated muscle degeneration. Body-wall muscle fibers observed using polarized light microscopy in (A) wild-type and (B) dys-1(cx18);hlh-1(cc561) animals. The arrow indicates an abnormal/degenerated muscle cell. Scale bar, 200 µm. (C) Muscle degen- eration was assessed in wild-type (wt), dys-1(cx18);hlh-1(cc561) and dyb-1(cx36);hlh- 1(cc561) animals submitted to RNAi against egfp (grey) or gdi-1 (green). The per- centage of abnormal muscle cells in a methanol fixed animal, estimated with polarized light microscopy, was used to quantify muscle degeneration in the animal. Boxplots of these percentages are shown. The total number of animals assessed across three independent experiments is shown above each boxplot in parentheses. The percent- age of abnormal muscle cells is significantly reduced in gdi-1(RNAi)-treated versus egfp(RNAi)-treated mutant animals as indicated by the P values shown at the top (see Methods).

125 CHAPTER 5 Supplementary Information for Chapter 4 This chapter contains all the supplementary information for the manuscript in chapter 4, except for the following: • Table 5-S1: Genetic interactions hand-curated from the literature. • Table 5-S3: Signaling pathway genes curated from the C. elegans literature. • Table 5-S4: Curated set of mental retardation and synaptic plasticity genes and their C. elegans orthologues. The above tables are not intended for the printed page, however, they are available for download (i.e. no subscription required) at www.plosone.org, PLoS ONE 5: e10624 (2010).

5.1 Supplementary Text: Supplementary Methods

5.1.1 Human orthologues of genes with predicted genetic interactions

The set of predicted genetic interactions was assessed in terms of usefulness for improving the characterization of human genes. HomoloGene orthology (release 56, downloaded from ftp://ftp.ncbi.nih.gov/pub/HomoloGene/) was used to map C. el- egans genes to human genes. Gene characterization indices of human genes were obtained from a previous study [143]. Figure 5–2C summarizes the results of this assessment.

126 5.1.2 Controls for the validation of predicted genetic interactions using RNAi and balanced heterozygote strains

In order to test whether the observed reduction of gdi-1(RNAi)-induced pheno- types in various mutant backgrounds was due to an overall reduction of RNAi effi- ciency (instead of an interaction), mutant animals expressing YOLK::GFP proteins were generated by mating mutant hermaphrodites with sqt-1(sc103) II; bIs1 X males. Efficiency of egfp(RNAi) treatment was estimated by counting animals without ob- servable GFP emission using fluorescence microscopy. No difference in egfp(RNAi) ef- ficiency was observed in dyb-1 (cx36), unc-89 (e1460), unc-89 (st85), unc-89 (ok1116), unc-89 (ok1659), unc-96 (su151), tra-4 (ok1636) and aspm-1 (ok1208) backgrounds (data not shown). This confirmed the genetic interactions between these genes and gdi-1. Interactions with gdi-1 were assessed through RNAi treatment of homozygotes for all alleles tested except for aspm-1 (ok1208). Instead, aspm-1 (ok1208) heterozy- gotes were balanced with a hT2 chromosomal translocation between chromosomes I and III. In order to assess whether gdi-1 also genetically interacts with unbal- anced heterozygotes of aspm-1(ok1208), VC761 (aspm-1(ok1208)I / hT2[bli-4(e937); let-?(q782)qsl48]) animals were mated with N2 animals. L1 larvae without GFP ex- pression in the pharynx (corresponding to unbalanced aspm-1(ok1208)) were isolated from the F1 population and submitted to RNAi against gdi-1 or egfp. The Emo phenotype was then scored as detailed in the Methods. As shown in Figure 5–6, the gdi-1(RNAi)-induced Emo phenotype was significantly reduced in unbalanced aspm- 1 (ok1208)) heterozygotes. This confirms the interaction between gdi-1 and aspm-1 (Figures 4–5A and 4–6).

127 5.1.3 Epistasis statistics

The statistical significance of genetic interactions with respect to the Gon and Emo phenotypes was determined as follows. Consider the putative interaction be- tween a gene m and gdi-1. For some genes, multiple alleles were tested via different strains (e.g. unc-89 ). In these cases, epistasis statistics were computed for each individual strain. Due to day effects, the epistasis coefficient i was computed inde- pendently for each day i on which the experiments were replicated. Let wi represent the total number of animals examined on day i (with respect to gene m and gdi-1 ).

The weighted mean and variance of  were computed using the wi values as weights. Finally, these weighted statistics of  were used to compute a t-statistic for testing the null hypothesis that  = 0 (i.e. a two-side test) with (N - 1) degrees of freedom, where N is the number of days on which the experiments were replicated. Similarly, the test for gdi-1(RNAi) suppressors was performed by computing a t-

0 0 statistic based on the weighted mean and variance of d = φgdi−1 − φm/gdi−1. Since the goal was to test the null hypothesis that d ≤ 0, a one-sided test was used to determine statistical significance. Figure 4–5 illustrates values of (-d) so that a reduction in phe- notypic level is associated with a negative value (green bars). Analogous statistics were computed and plotted for the egfp(RNAi) observations (Figure 4–5, grey bars)

0 0 for which we defined d = φegfp − φm/egfp. For the Ste phenotype, each genetic population was quantified on different days yet also replicated within each day. The same day measurements were used to estimate the variance of each population and paired measurements across days were used to estimate the covariance of different populations. For day i, i was computed from the mean brood sizes of the relevant genetic populations. The standard error of i was estimated using a formula derived from error propagation rules [30]. For cases where the formula yielded an invalid standard error value (e.g. by taking the square root of a negative number), the delta method [199] was used to estimate the error.

128 If this failed as well, across-day variance for each population was used in place of

the within-day variance as input to the delta method. A t-statistic for i was then

computed and used to test the null hypotheses i < 0 and i > 0 (i.e. one-sided tests) with νi = Ni,1 + Ni,2 + ... + Ni,k − k degrees of freedom, where Ni,j is the number of replicates on day i for population j, and k is the number of populations relevant to the computation of i. For each null hypothesis, the derived P values, Pi, were com-

bined to compute an overall P value using the weighted-Z method [265] with the νi values used as weights. Consequently, we obtained one P value for the significance of

an antagonistic interaction (i > 0), and a second for the significance of a synergistic interaction (i < 0). We opted for two one-tailed tests instead of a single two-tailed test for Ste since combining P values that each test the (two-sided) null hypothesis that i = 0 can be

misleading. For example, if P1 is low because 1 > 0 and P2 is low because 2 < 0, the combined P value will likely be low as well even though the across-day average  is likely close to zero. As a result, the significance of the interaction would be over-

estimated. Alternatively, Pi values from one-sided tests of the same null hypothesis

(e.g.  < 0) can only be low if the respective i values are extreme in a pre-specified direction (e.g.  > 0). Therefore, the combined P value will reflect the overall signif- icance of  being extreme in that specific direction (e.g.  > 0; antagonistic). Similarly, the test for gdi-1(RNAi) suppressors in terms of Ste was performed by com-

0 0 puting a t-statistic for d = φgdi−1 − φm/gdi−1. Specifically, di was computed for each day i on which the experiment was repeated. The standard error of di was estimated with the same techniques used to estimate the standard error of i.A t-statistic for di was then computed and used to test the null hypothesis di < 0 (i.e. a one-sided test) with νi = Ni,1 + Ni,2 + ... + Ni,k − k degrees of freedom, where Ni,j is the number of replicates on day i for population j, and k is the number of populations relevant to the computation of di. Again, the derived Pi were combined to compute

129 an overall P value using the weighted-Z method [265] with the νi values used as weights. Figure 4–5 illustrates values of the weighted Z-score so that a reduction in phenotypic level is associated with a negative value (green bars), and the error bars correspond to the standard error of the Z-score. Analogous statistics were computed and plotted for the egfp(RNAi) observations (Figure 4–5, grey bars) for which we

0 0 defined d = φegfp − φm/egfp. Since the statistical tests were repeated for different genes/strains, the resulting P values were adjusted for multiple comparisons using the Benjamini and Hochberg method [27]. For all tests, a threshold of (adjusted P ) ≤ 0.05 was used to determine statistical significance. All epistasis values and corresponding adjusted P values are provided in Tables 5–2 and 5–3, respectively. The t-distribution was used throughout under the assumption that the errors (i.e. de- viations of the measurements from the true values) are normally distributed. The va- lidity of this assumption was assessed by considering the gdi-1(RNAi)-treated animals since there are roughly 30 measurements per phenotype for this genetic population whereas other populations have much fewer. To make the measurements comparable, we weighted the wild-type-normalized values to reflect the confidence in the mea- surement. The weight of a measurement was derived from the sample size used to obtain it (e.g. the number of worms examined for the phenotype on a particular day). Specifically, we defined the weight as (sample size)/(total sample size across days). Figure 5–7 illustrates the histogram of the final scaled measurements along with a fitted normal distribution for each phenotype. Although the true distributions are bounded below by zero, their shapes are generally normal as required. The bound discrepancy simply leads to more conservative P values. The t-test is also appropriate since the data are often obtained from few samples and the t-distribution is explicitly parameterized by sample size (via degrees of freedom) accordingly.

130 Figure 5–1: Receiver-operating-characteristic curve of the genetic interac- tion predictor. The error rates were estimated with leave-one-out cross-validation. The threshold associated with each point (i.e. a pair of rates) is indicated in red text. Only the portion of the curve with the smallest false positive rates is shown since, in practice, having fewer false positives instead of greater sensitivity is more important for laborious experimental validation.

131 Figure 5–2: Comparison of genome-wide genetic interactions predicted by different approaches. (A) Venn diagrams of predicted interactions from Zhong and Sternberg [280], Lee et al. [157] and this study. Left, interactions between any C. elegans genes. Right, interactions between C. elegans genes with human orthologues. Our approach predicts many novel interactions and about 85% of them are between C. elegans genes without human orthologues. (B) Comparison of the mean number of predicted interactions per gene and the percentage of genes with predicted interactions (i.e. the percentage of the genome covered by the set of predicted interactions), between two studies. The comparisons are made in the context of mental retardation and synaptic plasticity (MRSP) genes only and in the genome-wide context (GW). (C) Comparison of the number of human genes whose C. elegans orthologues have predicted interactions, stratified by gene characterization index (see Supplementary Text). Our approach predicts novel interactions for genes orthologous to poorly- characterized human genes.

132 Table 5–1: Performance of genetic interaction predictors. arates are based on cross-validation brates are with respect to the positive and negative learning sets in this study cpercentage of all C. elegans genes with at least one predicted interaction Predictor True Positive False Positive Genome Threshold Rate Rate Coveragec this studya 10.78% 0.41% 35.75% 0.850 this studya 2.46% 0.02% 21.73% 0.975 Zhong & Sternberg, 2.83% 0.01% 8.10% 0.900 2006b

Table 5–2: Epistasis coefficients of experimentally tested genetic interac- tions. Abbreviations: Ste, sterility; Gon, severe shortening of gonads; Emo, accumulation of endomitotic oocytes; M minimum model; + additive model; * multiplicative model; S suppressor test; NA not available

Ste Gon Emo Gene Allele M + * S M + * S M + * S F42G8.10 ok1199 -0.005 0.017 0.009 -0.005 0.141 0.144 0.144 0.101 0.014 0.063 0.041 0.021 F54D5.4 ok2046 0.012 0.035 0.010 0.012 -0.034 -0.031 -0.023 -0.033 0.026 0.066 0.077 0.036 aspm-1 ok1208 +/- 0.139 0.158 0.151 0.139 0.539 0.547 0.586 0.531 0.936 0.969 0.940 0.933 dyb-1 cx36 0.151 0.148 0.150 0.151 0.412 0.395 0.421 0.429 0.248 0.238 0.238 0.228 lin-36 n766 0.008 0.120 0.011 0.008 0.135 0.136 0.130 0.122 -0.013 0.005 0.001 0.026 mel-11 sb56 +/- -0.022 0.062 -0.018 -0.022 -0.195 -0.072 -0.162 -0.194 -0.099 -0.105 -0.100 -0.096 mel-11 sb56 -/- 0.000 0.000 0.000 -0.044 -0.085 0.018 -0.054 -0.383 -0.057 0.025 -0.034 -0.102 smo-1 ok359 +/- -0.015 0.007 -0.004 -0.015 -0.062 -0.037 -0.042 -0.043 0.107 0.128 0.128 0.103 smo-1 ok359 -/- NA NA NA NA -0.094 0.033 -0.001 -0.243 NA NA NA NA tag-163 ok644 0.004 0.068 0.005 0.004 0.021 0.047 0.036 0.016 -0.117 -0.063 -0.106 -0.112 tra-4 ok1636 +/- 0.072 0.147 0.073 0.072 0.182 0.391 0.261 0.182 0.446 0.463 0.462 0.441 tra-4 ok1636 -/- 0.063 0.097 0.087 0.063 0.413 0.634 0.522 0.420 0.844 0.840 0.843 0.841 trp-2 gk298 0.058 0.032 0.060 0.058 -0.251 -0.161 -0.248 -0.243 0.173 0.190 0.190 0.195 unc-89 e1460 0.025 0.258 0.030 0.025 0.282 0.311 0.308 0.314 0.437 0.579 0.552 0.490 unc-89 ok1116 0.022 0.137 0.036 0.022 0.083 0.115 0.112 0.098 0.496 0.480 0.514 0.486 unc-89 ok1659 -0.005 0.162 -0.002 -0.005 -0.016 0.076 -0.002 -0.012 -0.056 0.056 -0.017 -0.068 unc-89 st85 -0.005 0.421 0.002 -0.005 0.229 0.226 0.242 0.237 0.482 0.443 0.454 0.599 unc-96 su151 -0.011 0.000 -0.001 -0.021 0.376 0.463 0.404 0.379 0.710 0.694 0.723 0.710 blebbistatin 10µM 0.001 0.025 0.002 0.001 0.079 0.099 0.081 0.080 0.074 0.105 0.071 0.073 blebbistatin 50µM 0.030 0.161 0.032 0.030 0.047 0.047 0.047 0.045 0.246 0.267 0.242 0.240 blebbistatin 100µM 0.001 0.016 0.001 0.001 0.244 0.316 0.262 0.244 0.217 0.269 0.233 0.220 ML-7 2µM 0.017 0.021 0.017 0.017 0.049 0.055 0.048 0.049 0.015 0.014 0.014 0.015 ML-7 10µM -0.004 0.055 -0.002 -0.004 0.238 0.239 0.239 0.238 0.052 0.132 0.070 0.052 ML-7 50µM -0.002 0.099 0.005 -0.002 0.431 0.480 0.436 0.383 0.340 0.378 0.353 0.341

133 134 Figure 5–3 (preceding page): The relationship between the quantity of infor- mation available for a gene and the number of predicted genetic interac- tions. The quantity of information available for a gene is a measure that takes into account the fact that some gene pair attributes are more informative than oth- ers for predicting genetic interactions. See the Methods for the computation of the total quantity of information for each gene. MRSP: mental retardation and synaptic plasticity; ZS: Zhong and Sternberg [280]. (A) The total quantity of information available for MRSP genes with the ZS approach and with our approach. The three sets of boxplots correspond to MRSP genes with predicted interactions in this study only, in the ZS study only and in neither study, respectively. (B) Types and total quantity of information available for MRSP genes with the ZS approach and with our approach. Each column corresponds to a gene and a black entry indicates that there is information for the gene of the type specified (to the left) by the row (except for the row labeled “Total quantity of information”). The ZS approach separates the information from three organisms: Saccharomyces cerevisiae (Sc), Drosophila melanogaster (Dm) and Caenorhabditis elegans (Ce). The information types (i.e. at- tributes) of this study are described in the Results and Methods. For each approach, there is also a row indicating the total quantity of information (scaled between 0 and 1), where white and black indicate zero and maximal information, respectively. The heatmap in the middle illustrates the number of interactions predicted for each gene by the different approaches, where a greater intensity of red corresponds to a greater number. gdi-1 is highlighted in green.

Table 5–3: Epistasis P values of experimentally tested genetic interactions. Abbreviations: Ste, sterility; Gon, severe shortening of gonads; Emo, accumulation of endomitotic oocytes; M minimum model; + additive model; * multiplicative model; S suppressor test; NA not available

Ste Gon Emo Gene Allele M + * S M + * S M + * S F42G8.10 ok1199 6.9e-01 0.092 2.8e-02 1.0e+00 0.440 0.390 0.430 0.340 0.960 0.760 0.870 0.560 F54D5.4 ok2046 5.3e-01 0.190 6.4e-01 5.3e-01 0.500 0.480 0.650 0.950 0.960 0.700 0.700 0.530 aspm-1 ok1208 +/- 4.0e-03 0.004 0.0e+00 4.0e-03 0.100 0.110 0.073 0.052 0.006 0.006 0.005 0.003 dyb-1 cx36 2.9e-05 0.002 1.8e-06 2.9e-05 0.012 0.053 0.014 0.006 0.140 0.120 0.150 0.080 lin-36 n766 9.8e-01 0.018 7.9e-01 1.0e+00 0.500 0.480 0.480 0.330 0.960 0.950 0.990 0.530 mel-11 sb56 +/- 5.0e-03 0.390 1.3e-01 1.0e+00 0.012 0.380 0.014 1.000 0.240 0.068 0.190 0.970 mel-11 sb56 -/- 9.8e-01 0.500 7.9e-01 1.0e+00 0.500 0.380 0.450 1.000 0.520 0.550 0.510 0.970 smo-1 ok359 +/- 4.0e-01 0.500 9.8e-01 1.0e+00 0.590 0.620 0.680 0.850 0.300 0.220 0.290 0.170 smo-1 ok359 -/- NA NA NA NA 0.160 0.480 0.980 1.000 NA NA NA NA tag-163 ok644 9.8e-01 0.500 7.9e-01 1.0e+00 0.620 0.130 0.440 0.480 0.280 0.570 0.290 0.970 tra-4 ok1636 +/- 2.1e-06 0.001 8.5e-07 2.1e-06 0.150 0.110 0.042 0.094 0.018 0.030 0.017 0.010 tra-4 ok1636 -/- 6.0e-03 0.500 2.0e-02 6.0e-03 0.150 0.053 0.089 0.095 0.006 0.002 0.005 0.003 trp-2 gk298 9.8e-01 0.430 6.4e-01 1.0e+00 0.120 0.270 0.130 1.000 0.520 0.530 0.510 0.300 unc-89 e1460 1.0e-03 0.005 4.1e-04 1.0e-03 0.150 0.150 0.150 0.095 0.300 0.220 0.290 0.160 unc-89 ok1116 1.0e+00 0.190 7.9e-01 1.0e+00 0.500 0.270 0.430 0.320 0.075 0.051 0.077 0.037 unc-89 ok1659 6.3e-01 0.290 1.0e+00 1.0e+00 0.830 0.480 0.980 0.750 0.680 0.400 0.870 0.900 unc-89 st85 6.9e-01 0.007 6.4e-01 1.0e+00 0.150 0.190 0.150 0.099 0.960 0.930 0.950 0.720 unc-96 su151 4.0e-01 0.500 1.3e-01 1.0e+00 0.100 0.130 0.089 0.052 0.017 0.030 0.011 0.008 blebbistatin 10µM 9.8e-01 0.420 7.6e-01 1.0e+00 0.150 0.130 0.130 0.095 0.710 0.550 0.700 0.450 blebbistatin 50µM 1.3e-02 0.001 2.0e-02 1.3e-02 0.500 0.480 0.480 0.200 0.300 0.220 0.290 0.170 blebbistatin 100µM 9.8e-01 0.350 7.9e-01 1.0e+00 0.045 0.130 0.072 0.034 0.170 0.092 0.160 0.080 ML-7 2µM 8.9e-01 0.390 6.4e-01 8.9e-01 0.320 0.390 0.380 0.170 0.710 0.700 0.700 0.450 ML-7 10µM 6.3e-01 0.001 1.0e+00 1.0e+00 0.120 0.130 0.099 0.072 0.430 0.140 0.290 0.260 ML-7 50µM 6.5e-01 0.001 4.8e-01 1.0e+00 0.120 0.110 0.089 0.046 0.010 0.051 0.008 0.005

135 Figure 5–4: Different methods for estimating the P value associated with a Pearson correlation value measuring the coexpression of two genes in the Kim et al. dataset [145]. The grey bars indicate the empirical P values associated with bins of correlation values. The t-distribution (blue line) and Fisher’s Z transform (red line) methods do not produce P values that match the empirical trend closely. In contrast, the fitted normal distribution approximates the empirical distribution well (green line).

Figure 5–5: The dependencies between the predictive gene pair attributes as defined by a learned Bayesian network. See the Methods for how the Bayesian network was derived.

136 Figure 5–6: The interaction of gdi-1 with unbalanced heterozygotes of aspm-1(ok1208). The mean penetrance/expressivity of the Emo phenotype in wild-type (wt) or unbalanced aspm-1(ok1208) heterozygotes (aspm-1(ok1208) +/-), submitted to either egfp or gdi-1 RNAi, is shown. The error bars correspond to ± one standard error over three independent experiments. (*) indicates a statistical difference between wt and aspm-1(ok1208) +/- animals submitted to gdi-1(RNAi) (P ≤ 0.05, see Methods).

Figure 5–7: Validity of the normality assumption for the application of Student’s t-tests to phenotype measurement data. The bars represent the empirical distribution of scaled phenotype values induced by gdi-1(RNAi) treatment (see Supplementary Text). Each red line is a fitted normal distribution.

137 Table 5–4: AIC values of 63 logistic regression models that use different combinations of the gene pair attributes. Each row corresponds to a different model and the use of an attribute in a model is indicated by an ‘x’ in the appropriate column. Each model was evaluated with five different folds of the learning set, using the 1:2 weight ratio. For each fold, the minimum AIC value is indicated in bold.

Exp Ph I CI N NPh Fold.1 Fold.2 Fold.3 Fold.4 Fold.5 x x x x x x 2213.41 2238.10 2215.91 2213.11 2220.02 x x x x x 2221.50 2247.04 2223.77 2218.65 2229.51 x x x x x 2388.08 2417.61 2395.18 2403.88 2398.94 x x x x 2420.72 2451.18 2428.98 2435.76 2432.46 x x x x x 2239.17 2260.41 2241.57 2234.44 2244.08 x x x x 2247.68 2264.82 2244.11 2239.41 2246.76 x x x x 2419.92 2446.76 2426.51 2430.27 2428.42 x x x 2454.40 2478.95 2464.21 2464.70 2464.60 x x x x x 2225.21 2248.71 2233.80 2228.89 2234.04 x x x x 2233.49 2257.37 2238.65 2233.02 2243.11 x x x x 2399.29 2428.01 2412.12 2417.74 2411.85 x x x 2431.97 2460.76 2445.87 2444.90 2444.75 x x x x 2254.18 2270.52 2255.28 2250.04 2257.08 x x x 2262.74 2283.03 2266.35 2258.10 2268.74 x x x 2437.63 2462.35 2447.75 2454.79 2447.44 x x 2471.40 2494.48 2484.75 2484.15 2482.01 x x x x x 2216.19 2242.25 2216.42 2214.21 2222.41 x x x x 2222.25 2247.97 2224.70 2219.05 2229.86 x x x x 2391.10 2421.86 2396.96 2404.69 2401.75 x x x 2419.77 2443.27 2426.13 2428.65 2427.76 x x x x 2240.40 2260.55 2242.88 2235.31 2244.28 x x x 2247.38 2263.35 2243.34 2238.49 2244.55 x x x 2420.73 2451.48 2430.61 2435.87 2434.10 x x 2452.73 2478.32 2462.16 2462.70 2462.40 x x x x 2227.67 2252.02 2233.81 2229.42 2235.44 x x x 2233.94 2258.85 2239.17 2233.46 2243.83 x x x 2401.93 2431.91 2413.28 2418.03 2413.88 x x 2430.80 2458.22 2445.85 2443.02 2444.41 x x x 2254.89 2272.92 2256.62 2251.03 2257.83 x x 2261.99 2282.73 2265.49 2256.63 2267.68 x x 2437.33 2461.85 2446.59 2453.84 2445.45 x 2469.45 2493.43 2482.48 2481.89 2479.54 x x x x x 2248.42 2269.32 2246.99 2240.20 2252.56 x x x x 2261.56 2274.59 2259.36 2254.10 2259.03 x x x x 2526.31 2531.71 2517.91 2528.10 2520.00 x x x 2598.87 2602.77 2594.57 2592.12 2596.40 x x x x 2274.38 2283.59 2271.09 2263.99 2270.23 x x x 2288.80 2297.33 2283.15 2277.03 2284.13 x x x 2557.85 2558.52 2548.25 2562.15 2549.54 x x 2636.10 2634.44 2631.53 2632.30 2630.76 x x x x 2258.58 2273.45 2256.63 2251.03 2258.56 x x x 2271.78 2283.33 2267.41 2261.66 2269.71 x x x 2536.19 2542.27 2533.08 2539.89 2533.36 x x 2608.93 2614.75 2610.37 2605.80 2611.43 x x x 2287.18 2294.28 2279.33 2271.33 2282.24 x x 2301.47 2308.95 2298.01 2289.30 2298.20 x x 2574.00 2573.43 2566.78 2579.27 2567.62 x 2651.61 2651.66 2651.31 2651.33 2651.16 x x x x 2251.56 2269.59 2248.50 2241.60 2253.88 x x x 2261.66 2275.52 2259.57 2254.88 2259.18 x x x 2531.67 2537.50 2522.43 2530.52 2524.37 x x 2596.99 2600.94 2592.54 2590.09 2594.34 x x x 2276.28 2286.40 2272.74 2265.39 2271.81 x x 2288.03 2297.06 2281.23 2272.85 2282.55 x x 2561.09 2563.47 2550.62 2563.22 2551.65 x 2634.10 2632.48 2629.64 2630.40 2628.94 x x x 2261.36 2273.82 2257.68 2251.20 2258.91 x x 2271.64 2283.25 2266.70 2260.86 2269.58 x x 2541.06 2547.60 2536.83 2541.70 2536.94 x 2606.99 2612.85 2608.31 2603.80 2609.40 x x 2288.60 2297.31 2282.29 2276.84 2282.66 x 2300.35 2308.61 2295.78 2287.31 2295.83 x 2576.07 2577.43 2567.95 2579.48 2568.70

138 Table 5–5: Genotypes of C. elegans strains used in this study. Strain Genotype LS505 dyb-1(cx36) I. LS590 dyb-1(cx36) I; hlh-1(cc561) II. LS587 dys-1(cx18) I; hlh-1(cc561) II. VC761 aspm-1(ok1208) I/hT2[bli-4(e937) let-?(q782) qIs48] (I;III). VC1176 +/szT1[lon-2(e678)] I; F53B3.1(ok1636)/szT1 X. CB1640 unc-89 (e1460) I. RW85 unc-89(st85) I. VC1192 unc-89(ok1659) I. RB1186 unc-89(ok1116) I. HR483 mel-11(sb56) unc-4(e120)/mnC1 dpy-10(e128) unc-52(e444) II. VC186 smo-1(ok359)/szT1[lon-2(e678)] I; +/szT1 X. VC504 tag-163(ok644) I. VC602 trp-2(gk298) III. RB1656 F54D5.4(ok2046) II. RB1223 F42G8.10(ok1199) IV/nT1[qIs51] (IV;V). HE151 unc-96(su151) X. NL2098 rrf-1(pk1417) I. NL2550 ppw-1(pk2505) I. MT6034 lin-36(n766) III.

139 CHAPTER 6 Chemogenomic Profiling Predicts Antifungal Synergy Gregor Jansen*1 , Anna Y Lee*2 ,3 , Elias Epp4 ,5 , Am´elieFredette1, Jamie Surprenant1, Doreen Harcus4, Michelle Scott2, Tamiko Nishimura1, Malcolm Whiteway4,5, Michael Hallett2,3,6 and David Y Thomas1 *these authors contributed equally to this work Duplicated by permission from Macmillan Publishers Ltd.: Mol Syst Biol 5: 338 (2009), under the terms of the Creative Commons Attribution-Noncommercial-No Dervative Works 3.0 License.

6.1 Preface

Although the drug development process typically starts with the identification of a therapeutic target, an alternative approach has been gaining momentum. The alternative approach begins with the search for compounds that induce a desired response from a specific biological system (e.g. a disease model). Subsequent experi- ments may be performed to resolve the targets of the identified compounds, towards

1 Department of Biochemistry, McGill University, Montr´eal,Qu´ebec, Canada 2 McGill Centre for Bioinformatics, McGill University, Montr´eal,Qu´ebec, Canada 3 School of Computer Science, McGill University, Montr´eal,Qu´ebec, Canada 4 Genetics Group, Biotechnology Research Institute, National Research Council, Montr´eal,Qu´ebec, Canada 5 Department of Biology, Montr´eal,Qu´ebec, Canada 6 Rosalind and Morris Goodman Cancer Centre, McGill University, Montr´eal, Qu´ebec, Canada

140 facilitating the optimization stage. Here we consider this response-driven approach for the development of combinatorial therapies. Although testing combinations of compounds to see if they can induce a desired response is conceptually simple, the scale of the experiments is prohibitive. Furthermore, the search for synergistic com- binations further exacerbates the scale issue. Therefore, we developed a method to predict chemical synergies with proof-of-principle results in the antifungal domain.

6.1.1 Contribution of Authors

Gregor Jansen designed and performed fungal experiments, and contributed to the design of the predictor and to the writing of the manuscript. Anna Y Lee analyzed the FCZ-Fungicidal chemogenomic profile with respect to existing data, developed all variants of the predictor, analyzed the predictors and the final predictions, and con- tributed to the writing of the manuscript. Elias Epp performed the experiments involving the Candida pathogens. Am´elieFredette performed synergy experiments in S. cerevisiae. Jamie Surprenant and Doreen Harcus performed experiments in- volving the FCZ-Fungicidal strains. Michelle Scott contributed to the analysis of the FCZ-Fungicidal chemogenomic profile. Tamiko Nishimura performed experiments in- volving the FCZ-Fungicidal strains and synergy experiments in S. cerevisiae. Malcolm Whiteway and David Y Thomas contributed to the design of the fungal experiments. Michael Hallett contributed to the design of the predictor, made suggestions for the analysis of the resulting predictions, and contributed to the writing of the manuscript. All authors contributed to the revising of the manuscript.

6.2 Abstract

Chemotherapies, HIV infections, and treatments to block organ transplant re- jection are creating a population of immunocompromised individuals at serious risk of systemic fungal infections. Since single-agent therapies are susceptible to failure

141 due to either inherent or acquired resistance, alternative therapeutic approaches such as multi-agent therapies are needed. We have developed a bioinformatics driven ap- proach that efficiently predicts compound synergy for such combinatorial therapies. The approach uses chemogenomic profiles in order to identify compound profiles that have a statistically significant degree of similarity to a fluconazole profile. The com- pounds identified were then experimentally verified to be synergistic with fluconazole and with each other, in both Saccharomyces cerevisiae and the fungal pathogen Can- dida albicans. Our method is therefore capable of accurately predicting compound synergy to aid the development of combinatorial antifungal therapies.

6.3 Introduction

Drugs that act against individual molecular targets are often insufficient to com- bat fungal infections, multigenic diseases such as cancer, and multiple cell or tis- sue type diseases including immune and inflammatory disorders [264, 202, 225, 284]. Combinatorial therapies that impact multiple targets simultaneously are less prone to the development of drug resistance, and increase therapeutic efficacy [107, 284]. One of the major benefits of combinatorial therapies is the potential for synergistic effects: that is, the overall therapeutic benefit of the drug combination is greater than the sum of the effects of the individual agents. In particular, synergies between the constituent compounds can provide broader pharmacological windows and reduced toxicity [90]. These advantages have driven drug discovery efforts towards the search for multi-agent therapies [202, 90, 284, 38]. Despite the obvious benefits, there are many challenges associated with the identifica- tion of multi-agent therapies. A sensitive, but low-throughput test for synergy is the dose-matrix response assay; in its simplest form it tests serial dilutions of two com- pounds in all possible permutations. The results from this assay can be analyzed with

142 respect to different models for quantifying synergy. Each model defines a baseline ef- ficacy level for the compounds, when used in combination at concentrations X and Y , describing the expected level if the compounds are not synergistic. The Loewe ad- ditivity model defines the baseline as the level that would be expected if a compound were in fact combined with itself [168]. The Bliss boosting model, an extension of the

Bliss independence model [35], defines the baseline level as IMult = IX + IY − IX IY , where IX and IY are the efficacy levels of the compounds in isolation at concentra- tions X and Y , respectively [162]. Alternatively, the potentiation model defines the baseline level as IP ot = max(IX ,IY ) [162]. The utility of any of these models depends on the comprehensiveness of the dose-matrix response data. Large-scale searches have demonstrated that high-throughput screens of thousands of compounds can be straightforward [279], but usually these screens can only test a small fraction of the exponential number of chemical combinations available. More- over, simplified dose-matrix assays are commonly used by these approaches, but the simplification may result in a failure to test the compound concentrations at which synergies occur, and therefore result in reduced synergy detection. Several lines of research address these problems [38, 279]. Some efforts reduce the scale issue by screening only combinations that include a particular compound of interest (i.e. by fixing one component). Other approaches tackle challenges later in the therapy de- velopment pipeline using in silico approaches to predict how two compounds act on pathways to achieve additive or synergistic effects [162]. Although improvements in the scale and sensitivity of synergy identification tech- niques promise a greater exploration of combinatorial chemical space, it is unlikely that experimental techniques will be sufficient to completely survey this vast space in a cost effective and timely fashion. Consequently, there is a clear need for an approach that winnows this space to a manageably large set of combinations that is enriched for synergistic combinations. The combinations in the set could then be rigorously tested

143 experimentally. A suggested experimental approach to finding this set entails an it- erative “maximal damage” search [4, 161]. In each iteration of the search, the most effective combination from the previous iteration is tested with all other compounds separately to identify a combination that is more effective. However, this directed strategy starts with the most effective compound and will thus miss potentiating syn- ergies between compounds that incur minimal damage separately. In contrast, an accurate in silico approach would alleviate this challenge of synergy identification by enabling comprehensive and efficient exploration of the combinatorial space. Such a strategy could employ data from single compound treatments to effectively pre- dict which combinations are most likely to behave synergistically. There are several approaches in the literature including that from Nelander and colleagues [190] that attempt to use data from perturbation screens and prior knowledge regarding the tar- gets of compounds to model the effects of these compounds when they are used alone or in combination. This approach is currently limited to compounds with known targets, but such an approach could potentially be extended to predict synergistic compound pairs [190]. The use of chemogenomic profiles offers promise for characterizing the global cellular response to an arbitrary compound, for predicting the mode of action of a compound, and for inferring the functions of genes. Here we focus on chemogenomic profiles gen- erated for Saccharomyces cerevisiae where each member of the yeast gene deletion library is grown in the presence of a particular compound, and the resultant growth fitness is recorded. Strains with reduced fitness in comparison to untreated or wild type cells suggest that the loss of particular genes confers sensitivity to the compound. For example, a set of genes involved in multi-drug resistance was identified by finding commonalities between yeast chemogenomic profiles of a chemically diverse panel of compounds [205, 120]. It has also been established that similarity between chemoge- nomic profiles often implies a similarity in the mode of action of the corresponding

144 compounds [206]. In other words, two compounds that induce sensitivity in many of the same gene deletions strains may target similar cellular pathways. Conversely, strains that behave similarly across a panel of compounds may indicate that the cor- responding genes are functionally related [45, 159, 111]. The ability of chemogenomic profiles to predict similarities in cellular response, mode of action and gene function poses the question as to whether they can be used to also predict synergy. This aspect has not been investigated to date and would provide a simple approach to synergy prediction that does not require the prior knowledge of the targets of compounds and extensive modelling of previous approaches [190]. We introduce here a combined experimental and bioinformatics approach to iden- tify antifungal synergies. In particular, for each compound of interest, we obtain a chemogenomic profile, which we define as a set of genes whose deletions confer sensitiv- ity to a given compound. The next step is to computationally measure the similarity between pairs of profiles. We establish that compound pairs that have correspond- ingly similar profiles are more likely to be synergistic when compared with randomly chosen compounds. This approach exploits the fact that chemogenomic profiles make compounds instantly comparable in silico: whereas exhaustive screening of only pair- wise combinations already necessitates a quadratic number of dose-matrix assays, the computational method requires only a linear number of chemogenomic profiles and a small number of subsequent validation assays relative to the total number of possible combinations. Our approach is thus a practical way to comprehensively search the vast chemical space for synergistic compounds. We validate this method by assessing the antifungal activity of compound combi- nations in S. cerevisiae and in the fungal pathogen Candida albicans. Infections by Candida species are an increasing problem, especially in patients who are im- munocompromised [107]. We show that our approach successfully predicts antifungal

145 synergies that occur in S. cerevisiae and in C. albicans.

6.4 Results

Our goal was the identification of compound pairs that exhibit antifungal synergy. There are two types of antifungal synergy; the constituent compounds act synergis- tically to kill fungal cells (cytotoxic synergy), or to arrest growth only (fungistatic synergy). Although a combination may be fungistatic against one fungal species, it might exhibit more potent synergy against others. Therefore, it may be useful to fur- ther investigate whether combinations that are fungistatic against particular fungal species can be developed into antifungal therapies against other fungal species.

6.4.1 The collection and generation of chemogenomic profiles

S. cerevisiae, with its accessible genetic resources, was used as the model for fungal pathogens. The first step in our method was to collect from the literature the results of 1,300 genome-wide, sensitivity and lethality screens generated with a broad e range of compounds [205, 52, 72, 14, 22, 77, 164, 257, 99, 206, 170, 120, 124]. This set forms our chemogenomic profile collection (Figure 6–1A, Table 7-S1). Although the screens were conducted differently (e.g. with diploids versus haploids; competitive versus non-competitive growth), the results of each screen permit the identification of a set of strains that are hypersensitive to the compound, which in turn define a set of hypersensitive genes. We focus on this hypersensitive gene set format of a chemogenomic profile in our analyses. Fluconazole, a widely-used fungistatic drug with favourable pharmacokinetic and toxi- cological properties [105], would be an ideal constituent compound of a combinatorial antifungal therapy. We thus generated a de novo profile for fluconazole using the yeast haploid deletion strain collection [268]. As a control, we used the hypomorphic strain for the essential gene ERG11 [42, 232]. Fluconazole directly targets Erg11p

146 and thus specifically inhibits its enzymatic activity in the biosynthesis pathway for ergosterol, an essential sterol in yeast [264]. As expected, fluconazole was lethal to the erg11 strain since inhibition of the already limited cellular amount of Erg11p likely decreased its activity to fatal levels. Although previous studies have identified strains that are sensitive to fluconazole [205], we re-screened the drug to focus on deletions that are lethal in the presence of fluconazole. The results define a set of genes that we call FCZ-Fungicidal (Table 7–1). We next validated the profile by determining the minimum inhibitory concentration (MIC) and minimum fungicidal concentration (MFC) for each strain (Figure 6–1B, Table 7–2). These values represent dosages where FCZ-Fungicidal strains are unable to recover after exposure to fluconazole, un- like wild type cells. To exclude the possibility that any secondary mutations present in the deletion strains were responsible for the FCZ-Fungicidal phenotype, we complemented the FCZ-Fungicidal strains with plasmid-borne copies of their respective deleted genes to demonstrate the reversibility of the phenotype. The presence of the overexpressed gene enabled the transformants to survive lethal concentrations of fluconazole above their MFCs (data not shown) without conferring resistance to fluconazole beyond levels observed for the wild type. Therefore, the complementation results confirm, for every FCZ-Fungicidal strain, that the gene deletion is responsible for the FCZ- Fungicidal phenotype.

6.4.2 Components of the FCZ-Fungicidal Set

From the screen of 4,997 deletion strains, 21 strains were unable to recover af- ter exposure to fluconazole, in addition to the erg11 hypomorphic strain (Table 7–1). Members of the SAGA histone acetyltransferase complex and genes with general RNA polymerase II transcription factor activity (e.g. members of the mediator complex) are significantly over-represented in the FCZ-Fungicidal set (adjusted P = 3.7 x 10−6

147 and 0.01, respectively; see Materials and Methods). Members of the vacuolar mem- brane H+-ATPase complex and cytoskeleton genes are also over-represented in the set (adjusted P = 3.7 x 10−6 and 0.03, respectively; see Materials and Methods). Taken together, genes involved with transcriptional regulation, vacuole function and cell structure are significantly associated with sensitivity to fluconazole.

6.4.3 Prediction of synergistic compounds

We assessed whether any given compound pair with a high level of similarity between its chemogenomic profiles is likely to exhibit antifungal synergy. A gold standard set of positive and negative examples of antifungal synergy was assembled for this purpose (Table 7–3). Specifically, the positive and negative examples are synergistic compound pairs curated from the literature and pairs that we showed are not synergistic in S. cerevisiae using a dose-matrix response assay (see Materials and Methods, Table 7–4), respectively. Moreover, the gold standard set is limited to compound pairs where each constituent compound is associated with at least one chemogenomic profile in our collection. Although it would be interesting to investi- gate the potential differences in the accuracy of a synergy predictor built exclusively from either diploid- or haploid-based profiles, there are too few gold standard exam- ples that are associated with both types of profiles to enable such a comparison (i.e. four positive examples). Similarly, there are too few examples to compare the accu- racy of predictors built exclusively from either profiles generated from competitive or non-competitive growth assays (the literature contains only one positive example). Therefore, the gold standard set, together with our complete chemogenomic profile collection, was used to evaluate three different pairwise measures of chemogenomic profile similarity for their ability to predict antifungal synergy.

148 Previous studies suggest that the vast majority of compound pairs do not exhibit an- tifungal synergy. For example, Borisy and colleagues tested 560 reference-listed com- pounds (i.e. known to have some bioactivity) in pairwise combination with fluconazole using a dose-matrix proliferation assay in fluconazole-resistant C. albicans [38]. They described one synergistic combination, although they also confirmed 20 combinations as potentially synergistic because each of these combinations shows an effect that is greater than the baseline level defined by the highest single agent model, i.e. the larger of the effects produced by the constituent agents when they are applied singly. Over- all, their results suggest that 0.2%-3.6% of the tested combinations exhibit antifungal synergy (and the limited number of antifungal synergies reported in the literature in general suggests that synergy is even rarer in other chemical libraries). The scarcity of synergy would suggest that the evaluation of a synergy predictor should place great emphasis on the identification of true synergies. It is standard practice to evaluate a predictor by estimating its receiver-operating characteristic (ROC) curve. However, applying this type of evaluation to a synergy predictor would equally emphasize the identification of true synergies and false pos- itives. Furthermore, the estimated rarity of antifungal synergy implies that a ROC curve would be estimated with a very small fraction of all negative examples of syn- ergy in the chemical space covered by our chemogenomic profile collection. That is, only 30 out of the estimated 175,000 negative examples are known in our study, e where the total number of negative examples is based on the estimated frequency of antifungal synergy, 3.6% [38]. A ROC curve estimated with the small negative gold standard set would likely hide the utility of the synergy predictor simply because our sample of negative examples is not sufficiently representative of the complete neg- ative set. Therefore, instead of estimating ROC curves, we evaluated each synergy predictor by estimating to what degree its predictions are enriched for true synergies. That is, we computed a prediction score for every positive and negative example in

149 our gold standard set and estimated to what degree the subset predicted to be syn- ergistic is enriched with positive examples (with a hypergeometric test). This type of evaluation places greater value on the identification of true synergies as desired. We also estimated the true synergy enrichment of predictions made using random per- mutations of our chemogenomic profile collection (see Materials and Methods). The enrichment estimates from the permutations establish a baseline enrichment distri- bution. The significance of the true synergy enrichment from the observed data was computed relative to this baseline distribution. Significant enrichment would suggest that testing a set of compound pairs that are predicted to be synergistic via profile similarity is expected to yield significantly more true synergies than testing an equal number of randomly selected compound pairs. The first measure of chemogenomic profile similarity that we assessed quantifies the significance of the overlap between two hypersensitive gene sets (see the example in Figure 6–2A). In particular, when x genes are observed in both hypersensitive gene sets, the measure is the probability of obtaining x or more genes in the overlap by chance (i.e. a P value from a hypergeometric distribution). With this gene-based profile similarity measure, a compound pair is predicted to be synergistic if its P value is less than or equal to a given threshold. We evaluated this profile similarity measure as a synergy predictor at different thresholds (Figure 6–3A). The predictions defined by the threshold P ≤ 10−6.5 exhibit a true synergy enrichment that represents a significant improvement over the expected baseline level (P = 0.0236; see Figure 6–3B). Furthermore, this threshold produces the most significant improvement and is thus optimal for defining synergy predictions. Taken together, these results suggest that chemogenomic profile similarity predicts antifungal synergy. We also assessed a chemogenomic profile similarity measure that accounts for addi- tional commonality that is observable by viewing the profiles at the level of protein complexes (Figure 6–2B). That is, although one subunit of a protein complex may be

150 associated with sensitivity to one compound and a different subunit associated with a second compound, it is nonetheless interesting that the same complex is associated with sensitivity to either compound. Each profile was converted into a complex-based profile defined by a list of 0s and 1s indicating the absence or presence (respectively) of each complex, and also each non-complex gene, in the hypersensitive gene set. The similarity between two such profiles was measured via weighted Pearson correlation. A protein complex with many subunits is weighted less because it is less rare for that complex, via any one of its subunits, to be included in any given hypersensitive gene set. As with the gene-based profile similarity measure, a compound pair is predicted to be synergistic if the similarity of the corresponding profiles is greater than or equal an optimal threshold. The enrichment of the predictions with true synergies is more significant for the complex-based measure than for the gene-based measure, relative to the expected baseline levels (P = 0.0092 and 0.0378, respectively; see Figure 7– 1A). Taken together, these results suggest that the complex-based profile similarity measure can predict synergy more effectively than the simpler gene-based measure. Lastly, we assessed a profile similarity measure that exploits the detailed quantitative data available for a subset of our chemogenomic profile collection. Namely, for some profiles each gene is associated with a log2ratio that reflects the growth of untreated versus chemically treated cells of the relevant deletion strain [120, 124, 206]. We thus considered the correlation across these log2ratios as a measure of profile similarity. Again, a compound pair is predicted to be synergistic if the similarity of the corre- sponding profiles is greater than or equal an optimal threshold. Unlike the measures of profile similarity based on hypersensitive gene sets, the enrichment of the predic- tions with true synergies is not significant for the log2ratio-based measure, relative to the expected baseline level (P = 0.3109; see Figure 7–1A). Therefore, we focus on the gene-based profile similarity measure as a predictor of synergy (using the threshold P ≤ 10−6.5) due to its simplicity and the significant enrichment of its predictions with

151 true synergies (Figure 6–1A). Consistent with the evidence that antifungal synergy is rare, the majority of com- pound pairs in the chemical space covered by our chemogenomic profile collection are not predicted to be synergistic (see the x-axis of Figure 7–2A). In addition, the esti- mated accuracy of the predictor (= 0.745) is significantly above the expected baseline level (P = 0.018; see Figure 7–2B), despite the fact that the estimate is likely based on a small fraction of all negative examples. If instead we over-estimate and assume that all compound pairs in our chemical space are negative examples (i.e. 182,000 e instead of the estimated 175,000 examples, where the total number of negative ex- e amples is based on the estimated frequency of antifungal synergy, 3.6% [38]), our estimates would be based on a more representative set of negative examples (see Fig- ure 7–2A for the ROC curve). At the selected threshold, the estimated true positive rate is 67% and, using the overlarge negative set, the estimated false positive rate e and accuracy are 5% and 95%, respectively. Furthermore, the level at which the e e predictions are enriched with true synergies would increase if the number of negative examples in the gold standard set were to increase and if all new examples were pre- dicted as true negatives (Figure 7–2C). Taken together, we have shown statistically that the predictor is surprisingly accurate and the estimate of its accuracy will in- crease as the community develops a more representative gold standard set. Therefore, we have shown that our predictor is useful for the efficient identification of antifungal synergies. The next step in the synergy identification method for our fluconazole example re- quires measuring the similarity between the FCZ-Fungicidal profile and each member of the chemogenomic profile collection. The FCZ-Fungicidal profile is significantly similar to ten profiles (P ≤ 10−6.5) and these other profiles are associated with eight different compounds (Table 7-S6). Consequently, eight compounds are predicted to

152 be synergistic with fluconazole through the FCZ-Fungicidal profile.

6.4.4 Validation of predicted synergies in S. cerevisiae

Of the eight compounds predicted to be synergistic with fluconazole through the FCZ-Fungicidal profile, seven were tested for fungistatic and cytotoxic synergy with the fluconazole in S. cerevisiae: latrunculin A, caspofungin, tunicamycin, cyclosporin A, FK506, alverine citrate and wortmannin. Fenpropimorph is predicted to be syn- ergistic with fluconazole through a different fluconazole profile (P = 9.20 x 10−45). Therefore, we also included fenpropimorph in our synergy tests. In total, eight pre- dicted fluconazole combinations were experimentally tested for synergy. We experimentally examined each compound combination using a dose-matrix re- sponse assay that measures the growth of treated cells. The results were used to quantify growth arrest synergy via the Loewe additivity model [168, 20] (see Ma- terials and Methods, Table 7–4). The dose-matrix response data was also fitted to Bliss boosting and potentiation models of synergy [162] (see Materials and Methods, Table 7–5). There is partial agreement between the results from the Loewe additivity model and the other models. However, we chose to identify synergies relative to the additive compound-with-itself baseline since it is the most conservative of the tested models. Each dose matrix of treated cells was also spotted on YPD to examine the recovery of the cells post-treatment (Figure 6–4). The absence of visible colonies after 24 hours suggests that the treatment has some cytotoxic effects. Synergy in terms of this cytotoxic phenotype was also quantified with the Loewe additivity model (see Materials and Methods). The cytotoxicity of compound combinations at particular concentrations was confirmed by the large reduction in the number of colony forming units of treated versus untreated cells (data not shown). Compound combinations that exhibit growth arrest but not cytotoxic synergy are referred to as exhibiting fungistatic synergy. Figure 6–4C, 6–4D show examples of fungistatic and cytotoxic

153 Table 6–1: Compound pairs that exhibit antifungal synergy. aPreviously shown to be synergistic. Abbreviations: AlvC, alverine citrate; CASP, caspofungin; CsA, cyclosporin A; FCZ, fluconazole; FEN, fenpropimorph; LatA, latrunculin A; TUN, tunicamycin; WM, wortmannin. Compound pair S. cerevisiae C. albicans AlvC + FCZ fungistatic cytotoxic CASP + FCZ - cytotoxic CsA + FENa fungistatic cytotoxic CsA + FCZa cytotoxic cytotoxic CsA + TUN cytotoxic cytotoxic FEN + FK506a fungistatic - FEN + FCZa fungistatic fungistatic FEN + WM cytotoxic - FK506 + FCZ fungistatic cytotoxic FK506 + TUN cytotoxic cytotoxic FK506 + WM cytotoxic - FCZ + LatA - cytotoxic FCZ + WM cytotoxic cytotoxic

synergy (respectively) in S. cerevisiae. Five compounds were validated as synergistic with fluconazole, including fenpropi- morph (Figure 7–3A and Table 7–4). Furthermore, we noticed that many of the compounds predicted to be synergistic with fluconazole are also predicted to be syn- ergistic with each other (Figure 6–2C). The same observation can be made based on predictions with the complex-based profile similarity measure (Figure 6–2D). We thus extended our validation efforts to include ten pairings of the predicted fluconazole- partners. These pairings include two that are not predicted to be synergistic: cy- closporin A + fenpropimorph and fenpropimorph + wortmannin. Of the 18 experi- mentally tested combinations in total, eleven showed a synergistic relationship with six and five demonstrating fungistatic and cytotoxic effects, respectively (Table 6–1, Figure 7–3 and Table 7–4). The two synergies involving fenpropimorph listed above are false negatives, although they are consistent with the observation that compounds that are synergistic with fluconazole tend to be synergistic with each other. Taken

154 together, the results indicate a validation success rate of 56% in S. cerevisiae.

6.4.5 Validation of predicted synergies in C. albicans

We sought to identify synergies in C. albicans that establish potential multi-agent therapies, after validating our approach in the model S. cerevisiae. Four compound pairs that we identified as synergistic in S. cerevisiae are already described in the liter- ature as synergistic in Candida species, suggesting that our approach may successfully identify synergies in C. albicans (Table 6–1). Using the dose-matrix response assay, the 18 combinations tested in S. cerevisiae were tested in C. albicans resulting in the identification of ten synergistic combinations in the fungal pathogen, nine of which are cytotoxic (see Figure 6–4E,F for examples of fungistatic and cytotoxic synergy, respectively, and see Figure 7–4 for all synergies identified in C. albicans). As before, we used the Loewe additivity model to quantify synergy (Table 7–6), although we also fitted the dose-matrix response data to other synergy models (Table 7–7). Table 6–1 lists the complete set of synergistic compound pairs that we identified. We showed that eight synergistic combinations identified in S. cerevisiae are also synergistic in C. albicans, and we identified two additional synergies in the fungal pathogen that could not be identified in S. cerevisiae (caspofungin + fluconazole and fluconazole + latrunculin A). Taken together, the validation success rate for the predictor of an- tifungal synergy is 69%. This implies that our method identifies true synergies at a rate that is 20-fold better than the estimated rate for testing randomly selected e compound pairs. Finally, we tested one of the novel synergistic combinations in fluconazole-resistant clinical isolates of C. albicans. These strains acquired fluconazole resistance by mu- tations that either lead to the upregulation of the target of fluconazole (ERG11, strain S2 [79]), or increased expression of a multi-drug efflux pump (MDR1, strain

155 G5 [185]). We chose to test fluconazole (FDA-approved) in combination with wort- mannin, analogues of which are in Phase I clinical trials [196]. Even when applied at concentrations 1000-fold higher than the MIC in corresponding wild type strains, e fluconazole has no readily detectable effect on cell growth in the clinical isolates. However, the combination of fluconazole and wortmannin exhibits a strong cytotoxic effect (Figure 6–5), suggesting potential clinical relevance.

6.4.6 A comparison of predictors dependent on haploid- and/or diploid- based profiles

After adding our novel synergistic compound pairs to the gold standard set of antifungal synergies, we revisited the question of whether the type of chemogenomic profiles used by the synergy predictor influences the enrichment of its predictions with true synergies. We therefore measured the significance of the enrichment associated with predictors dependent on haploid-based profiles only, diploid-based profiles only, and both haploid- and diploid-based profiles. However, this comparison could only be made using profiles generated from a competitive growth assay (i.e. the assay can use haploids or diploids) since there is an insufficient number of gold standard examples to also make the comparison in the context of profiles generated from a non-competitive growth assay. Although augmenting the gold standard set with validated synergies from this study may bias the enrichment values, all three predictors were subjected to the same bias since they were all evaluated with the same gold standard set, and here we are only interested in comparing the predictors relative to each other. As before, optimal prediction thresholds were selected for each of the three variants of the synergy predictor in the comparison. Our results suggest that the variant that exclusively uses haploid-based profiles produces predictions that are enriched with true synergies most significantly relative to the expected baseline enrichment level, followed by the variant that uses both haploid- and diploid-based profiles and the

156 variant that exclusively uses diploid-based profiles (P = 0.0372, 0.0572, and 0.0794, respectively; see Figure 7–1B). However, our collection of haploid-based profiles con- tains data for only 20 compounds. Therefore, it is currently best to use all types of chemogenomic profiles with our approach for better coverage of chemical space, and thus, for enabling the potential identification of more synergistic combinations.

Our results show that chemogenomic profile similarity predicts antifungal syn- ergy. The similarity values for all pairings of the chemogenomic profiles in our col- lection are contained in Table 7-S10. This data can be used immediately to identify compound pairs that are likely to exhibit antifungal synergy, and thus should stimu- late the search for effective combinatorial therapies.

6.5 Discussion

We have developed a bioinformatics driven approach using chemogenomic pro- files to predict compound pairs that exhibit antifungal synergy. First, we collected sensitivity-based chemogenomic profiles from the literature and generated a profile in S. cerevisiae for the widely used fungistatic drug fluconazole. We then showed statistical evidence supporting the use of our gene-based measure of profile similarity for predicting synergistic compound pairs. Our predictions of synergistic compound pairs validated with a high success rate. Overall, the results confirm that chemoge- nomic profile similarity can predict antifungal synergies. Chemogenomic profiles can be generated in several ways. As more profiles of differ- ent type become available, it would be interesting to further investigate the relative utility of each type for the prediction of synergy. Our collection includes profiles based on the competitive or non-competitive growth of diploids or haploids. Despite the heterogeneity of our collection, we used it in its entirety for better coverage of chemical space when predicting antifungal synergies. For example, if we had limited

157 the chemogenomic profile collection to haploid-based profiles, the cytotoxic synergy involving latrunculin A would not have been predicted because a haploid-based pro- file was not generated for this compound. Despite the expected differences in the profiles simply due to the different ways in which they were generated, our approach was able to identify synergies based on the similarity between profiles generated with different methods (e.g. the FCZ-Fungicidal profile derived from the non-competitive growth of haploids and the latrunculin A profile derived from the competitive growth of diploids). Therefore, different types of profiles may lead to false negatives, however, our approach generates predictions that are enriched with true antifungal synergies more significantly than what is expected by chance. A chemogenomic profile encodes the genes involved in resistance to a particular com- pound [159, 206]. If the profiles of two compounds are similar, there is likely some underlying drug resistance machinery to which both apply stress. Cells treated with both compounds concurrently simply may not be able to mount an effective response to the challenge, and the compounds thus exhibit antifungal synergy. Previous stud- ies have identified drug resistance machinery [205]. Interestingly, the FCZ-Fungicidal set (Table 7–1) includes the pleiotropic drug pump PDR5, genes that regulate the transcription of this pump as members of the SAGA and mediator complexes [93], and genes with vacuolar functionality. In short, the FCZ-Fungicidal set includes genes that have previously been associated with drug resistance. Our method therefore ex- ploits the drug response machinery identified by chemogenomic profiling to predict synergy. The FCZ-Fungicidal set also includes genes associated with the cytoskeleton or cell wall, two of which (BEM2, SLT2 ) are synthetically lethal with the target of flucona- zole, ERG11 [205]. It is possible that these genes become vital for maintaining the structural integrity of the cell to compensate for the instability that may result from reduced ergosterol production. This is a possible explanation for why these genes are

158 associated with resistance to fluconazole. By similar reasoning, we would expect these genes to be associated with resistance to latrunculin A, a compound that disrupts the actin cytoskeleton [18]. Indeed, the hypersensitive gene set of latrunculin A overlaps significantly with the FCZ-Fungicidal set, and the overlap includes genes associated with the cell wall or cytoskeleton (Table 7-S6). Latrunculin A was thus predicted as synergistic with fluconazole and this synergy was subsequently shown in C. albicans. Therefore, the FCZ-Fungicidal genes provide mechanisms to generate synergy. Previous work suggests that compounds with similar chemogenomic profiles have sim- ilar modes of action [205]. However, in both S. cerevisiae and C. albicans we identified synergy between fluconazole and cyclosporin A, which target respectively ergosterol biosynthesis [264] and calcineurin [263]. While the fluconazole and cyclosporin A profiles have distinguishing features (as expected due to the distinct targets of the compounds), our method uses a statistic that recognizes the profile similarities as significant, given what is possible by chance. That is, five genes in the overlap of the hypersensitive gene sets is in fact highly significant given that there are 5,000 e possible genes for each set (with 20 genes). As a predictor, our gene-based measure e of profile similarity is thus useful for identifying synergies that might be unexpected given what is already known about the participating compounds. The results establish that our method predicts synergy well in S. cerevisiae. It can also predict synergy in C. albicans based on chemogenomic profiles in S. cerevisiae. Despite differences in regulatory circuitry that have been observed between the fungal species [172, 122, 252], the majority of the synergies identified in C. albicans were transferred directly from S. cerevisiae. This suggests that predicted synergies could be tested in C. albicans immediately, without first testing predicted combinations in S. cerevisiae to filter out unlikely candidates. We have therefore shown that our method effectively uses S. cerevisiae resources to identify antifungal synergies in C.

159 albicans. Furthermore, our method predicts synergies previously shown in other fun- gal pathogens (see Table 7–3 for references) and it would thus be interesting to further investigate whether our method can predict broad spectrum antifungal combinations that exhibit synergy. We also statistically evaluated an alternative profile similarity measure, based on the

correlation of log2ratios that quantify the growth of untreated versus treated cells, as a predictor of antifungal synergy. We found that this measure predicts synergy markedly worse than the validated gene-based measure (Figure 7–1A). Interestingly, the gene-based measure compares log2ratio profiles by first converting them into hy- persensitive gene sets. This suggests that the quantitative profile data that are useful for predicting synergy are effectively summarized by a hypersensitive gene set. Our method for predicting antifungal synergy clearly requires chemogenomic pro- files for compounds. Although the construction of the chemogenomic profile for a compound is a significant task, the profile would be a beneficial resource in general because it is a multi-valued description of the bioactivity of a compound and can be used in all future studies. In fact, the number of published chemogenomic profiles is increasing [161] and as a result, the scope of our synergy prediction method is expanding. Moreover, an alternative profile similarity measure was defined to enable analysis at the protein complex level. The statistical evaluation of this measure as a predictor of antifungal synergy (using the gold standard set) suggests that the measure actu- ally predicts synergy better than the validated gene-based approach (Figure 7–1A), although this may be an artefact of the small size of the gold standard set. Neverthe- less, in combination with a variant of the complex-based measure, it may therefore be feasible to predict synergy using chemogenomic profiles built solely from strains pertaining to key members of protein complexes thereby reducing the scale of the screening task. This would represent another important advance in our methodology.

160 In addition, our method is efficient because it is capable of reducing a huge set of all possible compound pairs down to a set of manageable size for thorough synergy testing in fungi and indeed may be applied to other organisms. Our results show that compounds that are synergistic with fluconazole tend to be synergistic with each other, suggesting that our method is also able to identify compound synergy-clusters. Overall, the net gain from our method versus traditional screens is greater since costs are reduced and sensitivity is increased. Importantly, our method identified novel drug relationships including the cytotoxic synergy between fluconazole and wortmannin in S. cerevisiae, C. albicans and drug- resistant clinical isolates of C. albicans. Fluconazole is an FDA-approved drug and wortmannin analogues are in phase I clinical trials [196]. The method has thus un- covered a new synergistic combination that can be pursued as a viable therapy. Combinatorial therapies have been widely used in different medical scenarios [284, 140]. However, to discover new combinations using the vast number of compounds available (>10 million compounds available - http//www.emolecules.com) screening strategies must be adapted to address the scale of the discovery task. We have de- veloped a powerful tool for rapid synergy discovery that represents a promising step towards realizing the potential of combinatorial therapies. We have validated this approach with antifungal combinations and pointed out a potential path to attack the persisting problem of drug-resistant C. albicans strains in the clinic. It would thus be interesting to investigate whether our approach can be used to streamline the combinatorial therapy development process in other therapeutic situations.

161 6.6 Materials and Methods

6.6.1 Strains & media

The S. cerevisiae haploid strain BY4741 (MAT a his3∆1 leu2∆0 met15∆1 ura3∆0) and the complete yeast deletion array collection in the BY4741 background were ob- tained from the American Type Culture Collection. S. cerevisiae was cultured in rich media (YPD), synthetic complete media (SC), or synthetic drop-out media (SD- ura); for solid media, 2% agar was added. The C. albicans wild type strain SC5314, as well as the fluconazole- and multidrug-resistant strains S2 [79] and G5 [185], re- spectively, were cultured in YPDU media (YPD supplemented with 50 mg/l of uri- dine); for solid media, 2% agar was added. Amiodarone, benomyl, camptothecin, carboplatin, fenpropimorph, FK506, fluconazole, mycophenolic acid, myriocin, tu- nicamycin, and wortmannin were dissolved in DMSO; chlorpromazine, desipramine, doxycycline, MMS and nystatin were dissolved in water; cyclosporin A was dissolved in ethanol. Fluconazole was a gift from Pfizer Limited (Sandwich, Kent, UK) and all other compounds were purchased from Sigma.

6.6.2 Library screen to generate the lethality-based chemogenomic profile for fluconazole

96-well plates containing the American Type Culture Collection S. cerevisiae deletion strains were replicated with a 96-pin replicator (Boekel) to single-well Om- nitray plates (Nalgene Nunc) containing YPD-agar and geneticin (200 µg/ml) and, simultaneously, to plates containing YPD-agar and fluconazole (85 g/ml). Plates were incubated at 30◦C for 48 hours. Following incubation, cells on the YPD control and fluconazole plates were replicated to fresh YPD plates (without fluconazole) and incubated at 30◦C for 48 hours. Plates were scored for deletion strains that were unable to grow after exposure to fluconazole. Strains that fit this criterion were sub- sequently retested in minimum inhibitory concentration (MIC) and recovery assays

162 (see below). The deletion strains that were unable to recover from fluconazole at concentrations that in contrast did not affect wild type cells, were assigned a score of 1 in the chemogenomic profile and all other strains were assigned 0. Moreover, the genes associated with the deletion strains with score = 1 define our hypersensitive gene set for fluconazole (i.e. the FCZ-Fungicidal set).

6.6.3 MIC assays, recovery assays and compound synergy tests in S. cerevisiae and C. albicans

Antifungal sensitivity testing was done with a modified version of the CLSI (for- merly NCCLS) procedure [NCCLS. Reference Method for Broth Antifungal Suscep- tibility Testing of Yeasts: Approved Standard-Second Edition. NCCLS document M27-A2]. Briefly, overnight cultures of wild type and deletion strains were diluted to an OD600 of 0.0005 for S. cerevisiae and OD600 0.001 for C. albicans. Volumes of 50 µl of culture were inoculated into 96-well flat-bottom plates containing 50 µl of synthetic complete (SC) media with increasing concentrations of compound (in 2-fold serial dilutions). The cultures were grown without shaking at 30◦C for 24 hours and

OD600 measurements were taken on a Tecan Safire microplate monochromator reader (Tecan, Austria, GmbH). The MIC was determined by the first well with a growth reduction of at least 95% in the presence of a compound compared to untreated cells. Cells were then spotted (2 µl) onto YPD plates and incubated at 30◦C for 48 hours to assess the extent to which cells recover from the treatments. The minimum fungici- dal concentration (MFC) of a given strain was determined from these recovery assays (Tables 7–4 and 7–5). Compound synergy interactions were assessed by growth in a dose-matrix titration assay. Volumes of 50 µl of each compound were 2-fold serially-diluted in SC media and dispensed into 96-well flat-bottom plates, either across the columns of the plate

163 (compound A) or down the rows of the plate (compound B). Wells were then in- oculated with 50 µl of wild type yeast prepared as in the MIC assay. Plates were

◦ incubated at 30 C without shaking and OD600 measurements were taken after 24 hours. MICs were determined for the compounds alone and in combination by the first well with ≥ 95% decrease in absorbance compared to the control. The growth arrest synergy of a compound pair was quantified with respect to the Loewe addi- tivity model [168] via the fractional inhibitory concentration index FICIgrowthArrest

= (MICAinCombo/MICAalone) + (MICBinCombo/MICBalone). A compound pair is clas- sified as synergistic if its FICI is ≤ 0.5, the standard threshold [168, 20]. Cells were also spotted (2 µl) onto fresh YPD plates and incubated at 30◦C for 24 hours to test for cytotoxic synergy. The minimum cytotoxic concentration (MCC) was defined for a compound alone and in combination as the lowest concentration that did not result in visible colonies on the plate. The cytotoxicity was confirmed by measuring colony forming units (cfu) after compound treatment. Cells were treated with both compounds at their MCCs, then plated on YPD plates and cfus were counted after 48 hours incubation at 30◦C. The cytotoxic synergy of a compound pair was quan- tified with FICIcytotoxic = (MCCAinCombo/MCCAalone) + (MCCBinCombo/MCCBalone). A compound pair was classified as exhibiting fungistatic synergy if we identified the pair as synergistic with respect to the growth arrest phenotype only.

6.6.4 Complementation assay

To demonstrate that fluconazole sensitivity was dependent on the particular gene deletion and not on an acquired secondary mutation, the deletion strains were trans- formed with plasmids carrying their respective deleted genes expressed from the galac- tose inducible GAL1 promoter [97, 133]. The transformants were incubated in two sequential overnight cultures. Cells were diluted and treated with fluconazole as described above and then incubated at 30◦C for 24 hours. After incubation, 2 µl of

164 cultures were spotted onto fresh YPD plates, incubated at 30◦C for 2 days, and scored for growth. The complementation test was performed under both inducing (SC- 4% galactose) and non-inducing (SC- 2% glucose) conditions.

6.6.5 The chemogenomic profile collection

We collected the results of compound sensitivity screens described in the litera- ture. The different screens used different schemes to score each deletion strain based on its observed level of sensitivity to a given compound (and the complete set of strain scores defines the chemogenomic profile of the compound). For each chemogenomic profile, we identified strains that were scored as moderately to highly sensitive to the compound by noting the strains with scores that surpass the threshold specified in Table 7-S1. The genes associated with these strains define the hypersensitive gene set of the compound.

6.6.6 Annotations of the FCZ-Fungicidal genes

Descriptions of the FCZ-Fungicidal genes were downloaded from the Saccha- romyces Genome Database (SGD, ftp://ftp.yeastgenome.org/yeast/). Gene Ontology (GO)-based gene annotations [17] were used to test whether particular biological pro- cesses, molecular functions and cellular components are significantly over-represented in the FCZ-Fungicidal gene set. For each GO gene set, a P value was obtained from a hypogeometric test performed within the scope of the set of genes associated with strains that were used in the FCZ-Fungicidal screen. The P values were adjusted for multiple comparisons using the Benjamini & Hochberg method [27].

165 6.6.7 The gold standard set

A gold standard set of positive and negative examples of antifungal synergy was assembled to evaluate the synergy predictors (Table 7–3). Specifically, the 21 pos- itive examples are synergistic compound pairs curated from the literature. The 30 negative examples are pairs that we showed are not synergistic in S. cerevisiae using a dose-matrix response assay (Table 7–4).

6.6.8 The measures of chemogenomic profile similarity

Consider profiles A and B and their associated hypersensitive gene sets GA and

GB, respectively. Let UA and UB represent the sets of all genes associated with strains that were screened to generate profiles A and B, respectively. UA and UB may differ if, for example, one screen involved essential genes (through heterozygous strains) and the other did not. We define U = UA ∩ UB as the scope of the statistical test that measures the similarity between profiles A and B, and therefore compute

0 0 GA = GA ∩U and GB = GB ∩U. The gene-based profile similarity measure compares the hypersensitive gene sets associated with the profiles. The measure is defined as the P value obtained from a hypergeometric test that quantifies the significance of

0 0 |GA ∩ GB| (i.e. the probability of obtaining an equal or larger number by chance),

0 0 given |GA|, |GB| and |U|. The complex-based profile similarity measure compares hypersensitive gene sets that have been transformed into complex-based profiles (Figure 6–2B). Mappings of genes to GO-defined protein complexes were downloaded from SGD. Let Ci represent the set of genes associated with complex i. The GO hierarchy subdivides some complexes into its constituent domains. In these cases, we treat each domain as a separate complex, and the genes associated with the domains are removed from the gene set of the parent complex (to avoid redundancy). For protein complex i, we compute

0 0 0 0 Ci = Ci ∩ U. If G ∩ Ci 6= ∅, we say that complex i is present in G , otherwise it

166 is absent. We generate a complex-based profile xA, defined as a vector of 0s and 1s indicating the absence or presence (respectively) of each complex, and also each non-

0 0 complex gene, in GA. Similarly, xB was generated with GB. The similarity between

xA and xB is measured via weighted Pearson correlation. Each non-complex gene is

0 assigned a full weight of 1 and complex i is assigned a weight equal to 1/|Ci|. That is, a protein complex with many subunits is weighted less because it is less rare (and thus less significant) for that complex, via any one of its subunits, to be present in any given G0.

The log2ratio-based profile similarity measure compares profiles that have log2ratio

sensitivity scores. The log2ratios quantify the growth of untreated cells versus treated

cells. We define yA as the vector of log2ratios of profile A, with one value specified

for each gene in U. Similarly, we define yB for profile B, retaining the same gene order that is used in yA. The similarity between yA and yB is measured with Pearson correlation.

6.6.9 Synergy prediction

Our chemogenomic profile collection may contain several different profiles asso- ciated with a single compound. For example, these profiles may have been generated with different assays and/or different concentrations of the compound. For com- pounds A and B, every profile for A is compared to every profile for B. The similarity value for the compound pair is defined as the best similarity value obtained from all the pairwise profile comparisons. For the gene-based profile similarity measure, the best is the lowest P value. If the similarity value of a compound pair is less than or equal to some threshold, the pair is predicted to be synergistic. For the complex-based and log2ratio-based measures, the best similarity value is the highest correlation value. For these measures, a compound pair is predicted to be synergistic if its similarity value is greater than or equal to some threshold.

167 For the comparison of the three profile similarity measures as predictors of synergy (Figure 7–1A), a reduced gold standard set was used to evaluate each predictor (16 and 24 positive and negative examples, respectively). Each compound pair in the reduced set is associated with log2ratio profiles since the log2ratio-based measure re- quires these types of profiles (see Supplementary Text). Similarly, for the comparison of the gene-based measure predictors dependent on haploid-based profiles only, diploid-based profiles only, and both types of profiles (Figure 7–1B), a different gold standard set was used to evaluate these predictors (10 and 16 positive and negative examples, respectively). Each compound pair in the set is associated with both haploid-based and diploid-based profiles, and all the profiles were generated from a competitive growth assay. To avoid an extremely small number of positive examples, the set includes synergies validated in this study (see Supple- mentary Text). Even so, there is an insufficient number of gold standard examples to also make the comparison in the context of profiles generated from a non-competitive growth assay. We evaluated each variant of the synergy predictor based on the significance of the enrichment of its predictions with true positives/synergies, relative to the expected baseline level (see Permutation analysis below). The optimal profile similarity thresh- old for defining the predictions of each variant is therefore the threshold that results in the most significant enrichment.

6.6.10 Permutation analysis

The chemogenomic profile labels were permuted 5,000 times in order to estimate the baseline levels of different statistics, for each variant of the synergy predictor. For each type of profile (the type of each profile is specified in Table 7-S1), the labels were randomly permuted amongst all profiles of that type. This preserves any systematic differences between profiles of different type. However, we excluded permutations

168 where at least one profile label is assigned to a profile corresponding to the same compound. For example, this could potentially occur when there are multiple profiles generated with different concentrations of the same compound. Additional restric- tions were applied, depending on the type of analysis. For the comparison between

the three profile similarity measures, permutations were only done across log2ratio profiles. For the comparison of the predictors dependent on haploid-based profiles only, diploid-based profiles only, and both types of profiles, permutations were only done across profiles generated from a competitive growth assay. In addition, the per- mutations were only done across haploid- and diploid-based profiles for the predictors exclusively dependent on haploid- and diploid-based profiles, respectively. With each permutation, synergy predictions were made using a given variant of the predictor at different thresholds. We computed a statistic quantifying the enrichment of the predictions (defined by the optimal threshold) with true synergies. That is, the P value obtained from a hypergeometric test that equals the probability of obtaining an equal or larger number of positive examples predicted to be synergistic by chance, given the numbers of positive examples, negative examples and predicted synergies in the given gold standard set. For the final predictor, the predictions were also used to compute the accuracy at the optimal threshold. For each statistic z (e.g. the enrichment P value), a permutation distribution of the baseline value was obtained by collecting the computed values from all 5,000 permu- tations. Moreover, the significance of the value computed with the observed/real data

(zobs) relative to the expected baseline value was quantified as P = (x + 1)/(n + 1), where x is the number of permutations with a z value better than or equal to zobs and n = 5,000, the number of permutations in this case [182].

169 6.6.11 Fitting to other models of synergy

For each compound pair that was experimentally tested for synergy, Bliss boost- ing and potentiation models of synergy were fit to the dose-matrix response data [162].

First, the OD600 values were used to compute a corresponding matrix of % inhibi- tion values (I) relative to untreated cells. Model fits to the inhibition data were then obtained as previously described [162]. The sum-of-squared fit errors (SS) =

2 Σ(Iobserved −Ifit) was computed for each model. The best fit model was defined as the first consistent model, with the Bliss boosting model considered before the potentia-

tion model because it is less complex. We define consistent as |SS − SSmin| < SSmin,

where SSmin is the minimum SS of the two models.

IX IY A Bliss boosting surface is defined by IBliss = IX + IY + (β − Emin) · where EX EY

IBliss is the Bliss boosting inhibition level when both compounds are used in com- bination, with the first and second compounds used at concentrations X and Y , re-

spectively. IX and IY are the inhibition levels when the first and second compounds

are used alone at concentrations X and Y , respectively. EX and EY are the max- imum inhibition levels achievable by the first and second compounds, respectively,

and Emin = min(EX ,EY ). β is the fitted parameter and it represents the amount

of boosting above max(EX ,EY ). Reference values of β indicate cancelling, suppres- sive, masking, multiplicative, and saturating levels of Bliss boosting. The selected Bliss boosting level of a compound pair is defined as the first consistent level (in

the order shown above), where consistent is defined as |∆β − ∆βmin| < ∆βmin with

∆β = |β − βref | for some reference level βref , and βmin is the minimum ∆β across all reference levels.

A potentiation model surface is defined by IP otent = max(IX (C),IY ), where IX (C) is the inhibition level when the potentiated compound is used alone, at a shifted

|p| sign(p) concentration C. We have that C = X[1 + (Y/Ypot) ] where Ypot and p are fitted parameters, representing the concentration of the potentiated compound above

170 which potentiation occurs and the potentiation slope, respectively [162]. p = 0, p > 0 and p < 0 indicate no potentiation, synergy and antagonism respectively. For each compound pair, the inhibition matrix was fitted to this model twice; the first time assuming that the first compound is potentiated, the second time assuming that the second compound is potentiated. Of the two fits, we report the one with the lower SS (Tables 7–5 and 7–7).

All computational analyses were performed in the R statistical software frame- work [246].

6.7 Acknowledgements

We thank Dr. Robert Annan for his comments and suggestions regarding the writing of the manuscript. We thank Dr. Charles Boone for providing additional chemogenomic profile data that was generated in [206]. This is National Research Council of Canada publication NRC 495413. The authors declare that they have no competing financial interests. This work was supported by the Natural Sciences and Engineering Research Council of Canada and the Canadian Institutes of Health Research (scholarship to AYL and grants to MW, MH, and DYT).

171 Figure 6–1: A method for identifying synergistic compounds with antifungal activity. (A) A schematic illustrating the steps of the method. (B) Validation of the first step: recovery after fluconazole treatment for defining the chemogenomic profile of the drug. Strains were treated with increasing amounts of fluconazole (0- 128 µg/ml) for 24 hours before spotting aliquots on YPD and incubating at 30◦C for two days.

172 A B

Caspofungin & FCZ-Fungicidal 4,967 genes

38 7 14 complex-based profile

weights:

significance of overlap P = 3.26 x 10-10 weighted correlation

C D

Figure 6–2: The measures of chemogenomic profile similarity and antifungal synergy predictions based on these measures. (A) An example of how the gene-based measure quantifies profile similarity with the hypergeometric test. (B) The complex-based measure. This measure compares complex-based profiles derived from hypersensitive gene sets. The similarity between the complex-based profiles is measured with weighted Pearson correlation. A protein complex is weighted less if it has many subunits, and all genes that are not annotated to any complex (i.e. non-complex genes) are assigned maximal weight. (C) A heatmap of the similarity values of select compound pairs, using the gene-based measure. The intensity of Figurepurple 2 for a pair corresponds to the degree of similarity. All compounds in the heatmap are predicted as synergistic with fluconazole (using a threshold of P ≤ 10−6.5), except for camptothecin (included for contrast). (D) As in (C), except that the similarity values were computed using the complex-based measure. Abbreviations: AlvC, alverine citrate; CASP, caspofungin; CsA, cyclosporin A; FCZ, fluconazole; FEN, fenpropimorph; LatA, latrunculin A; TUN, tunicamycin; WM, wortmannin.

173 A B P = 0.0236

Figure 6–3: Statistical evaluation of the gene-based chemogenomic profile similarity measure as a predictor of antifungal synergy. The evaluation is based on the enrichment of the predictions with true positives/synergies, and higher values in this figure indicate greater enrichment. Baseline enrichment values were esti- mated using random permutations of the data (N = 5,000 permutations, see Materials and Methods). The significance of the enrichment estimated from the observed data is computed relative to the baseline enrichment values. (A) The true synergy enrich- ment estimated from observed and randomly permuted data, at different similarity thresholds. The triangles indicate the enrichment values estimated from the observed data, at different thresholds. For each threshold, the median (o) and interquartile range (whiskers) of the enrichment values computed from the different permutations, are also shown. (B) The significance of the true synergy enrichment associated with the predictor used with a threshold of 10−6.5, relative to the permutation distribution of the baseline enrichment. The red line indicates the enrichment estimated from the observed data and its P value of significance is also shown. The threshold of 10−6.5 results in the most significant enrichment and is therefore considered the optimal profile similarity threshold for defining the synergy predictions. Figure 3

174 A Cyclosporin A B Cyclosporin A 0 62.5 125 250 500 0 1.6 3.1 6.3 12.5 25 50 100 0 0 1.5625 62.5 3.125 6.25 125 12.5 FK506 250 FK506 25

50 500 100

C Alverine citrate 0 31.25 62.5 125 250 500 0

2

4

8

16 Fluconazole 32

Tunicamycin D 0 0.5 1 2 4 8 0

3.125

6.25

12.5

25 CyclosporinA

50

E Fluconazole 0 0.5 1 2 4 8

0 0.003125

0.00625

0.0125

0.025

0.05

0.1 Fenpropimoprh 0.2

F Tunicamycin 0 0.39 0.78 1.5625 3.125

0

0.05

0.1 FK506 0.2

0.39

Figure 4 175 Figure 6–4 (preceding page): Dose-matrix responses to compound pairs that exhibit antifungal synergy. (C-F) Growth inhibition levels of cells grown in the presence of the compounds are shown on the left. The recovery of cells post-treatment is shown on the right. (A, B) The recovery of cells treated with a compound pair that is not synergistic in S. cerevisiae and C. albicans, respectively. (C, D) Compound pairs that exhibit fungistatic and fungicidal synergy in S. cerevisiae, respectively. (E, F) Compound pairs that exhibit fungistatic and fungicidal synergy in C. albicans, respectively.

176 A DMSO Fluconazole 0 16 32 64 128 256 512 1024

0

3.125

6.25

12.5

DMSO 25

Wortmannin 50

100

200

B DMSO Fluconazole 0 16 32 64 128 256 512 1024

0

3.125

6.25

12.5

25 DMSO

Wortmannin 50

100

200

Figure 6–5: The dose-matrix recovery from treatment with fluconazole and wortman- nin, a compound pair exhibiting antifungal synergy in (A) the multidrug-resistant clinical C. albicans isolate G5 and (B) the fluconazole-resistant clinical C. albicans isolate S2. Solvent controls are shown on the left.

Figure 5

177 CHAPTER 7 Supplementary Information for Chapter 6 This chapter contains all the supplementary information for the manuscript in chapter 6, except for the following: • Table 7-S1: Chemogenomic profiles used in this study. • Table 7-S6: Chemogenomic profiles that overlap with the FCZ-Fungicidal pro- file and the significances of the overlaps reported with the gene-based profile similarity P values. • Table 7-S10: The similarity values for all pairings of the chemogenomic profiles collected in this study. The above tables are not intended for the printed page, however, they are available for download (i.e. no subscription required) at www.nature.com/msb, Mol Syst Biol 5: 338 (2009).

7.1 Supplementary Text

The comparison of the three different profile similarity measures as predictors of antifungal synergy was conducted relative to a reduced gold standard set. Since the log2ratio-based measure can only be applied to profiles with relative growth data in the form of log2ratios, each compound pair in the reduced set is associated with log2ratio profiles. All profile similarity measures could thus be applied to all examples in the reduced gold standard set, enabling a fair comparison of the three different measures. However, this reduced set significantly under-represents the false negatives that occur when the gene-based measure is applied to the complete gold standard set (P = 0.025, hypergeometric test). For example, there is only one profile for cyclosporin A and it is not associated with log2ratios. Consequently, the two examples involving

178 cyclosporin A that were false negatives in the earlier analysis, were excluded from the gold standard set used in the profile similarity measure comparison analysis. It is therefore not surprising that the predictions of the gene-based measure (using the optimal threshold) are more enriched for true synergies when evaluated with the reduced versus complete gold standard set (compare the position of the red line in Figure 7–1A and Figure 6–3B). However, all three measures were evaluated with the same reduced gold standard set and thus all evaluations were subjected to the same bias. Therefore, it is unlikely that the bias impacted the comparison of the measures relative to each other. Similarly, for the comparison of the gene-based measure predictors dependent on haploid-based profiles only, diploid-based profiles only, and both types of profiles, an altered gold standard set was used to enable a fair comparison that is potentially more robust than a comparison made without the additional examples in the set. Although bias may have been introduced by using the altered gold standard set, all evaluations were subjected to the same bias, and thus it is unlikely that the bias impacted the comparison of the three predictors relative to each other.

179 Table 7–1: The 22 FCZ-Fungicidal genes. aA hypomorphic strain instead of a deletion strain for this essential gene was tested. Abbreviation: ER, endoplasmic reticulum.

Systematic Common Gene description Localization YDR448W ADA2 subunit of the ADA & SAGA transcriptional nuclear lumen adaptor/histone acetyltransferase complexes YGR252W GCN5 histone acetyltransferase, catalytic subunit of the nuclear lumen ADA and SAGA histone acetyltransferase com- plexes YDR176W NGG1 subunit of the ADA, SLIK & SAGA transcrip- nuclear lumen tional adaptor/histone acetyltransferase com- plexes YOL148C SPT20 subunit of the SAGA transcriptional adap- nuclear lumen tor/histone acetyltransferase complex, maintains complex integrity YDL185W TFP1 subunit of the vacuolar H+-ATPase V1 domain vacuolar mem- brane YOR332W VMA4 subunit of the vacuolar H+-ATPase V1 domain vacuolar mem- brane YHR039C-A VMA10 subunit of the vacuolar H+-ATPase V1 domain vacuolar mem- brane YPR036W VMA13 subunit of the vacuolar H+-ATPase V1 domain vacuolar mem- brane YHR060W VMA22 functions in the assembly of the vacuolar H+- ER membrane ATPase YKL119C VPH2 functions in the assembly of the vacuolar H+- ER membrane ATPase YER155C BEM2 Rho GTPase activating protein involved in the Mitochondrion control of cytoskeleton organization & cellular morphogenesis YDL226C GCS1 ADP-ribosylation factor GTPase activating pro- ER-Golgi, tein involved in ER-Golgi transport & actin fila- cytoskeleton ment reorganization YDR129C SAC6 fimbrin, actin-bundling protein involved in orga- actin cytoskele- nization & maintenance of the actin cytoskeleton ton YHR030C SLT2 serine/threonine MAP kinase involved in regulat- bud tip, nucleus ing the maintenance of cell wall integrity & cell cycle progression YCR081W SRB8 subunit of the RNA polymerase II Mediator com- nuclear lumen plex & RNA polymerase II holoenzyme YPL042C SSN3 cyclin-dependent protein kinase, subunit of RNA nuclear lumen polymerase II holoenzyme YOR153W PDR5 ABC transporter, actively exports various drugs, plasma mem- also involved in steroid transport & cation resis- brane, mito- tance chondrion YDR074W TPS2 Phosphatase subunit of the trehalose-6- mitochondrion phosphate synthase/phosphatase complex YDR532C KRE28 telomere maintenance cytoskeleton YDL116W NUP84 subunit of the nuclear pore complex, nuclear nuclear pore mRNA export YHR025W THR1 homoserine kinase, required for threonine biosyn- unknown thesis YHR007C ERG11a lanosterol 14-α-demethylase of the ergosterol ER pathway 180 Table 7–2: Phenotypes of the FCZ-Fungicidal strains. aplasmid-born copies of respective genes restore the wild type MFC bminimum inhibitory concentration of fluconazole that results in 95% inhibition cminimum fungicidal concentration of fluconazole dhypomorphic strain Abbreviations: nd, not done; na, not applicable.

a Systematic name ∆Strain Complementation MIC95 MFC (µg/ml)b (µg/ml)c YDR448W ADA2 + 16.0 32 YGR252W GCN5 + 16.0 32 YDR176W NGG1 + 16.0 64 YOL148C SPT20 + 8.0 32 YDL185W TFP1 + 8.0 32 YOR332W VMA4 + 8.0 32 YHR039C-A VMA10 + 8.0 32 YPR036W VMA13 + 16.0 32 YHR060W VMA22 + 8.0 32 YKL119C VPH2 + 4.0 32 YER155C BEM2 + 16.0 64 YDL226C GCS1 + 4.0 32 YDR129C SAC6 nd 8.0 64 YHR030C SLT2 + 4.0 32 YCR081W SRB8 + 8.0 64 YPL042C SSN3 + 8.0 64 YOR153W PDR5 + 0.5 2 YDR074W TPS2 + 16.0 32 YDR532C KRE28 + 16.0 64 YDL116W NUP84 + 8.0 64 YHR025W THR1 + 16.0 32 YHR007C ERG11d + 2.0 8 wild type BY4741 na 16.0 >512

181 Table 7–3: The gold standard set of positive and negative examples of com- pound pairs that exhibit antifungal synergy. aPubMed IDs provided for other studies.

Compound 1 Compound 2 Type Sourcea Amitriptyline Fluconazole positive 10952582 Amphotericin B Caspofungin positive 9021188 Amphotericin B Flucytosine positive 7047097 Amphotericin B Itraconazole positive 1309690 Amphotericin B Ketoconazole positive 6295266 Amphotericin B Terbinafine positive 9511038 Amphotericin B Tetracycline positive 18310042 Calyculin A Cantharidin positive 18622389 Chlorpromazine Fluconazole positive 10952582 Clomipramine Fluconazole positive 10952582 Cyclosporin A Fenpropimorph positive 12604527;this study Cyclosporin A Fluconazole positive 10952582;this study Cyclosporin A Terbinafine positive 12604527 Fenpropimorph FK506 positive 12604527;this study FK506 Fluconazole positive 10052898;10952582;this study FK506 Terbinafine positive 12604527 Fluconazole Flucytosine positive 9145879 Fluconazole Terbinafine positive 17332758 Flucytosine Itraconazole positive 10459811 Itraconazole Terbinafine positive 9511038 Miconazole Terbinafine positive 17332758 Amiodarone Camptothecin negative this study Amiodarone Cyclosporin A negative this study Amiodarone FK506 negative this study Amiodarone Fluconazole negative this study Amiodarone Myriocin negative this study Amiodarone Tunicamycin negative this study Benomyl Desipramine negative this study Camptothecin Cyclosporin A negative this study Camptothecin FK506 negative this study Camptothecin Fluconazole negative this study Carboplatin FK506 negative this study Carboplatin Tunicamycin negative this study Chlorpromazine FK506 negative this study Chlorpromazine MMS negative this study Chlorpromazine Mycophenolic acid negative this study Chlorpromazine Nystatin negative this study Chlorpromazine Wortmannin negative this study Cyclosporin A FK506 negative this study Cyclosporin A Wortmannin negative this study Desipramine Fluconazole negative this study Desipramine MMS negative this study Desipramine Tunicamycin negative this study Doxycycline Fluconazole negative this study Fenpropimorph Tunicamycin negative this study Fluconazole Hygromycin B negative this study Fluconazole Ibuprofen negative this study Fluconazole Kanamycin negative this study Fluconazole Tetracycline negative this study Fluconazole Tunicamycin negative this study Tunicamycin Wortmannin negative this study

182 Table 7–4: Evaluation of compound pairs for antifungal synergy in S. cere- visiae with the Loewe additivity model. aUnits of concentrations: µg/ml - BEN, CASP, CPT, CsA, DOXY, FCZ, FK506, IBU, MPA, MYR, NYS, TC, TUN, WM; µM - AlvC, AMD, CBP, CPZ, DMI, FEN, HygB, LatA; mg/ml - KAN; % - MMS. Abbreviations: AMD, amiodarone; AlvC, alverine citrate; BEN, benomyl; CASP, caspofungin; CBP, carboplatin; cmpd, compound; CPT, camptothecin; CPZ, chlor- promazine; CsA, cyclosporin A; DMI, desipramine; DOXY, doxycycline; FCZ, flu- conazole; FEN, fenpropimorph; FICI, fractional inhibitory concentration index; HygB, hygromycin B; IBU, ibuprofen; KAN, kanamycin; LatA, latrunculin A; MIC and MCC, minimum inhibitory and cytotoxic concentrations, respectively; MMS, methyl methanesulfonate; MPA, mycophenolic acid; MYR, myriocin; NYS, nystatin; TC, tetracycline; TUN, tunicamycin; WM, wortmannin.

Compound pair MIC95 MIC95 MIC95 MIC95 FICI MCC MCC MCC MCC FICI Synergy cmpd1 cmpd2 cmpd1 cmpd2 (growth cmpd1 cmpd2 cmpd1 cmpd2 (cy- Type alonea alonea comboa comboa ar- alonea alonea comboa comboa to- rest) toxic) AlvC + FCZ ≥1000 16 250 2 0.4 - - - - - fungistatic AMD + CPT 18.75 60 18.75 3.75 1.1 ------AMD + CsA 18.75 ≥200 18.75 3.125 1 ------AMD + FK506 18.75 128 9.38 8 0.6 ------AMD + FCZ 18.75 16 1.1719 16 1.1 ------AMD + MYR 18.75 16 9.375 8 1 ------AMD + TUN 18.75 4 9.375 2 1 ------BEN + DMI 2.5 924 2.5 57.75 1.1 ------CPT+ CsA ≥60 ≥200 ------CPT + FK506 ≥60 128 ------CPT + FCZ ≥60 16 3.75 16 1.1 ------CBP + FK506 ≥150 128 75 64 1 ------CBP + TUN ≥150 8 4.69 8 1 ------CASP + FCZ 1.92 16 0.12 16 1.1 ------CPZ + FK506 112.8 128 14.1 64 0.6 ------CPZ + MMS 112.8 0.0016 7.05 0.0016 1.1 ------CPZ + MPA 112.8 30 28.2 15 0.8 ------CPZ + NYS 112.8 2 112.8 0.5 1.3 ------CPZ + WM 112.8 0.8 28.2 0.4 0.8 ------CsA + FEN ≥200 4 3.2 2 0.5 - - - - - fungistatic CsA + FK506 ≥200 ≥256 ------CsA + FCZ ≥200 16 3.125 16 1 ≥500 ≥256 12.5 64 0.3 cytotoxic CsA + TUN ≥200 4 6.25 0.5 0.2 ≥500 8 12.5 4 0.5 cytotoxic CsA+ WM ≥200 0.8 6.25 0.8 1 ------DMI + FCZ 924 16 115.5 8 0.6 ------DMI + MMS 924 0.0016 924 0.0001 1.1 ------DMI + TUN 924 4 462 2 1 ------DOXY + FCZ 400 16 200 4 0.8 ------FEN + FK506 4 ≥256 2 2 0.5 - - - - - fungistatic FEN + FCZ 4 32 1 8 0.5 - - - - - fungistatic FEN + TUN 2 8 1 4 1 ------FEN + WM 2 ≥1.6 0.5 0.4 0.5 ≥64 ≥1.6 2 0.2 0.2 cytotoxic FK506 + FCZ 128 16 4 4 0.3 - - - - - fungistatic FK506 + TUN 128 4 1 2 0.5 ≥500 8 2 4 0.5 cytotoxic FK506 + WM 128 0.8 8 0.2 0.3 ≥500 ≥1.6 4 0.4 0.3 cytotoxic FCZ + HygB 16 6 16 0.375 1.1 ------FCZ + IBU 16 200 16 6.25 1 ------FCZ + KAN 16 3 8 0.375 0.6 ------FCZ + LatA 16 3.12 2 3.12 1.1 ≥256 6.24 16 3.12 0.6 - FCZ + TC 16 1600 16 50 1 ------FCZ + TUN 16 4 1 2 0.6 ------FCZ + WM 16 0.8 16 0.05 1.1 ≥256 ≥1.6 32 0.1 0.2 cytotoxic TUN + WM 8 0.8 0.5 0.8 1.1 ------

183 Table 7–5: Dose-matrix responses to compound pairs in S. cerevisiae fitted to Bliss boosting and potentiation models. aEach Bliss level is associated with a reference β value and the value that is closest to the fitted β is shown here. b Units of Ypot: µg/ml - BEN, CASP, CPT, CsA, DOXY, FCZ, FK506, IBU, MPA, MYR, NYS, TC, TUN, WM; µM - AlvC, AMD, CBP, CPZ, DMI, FEN, HygB, LatA; mg/ml - KAN; % - MMS. cThe first consistent model, see Methods in chapter 6. Abbreviations: AMD, amiodarone; AlvC, alverine citrate; BEN, benomyl; CASP, caspofungin; CBP, carboplatin; CPT, camptothecin; CPZ, chlorpromazine; CsA, cy- closporin A; DMI, desipramine; DOXY, doxycycline; FCZ, fluconazole; FEN, fen- propimorph; FICI, fractional inhibitory concentration index; HygB, hygromycin B; IBU, ibuprofen; KAN, kanamycin; LatA, latrunculin A; MMS, methyl methanesul- fonate; MPA, mycophenolic acid; MYR, myriocin; SS, sum of squared fit errors; NYS, nystatin; TC, tetracycline; TUN, tunicamycin; WM, wortmannin.

Compound Pair BLISS BOOSTING POTENTIATION Best fitc b β Reference Bliss level SS p Ypot PotentiatedSS βa com- pound AlvC + FCZ 0.056 0 masking 4.413 1.502 65.546 FCZ 0.022 potentiation AMD + CPT -0.094 0 masking 0.3 19.687 9.051 CPT 0.32 Bliss boosting AMD + CsA -0.004 0 masking 0.187 1 1.172 CsA 0.182 Bliss boosting AMD + FK506 0.032 0 masking 1.776 0.146 2.617 AMD 0.018 potentiation AMD + FCZ -0.001 0 masking 0.019 -0.232 1.77E+10 FCZ 0.017 Bliss boosting AMD + MYR -0.009 -0.014 suppressing 1.554 0.578 76.435 AMD 1.241 Bliss boosting AMD + TUN -0.009 -0.014 suppressing 0.378 2.285 9.506 TUN 0.028 potentiation BEN + DMI -0.028 0 suppressing 0.009 6.43 1.89E+04 DMI 0.011 Bliss boosting CPT+ CsA 0.019 0.017 multiplicative 0.035 0 22.375 CPT 0.036 Bliss boosting CPT + FK506 0.011 0 masking 0.027 1 8 CPT 0.062 Bliss boosting CPT + FCZ 0 0 masking 0.006 0.864 7.39E+08 FCZ 0.062 Bliss boosting CBP + FK506 0.414 0 masking 0.128 1.126 67.37 FK506 0.056 potentiation CBP + TUN 0 0 masking 0.063 1.535 230.292 TUN 0.043 Bliss boosting CASP + FCZ 0.054 0 masking 0.081 1.464 5.895 FCZ 0.077 Bliss boosting CPZ + FK506 0.105 0 masking 1.211 1.872 19.171 FK506 0.107 potentiation CPZ + MMS -0.005 -0.001 suppressing 0.226 0.116 0.199 CPZ 0.051 potentiation CPZ + MPA -0.003 0 suppressing 0.037 0.572 25.549 CPZ 0.04 Bliss boosting CPZ + NYS -0.015 0 suppressing 1.735 0 21.699 NYS 1.204 Bliss boosting CPZ + WM 0.001 0 masking 5.482 0.997 14.554 WM 0.367 potentiation CsA + FEN 0.055 0 masking 0.388 0.186 54.676 FEN 0.355 Bliss boosting CsA + FK506 -0.069 0 masking 0.364 -0.016 1.38E+43 FK506 0.364 Bliss boosting CsA + FCZ -0.029 0 masking 0.108 -0.309 14158.014 FCZ 0.05 potentiation CsA + TUN 0.136 0 masking 12.67 2.787 1.943 TUN 0.029 potentiation CsA+ WM 0.019 0 masking 0.282 0.77 3090.591 WM 0.073 potentiation DMI + FCZ 0.036 -0.009 suppressing 1.599 1.365 163.024 FCZ 0.026 potentiation DMI + MMS -0.012 -0.001 suppressing 0.294 0.146 0.277 DMI 0.034 potentiation DMI + TUN -0.011 -0.007 suppressing 2.068 0.952 343.285 TUN 0.079 potentiation DOXY + FCZ -0.002 0 masking 0.766 1.667 100.836 FCZ 0.016 potentiation FEN + FK506 0.091 0.046 multiplicative 0.134 1.347 1.467 FEN 0.139 Bliss boosting FEN + FCZ -0.002 0 suppressing 0.246 20.248 4.284 FEN 0.144 Bliss boosting FEN + TUN -0.001 0 suppressing 0.016 1.854 2.936 FEN 0.012 Bliss boosting FEN + WM -0.014 0 masking 2.435 1.844 0.322 WM 0.03 potentiation FK506 + FCZ 0.093 0 masking 5.443 0.999 2.754 FCZ 0.04 potentiation FK506 + TUN -0.049 0 masking 2.969 1.139 1.648 TUN 0.208 potentiation FK506 + WM -0.05 0 masking 10.052 0.686 1.618 WM 0.599 potentiation FCZ + HygB 0.033 0.035 saturating 0.121 -0.079 1.53E+11 FCZ 1.685 Bliss boosting FCZ + IBU -0.189 -0.013 suppressing 0.641 -0.036 86.999 FCZ 0.03 potentiation FCZ + KAN 0.016 -0.008 suppressing 0.209 1.421 0.864 FCZ 0.022 potentiation FCZ + LatA 0.002 0 masking 0.127 1.94 12.372 Lat A 0.086 Bliss boosting FCZ + TC -0.079 0.033 suppressing 0.076 -0.627 3362.32 FCZ 0.031 potentiation FCZ + TUN 0 0 masking 0.64 0.134 65.1 TUN 0.095 potentiation FCZ + WM 0.004 0 masking 0.105 20.11 0.426 FCZ 0.038 potentiation TUN + WM -0.026 0 suppressing184 0.067 9.81 227.407 WM 0.041 Bliss boosting Table 7–6: Evaluation of compound pairs for antifungal synergy in C. albi- cans with the Loewe additivity model. aUnits of concentrations: µg/ml - CASP, CsA, FCZ, FK506, TUN, WM; µM - AlvC, FEN, LatA. b This experiment estimates the MIC95 for FCZ alone as 1.6 µg/ml, however, at 0.8 µg/ml there was nearly 95% inhibition. If 0.8 µg/ml is taken as a more conservative estimate of the MIC95 for FCZ alone, the resulting FICI suggests that FCZ + TUN pair is not synergistic. cTested in a multidrug-resistant clinical isolate (G5). Abbreviations: AlvC, alverine citrate; CASP, caspofungin; cmpd, compound; CsA, cyclosporin A; FCZ, fluconazole; FEN, fenpropimorph; FICI, fractional inhibitory concentration index; LatA, latrunculin A; MIC and MCC, minimum inhibitory and cytotoxic concentrations, respectively; TUN, tunicamycin; WM, wortmannin.

Compound pair MIC95 MIC95 MIC95 MIC95 FICI MCC MCC MCC MCC FICI Synergy cmpd1 cmpd2 cmpd1 cmpd2 (growth cmpd1 cmpd2 cmpd1 cmpd2 (cy- Type alonea alonea comboa comboa ar- alonea alonea comboa comboa to- rest) toxic) AlvC + FCZ ≥2000 4 15.625 4 1 2000 ≥64 500 8 0.4 cytotoxic CASP + FCZ 0.125 2 0.0625 1 1 0.125 ≥64 0.03125 2 0.3 cytotoxic CsA + FEN ≥200 0.8 0.39 0.4 0.5 ≥200 0.8 1.5625 0.2 0.3 cytotoxic CsA + FK506 ≥200 ≥800 ------CsA + FCZ ≥200 1.5625 0.78125 0.8 0.5 ≥200 ≥64 3.125 1.5625 0 cytotoxic CsA + TUN ≥200 12.5 6.25 0.78125 0.1 ≥200 ≥12.5 3.125 1.5625 0.1 cytotoxic CsA+ WM ≥200 100 0.2 100 1 ------FEN + FK506 0.2 ≥800 0.2 1.5625 1 ------FEN + FCZ 0.2 1.6 0.025 0.4 0.4 - - - - - fungistatic FEN + TUN 0.2 12.5 0.1 6.25 1 ------FEN + WM 0.2 100 0.1 25 0.8 0.8 ≥100 1.6 6.25 2.1 - FK506 + FCZ ≥800 1.5625 0.05 1.5625 1 ≥200 ≥64 0.025 1.5625 0 cytotoxic FK506 + TUN ≥800 3.125 0.1 0.39 0.1 200 12.5 0.1 0.78125 0.1 cytotoxic FK506 + WM ≥800 100 400 50 1 ------FCZ + LatA 2 ≥1000 2 7.8125 1 ≥64 ≥1000 4 15.625 0.1 cytotoxic FCZ + TUN 1.6 12.5 0.8 0.39 0.5b ------FCZ + WM 1.5625 ≥400 0.78125 200 1 ≥64 ≥100 6.25 12.5 0.2 cytotoxic TUN + WM 12.5 100 6.25 50 1 ------FCZ + WMc ≥2048 ≥400 128 12.5 0.1 ≥2048 ≥400 128 25 0.1 cytotoxic

185 Table 7–7: Dose-matrix responses to compound pairs in C. albicans fitted to Bliss boosting and potentiation models. aEach Bliss level is associated with a reference β value and the value that is closest to the fitted β is shown here. b Units of Ypot: µg/ml - CASP, CsA, FCZ, FK506, TUN, WM; µM - AlvC, FEN, LatA. cThe first consistent model, see Methods in chapter 6. dResponse of a multidrug-resistant clinical isolate (G5). Abbreviations: AlvC, alverine citrate; CASP, caspofungin; CsA, cyclosporin A; FCZ, fluconazole; FEN, fenpropimorph; LatA, latrunculin A; SS, sum of squared fit errors; TUN, tunicamycin; WM, wortmannin.

Compound Pair BLISS BOOSTING POTENTIATION Best fitc b β Reference Bliss level SS p Ypot PotentiatedSS βa com- pound AlvC + FCZ -0.163 0 masking 6.033 -0.001 0 FCZ 0.837 Potentiation CASP + FCZ -0.006 -0.006 suppressing 0.117 0.003 2.83E+242 FCZ 0.087 Bliss CsA + FEN 0.147 0 masking 0.392 0.063 8.02E+12 FEN 0.293 Bliss CsA + FK506 -0.019 0 masking 0.126 -3.912 41.366 FK506 0.032 Potentiation CsA + FCZ -0.009 0 masking 0.534 0.624 7.51 CsA 0.312 Bliss CsA + TUN -0.048 0 masking 10.923 0 18.705 TUN 5.698 Bliss CsA+ WM 0.063 0.039 saturating 0.267 0.017 5.93E+91 WM 0.128 Potentiation FEN + FK506 -0.351 -0.995 cancelling 3.333 -0.463 27.093 FEN 0.282 Potentiation FEN + FCZ -0.006 0 masking 6.663 1.511 0.018 FCZ 0.086 Potentiation FEN + TUN 0.021 -0.012 suppressing 0.825 0.554 57.928 FEN 0.349 Potentiation FEN + WM 0.009 0 masking 1.616 1.088 0.03 WM 0.06 Potentiation FK506 + FCZ -0.012 0 masking 0.65 2.144 1.782 FK506 0.654 Bliss FK506 + TUN 0.188 0 masking 10.495 0.067 0 TUN 2.1 Potentiation FK506 + WM 0.122 0 masking 1.365 1.5 13.382 FK506 0.382 Potentiation FCZ + LatA 0.003 0 masking 0.017 3.521 895.271 FCZ 0.017 Bliss FCZ + TUN -0.001 0 masking 0.582 3.342 5.37E+04 TUN 0.519 Bliss FCZ + WM 0.037 0.038 mulitplicative 1.035 1.625 119.371 FCZ 0.682 Bliss TUN + WM 0.031 0 masking 0.771 2.184 6.092 WM 0.258 Potentiation FCZ + WMd 0.105 0 masking 4.428 3.749 71.723 WM 1.06 Potentiation

186 A complex-based gene-based log2ratio-based correlation P = 0.0092 P = 0.0378 P = 0.3109

B haploids diploids haploids & diploids P = 0.0372 P = 0.0794 P = 0.0572

Supplementary Figure S1. Comparisons of variants of the antifungal synergy predictor. FigureThe 7–1:significanceComparisonsof the true synergy of variantsenrichment ofassociated the antifungalwith each variant synergy(used with predictor.its The significanceoptimal threshold, of thesee trueMaterials synergyand Methods) enrichmentrelative associatedto the permutation with eachdistribution variantof (used with itsthe optimalbaseline enrichment, threshold,is seeshown Materials. Specifically, andthe Methodspermutation indistribution chapter 6)of a relativestatistic to the permutationquantifying distributionthe enrichment of theof the baselinepredictions enrichment,with true synergies, is shown.where Specifically,higher values the per- indicate greater enrichment, is shown (based on 5,000 permutations, see Materials and mutationMethods) distribution. The red line ofindicates a statisticthe quantifyingenrichment estimated the enrichmentfrom the observed of thedata predictionsand its P with true synergies,value of significance where higheris also valuesshown. indicate(A) Comparison greaterof the enrichment,three different is shownmeasures (basedof on 5,000 permutations,profile similarity seeas predictors Materialsof andsynergy Methods. An altered in chaptergold standard 6). Theset was redused line indicatesto the enrichmentevaluate these estimatedpredictors for fromthis comparison the observed(see Materials data andand itsMethods)P value. (B) Comparison of significance is also shown.of gene-based (A) Comparisonmeasure predictors of thedependent threeon differenthaploid-based measuresprofiles ofonly, profilediploid similarity-based as profiles only, and both types of profiles. An altered gold standard set was used to evaluate predictorsthese ofpredictors synergy.for this Ancomparison altered gold(see standardMaterials and setMethods) was used. to evaluate these predic- tors for this comparison (see Materials and Methods in chapter 6). (B) Comparison of gene-based measure predictors dependent on haploid-based profiles only, diploid- based profiles only, and both types of profiles. An altered gold standard set was used to evaluate these predictors for this comparison (see Materials and Methods in chapter 6).

187 A B P = 0.0180

C

Figure S2. Statistical evaluations of the gene-based chemogenomic profile similarity Figuremeasure 7–2: Statisticalas a predictor evaluationsof antifungal synergy of thewith gene-baseddifferent sets chemogenomicof negative examples profile. similarityEstimates measureof the frequency as a predictorof antifungal of antifungalsynergy from synergyprevious studies withsuggest differentthat sets of negativethere are ~ examples.175,000 negativeEstimatesexamples of ofsynergy the frequencyin the chemical of antifungalspace covered synergyby our from previous studies suggest that there are 175,000 negative examples of synergy in the chemogenomic profile collection. (A) Thee prediction rate versus the true positive rate chemicalat different space coveredsimilarity bythresholds our chemogenomic. The prediction profilerate is collection.the proportion (A)of Thecompound prediction rate versuspairs in theout truechemical positivespace ratepredicted at differentas synergistic similarity(the thresholds.total number The predictionof pairs is rate is theindicated proportionin the of compoundx-axis label) pairs. The innumber out chemicalof gold spacestandard predictedexamples asused synergisticfor the (the totalestimations number ofof pairsthe true is indicatedpositive rates inis theindicated x-axisin label).the y-axis Thelabel number. This plot of goldshows standardthe ROC curve estimated from an overlarge gold standard set that contains all compound examples used for the estimations of the true positive rates is indicated in the y-axis pairs in the chemical space as negative examples. The rates for a threshold of 10-6.5 are label.highlighted This plotwith showsthe thedashed ROClines curve. (B) The estimatedsignificance fromof anthe overlargeaccuracy of goldthe predictor standard set that containsused with alla threshold compoundof 10 pairs-6.5, relative in theto chemicalthe permutation spacedistribution as negativeof the examples.baseline The ratesaccuracy for a threshold(based on of5 10,000−6.5permutations,are highlightedsee Materials with theand dashedMethods) lines.. The (B)red Theline signifi- canceindicates of the accuracythe accuracy ofestimated the predictorfrom the usedobserved with adata thresholdand its P ofvalue 10−6of.5,significance relative to the permutationis also shown distribution. The estimates of theare baselinebased on accuracyconfirmed (basedpositive onand 5,000negative permutations,examples of see Materialssynergy andand Methodsthe number inof chaptergold standard 6). Theexamples red lineused indicatesfor the estimations the accuracyis indicated estimated fromin thethe observedx-axis label data. (C) andThe itslevelPofvaluetrue synergy of significanceenrichment isincreases also shown.as the Thenumber estimatesof are basedtrue negatives on confirmedincreases positive. Specifically, and negativethe level at exampleswhich the ofpredictions synergy andare enriched the number with true synergies becomes greater as the number of correctly predicted negative of gold standard examples used for the estimations is indicated in the x-axis label. examples increases (used with a threshold of 10-6.5). (C) The level of true synergy enrichment increases as the number of true negatives increases. Specifically, the level at which the predictions are enriched with true syner- gies becomes greater as the number of correctly predicted negative examples increases (used with a threshold of 10−6.5).

188 A

Fluconazole Fluconazole 0 8 16 32 64 128 0 8 16 32 64 128 0 0

3.125 0.05

6.25 0.1

12.5 0.2 Wortmannin 25 0.4 CyclosporinA

50

B Wortmannin 0 0.05 0.1 0.2 0.4 0.8 0

2

4

FK506 8

16

Tunicamycin Fenpropimorph FK506 0 0.5 1 2 4 8 0 0.5 1 2 4 8 0 0.25 0.5 1 2 4

0 0 0

3.125 0.05 0.25

6.25 0.1 0.5

12.5 0.2 1 Wortmannin 25 0.4 2 Tunicamycin CyclosporinA 50 4

Supplementary Figure S3. Dose-matrix responses to compound pairs that exhibit Figureantifungal 7–3:synergyDose-matrixin S. cerevisiae responses. For topairs compoundthat exhibit pairsfungistatic that exhibitsynergy, antifun-the levels galof inhibited synergygrowth in S.(in cerevisiaeculture) of. treatedFor pairscells thatrelative exhibitto untreated fungistaticcells synergy,are shown the levelswith heatmaps of inhibited. For growthpairs that (in culture)exhibit synergy of treatedthrough cellscytotoxic relative toeffects, untreatedthe levels cells areof shownrecovery/growth with heatmaps.of aliquots For pairsof thattreated exhibitcells synergyspotted throughon YPD cytotoxicare shown effects,. theAll levelsconcentrations of recovery/growthare in µg/ml, of aliquotsexcept for of treatedalverine cellscitrate spottedand fenpropimorph on YPD are shown.(µM). ( AllA) concentrationsSynergistic pairs arethat in µincludeg/ml, exceptfluconazole for alverine. (B) citrateSynergistic and fenpropimorphpairs that do not (µM).include (A) Synergisticfluconazole. pairs that include fluconazole. (B) Synergistic pairs that do not include fluconazole.

189 A Fluconazole Cyclosporin A 0 0.5 1 2 4 8 16 32 0 0.78 1.56 3.13 6.25 12.5 25 50 0 0 Fluconazole 0 0.25 0.5 1 2 4 8 16 15.625 0.78125 0 31.25 1.5625 0.0156 62.5 3.125 0.0313 125 0.0625 6.25

250 Caspofungin 12.5 0.125 Fluconazole Alverine citrate Alverine 500 25

1000 50

Fluconazole Fluconazole 0 0.8 1.6 3.13 6.25 12.5 Fluconazole 0 0.5 1 2 4 8 16 32 0 0 0.78125 1.5625 3.125 1.5625 0 0 3.125 7.8125 6.25 0.025 15.625 12.5 0.05 FK506 31.25 25

0.1 LatrunculinA

62.5 Wortmannin 50

100

Cyclosporin A B Cyclosporin A Tunicamycin 0 1.5625 3.125 6.25 0 0.39 0.78 1.5625 3.125 0 1.5625 3.125 0 0 0 0.78125 0.05 0.05 1.5625 0.1 0.1 FK506 3.125 0.2 0.2 Tunicamycin Fenpropimorph 6.25 0.39

Supplementary Figure S4. Dose-matrix responses to compound pairs that exhibit Figureantifungal 7–4: Dose-matrixsynergy in C. albicans responses. For the topair compoundthat exhibits pairsfungistatic thatsynergy, exhibitthe levels antifun- gal synergyof inhibited ingrowthC. albicans(in culture).of treatedFor thecells pairrelative thatto exhibitsuntreated fungistaticcells are shown synergy,with a the levelsheatmap of inhibited. For pairs growththat (inexhibit culture)synergy of treatedthrough cellscytotoxic relativeeffects, to untreatedthe levels cellsof are recovery/growth of aliquots of treated cells spotted on YPD are shown. All concentrations shown with a heatmap. For pairs that exhibit synergy through cytotoxic effects, the are in µg/ml, except for alverine citrate and fenpropimorph (µM). (A) Synergistic pairs that levelsinclude of recovery/growthfluconazole. (B) Synergistic of aliquotspairs ofthat treateddo not cellsinclude spottedfluconazole on YPD. are shown. All concentrations are in µg/ml, except for alverine citrate and fenpropimorph (µM). (A) Synergistic pairs that include fluconazole. (B) Synergistic pairs that do not include fluconazole.

190 CHAPTER 8 Discussion As the advances in systems biology continue to improve our understanding of bi- ological systems, it is fitting that the field should continue to be integrated into drug development. Despite the rapid development of tools for identifying the components of biological systems at molecular resolution, a complete catalogue of interactions has yet to be acquired for even simple model organisms such as S. cerevisiae (see section 1.1.2). Here we discuss two main approaches to facilitating the early stages of drug development given the incomplete systems biology knowledgebase. Both approaches aim to improve the efficiency of the drug development process by incorporating com- putational prediction, not to replace biological experiments, but to prioritize them. The first approach involves predicting interactions towards filling in the gaps in the system biology knowledgebase. In particular, we predicted two types of interactions that are relevant to the identification of therapeutic targets: (i) kinase-substrate in- teractions, and (ii) genetic interactions. The second approach involves predicting chemical synergies towards facilitating the development of combinatorial therapies. With this approach, predictions are made based on profiles that capture the responses of a system to single compounds, even though the mechanisms with which the sys- tem formulates the responses may be unresolved. Overall, we used computational prediction in a systems biology framework to facilitate and expedite the early stages of drug development.

8.1 The Prediction of Kinase-Substrate Interactions

The potential role of kinases as therapeutic targets motivates the identification of all substrates of any given kinase. The substrates provide a key to elucidating

191 the consequences of the activity of a kinase at the molecular level, and by exten- sion, the consequences of modulating the activity of the kinase with a compound. Although there are several experimental techniques for identifying kinase-substrate interactions [241], it is currently a challenge to unambiguously identify interactions that are relevant in vivo. However, we wish to exploit the data obtained from cur- rent techniques to facilitate the identification of therapeutic targets. We therefore developed a method to predict kinase-substrate interactions, and obtained proof-of- principle results in S. cerevisiae. In developing our method, we identified specific features of substrates that contribute to the prediction of kinase-substrate interactions. We focused on features that can be simply characterized as protein sequence motifs. It is thought that more function- ally important regions in the genome tend to be more highly conserved. Given the important regulatory role of kinases, we hypothesized that features governing kinase- substrate interactions would be conserved, and we thus selected motifs that tend to be conserved (across closely related species). It was subsequently shown that known phosphorylation sites are more highly conserved than the sequences surrounding the sites, supporting our choice of motifs [193]. More than just the phosphorylation site on a substrate governs a kinase-substrate interaction within the cell [282, 84, 148, 208, 31]. Rather than exclusively using motifs that characterize the phosphorylation site, we use motifs that may occur any- where on a substrate and contribute to the context that enables interaction with the kinase. Thus, by integrating these motifs into our predictor, we use contextual infor- mation to facilitate prediction. However, sequence motifs are limited by the extent to which they can capture structural features of a substrate, apparent in the three- dimensional context, that govern an interaction with a specific kinase. Although the three-dimensional structures of the majority of human proteins have yet to be resolved experimentally [91], protein structure prediction is continually improving.

192 Thus, the prediction of kinase-substrate interactions would likely be improved by considering the predicted structures rather than the sequences of proteins. In partic- ular, one could identify structural motifs instead of sequence motifs that characterize substrates of a specific kinase, and subsequently search for these motifs in any given protein to predict whether or not it is a substrate of the kinase. Furthermore, the identified structural motifs would likely provide detailed insight into the contextual factors that enable a kinase-substrate interaction. Therefore, considering proteins in three-dimensions as they occur in cells will likely improve our understanding of kinase-substrate interactions. Predicting all substrates of a kinase using protein structures may contribute to drug development in multiple ways. As mentioned above, the predicted substrates may suggest detrimental downstream consequences of inhibiting the kinase, which in turn suggests that the kinase would not be a good therapeutic target. However, if the ki- nase is related to the disease of interest, this implies that at least one of its substrates is also related to the disease, and perhaps more closely. Thus, predicting substrates may suggest alternative therapeutic targets. Furthermore, the locations of the motif occurrences in a substrate may suggest specific regions of the protein that should be targeted by a compound to induce therapeutic effects (i.e. as a result of disrupting the kinase-substrate interaction). By examining the chemical and structural properties of a target region, it is possible to identify constraints on the type of compound that can bind there. Thus, selecting a target region of a protein structure may guide drug design. Similarly, having such a region also enables in silico compound screening (i.e. docking screens). Specifically, computational simulations can be used to predict how well any given compound binds to the target region. Taken together, our framework for predicting kinase-substrate interactions with protein structures has potentially many uses in drug development.

193 8.2 The Prediction of Genetic Interactions in C. elegans

Genetic interactions are relevant to drug development because they can facilitate the identification of therapeutic targets for monogenic diseases in particular. Each interaction describes how two genes modulate a phenotype, which could be a dis- ease symptom. In the case of an antagonistic interaction, perturbing one gene would cause the phenotype whereas perturbing the other gene would suppress the pheno- type. Given a monogenic disease, a mutation in one gene causes disease symptoms and it is expected that targeting an antagonistic gene/protein would suppress the symptoms. Thus, a strategy for identifying therapeutic targets for a given monogenic disease is to identify the genetic interactors of the mutated gene. To use this strategy, the disease must at least be characterized to the point where the mutated gene has been identified. Details of the physiopathological mechanism are not required, and the strategy can thus be used for less-characterized diseases. However, the strategy does require a method for identifying genetic interactions. Although there are ex- perimental methods to test for these interactions in model organisms or human cell lines, these methods are impractical for testing all gene pairs, or they are limited to interactions with respect to one particular phenotype (as described in section 1.1.2). Therefore, we developed a method to predict genetic interactions in the model animal C. elegans. Our method uses types of data that are also used by other methods for predicting genetic interaction in C. elegans. However, we integrated the data in novel ways (i.e. with novel features). The majority of our predicted interactions are also novel and a sizable fraction of the experimentally-tested predictions was shown to be correct. These results highlight the utility of integration when the predictive datasets have a large amount of missing data. Our cross-validation results show that there is still much room for improvement in

194 terms of the sensitivity our method. One likely explanation for this is the general- ity of our method. That is, while a significant number of genetic interactions may have common characteristics, different classes of genetic interactions may also have very different and perhaps even opposite characteristics. This suggests that a method that predicts genetic interactions based on generic and also class-specific characteris- tics may have improved sensitivity. Random forests would be well suited for this task since different classes could be identified through different branches. However, the current extent of missing data makes it difficult to identify characteristics of classes that significantly distinguish gene pairs of each class from pairs that do not interact. Ongoing experiments in the community will reduce the amount of missing data. In the mean time, the replacement of missing data might be improved by using an im- putation method that exploits additional data that are correlated with the feature data (but perhaps only weakly correlated with the incidence of a genetic interaction). For example, two proteins must be in the same cellular compartment in order to ex- hibit a PP interaction, thus localization data could be used to help fill in the missing PP interaction data. Taken together, a method that considers the heterogeneity of genetic interactions would likely have improved sensitivity, if the amount of missing data were reduced and/or more accurate replacement values were obtained. The class of a genetic interaction may capture a specific relationship (e.g. antago- nism) between the interacting genes with respect to some phenotype; however, the relationship may in fact exist with respect to multiple phenotypes. For example, our results showed that two specific genes genetically interact with respect to defective reproduction and abnormal muscle cell phenotypes (as shown in Figures 4–6 and 4–7). The phenotypes implicate the biological processes that are influenced by an interac- tion, and thus a catalogue of the phenotypes associated with a given interaction is useful for furthering our understanding of biological systems. Obtaining such a cata- logue experimentally requires testing gene pairs with respect to multiple phenotypes.

195 Furthermore, high-throughput methods to quantify each phenotype are required for comprehensive experimental testing, yet the development of such methods is nontriv- ial for many phenotypes (e.g. behavioural phenotypes). Predicting the phenotypes associated with predicted genetic interactions might circumvent this bottleneck and facilitate the identification of the interactions that are most relevant to a given dis- ease. The extent to which genetic interactions are conserved between evolutionarily distant species is still an open question. Although a gene and its function may be conserved, disrupting the gene in different organisms (i.e. different contexts) may result in dif- ferent phenotypes. However, parallel phenotypes have been identified [174]. This suggests that studying genes and their relationship with phenotype A in organism A, e.g. an organism that is easy to manipulate, can generate hypotheses about the orthologous genes and their relationship to phenotype B in organism B, e.g. to a dis- ease phenotype in humans. This raises the question as to whether genetic interactions with respect to phenotype A in organism A can generate hypotheses about interac- tions with respect to a phenotype B in organism B. Indeed, yeast genetic interactions have recently contributed to the prediction of C. elegans genetic interactions [158]. Taken together, these results encourage the study of C. elegans genetic interactions to infer human genetic interactions. Large-scale studies of genetic interactions have focused on interactions between two genes. However, an interaction between two genes could be conditional on a third gene. In general, there may be genetic interactions that involve k genes. Just as two-gene interactions may identify therapeutic targets against monogenic diseases, k-gene interactions may identify therapeutic targets for diseases that involve muta- tions in multiple genes, i.e. multigenic diseases. Moreover, k-gene interactions may identify multiple targets for a single disease, thereby facilitating the development of multi-target or combinatorial therapies. As drug development efforts tackle more

196 complex diseases, there is a pressing need for methods to facilitate the identification of therapeutic targets for multigenic diseases. It would therefore be interesting to in- vestigate whether or not two-gene genetic interactions can be used to predict k-gene interactions.

8.3 The Prediction of Chemical Synergies

As our understanding of diseases improves, it is apparent for many diseases, including prevalent cancers and heart diseases, that the symptoms manifest due to perturbations of multiple genes. This suggests that a single therapeutic target is in- sufficient to address a complex disease, and thus, compounds may fail during clinical trials because they only have one target. Therefore, more focus is being placed on the development of multi-target/combinatorial therapies [284, 38]. Due to the challenge of identifying multiple targets towards the development of a combinatorial therapy for a given disease, development efforts have shifted towards an alternative approach that involves starting with the desired response from a bio- logical system (e.g. a cell or a C. elegans disease model). Specifically, combinations of compounds are experimentally screened to identify those that induce a reduction in disease symptoms, i.e. the desired response [38, 28]. Such a screen may identify compounds with novel mechanisms for achieving the desired response, making the approach a particularly attractive alternative to target-driven development when our understanding of the disease is limited. However, the combinatorial nature of the screen increases the scale exponentially compared to single compound screens. More- over, synergistic compounds are particularly attractive for combinatorial therapies, yet the experiments for identifying these compounds further increase the scale of the screen. Since chemical libraries can contain thousands to millions of compounds, the costs of such screens with current technologies become prohibitively large. Therefore, we developed a method to predict chemical synergies with proof-of-principle results

197 in the antifungal domain. To our knowledge, our method is the first computational approach to predicting syn- ergy that does not require the targets of each compound of interest to be known. Since only small fractions of chemical libraries contain compounds with known tar- gets, our method can potentially assess a much larger number of compounds than other methods. Instead of exploiting target information, our method exploits the system responses to each compound captured by chemogenomic profiles. The gener- ation of profiles can be seen as a limiting factor, however, the rate at which profiles can be generated for compounds is still much faster than the rate at which their targets can be identified (i.e. hundreds of profiles are published at a time [124, 120]). Taken together, our method for predicting chemical synergy can explore chemical space much more efficiently that other methods. The main idea behind our method is that chemogenomic profile similarity predicts that the corresponding compounds are synergistic. However, we expect two profiles of the same compound to be highly similar, yet it would not be reasonable to pre- dict that a compound is synergistic with itself. Indeed, our analysis suggests that both very low and very high similarity requirements do not yield predictions that are significantly enriched for true synergistic pairs (Figure 6–3A). These results suggest that having both a lower and an upper bound on profile similarity would yield more accurate predictions (compared to just a lower bound). However, chemical synergies seem to be rare if we assume that the percentage of synergistic pairs derived from the entire chemical space is less than the estimated 3.6% observed from a subset of this space that has already shown bioactivity [38], and thus there are few known synergistic pairs to help guide the selection of the bounds. Any attempt to select the bounds with the currently known synergies may result in overfitting. Therefore, our method for predicting chemical synergies may be improved by using two profile similarity bounds, although this improvement requires more known synergies.

198 Since we only demonstrated the success of our method in one medical domain, it would be of great interest to assess the usefulness of our method in other domains. Chemical synergies with respect to cellular growth inhibition are sought after in both the antifungal and the anticancer domains. Thus, the success of our method in the antifungal domain suggests that our method would also be successful in the anticancer domain. It is less clear whether or not our method would be useful for identifying syn- ergies with respect to other phenotypes. To address this question, one could collect known synergies with respect to other phenotypes from the literature and, as before, measure the significance with which the predictions are enriched with these synergies. Overall, the potential applicability of our method in other medical domains implies that our method could have a broad influence on drug development. We have shown that it is possible to predict synergies between two compounds with- out information about their targets. Synergies between more than two compounds may target more than two genes, and thus be more appropriate for complex diseases. Therefore, it would be interesting to explore the possibility of predicting synergy be- tween k compounds without information about their targets. Following the identification of a synergistic combination, the next major challenge is to uncover the molecular mechanism by which the combination induces the desired re- sponse. Doing so is important for drug development because the mechanism provides insight into the toxicity level of the combination in a human, and into how the com- pounds could be optimized. Although uncovering the mechanism of even a single com- pound can be difficult, systems-based techniques have facilitated this task [74, 124]. A systems perspective is especially critical for elucidating the mechanism of a syner- gistic combination, since the different chemical perturbations are linked through the system architecture such that the effect of the combination is greater than the sum of the effects of the individual perturbations. Therefore, mapping the response of a system to a synergistic combination onto the system architecture will likely facilitate

199 the uncovering of the mechanism of the combination.

8.4 Conclusions

Systems biology provides drug development with detailed yet broad perspectives of biological systems. Such a perspective may permit a better understanding of how a system would respond to a potential drug, and also permit the discovery of novel therapeutic avenues. However, our knowledge of systems is incomplete, particularly in terms of the interactions within a system. Current experimental methods cannot practically address an entire system, either because of the scale of the task or be- cause of special cases that are difficult to identify. Prior to the development of more adequate experimental methods, machine learning can exploit the available data to fill in the gaps with educated guesses and in doing so, learn trends underlying the system. However, in addition to being incomplete, system-based datasets also tend to be noisy. A particular type of data may exhibit a trend, but when several types exhibit the same trend in parallel, we have more confidence that the trend is real, rather than an artefact of the data. Therefore, integration is an important aspect of machine learning in systems biology. Computational tools have now been developed for several early stages of drug development. We developed methods to facilitate the identification of therapeutic targets. Docking is an active area of research to allow individual compounds to be screened in silico [146]. We also developed a method to predict chemical synergies in order to screen chemical combinations in silico. Further- more, chemical informatics is used to optimize compounds [44]. The computational tools may have a great financial impact on drug development. First, computational predictions are used towards focusing experimental efforts so that fewer experiments need to be conducted, thereby reducing experiment costs. In addition, by mining systems data, the computational tools may lead to better drug candidates so that

200 costly clinical trials are not even attempted for poor candidates. Running computa- tional tools also tends to be quicker than running experiments for the same purpose, and thus the tools may accelerate drug development. Taken together, computational tools may reduce the costs and time required by drug development. In conclusion, we have shown that systems data can be exploited by machine learn- ing techniques to facilitate drug development. The benefits of a systems perspective together with computational tools may encourage the private sector to approach less common diseases, and allow the public sector to play a greater role in the develop- ment of drugs for all diseases. Therefore, systems biology together with computation helps address the challenges of drug development and may have a large impact on development efforts in general.

201 REFERENCES [1] A. A. Aboobaker and M. L. Blaxter. Medical significance of caenorhabditis elegans. Ann Med, 32(1):23–30, 2000. [2] C. P. Adams and V. V. Brantner. Estimating the cost of new drug development: is it really 802 million dollars? Health Aff (Millwood), 25(2):420–8, 2006. [3] V. R. Adams and M. Leggas. Sunitinib malate for the treatment of metastatic renal cell carcinoma and gastrointestinal stromal tumors. Clin Ther, 29(7):1338–53, 2007. [4] V. Agoston, P. Csermely, and S. Pongor. Multiple weak hits confuse complex systems: a transcriptional regulatory network as an example. Phys Rev E Stat Nonlin Soft Matter Phys, 71(5 Pt 1):051909, 2005. [5] S. H. Ahn, W. L. Cheung, J. Y. Hsu, R. L. Diaz, M. M. Smith, and C. D. Allis. Sterile 20 kinase phosphorylates histone h2b at serine 10 during hydrogen peroxide-induced apoptosis in s. cerevisiae. Cell, 120(1):25–36, 2005. [6] J. Ahringer. Turn to the worm! Curr Opin Genet Dev, 7(3):410–5, 1997. [7] M. Akaaboune, R. M. Grady, S. Turney, J. R. Sanes, and J. W. Lichtman. Neu- rotransmitter receptor dynamics studied in vivo by reversible photo-unbinding of fluorescent ligands. Neuron, 34(6):865–76, 2002. [8] J. G. Albeck, J. M. Burke, S. L. Spencer, D. A. Lauffenburger, and P. K. Sorger. Modeling a snap-action, variable-delay switch controlling extrinsic cell death. PLoS Biol, 6(12):2831–52, 2008. [9] Bruce Alberts. Molecular biology of the cell, chapter 15 Cell Communication, pages 831–906. Garland Science, New York, 4th edition, 2002. [10] D. E. Albrecht and S. C. Froehner. Syntrophins and dystrobrevins: defining the dystrophin scaffold at synapses. Neurosignals, 11(3):123–9, 2002. [11] S. R. Allerheiligen. Next-generation model-based drug discovery and de- velopment: quantitative and systems pharmacology. Clin Pharmacol Ther, 88(1):135–7, 2010. [12] D. C. Amberg, J. E. Zahner, J. W. Mulholland, J. R. Pringle, and D. Botstein. Aip3p/bud6p, a yeast actin-interacting protein that is involved in morphogene- sis and the selection of bipolar budding sites. Mol Biol Cell, 8(4):729–53, 1997.

202 203 [13] J. Aoto and L. Chen. Bidirectional ephrin/eph signaling in synaptic functions. Brain Res, 1184:72–80, 2007. [14] Mustapha Aouida, Nicolas Pag, Anick Leduc, Matthias Peter, and Dindial Ramotar. A genome-wide screen in saccharomyces cerevisiae reveals altered transport as a mechanism of resistance to the anticancer drug bleomycin. Can- cer Res, 64(3):1102–1109, 2004. [15] D. K. Arrell and A. Terzic. Network systems biology for drug discovery. Clin Pharmacol Ther, 88(1):120–5, 2010. [16] M. Artal-Sanz, L. de Jong, and N. Tavernarakis. Caenorhabditis elegans: a versatile platform for drug discovery. Biotechnol J, 1(12):1405–18, 2006. [17] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ring- wald, G. M. Rubin, and G. Sherlock. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat Genet, 25(1):25–9, 2000. [18] K. R. Ayscough, J. Stryker, N. Pokala, M. Sanders, P. Crews, and D. G. Drubin. High rates of actin filament turnover in budding yeast and roles for actin in establishment and maintenance of cell polarity revealed using the actin inhibitor latrunculin-a. J Cell Biol, 137(2):399–416, 1997. [19] J. G. Barbara. Ip3-dependent calcium-induced calcium release mediates bidirec- tional calcium waves in neurones: functional implications for synaptic plasticity. Biochim Biophys Acta, 1600(1-2):12–8, 2002. [20] F. Barchiesi, L. F. Di Francesco, P. Compagnucci, D. Arzeni, A. Giacometti, and G. Scalise. In-vitro interaction of terbinafine with amphotericin b, fluconazole and itraconazole against clinical isolates of candida albicans. J Antimicrob Chemother, 41(1):59–65, 1998. [21] L. E. Baum and T. Petrie. Statistical inference for probabilistic functions of finite state markov chains. Annals of Mathematical Statistics, 37(6):1554–1563, 1966. [22] Thomas J. Begley, Ari S. Rosenbach, Trey Ideker, and Leona D. Samson. Hot spots for modulating toxicity identified by genomic phenotyping and localization mapping. Mol Cell, 16(1):117–125, 2004. [23] P. Beltrao, J. C. Trinidad, D. Fiedler, A. Roguev, W. A. Lim, K. M. Shokat, A. L. Burlingame, and N. J. Krogan. Evolution of phosphoregulation: compari- son of phosphorylation patterns across yeast species. PLoS Biol, 7(6):e1000134, 2009. 204 [24] A. M. Bender, N. V. Kirienko, S. K. Olson, J. D. Esko, and D. S. Fay. lin-35/rb and the corest ortholog spr-1 coordinately regulate vulval morphogenesis and gonad development in c. elegans. Dev Biol, 302(2):448–62, 2007. [25] A. M. Bender, O. Wells, and D. S. Fay. lin-35/rb and xnp-1/atr-x function redundantly to control somatic gonad development in c. elegans. Dev Biol, 273(2):335–49, 2004. [26] G. M. Benian, T. L. Tinley, X. Tang, and M. Borodovsky. The caenorhabdi- tis elegans gene unc-89, required fpr muscle m-line assembly, encodes a giant modular protein composed of ig and signal transduction domains. J Cell Biol, 132(5):835–48, 1996. [27] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57:289–300, 1995. [28] J. M. Berg, M. E. Rogers, and P. M. Lyster. Systems biology and pharmacology. Clin Pharmacol Ther, 88(1):17–9, 2010. [29] J. Berkson. Maximum likelihood and minimum chi2 estimates of the logistic function. J Am Stat Assoc, 50(269):130–162, 1955. [30] P.R. Bevington. Data Reduction and Error Analysis for the Physical Sciences. McGraw-Hill, New York, 1969. [31] R. P. Bhattacharyya, A. Remenyi, B. J. Yeh, and W. A. Lim. Domains, motifs, and scaffolds: The role of modular interactions in the evolution and wiring of cell signaling circuits. Annu Rev Biochem, 2006. [32] V. Bianchi, P. Farisello, P. Baldelli, V. Meskenaite, M. Milanese, M. Vecel- lio, S. Muhlemann, H. P. Lipp, G. Bonanno, F. Benfenati, D. Toniolo, and P. D’Adamo. Cognitive impairment in gdi1-deficient mice is associated with altered synaptic vesicle pools and short-term synaptic plasticity, and can be corrected by appropriate learning training. Hum Mol Genet, 18(1):105–17, 2009. [33] D. J. Blake. Dystrobrevin dynamics in muscle-cell signalling: a possible target for therapeutic intervention in duchenne muscular dystrophy? Neuromuscul Disord, 12 Suppl 1:S110–7, 2002. [34] D. J. Blake, R. Nawrotzki, N. Y. Loh, D. C. Gorecki, and K. E. Davies. beta- dystrobrevin, a member of the dystrophin-related protein family. Proc Natl Acad Sci U S A, 95(1):241–6, 1998. [35] CI Bliss. The toxicity of poisons applied jointly,. Ann Appl Biol, 26:585–615, 1939. [36] G. M. Bokoch. Biology of the p21-activated kinases. Annu Rev Biochem, 72:743– 81, 2003. 205 [37] J. Bond and C. G. Woods. Cytoskeletal genes regulating brain size. Curr Opin Cell Biol, 18(1):95–101, 2006. [38] A. A. Borisy, P. J. Elliott, N. W. Hurst, M. S. Lee, J. Lehar, E. R. Price, G. Serbedzija, G. R. Zimmermann, M. A. Foley, B. R. Stockwell, and C. T. Keith. Systematic discovery of multicomponent therapeutics. Proc Natl Acad Sci U S A, 100(13):7977–82, 2003. [39] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. [40] Leo Breiman. Classification and regression trees. The Wadsworth statis- tics/probability series. Wadsworth International Group, Belmont, Calif., 1984. [41] B. J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livstone, R. Oughtred, D. H. Lackner, J. Bahler, V. Wood, K. Dolinski, and M. Tyers. The biogrid interaction database: 2008 update. Nucleic Acids Res, 36(Database issue):D637–40, 2008. [42] D. K. Breslow, D. M. Cameron, S. R. Collins, M. Schuldiner, J. Stewart- Ornstein, H. W. Newman, S. Braun, H. D. Madhani, N. J. Krogan, and J. S. Weissman. A comprehensive strategy enabling high-resolution functional anal- ysis of the yeast genome. Nat Methods, 5(8):711–8, 2008. [43] E. J. Brown, M. W. Albers, T. B. Shin, K. Ichikawa, C. T. Keith, W. S. Lane, and S. L. Schreiber. A mammalian protein targeted by g1-arresting rapamycin- receptor complex. Nature, 369(6483):756–8, 1994. [44] F. Brown. Editorial opinion: chemoinformatics - a ten year update. Curr Opin Drug Discov Devel, 8(3):298–302, 2005. [45] J. A. Brown, G. Sherlock, C. L. Myers, N. M. Burrows, C. Deng, H. I. Wu, K. E. McCann, O. G. Troyanskaya, and J. M. Brown. Global analysis of gene function in yeast by quantitative phenotypic profiling. Mol Syst Biol, 2:2006 0001, 2006. [46] C. Burge and S. Karlin. Prediction of complete gene structures in human ge- nomic dna. J Mol Biol, 268(1):78–94, 1997. [47] E. C. Butcher, E. L. Berg, and E. J. Kunkel. Systems biology in drug discovery. Nat Biotechnol, 22(10):1253–9, 2004. [48] A. B. Byrne, M. T. Weirauch, V. Wong, M. Koeva, S. J. Dixon, J. M. Stuart, and P. J. Roy. A global analysis of genetic interactions in caenorhabditis elegans. J Biol, 6(3):8, 2007. [49] S.G Bttcher and C. Dethlefsen. Deal: a package for learning bayesian networks. J Stat Software, 8:1–40, 2003. 206 [50] M. E. Caruso, S. Jenna, M. Bouchecareilh, D. L. Baillie, D. Boismenu, D. Ha- lawani, M. Latterich, and E. Chevet. Gtpase-mediated regulation of the un- folded protein response in caenorhabditis elegans is dependent on the aaa+ atpase cdc-48. Mol Cell Biol, 28(13):4261–74, 2008. [51] E. J. Chang, R. Begum, B. T. Chait, and T. Gaasterland. Prediction of cyclin- dependent kinase phosphorylation substrates. PLoS ONE, 2(7):e656, 2007. [52] Michael Chang, Mohammed Bellaoui, Charles Boone, and Grant W. Brown. A genome-wide screen for methyl methanesulfonate-sensitive mutants reveals genes required for s phase progression in the presence of dna damage. Proc Natl Acad Sci U S A, 99(26):16934–16939, 2002. [53] J. Cheng, A. Z. Randall, M. J. Sweredoski, and P. Baldi. Scratch: a protein structure and structural feature prediction server. Nucleic Acids Res, 33(Web Server issue):W72–6, 2005. [54] R. Chenna, H. Sugawara, T. Koike, R. Lopez, T. J. Gibson, D. G. Higgins, and J. D. Thompson. Multiple sequence alignment with the clustal series of programs. Nucleic Acids Res, 31(13):3497–500, 2003. [55] K. C. Chipman and A. K. Singh. Predicting genetic interactions with random walks on biological networks. BMC Bioinformatics, 10(1):17, 2009. [56] E. Chiroli, R. Fraschini, A. Beretta, M. Tonelli, G. Lucchini, and S. Piatti. Budding yeast pak kinases regulate mitotic exit by two different mechanisms. J Cell Biol, 160(6):857–74, 2003. [57] T. R. Clandinin, J. A. DeModena, and P. W. Sternberg. Inositol trisphosphate mediates a ras-independent response to let-23 receptor tyrosine kinase activation in c. elegans. Cell, 92(4):523–33, 1998. [58] P. Cohen. The regulation of protein function by multisite phosphorylation–a 25 year update. Trends Biochem Sci, 25(12):596–601, 2000. [59] P. Cohen. Protein kinases–the major drug targets of the twenty-first century? Nat Rev Drug Discov, 1(4):309–15, 2002. [60] H. J. Cordell. Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Hum Mol Genet, 11(20):2463–8, 2002. [61] M. Costanzo, A. Baryshnikova, J. Bellay, Y. Kim, E. D. Spear, C. S. Sevier, H. Ding, J. L. Koh, K. Toufighi, S. Mostafavi, J. Prinz, R. P. St Onge, B. Van- derSluis, T. Makhnevych, F. J. Vizeacoumar, S. Alizadeh, S. Bahr, R. L. Brost, Y. Chen, M. Cokol, R. Deshpande, Z. Li, Z. Y. Lin, W. Liang, M. Marback, J. Paw, B. J. San Luis, E. Shuteriqi, A. H. Tong, N. van Dyk, I. M. Wallace, J. A. Whitney, M. T. Weirauch, G. Zhong, H. Zhu, W. A. Houry, M. Brudno, S. Ragibizadeh, B. Papp, C. Pal, F. P. Roth, G. Giaever, C. Nislow, O. G. 207 Troyanskaya, H. Bussey, G. D. Bader, A. C. Gingras, Q. D. Morris, P. M. Kim, C. A. Kaiser, C. L. Myers, B. J. Andrews, and C. Boone. The genetic landscape of a cell. Science, 327(5964):425–31, 2010. [62] D. R. Cox. The regression-analysis of binary sequences. Journal of the Royal Statistical Society Series B-Statistical Methodology, 20(2):215–242, 1958. [63] F. Crick. Central dogma of molecular biology. Nature, 227(5258):561–3, 1970. [64] E. Culetto and D. B. Sattelle. A role for caenorhabditis elegans in understanding the function and interactions of human disease genes. Hum Mol Genet, 9(6):869– 77, 2000. [65] F. Cvrckova, C. De Virgilio, E. Manser, J. R. Pringle, and K. Nasmyth. Ste20- like protein kinases are required for normal localization of cell growth and for cytokinesis in budding yeast. Genes Dev, 9(15):1817–30, 1995. [66] P. D’Adamo, A. Menegon, C. Lo Nigro, M. Grasso, M. Gulisano, F. Tamanini, T. Bienvenu, A. K. Gedeon, B. Oostra, S. K. Wu, A. Tandon, F. Valtorta, W. E. Balch, J. Chelly, and D. Toniolo. Mutations in gdi1 are responsible for x-linked non-specific mental retardation. Nat Genet, 19(2):134–9, 1998. [67] P. D’Adamo, H. Welzl, S. Papadimitriou, M. Raffaele di Barletta, C. Tiveron, L. Tatangelo, L. Pozzi, P. F. Chapman, S. G. Knevett, M. F. Ramsay, F. Val- torta, C. Leoni, A. Menegon, D. P. Wolfer, H. P. Lipp, and D. Toniolo. Deletion of the mental retardation gene gdi1 impairs associative memory and alters social behavior in mice. Hum Mol Genet, 11(21):2567–80, 2002. [68] P. D’Adamo, D. P. Wolfer, C. Kopp, I. Tobler, D. Toniolo, and H. P. Lipp. Mice deficient for the synaptic vesicle protein rab3a show impaired spatial re- versal learning and increased explorative activity but none of the behavioral changes shown by mice deficient for the rab3a regulator gdi1. Eur J Neurosci, 19(7):1895–905, 2004. [69] I. Dan, N. M. Watanabe, and A. Kusumi. The ste20 group kinases as regulators of map kinase cascades. Trends Cell Biol, 11(5):220–30, 2001. [70] G. de Voer, D. Peters, and P. E. Taschner. Caenorhabditis elegans as a model for lysosomal storage disorders. Biochim Biophys Acta, 1782(7-8):433–46, 2008. [71] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via em algorithm. Journal of the Royal Statistical Society Series B-Methodological, 39(1):1–38, 1977. [72] Christine Desmoucelles, Benoit Pinson, Christelle Saint-Marc, and Bertrand Daignan-Fornier. Screening the yeast ”disruptome” for mutants affecting re- sistance to the immunosuppressive drug, mycophenolic acid. J Biol Chem, 277(30):27036–27044, 2002. 208 [73] S. Dharmawardhane, A. Schurmann, M. A. Sells, J. Chernoff, S. L. Schmid, and G. M. Bokoch. Regulation of macropinocytosis by p21-activated kinase-1. Mol Biol Cell, 11(10):3341–52, 2000. [74] D. di Bernardo, M. J. Thompson, T. S. Gardner, S. E. Chobot, E. L. Eastwood, A. P. Wojtovich, S. J. Elliott, S. E. Schaus, and J. J. Collins. Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nat Biotechnol, 23(3):377–83, 2005. [75] S. J. Dixon, Y. Fedyshyn, J. L. Koh, T. S. Prasad, C. Chahwan, G. Chua, K. Toufighi, A. Baryshnikova, J. Hayles, K. L. Hoe, D. U. Kim, H. O. Park, C. L. Myers, A. Pandey, D. Durocher, B. J. Andrews, and C. Boone. Significant conservation of synthetic lethal genetic interaction networks between distantly related eukaryotes. Proc Natl Acad Sci U S A, 105(43):16653–8, 2008. [76] J. H. Do and D. K. Choi. Computational approaches to gene prediction. J Microbiol, 44(2):137–44, 2006. [77] Russell K. Dorer, Sheng Zhong, John A. Tallarico, Wing Hung Wong, Timo- thy J. Mitchison, and Andrew W. Murray. A small-molecule inhibitor of mps1 blocks the spindle-checkpoint response to a lack of tension on mitotic chromo- somes. Curr Biol, 15(11):1070–1076, 2005. [78] B. L. Drees, V. Thorsson, G. W. Carter, A. W. Rives, M. Z. Raymond, I. Avila- Campillo, P. Shannon, and T. Galitski. Derivation of genetic interaction net- works from quantitative phenotype data. Genome Biol, 6(4):R38, 2005. [79] N. Dunkel, T. T. Liu, K. S. Barker, R. Homayouni, J. Morschhauser, and P. D. Rogers. A gain-of-function mutation in the transcription factor upc2p causes upregulation of ergosterol biosynthesis genes and increased fluconazole resistance in a clinical candida albicans isolate. Eukaryot Cell, 7(7):1180–90, 2008. [80] J. J. Eby, S. P. Holly, F. van Drogen, A. V. Grishin, M. Peter, D. G. Drubin, and K. J. Blumer. Actin cytoskeleton organization regulated by the pak family of protein kinases. Curr Biol, 8(17):967–70, 1998. [81] M. B. Elowitz and S. Leibler. A synthetic oscillatory network of transcriptional regulators. Nature, 403(6767):335–8, 2000. [82] M. Evangelista, K. Blundell, M. S. Longtine, C. J. Chow, N. Adames, J. R. Pringle, M. Peter, and C. Boone. Bni1p, a yeast formin linking cdc42p and the actin cytoskeleton during polarized morphogenesis. Science, 276(5309):118–22, 1997. [83] Brian Everitt, Sabine Landau, and Morven Leese. Cluster analysis. Arnold ; Oxford University Press, London New York, 4th edition, 2001. 209 [84] M. C. Faux and J. D. Scott. More on target with protein phosphorylation: conferring specificity by location. Trends Biochem Sci, 21(8):312–5, 1996. [85] S. Fields and M. Johnston. Cell biology. whither model organism research? Science, 307(5717):1885–6, 2005. [86] S. Fields and O. Song. A novel genetic system to detect protein-protein inter- actions. Nature, 340(6230):245–6, 1989. [87] G. Finak, N. Bertos, F. Pepin, S. Sadekova, M. Souleimanova, H. Zhao, H. Chen, G. Omeroglu, S. Meterissian, A. Omeroglu, M. Hallett, and M. Park. Stromal gene expression predicts clinical outcome in breast cancer. Nat Med, 14(5):518– 27, 2008. [88] A. Fire, S. Xu, M. K. Montgomery, S. A. Kostas, S. E. Driver, and C. C. Mello. Potent and specific genetic interference by double-stranded rna in caenorhab- ditis elegans. Nature, 391(6669):806–11, 1998. [89] J. L. Fish, C. Dehay, H. Kennedy, and W. B. Huttner. Making bigger brains- the evolution of neural-progenitor-cell division. J Cell Sci, 121(Pt 17):2783–93, 2008. [90] J. B. Fitzgerald, B. Schoeberl, U. B. Nielsen, and P. K. Sorger. Systems biology and combination therapy in the quest for clinical efficacy. Nat Chem Biol, 2(9):458–66, 2006. [91] Research Collaboratory for Structural Bioinformatics. Protein data bank, Au- gust 17 2010. [92] E. Formstecher, S. Aresta, V. Collura, A. Hamburger, A. Meil, A. Trehin, C. Reverdy, V. Betin, S. Maire, C. Brun, B. Jacq, M. Arpin, Y. Bellaiche, S. Bel- lusci, P. Benaroch, M. Bornens, R. Chanet, P. Chavrier, O. Delattre, V. Doye, R. Fehon, G. Faye, T. Galli, J. A. Girault, B. Goud, J. de Gunzburg, L. Jo- hannes, M. P. Junier, V. Mirouse, A. Mukherjee, D. Papadopoulo, F. Perez, A. Plessis, C. Rosse, S. Saule, D. Stoppa-Lyonnet, A. Vincent, M. White, P. Legrain, J. Wojcik, J. Camonis, and L. Daviet. Protein interaction map- ping: a drosophila case study. Genome Res, 15(3):376–84, 2005. [93] Chen Gao, Luming Wang, Elena Milgrom, and W. C. Winston Shen. On the mechanism of constitutive pdr1 activator-mediated pdr5 transcription in sac- charomyces cerevisiae: evidence for enhanced recruitment of coactivators and altered nucleosome structures. J Biol Chem, 279(41):42677–42686, 2004. [94] T. S. Gardner, C. R. Cantor, and J. J. Collins. Construction of a genetic toggle switch in escherichia coli. Nature, 403(6767):339–42, 2000. [95] M. J. Garside. The best sub-set in multiple-regression analysis. The Royal Statistical Society Series C-Applied Statistics, 14(2-3):196–200, 1965. 210 [96] E. Gasteiger, C. Hoogland, A. Gattiker, S. Duvaud, M.R. Wilkins, R.D. Appel, and A. Bairoch. The proteomics protocols handbook, chapter 52 Protein Iden- tification and Analysis Tools on the ExPASy Server, pages 571–607. Humana Press, Totowa, N.J., 2005. [97] D. M. Gelperin, M. A. White, M. L. Wilkinson, Y. Kon, L. A. Kung, K. J. Wise, N. Lopez-Hoyo, L. Jiang, S. Piccirillo, H. Yu, M. Gerstein, M. E. Dumont, E. M. Phizicky, M. Snyder, and E. J. Grayhack. Biochemical and genetic analysis of the yeast proteome with a movable orf collection. Genes Dev, 19(23):2816–26, 2005. [98] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias variance dilemma. Neural Computation, 4(1):1–58, 1992. [99] Guri Giaever, Patrick Flaherty, Jochen Kumm, Michael Proctor, Corey Nislow, Daniel F. Jaramillo, Angela M. Chu, Michael I. Jordan, Adam P. Arkin, and Ronald W. Davis. Chemogenomic profiling: identifying the functional interac- tions of small molecules in yeast. Proc Natl Acad Sci U S A, 101(3):793–798, 2004. [100] K. Gieseler, C. Bessou, and L. Segalat. Dystrobrevin- and dystrophin-like mu- tants display similar phenotypes in the nematode caenorhabditis elegans. Neu- rogenetics, 2(2):87–90, 1999. [101] K. Gieseler, K. Grisoni, M. C. Mariol, and L. Segalat. Overexpression of dys- trobrevin delays locomotion defects and muscle degeneration in a dystrophin- deficient caenorhabditis elegans. Neuromuscul Disord, 12(4):371–7, 2002. [102] K. Gieseler, K. Grisoni, and L. Segalat. Genetic suppression of phenotypes arising from mutations in dystrophin-related genes in caenorhabditis elegans. Curr Biol, 10(18):1092–7, 2000. [103] S. F. Gilbert and J. A. Bolker. Homologies of process and modular elements of embryonic construction. J Exp Zool, 291(1):1–12, 2001. [104] A. S. Goehring, D. A. Mitchell, A. H. Tong, M. E. Keniry, C. Boone, and Jr. Sprague, G. F. Synthetic lethal analysis implicates ste20p, a p21-activated potein kinase, in polarisome activation. Mol Biol Cell, 14(4):1501–16, 2003. [105] S. M. Grant and S. P. Clissold. Fluconazole. a review of its pharmacodynamic and pharmacokinetic properties, and therapeutic potential in superficial and systemic mycoses. Drugs, 39(6):877–916, 1990. [106] K. Grisoni, E. Martin, K. Gieseler, M. C. Mariol, and L. Segalat. Genetic evidence for a dystrophin-glycoprotein complex (dgc) in caenorhabditis elegans. Gene, 294(1-2):77–86, 2002. 211 [107] Andreas H. Groll and Thomas J. Walsh. Antifungal chemotherapy: advances and perspectives. Swiss Med Wkly, 132(23-24):303–311, 2002. [108] P. Grote and B. Conradt. The plzf-like protein tra-4 cooperates with the gli- like transcription factor tra-1 to promote female development in c. elegans. Dev Cell, 11(4):561–73, 2006. [109] Y. Guan, C. L. Myers, D. C. Hess, Z. Barutcuoglu, A. A. Caudy, and O. G. Troyanskaya. Predicting gene function in a hierarchical context with an ensem- ble of classifiers. Genome Biol, 9 Suppl 1:S3, 2008. [110] A. Gururaj, C. J. Barnes, R. K. Vadlamudi, and R. Kumar. Regulation of phos- phoglucomutase 1 phosphorylation and activity by a signaling kinase. Oncogene, 23(49):8118–27, 2004. [111] S. J. Haggarty, P. A. Clemons, and S. L. Schrieber. Chemical genomic profiling of biological networks using graph theory and combinations of small molecule perturbations. J Am Chem Soc, 125:10543–10545, 2003. [112] S. K. Hanks. Genomic analysis of the eukaryotic protein kinase superfamily: a perspective. Genome Biol, 4(5):111, 2003. [113] G. T. Hart, A. K. Ramani, and E. M. Marcotte. How complete are current yeast and human protein-interaction networks? Genome Biol, 7(11):120, 2006. [114] L. H. Hartwell, J. J. Hopfield, S. Leibler, and A. W. Murray. From molecular to modular cell biology. Nature, 402(6761 Suppl):C47–52, 1999. [115] Trevor Hastie, Robert Tibshirani, and J. H. Friedman. The elements of sta- tistical learning : data mining, inference, and prediction : with 200 full-color illustrations, chapter 14.3.12 Hierarchical Clustering, pages 472–480. Springer, New York, NY, 2001. [116] Trevor Hastie, Robert Tibshirani, and J. H. Friedman. The elements of sta- tistical learning : data mining, inference, and prediction : with 200 full-color illustrations, chapter 6.6.3 The Naive Bayes Classifer, pages 184–185. Springer, New York, NY, 2001. [117] D. Heckerman. A tutorial on learning with bayesian networks. Learning in Graphical Models, 89:301–354 630, 1998. [118] J. Heitman, N. R. Movva, and M. N. Hall. Targets for cell cycle arrest by the immunosuppressant rapamycin in yeast. Science, 253(5022):905–9, 1991. [119] M. J. Higgins and S. J. Graham. Intellectual property. balancing innovation and access: patent challenges tip the scales. Science, 326(5951):370–1, 2009. [120] M. E. Hillenmeyer, E. Fung, J. Wildenhain, S. E. Pierce, S. Hoon, W. Lee, M. Proctor, R. P. St Onge, M. Tyers, D. Koller, R. B. Altman, R. W. Davis, 212 C. Nislow, and G. Giaever. The chemical genomic portrait of yeast: uncovering a phenotype for all genes. Science, 320(5874):362–5, 2008. [121] Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S. L. Adams, A. Mil- lar, P. Taylor, K. Bennett, K. Boutilier, L. Yang, C. Wolting, I. Donaldson, S. Schandorff, J. Shewnarane, M. Vo, J. Taggart, M. Goudreault, B. Muskat, C. Alfarano, D. Dewar, Z. Lin, K. Michalickova, A. R. Willems, H. Sassi, P. A. Nielsen, K. J. Rasmussen, J. R. Andersen, L. E. Johansen, L. H. Hansen, H. Jespersen, A. Podtelejnikov, E. Nielsen, J. Crawford, V. Poulsen, B. D. Sorensen, J. Matthiesen, R. C. Hendrickson, F. Gleeson, T. Pawson, M. F. Moran, D. Durocher, M. Mann, C. W. Hogue, D. Figeys, and M. Tyers. Sys- tematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature, 415(6868):180–3, 2002. [122] H. Hogues, H. Lavoie, A. Sellam, M. Mangos, T. Roemer, E. Purisima, A. Nan- tel, and M. Whiteway. Transcription factor substitution during the evolution of fungal ribosome regulation. Mol Cell, 29(5):552–62, 2008. [123] S. P. Holly and K. J. Blumer. Pak-family kinases regulate cell and actin po- larization throughout the cell cycle of saccharomyces cerevisiae. J Cell Biol, 147(4):845–56, 1999. [124] S. Hoon, A. M. Smith, I. M. Wallace, S. Suresh, M. Miranda, E. Fung, M. Proc- tor, K. M. Shokat, C. Zhang, R. W. Davis, G. Giaever, R. P. St Onge, and C. Nislow. An integrated platform of genomic assays reveals small-molecule bioactivities. Nat Chem Biol, 4(8):498–506, 2008. [125] C. C. Huang, J. L. You, M. Y. Wu, and K. S. Hsu. Rap1-induced p38 mitogen- activated protein kinase activation facilitates ampa receptor trafficking via the gdi.rab5 complex. potential role in (s)-3,5-dihydroxyphenylglycene-induced long term depression. J Biol Chem, 279(13):12286–92, 2004. [126] H. D. Huang, T. Y. Lee, S. W. Tzeng, L. C. Wu, J. T. Horng, A. P. Tsou, and K. T. Huang. Incorporating hidden markov models for identifying protein kinase-specific phosphorylation sites. J Comput Chem, 26(10):1032–41, 2005. [127] B. Hughes. 2007 fda drug approvals: a year of flux. Nat Rev Drug Discov, 7(2):107–9, 2008. [128] T. R. Hughes. Yeast and drug discovery. Funct Integr Genomics, 2(4-5):199– 211, 2002. [129] T. R. Hughes, M. J. Marton, A. R. Jones, C. J. Roberts, R. Stoughton, C. D. Armour, H. A. Bennett, E. Coffey, H. Dai, Y. D. He, M. J. Kidd, A. M. King, M. R. Meyer, D. Slade, P. Y. Lum, S. B. Stepaniants, D. D. Shoemaker, D. Ga- chotte, K. Chakraburtty, J. Simon, M. Bard, and S. H. Friend. Functional discovery via a compendium of expression profiles. Cell, 102(1):109–26, 2000. 213 [130] Innovative Medicines Initiative. The innovative medicines initiative (imi) re- search agenda, 2008. [131] M. Ishiura, S. Kutsuna, S. Aoki, H. Iwasaki, C. R. Andersson, A. Tanabe, S. S. Golden, C. H. Johnson, and T. Kondo. Expression of a gene cluster kaiabc as a circadian feedback process in cyanobacteria. Science, 281(5382):1519–23, 1998. [132] F. Jacob and J. Monod. Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol, 3:318–56, 1961. [133] G. Jansen, C. Wu, B. Schade, D. Y. Thomas, and M. Whiteway. Drag and drop cloning in yeast. Gene, 344:43–51, 2005. [134] R. Jansen, H. Yu, D. Greenbaum, Y. Kluger, N. J. Krogan, S. Chung, A. Emili, M. Snyder, J. F. Greenblatt, and M. Gerstein. A bayesian networks ap- proach for predicting protein-protein interactions from genomic data. Science, 302(5644):449–53, 2003. [135] S. Jenna, M. E. Caruso, A. Emadali, D. T. Nguyen, M. Dominguez, S. Li, R. Roy, J. Reboul, M. Vidal, G. N. Tzimas, R. Bosse, and E. Chevet. Regula- tion of membrane trafficking by a novel cdc42-related protein in caenorhabditis elegans epithelial cells. Mol Biol Cell, 16(4):1629–39, 2005. [136] H. Jin and D. C. Amberg. The secretory pathway mediates localization of the cell polarity regulator aip3p/bud6p. Mol Biol Cell, 11(2):647–61, 2000. [137] D. I. Johnson. Cdc42: An essential rho-type gtpase controlling eukaryotic cell polarity. Microbiol Mol Biol Rev, 63(1):54–105, 1999. [138] R. P. Johnson, S. H. Kang, and J. M. Kramer. C. elegans dystroglycan dgn- 1 functions in epithelia and neurons, but not muscle, and independently of dystrophin. Development, 133(10):1911–21, 2006. [139] R. S. Kamath, A. G. Fraser, Y. Dong, G. Poulin, R. Durbin, M. Gotta, A. Kanapin, N. Le Bot, S. Moreno, M. Sohrmann, D. P. Welchman, P. Zip- perlen, and J. Ahringer. Systematic functional analysis of the caenorhabditis elegans genome using rnai. Nature, 421(6920):231–7, 2003. [140] C. T. Keith, A. A. Borisy, and B. R. Stockwell. Multicomponent therapeutics for networked systems. Nat Rev Drug Discov, 4(1):71–8, 2005. [141] R. Kelley and T. Ideker. Systematic interpretation of genetic interactions using protein networks. Nat Biotechnol, 23(5):561–6, 2005. [142] M. Kellis, N. Patterson, M. Endrizzi, B. Birren, and E. S. Lander. Sequenc- ing and comparison of yeast species to identify genes and regulatory elements. Nature, 423(6937):241–54, 2003. 214 [143] D. Kemmer, R. M. Podowski, D. Yusuf, J. Brumm, W. Cheung, C. Wahlestedt, B. Lenhard, and W. W. Wasserman. Gene characterization index: assessing the depth of gene annotation. PLoS ONE, 3(1):e1440, 2008. [144] H. Kim, M. J. Rogers, J. E. Richmond, and S. L. McIntire. Snf-6 is an acetyl- choline transporter interacting with the dystrophin complex in caenorhabditis elegans. Nature, 430(7002):891–6, 2004. [145] S. K. Kim, J. Lund, M. Kiraly, K. Duke, M. Jiang, J. M. Stuart, A. Eizinger, B. N. Wylie, and G. S. Davidson. A gene expression map for caenorhabditis elegans. Science, 293(5537):2087–92, 2001. [146] D. B. Kitchen, H. Decornez, J. R. Furr, and J. Bajorath. Docking and scoring in virtual screening for drug discovery: methods and applications. Nat Rev Drug Discov, 3(11):935–49, 2004. [147] K. Klemm and S. Bornholdt. Topology of biological networks and reliability of information processing. Proc Natl Acad Sci U S A, 102(51):18414–9, 2005. [148] B. Kobe, T. Kampmann, J. K. Forwood, P. Listwan, and R. I. Brinkworth. Sub- strate specificity of protein kinases and computational prediction of substrates. Biochim Biophys Acta, 1754(1-2):200–9, 2005. [149] P. Kohl, E. J. Crampin, T. A. Quinn, and D. Noble. Systems biology: an approach. Clin Pharmacol Ther, 88(1):25–33, 2010. [150] I. Kola and J. Landis. Can the pharmaceutical industry reduce attrition rates? Nat Rev Drug Discov, 3(8):711–5, 2004. [151] D. J. Krysan and L. Didone. A high-throughput screening assay for small molecules that disrupt yeast cell integrity. J Biomol Screen, 13(7):657–64, 2008. [152] K. Kutsche, H. Yntema, A. Brandt, I. Jantke, H. G. Nothwang, U. Orth, M. G. Boavida, D. David, J. Chelly, J. P. Fryns, C. Moraine, H. H. Ropers, B. C. Hamel, H. van Bokhoven, and A. Gal. Mutations in arhgef6, encoding a guanine nucleotide exchange factor for rho gtpases, in patients with x-linked mental retardation. Nat Genet, 26(2):247–50, 2000. [153] R. Lamprecht, D. S. Margulies, C. R. Farb, M. Hou, L. R. Johnson, and J. E. LeDoux. Myosin light chain kinase regulates synaptic plasticity and fear learn- ing in the lateral amygdala. Neuroscience, 139(3):821–9, 2006. [154] E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh, R. Funke, D. Gage, K. Harris, A. Heaford, J. Howland, L. Kann, J. Lehoczky, R. LeVine, P. McEwan, K. McK- ernan, J. Meldrim, J. P. Mesirov, C. Miranda, W. Morris, J. Naylor, C. Ray- mond, M. Rosetti, R. Santos, A. Sheridan, C. Sougnez, N. Stange-Thomann, 215 N. Stojanovic, A. Subramanian, D. Wyman, J. Rogers, J. Sulston, R. Ain- scough, S. Beck, D. Bentley, J. Burton, C. Clee, N. Carter, A. Coulson, R. Dead- man, P. Deloukas, A. Dunham, I. Dunham, R. Durbin, L. French, D. Grafham, S. Gregory, T. Hubbard, S. Humphray, A. Hunt, M. Jones, C. Lloyd, A. McMur- ray, L. Matthews, S. Mercer, S. Milne, J. C. Mullikin, A. Mungall, R. Plumb, M. Ross, R. Shownkeen, S. Sims, R. H. Waterston, R. K. Wilson, L. W. Hillier, J. D. McPherson, M. A. Marra, E. R. Mardis, L. A. Fulton, A. T. Chinwalla, K. H. Pepin, W. R. Gish, S. L. Chissoe, M. C. Wendl, K. D. Delehaunty, T. L. Miner, A. Delehaunty, J. B. Kramer, L. L. Cook, R. S. Fulton, D. L. Johnson, P. J. Minx, S. W. Clifton, T. Hawkins, E. Branscomb, P. Predki, P. Richardson, S. Wenning, T. Slezak, N. Doggett, J. F. Cheng, A. Olsen, S. Lucas, C. Elkin, E. Uberbacher, M. Frazier, et al. Initial sequencing and analysis of the . Nature, 409(6822):860–921, 2001. [155] E. Leberer, D. Dignard, D. Harcus, D. Y. Thomas, and M. Whiteway. The pro- tein kinase homologue ste20p is required to link the yeast pheromone response g-protein beta gamma subunits to downstream signalling components. EMBO J, 11(13):4815–24, 1992. [156] T. Lechler, A. Shevchenko, and R. Li. Direct involvement of yeast type i myosins in cdc42-dependent actin polymerization. J Cell Biol, 148(2):363–73, 2000. [157] I. Lee, B. Lehner, C. Crombie, W. Wong, A. G. Fraser, and E. M. Marcotte. A single gene network accurately predicts phenotypic effects of gene perturbation in caenorhabditis elegans. Nat Genet, 40(2):181–8, 2008. [158] I. Lee, B. Lehner, T. Vavouri, J. Shin, A. G. Fraser, and E. M. Marcotte. Predicting genetic modifier loci using functional gene networks. Genome Res, 20(8):1143–53, 2010. [159] W. Lee, R. P. St Onge, M. Proctor, P. Flaherty, M. I. Jordan, A. P. Arkin, R. W. Davis, C. Nislow, and G. Giaever. Genome-wide requirements for resistance to functionally distinct dna-damaging agents. PLoS Genet, 1(2):e24, 2005. [160] T. Leeuw, A. Fourest-Lieuvin, C. Wu, J. Chenevert, K. Clark, M. White- way, D. Y. Thomas, and E. Leberer. Pheromone response in yeast: associ- ation of bem1p with proteins of the map kinase cascade and actin. Science, 270(5239):1210–3, 1995. [161] J. Lehar, B. R. Stockwell, G. Giaever, and C. Nislow. Combination chemical genetics. Nat Chem Biol, 4(11):674–81, 2008. [162] J. Lehar, G. R. Zimmermann, A. S. Krueger, R. A. Molnar, J. T. Ledell, A. M. Heilbut, 3rd Short, G. F., L. C. Giusti, G. P. Nolan, O. A. Magid, M. S. Lee, A. A. Borisy, B. R. Stockwell, and C. T. Keith. Chemical combination effects predict connectivity in biological systems. Mol Syst Biol, 3:80, 2007. 216 [163] B. Lehner, C. Crombie, J. Tischler, A. Fortunato, and A. G. Fraser. Systematic mapping of genetic interactions in caenorhabditis elegans identifies common modifiers of diverse signaling pathways. Nat Genet, 38(8):896–903, 2006. [164] Guillaume Lesage, Anne-Marie Sdicu, Patrice Mnard, Jesse Shapiro, Shamiza Hussein, and Howard Bussey. Analysis of beta-1,3-glucan assembly in saccha- romyces cerevisiae using a synthetic interaction network and altered sensitivity to caspofungin. Genetics, 167(1):35–49, 2004. [165] A. Lin, R. T. Wang, S. Ahn, C. C. Park, and D. J. Smith. A genome-wide map of human genetic interactions inferred from radiation hybrid genotypes. Genome Res, 2010. [166] J. Lin, D. C. Sahakian, S. M. de Morais, J. J. Xu, R. J. Polzer, and S. M. Winter. The role of absorption, distribution, metabolism, excretion and toxicity in drug discovery. Curr Top Med Chem, 3(10):1125–54, 2003. [167] R. Linding, L. J. Jensen, G. J. Ostheimer, M. A. van Vugt, C. Jorgensen, I. M. Miron, F. Diella, K. Colwill, L. Taylor, K. Elder, P. Metalnikov, V. Nguyen, A. Pasculescu, J. Jin, J. G. Park, L. D. Samson, J. R. Woodgett, R. B. Rus- sell, P. Bork, M. B. Yaffe, and T. Pawson. Systematic discovery of in vivo phosphorylation networks. Cell, 129(7):1415–26, 2007. [168] S. Loewe. The problem of synergism and antagonism of combined drugs. Arzneimittelforschung, 3:285–290, 1953. [169] K. M. Loyet, J. T. Stults, and D. Arnott. Mass spectrometric contributions to the practice of phosphorylation site mapping through 2003: a literature review. Mol Cell Proteomics, 4(3):235–45, 2005. [170] P. Y. Lum, C. D. Armour, S. B. Stepaniants, G. Cavet, M. K. Wolf, J. S. Butler, J. C. Hinshaw, P. Garnier, G. D. Prestwich, A. Leonardson, P. Garrett-Engele, C. M. Rush, M. Bard, G. Schimmack, J. W. Phillips, C. J. Roberts, and D. D. Shoemaker. Discovering modes of action for therapeutic compounds using a genome-wide screen of yeast heterozygotes. Cell, 116(1):121–37, 2004. [171] R. Mani, R. P. St Onge, J. L. th Hartman, G. Giaever, and F. P. Roth. Defining genetic interaction. Proc Natl Acad Sci U S A, 105(9):3461–6, 2008. [172] M. Martchenko, A. Levitin, H. Hogues, A. Nantel, and M. Whiteway. Transcrip- tional rewiring of fungal galactose-metabolism circuitry. Curr Biol, 17(12):1007– 13, 2007. [173] J. McCarter, B. Bartlett, T. Dang, and T. Schedl. On the control of oocyte mei- otic maturation and ovulation in caenorhabditis elegans. Dev Biol, 205(1):111– 28, 1999. 217 [174] K. L. McGary, T. J. Park, J. O. Woods, H. J. Cha, J. B. Wallingford, and E. M. Marcotte. Systematic discovery of nonobvious human disease models through orthologous phenotypes. Proc Natl Acad Sci U S A, 107(14):6544–9, 2010. [175] M. R. Mehan, J. Nunez-Iglesias, C. Dai, M. S. Waterman, and X. J. Zhou. An integrative modular approach to systematically predict gene-phenotype associ- ations. BMC Bioinformatics, 11 Suppl 1:S62, 2010. [176] K. B. Mercer, R. K. Miller, T. L. Tinley, S. Sheth, H. Qadota, and G. M. Benian. Caenorhabditis elegans unc-96 is a new component of m-lines that interacts with unc-98 and paramyosin and is required in adult muscle for assembly and/or maintenance of thick filaments. Mol Biol Cell, 17(9):3832–47, 2006. [177] I. M. Meyer and R. Durbin. Gene structure conservation aids similarity based gene prediction. Nucleic Acids Res, 32(2):776–83, 2004. [178] M. A. Miller, V. Q. Nguyen, M. H. Lee, M. Kosinski, T. Schedl, R. M. Caprioli, and D. Greenstein. A sperm cytoskeletal protein that signals oocyte meiotic maturation and ovulation. Science, 291(5511):2144–7, 2001. [179] M. A. Miller, P. J. Ruest, M. Kosinski, S. K. Hanks, and D. Greenstein. An eph receptor sperm-sensing control mechanism for oocyte meiotic maturation in caenorhabditis elegans. Genes Dev, 17(2):187–200, 2003. [180] M. L. Miller and N. Blom. Kinase-specific prediction of protein phosphorylation sites. Methods Mol Biol, 527:299–310, x, 2009. [181] J. Miskowski, Y. Li, and J. Kimble. The sys-1 gene and sexual dimorphism during gonadogenesis in caenorhabditis elegans. Dev Biol, 230(1):61–73, 2001. [182] DS Moore, GP McCabe, and BA Craig. Bootstrap Methods and Permutation Tests, pages 16:1–60. W.H. Freeman and Company, New York, 2009. [183] P. A. Moore, C. A. Rosen, and K. C. Carter. Assignment of the human fkbp12- rapamycin-associated protein (frap) gene to chromosome 1p36 by fluorescence in situ hybridization. Genomics, 33(2):331–2, 1996. [184] S. Morandell, T. Stasyk, K. Grosstessner-Hain, E. Roitinger, K. Mechtler, G. K. Bonn, and L. A. Huber. Phosphoproteomics strategies for the functional anal- ysis of signal transduction. Proteomics, 6(14):4047–56, 2006. [185] J. Morschhauser, K. S. Barker, T. T. Liu, B. Warmuth J. Bla, R. Homayouni, and P. D. Rogers. The transcription factor mrr1p controls expression of the mdr1 efflux pump and mediates multidrug resistance in candida albicans. PLoS Pathog, 3(11):e164, 2007. [186] A. Mortier, J. Van Damme, and P. Proost. Regulation of chemokine activity by posttranslational modification. Pharmacol Ther, 120(2):197–217, 2008. 218 [187] H. U. Mosch, R. L. Roberts, and G. R. Fink. Ras2 signals via the cdc42/ste20/mitogen-activated protein kinase module to induce filamentous growth in saccharomyces cerevisiae. Proc Natl Acad Sci U S A, 93(11):5352–6, 1996. [188] A. M. Moses, J. K. Heriche, and R. Durbin. Clustering of phosphorylation site recognition motifs can be exploited to predict the targets of cyclin-dependent kinase. Genome Biol, 8(2):R23, 2007. [189] H. M. Muller, E. E. Kenny, and P. W. Sternberg. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol, 2(11):e309, 2004. [190] S. Nelander, W. Wang, B. Nilsson, Q. B. She, C. Pratilas, N. Rosen, P. Gen- nemark, and C. Sander. Models from experiments: combinatorial drug pertur- bations of cancer cells. Mol Syst Biol, 4:216, 2008. [191] J. A. Nelder and R. W. M. Wedderburn. Generalized linear models. Journal of the Royal Statistical Society Series a-General, 135(3):370–384, 1972. [192] S. E. Newey, V. Velamoor, E. E. Govek, and L. Van Aelst. Rho gtpases, den- dritic structure, and mental retardation. J Neurobiol, 64(1):58–74, 2005. [193] A. N. Nguyen Ba and A. M. Moses. Evolution of characterized phosphorylation sites in budding yeast. Mol Biol Evol, 2010. [194] T. G. Nick and K. M. Campbell. Logistic regression. Methods Mol Biol, 404:273– 301, 2007. [195] D. Noble. Modelling the heart: insights, failures and progress. Bioessays, 24(12):1155–63, 2002. [196] M. E. Noble, J. A. Endicott, and L. N. Johnson. Protein kinase inhibitors: insights into drug design from structure. Science, 303(5665):1800–5, 2004. [197] T. Obata, M. B. Yaffe, G. G. Leparc, E. T. Piro, H. Maegawa, A. Kashiwagi, R. Kikkawa, and L. C. Cantley. Peptide and protein library screening defines optimal substrate motifs for akt/pkb. J Biol Chem, 275(46):36108–15, 2000. [198] K. P. O’Brien, M. Remm, and E. L. Sonnhammer. Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res, 33(Database issue):D476– 80, 2005. [199] G.W. Oehlert. A note on the delta method. American Stat, 46(1):27–29, 1992. [200] H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, and M. Kanehisa. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res, 27(1):29–34, 1999. 219 [201] S. E. Ong and M. Mann. Mass spectrometry-based proteomics turns quantita- tive. Nat Chem Biol, 1(5):252–62, 2005. [202] C. Onyewu and J. Heitman. Unique applications of novel antifungal drug com- binations. Anti-Infective Agents in Medicinal Chemistry, 6:3–15, 2007. [203] World Health Organization. Genes and human disease: Monogenic dis- eases. http://www.who.int/genomics/public/geneticdiseases/en/index2.html, 2010 2010. [204] J. P. Overington, B. Al-Lazikani, and A. L. Hopkins. How many drug targets are there? Nat Rev Drug Discov, 5(12):993–6, 2006. [205] A. B. Parsons, R. L. Brost, H. Ding, Z. Li, C. Zhang, B. Sheikh, G. W. Brown, P. M. Kane, T. R. Hughes, and C. Boone. Integration of chemical-genetic and genetic interaction data links bioactive compounds to cellular target pathways. Nat Biotechnol, 22(1):62–69, 2004. [206] A. B. Parsons, A. Lopez, I. E. Givoni, D. E. Williams, C. A. Gray, J. Porter, G. Chua, R. Sopko, R. L. Brost, C. H. Ho, J. Wang, T. Ketela, C. Brenner, J. A. Brill, G. E. Fernandez, T. C. Lorenz, G. S. Payne, S. Ishihara, Y. Ohya, B. Andrews, T. R. Hughes, B. J. Frey, T. R. Graham, R. J. Andersen, and C. Boone. Exploring the mode-of-action of bioactive compounds by chemical- genetic profiling in yeast. Cell, 126(3):611–625, 2006. [207] W. E. Payne and J. I. Garrels. Yeast protein database (ypd): a database for the complete proteome of saccharomyces cerevisiae. Nucleic Acids Res, 25(1):57–62, 1997. [208] P. Pellicena and J. Kuriyan. Protein-protein interactions in the allosteric regu- lation of protein kinases. Curr Opin Struct Biol, 16(6):702–9, 2006. [209] SGD project. Saccharomyces genome database, April, 2008 2008. [210] D. Pruyne and A. Bretscher. Polarization of cell growth in yeast. J Cell Sci, 113 ( Pt 4):571–85, 2000. [211] J. Ptacek, G. Devgan, G. Michaud, H. Zhu, X. Zhu, J. Fasolo, H. Guo, G. Jona, A. Breitkreutz, R. Sopko, R. R. McCartney, M. C. Schmidt, N. Rachidi, S. J. Lee, A. S. Mah, L. Meng, M. J. Stark, D. F. Stern, C. De Virgilio, M. Tyers, B. Andrews, M. Gerstein, B. Schweitzer, P. F. Predki, and M. Snyder. Global analysis of protein phosphorylation in yeast. Nature, 438(7068):679–84, 2005. [212] Mark Ptashne. A genetic switch : phage [lambda] and higher organisms. Cell Press : Blackwell Scientific Publications, Cambridge, Mass., 2nd edition, 1992. [213] J. R. Quinlan. C4.5 : programs for machine learning. The Morgan Kaufmann series in machine learning. Morgan Kaufmann Publishers, San Mateo, Calif., 1993. 220 [214] D. C. Raitt, F. Posas, and H. Saito. Yeast cdc42 gtpase and ste20 pak-like kinase regulate sho1-dependent activation of the hog1 mapk pathway. Embo J, 19(17):4623–31, 2000. [215] S. W. Ramer and R. W. Davis. A dominant truncation allele identifies a gene, ste20, that encodes a putative protein kinase necessary for mating in saccha- romyces cerevisiae. Proc Natl Acad Sci U S A, 90(2):452–6, 1993. [216] A. Remenyi, M. C. Good, R. P. Bhattacharyya, and W. A. Lim. The role of docking interactions in mediating signaling input, output, and discrimination in the yeast mapk network. Mol Cell, 20(6):951–62, 2005. [217] U. E. Rennefahrt, S. W. Deacon, S. A. Parker, K. Devarajan, A. Beeser, J. Chernoff, S. Knapp, B. E. Turk, and J. R. Peterson. Specificity profiling of pak kinases allows identification of novel phosphorylation sites. J Biol Chem, 282(21):15667–78, 2007. [218] D. L. Riddle, T. Blumethal, B. J. Meyer, and J. R. Priess, editors. C. elegans II. Cold Spring Harbor Laboratory Press, Plainview, NY, 1997. [219] G. Rigaut, A. Shevchenko, B. Rutz, M. Wilm, M. Mann, and B. Seraphin. A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotechnol, 17(10):1030–2, 1999. [220] I. Rigoutsos and A. Floratos. Combinatorial pattern discovery in biological sequences: The teiresias algorithm. Bioinformatics, 14(1):55–67, 1998. [221] Mark David Rose, Fred Marshall Winston, Philip Hieter, Fred Sherman, Ger- ald R. Fink, Jim Hicks, and Cold Spring Harbor Laboratory. Methods in yeast genetics : a laboratory course manual. Cold Spring Harbor Laboratory, Cold Spring Harbor, 1990. [222] J. F. Rual, J. Ceron, J. Koreth, T. Hao, A. S. Nicot, T. Hirozane-Kishikawa, J. Vandenhaute, S. H. Orkin, D. E. Hill, S. van den Heuvel, and M. Vidal. Toward improving caenorhabditis elegans phenome mapping with an orfeome- based rnai library. Genome Res, 14(10B):2162–8, 2004. [223] D. Ruths, M. Muller, J. T. Tseng, L. Nakhleh, and P. T. Ram. The signaling petri net-based simulator: a non-parametric strategy for characterizing the dy- namics of cell-specific signaling networks. PLoS Comput Biol, 4(2):e1000005, 2008. [224] J. Ryu, L. Liu, T. P. Wong, D. C. Wu, A. Burette, R. Weinberg, Y. T. Wang, and M. Sheng. A critical role for myosin iib in dendritic spine morphology and synaptic function. Neuron, 49(2):175–82, 2006. [225] F. Sams-Dodd. Target-based drug discovery: is something wrong? Drug Discov Today, 10(2):139–47, 2005. 221 [226] T. Sato, T. Honma, and S. Yokoyama. Combining machine learning and pharmacophore-based interaction fingerprint for in silico screening. J Chem Inf Model, 50(1):170–85, 2010. [227] N. F. Saunders, R. I. Brinkworth, T. Huber, B. E. Kemp, and B. Kobe. Predikin and predikindb: a computational framework for the prediction of protein kinase peptide specificity and an associated database of phosphorylation sites. BMC Bioinformatics, 9:245, 2008. [228] J. L. Schafer. Multiple imputation: a primer. Statistical Methods in Medical Research, 8(1):3–15, 1999. [229] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown. Quantitative monitoring of gene expression patterns with a complementary dna microarray. Science, 270(5235):467–70, 1995. [230] P. Schmidtke and X. Barril. Understanding and predicting druggability. a high-throughput method for detection of drug binding sites. J Med Chem, 53(15):5858–67, 2010. [231] V. Schubert and C. G. Dotti. Transmitting on actin: synaptic control of den- dritic architecture. J Cell Sci, 120(Pt 2):205–12, 2007. [232] Maya Schuldiner, Sean R. Collins, Natalie J. Thompson, Vladimir Denic, Arunashree Bhamidipati, Thanuja Punna, Jan Ihmels, Brenda Andrews, Charles Boone, Jack F. Greenblatt, Jonathan S. Weissman, and Nevan J. Kro- gan. Exploration of the function and organization of the yeast early secretory pathway through an epistatic miniarray profile. Cell, 123(3):507–519, 2005. [233] D. Schwartz and S. P. Gygi. An iterative statistical approach to the iden- tification of protein phosphorylation motifs from large-scale data sets. Nat Biotechnol, 23(11):1391–8, 2005. [234] P. Senawongse, A. R. Dalby, and Z. R. Yang. Predicting the phosphorylation sites using hidden markov models and machine learning methods. J Chem Inf Model, 45(4):1147–52, 2005. [235] K. Shah and K. M. Shokat. A chemical genetic approach for the identification of direct substrates of protein kinases. Methods Mol Biol, 233:253–71, 2003. [236] F. Sherman. Getting started with yeast. Methods Enzymol, 350:3–41, 2002. [237] Y. Shi and I. M. Ethell. Integrins control dendritic spine plasticity in hippocam- pal neurons through nmda receptor and ca2+/calmodulin-dependent protein kinase ii-mediated actin reorganization. J Neurosci, 26(6):1813–22, 2006. [238] G. A. Silverman, C. J. Luke, S. R. Bhatia, O. S. Long, A. C. Vetica, D. H. Perlmutter, and S. C. Pak. Modeling molecular and cellular aspects of human 222 disease using the nematode caenorhabditis elegans. Pediatr Res, 65(1):10–8, 2009. [239] D. B. Smith and K. S. Johnson. Single-step purification of polypeptides ex- pressed in escherichia coli as fusions with glutathione s-transferase. Gene, 67(1):31–40, 1988. [240] Z. Songyang, S. Blechner, N. Hoagland, M. F. Hoekstra, H. Piwnica-Worms, and L. C. Cantley. Use of an oriented peptide library to determine the optimal substrates of protein kinases. Curr Biol, 4(11):973–82, 1994. [241] R. Sopko and B. J. Andrews. Linking the kinome and phosphorylome–a compre- hensive review of approaches to find kinase targets. Mol Biosyst, 4(9):920–33, 2008. [242] A. Sorkin and M. Von Zastrow. Signal transduction and endocytosis: close encounters of many kinds. Nat Rev Mol Cell Biol, 3(8):600–14, 2002. [243] U. Stelzl, U. Worm, M. Lalowski, C. Haenig, F. H. Brembeck, H. Goehler, M. Stroedicke, M. Zenkner, A. Schoenherr, S. Koeppen, J. Timm, S. Mintzlaff, C. Abraham, N. Bock, S. Kietzmann, A. Goedde, E. Toksoz, A. Droege, S. Kro- bitsch, B. Korn, W. Birchmeier, H. Lehrach, and E. E. Wanker. A human protein-protein interaction network: a resource for annotating the proteome. Cell, 122(6):957–68, 2005. [244] K. Tarassov, V. Messier, C. R. Landry, S. Radinovic, M. M. Serna Molina, I. Shames, Y. Malitskaya, J. Vogel, H. Bussey, and S. W. Michnick. An in vivo map of the yeast protein interactome. Science, 320(5882):1465–70, 2008. [245] S. F. Tavazoie, V. A. Alvarez, D. A. Ridenour, D. J. Kwiatkowski, and B. L. Sabatini. Regulation of neuronal morphology and function by the tumor sup- pressors tsc1 and tsc2. Nat Neurosci, 8(12):1727–34, 2005. [246] R Development Core Team. R: A language and environment for statistical computing, 2007. [247] L. Timmons, D. L. Court, and A. Fire. Ingestion of bacterially expressed dsrnas can produce specific and potent genetic interference in caenorhabditis elegans. Gene, 263(1-2):103–12, 2001. [248] L. Timmons and A. Fire. Specific interference by ingested dsrna. Nature, 395(6705):854, 1998. [249] S. M. Tokuoka, A. Saiardi, and S. J. Nurrish. The mood stabilizer valproate inhibits both inositol- and diacylglycerol-signaling pathways in caenorhabditis elegans. Mol Biol Cell, 19(5):2241–50, 2008. [250] A. H. Tong, M. Evangelista, A. B. Parsons, H. Xu, G. D. Bader, N. Page, M. Robinson, S. Raghibizadeh, C. W. Hogue, H. Bussey, B. Andrews, M. Tyers, 223 and C. Boone. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science, 294(5550):2364–8, 2001. [251] A. H. Tong, G. Lesage, G. D. Bader, H. Ding, H. Xu, X. Xin, J. Young, G. F. Berriz, R. L. Brost, M. Chang, Y. Chen, X. Cheng, G. Chua, H. Friesen, D. S. Goldberg, J. Haynes, C. Humphries, G. He, S. Hussein, L. Ke, N. Krogan, Z. Li, J. N. Levinson, H. Lu, P. Menard, C. Munyana, A. B. Parsons, O. Ryan, R. Tonikian, T. Roberts, A. M. Sdicu, J. Shapiro, B. Sheikh, B. Suter, S. L. Wong, L. V. Zhang, H. Zhu, C. G. Burd, S. Munro, C. Sander, J. Rine, J. Green- blatt, M. Peter, A. Bretscher, G. Bell, F. P. Roth, G. W. Brown, B. Andrews, H. Bussey, and C. Boone. Global mapping of the yeast genetic interaction network. Science, 303(5659):808–13, 2004. [252] B. B. Tuch, D. J. Galgoczy, A. D. Hernday, H. Li, and A. D. Johnson. The evolution of combinatorial gene regulation in fungi. PLoS Biol, 6(2):e38, 2008. [253] J. A. Ubersax, E. L. Woodbury, P. N. Quang, M. Paraz, J. D. Blethrow, K. Shah, K. M. Shokat, and D. O. Morgan. Targets of the cyclin-dependent kinase cdk1. Nature, 425(6960):859–64, 2003. [254] P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight, D. Lockshon, V. Narayan, M. Srinivasan, P. Pochart, A. Qureshi-Emili, Y. Li, B. Godwin, D. Conover, T. Kalbfleisch, G. Vijayadamodar, M. Yang, M. John- ston, S. Fields, and J. M. Rothberg. A comprehensive analysis of protein-protein interactions in saccharomyces cerevisiae. Nature, 403(6770):623–7, 2000. [255] A. K. Vaishnaw, J. Gollob, C. Gamba-Vitalo, R. Hutabarat, D. Sah, R. Meyers, T. de Fougerolles, and J. Maraganore. A status report on rnai therapeutics. Silence, 1(1):14, 2010. [256] M. van der Voet, C. W. Berends, A. Perreault, T. Nguyen-Ngoc, P. Gonczy, M. Vidal, M. Boxem, and S. van den Heuvel. Numa-related lin-5, aspm-1, calmodulin and dynein promote meiotic spindle rotation independently of cor- tical lin-5/gpr/galpha. Nat Cell Biol, 11(3):269–77, 2009. [257] F. van Voorst, J. Houghton-Larsen, L. Jnson, M.C. Kielland-Brandt, and A. Brandt. Genome-wide identification of genes required for growth of sac- charomyces cerevisiae under ethanol stress. Yeast, 23:351–9, 2006. [258] Vladimir Naumovich Vapnik. The nature of statistical learning theory. Springer, New York, 1995. [259] J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O. Smith, M. Yandell, C. A. Evans, R. A. Holt, J. D. Gocayne, P. Ama- natides, R. M. Ballew, D. H. Huson, J. R. Wortman, Q. Zhang, C. D. Kodira, X. H. Zheng, L. Chen, M. Skupski, G. Subramanian, P. D. Thomas, J. Zhang, G. L. Gabor Miklos, C. Nelson, S. Broder, A. G. Clark, J. Nadeau, V. A. McKusick, N. Zinder, A. J. Levine, R. J. Roberts, M. Simon, C. Slayman, 224 M. Hunkapiller, R. Bolanos, A. Delcher, I. Dew, D. Fasulo, M. Flanigan, L. Flo- rea, A. Halpern, S. Hannenhalli, S. Kravitz, S. Levy, C. Mobarry, K. Reinert, K. Remington, J. Abu-Threideh, E. Beasley, K. Biddick, V. Bonazzi, R. Bran- don, M. Cargill, I. Chandramouliswaran, R. Charlab, K. Chaturvedi, Z. Deng, V. Di Francesco, P. Dunn, K. Eilbeck, C. Evangelista, A. E. Gabrielian, W. Gan, W. Ge, F. Gong, Z. Gu, P. Guan, T. J. Heiman, M. E. Higgins, R. R. Ji, Z. Ke, K. A. Ketchum, Z. Lai, Y. Lei, Z. Li, J. Li, Y. Liang, X. Lin, F. Lu, G. V. Merkulov, N. Milshina, H. M. Moore, A. K. Naik, V. A. Narayan, B. Neelam, D. Nusskern, D. B. Rusch, S. Salzberg, W. Shao, B. Shue, J. Sun, Z. Wang, A. Wang, X. Wang, J. Wang, M. Wei, R. Wides, C. Xiao, C. Yan, et al. The sequence of the human genome. Science, 291(5507):1304–51, 2001. [260] M. Vergeer and E. S. Stroes. The pharmacology and off-target effects of some cholesterol ester transfer protein inhibitors. Am J Cardiol, 104(10 Suppl):32E– 8E, 2009. [261] M. Versele and J. Thorner. Septin collar formation in budding yeast requires gtp binding and direct phosphorylation by the pak, cla4. J Cell Biol, 164(5):701–15, 2004. [262] C. von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417(6887):399–403, 2002. [263] P. Wang and J. Heitman. The cyclophilins. Genome Biol, 6(7):226, 2005. [264] T.C. White, K.A. Marr, and R.A. Bowden. Clinical, cellular, and molecu- lar factors that contribute to antifungal drug resistance. Clin Microbiol Rev, 11(2):382–402, 1998. [265] M. C. Whitlock. Combining probability from independent tests: the weighted z-method is superior to fisher’s approach. J Evol Biol, 18(5):1368–73, 2005. [266] L. C. Wienkers and T. G. Heath. Predicting in vivo drug interactions from in vitro drug discovery data. Nat Rev Drug Discov, 4(10):825–33, 2005. [267] D. J. Wilkinson. Bayesian methods in bioinformatics and computational sys- tems biology. Brief Bioinform, 8(2):109–16, 2007. [268] E. A. Winzeler, D. D. Shoemaker, A. Astromoff, H. Liang, K. Anderson, B. An- dre, R. Bangham, R. Benito, J. D. Boeke, H. Bussey, A. M. Chu, C. Con- nelly, K. Davis, F. Dietrich, S. W. Dow, M. El Bakkoury, F. Foury, S. H. Friend, E. Gentalen, G. Giaever, J. H. Hegemann, T. Jones, M. Laub, H. Liao, N. Liebundguth, D. J. Lockhart, A. Lucau-Danila, M. Lussier, N. M’Rabet, P. Menard, M. Mittmann, C. Pai, C. Rebischung, J. L. Revuelta, L. Riles, C. J. Roberts, P. Ross-MacDonald, B. Scherens, M. Snyder, S. Sookhai-Mahadeo, R. K. Storms, S. Veronneau, M. Voet, G. Volckaert, T. R. Ward, R. Wysocki, G. S. Yen, K. Yu, K. Zimmermann, P. Philippsen, M. Johnston, and R. W. 225 Davis. Functional characterization of the s. cerevisiae genome by gene deletion and parallel analysis. Science, 285(5429):901–906, 1999. [269] A. Wissmann, J. Ingles, and P. E. Mains. The caenorhabditis elegans mel-11 myosin phosphatase regulatory subunit affects tissue contraction in the somatic gonad and the embryonic epidermis and genetically interacts with the rac sig- naling pathway. Dev Biol, 209(1):111–27, 1999. [270] Y. H. Wong, T. Y. Lee, H. K. Liang, C. M. Huang, T. Y. Wang, Y. H. Yang, C. H. Chu, H. D. Huang, M. T. Ko, and J. K. Hwang. Kinasephos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on se- quences and coupling patterns. Nucleic Acids Res, 35(Web Server issue):W588– 94, 2007. [271] C. Wu, E. Leberer, D. Y. Thomas, and M. Whiteway. Functional charac- terization of the interaction of ste50p with ste11p mapkkk in saccharomyces cerevisiae. Mol Biol Cell, 10(7):2425–40, 1999. [272] C. Wu, V. Lytvyn, D. Y. Thomas, and E. Leberer. The phosphorylation site for ste20p-like protein kinases is essential for the function of myosin-i in yeast. J Biol Chem, 272(49):30623–6, 1997. [273] C. Wu, M. Whiteway, D. Y. Thomas, and E. Leberer. Molecular characteri- zation of ste20p, a potential mitogen-activated protein or extracellular signal- regulated kinase kinase (mek) kinase kinase from saccharomyces cerevisiae. J Biol Chem, 270(27):15984–92, 1995. [274] X. Xu, D. Lee, H. Y. Shih, S. Seo, J. Ahn, and M. Lee. Linking integrin to ip(3) signaling is important for ovulation in caenorhabditis elegans. FEBS Lett, 579(2):549–53, 2005. [275] M. B. Yaffe, G. G. Leparc, J. Lai, T. Obata, S. Volinia, and L. C. Cantley. A motif-based profile scanning approach for genome-wide prediction of signaling pathways. Nat Biotechnol, 19(4):348–53, 2001. [276] R. F. Yeh, L. P. Lim, and C. B. Burge. Computational inference of homologous gene structures in the human genome. Genome Res, 11(5):803–16, 2001. [277] Y. Zan, J. D. Haag, K. S. Chen, L. A. Shepel, D. Wigington, Y. R. Wang, R. Hu, C. C. Lopez-Guajardo, H. L. Brose, K. I. Porter, R. A. Leonard, A. A. Hitt, S. L. Schommer, A. F. Elegbede, and M. N. Gould. Production of knockout rats using enu mutagenesis and a yeast-based screening assay. Nat Biotechnol, 21(6):645–51, 2003. [278] H. Zhang, D. J. Webb, H. Asmussen, S. Niu, and A. F. Horwitz. A git1/pix/rac/pak signaling module regulates spine morphogenesis and synapse formation through mlc. J Neurosci, 25(13):3379–88, 2005. 226 [279] L. Zhang, K. Yan, Y. Zhang, R. Huang, J. Bian, C. Zheng, H. Sun, Z. Chen, N. Sun, R. An, F. Min, W. Zhao, Y. Zhuo, J. You, Y. Song, Z. Yu, Z. Liu, K. Yang, H. Gao, H. Dai, X. Zhang, J. Wang, C. Fu, G. Pei, J. Liu, S. Zhang, M. Goodfellow, Y. Jiang, J. Kuai, G. Zhou, and X. Chen. High-throughput synergy screening identifies microbial metabolites as combination agents for the treatment of fungal infections. Proc Natl Acad Sci U S A, 104(11):4606–11, 2007. [280] W. Zhong and P. W. Sternberg. Genome-wide prediction of c. elegans genetic interactions. Science, 311(5766):1481–4, 2006. [281] F. Zhu, B. Han, P. Kumar, X. Liu, X. Ma, X. Wei, L. Huang, Y. Guo, L. Han, C. Zheng, and Y. Chen. Update of ttd: Therapeutic target database. Nucleic Acids Res, 38(Database issue):D787–91, 2010. [282] G. Zhu, Y. Liu, and S. Shaw. Protein kinase specificity. a strategic collabora- tion between kinase peptide specificity and substrate recruitment. Cell Cycle, 4(1):52–6, 2005. [283] H. Zhu, M. Bilgin, R. Bangham, D. Hall, A. Casamayor, P. Bertone, N. Lan, R. Jansen, S. Bidlingmaier, T. Houfek, T. Mitchell, P. Miller, R. A. Dean, M. Gerstein, and M. Snyder. Global analysis of protein activities using proteome chips. Science, 293(5537):2101–5, 2001. [284] G. R. Zimmermann, J. Lehar, and C. T. Keith. Multi-target therapeutics: when the whole is greater than the sum of the parts. Drug Discov Today, 12(1-2):34– 42, 2007. [285] S. Zolnierowicz and M. Bollen. Protein phosphorylation and protein phos- phatases. de panne, belgium, september 19-24, 1999. Embo J, 19(4):483–8, 2000.