Functional Analysis of Short Linear Motifs in Intrinsically Disordered Regions

by

Mitchell Li Cheong Man

A thesis submitted in conformity with the requirements for the degree of Master of Science Department of Cell and Systems Biology University of Toronto

© Copyright by Mitchell Li Cheong Man 2017

Functional Analysis of Short Linear Motifs in Intrinsically Disordered Regions

Mitchell Li Cheong Man

Master of Science

Department of Cell and Systems Biology University of Toronto

2017 Abstract Short linear motifs (SLiMs) are regulatory binding sites that are involved in signalling and protein regulation. SLiMs are often found in intrinsically disordered regions (IDRs) which are rapidly evolving and lack stable tertiary conformations.

Despite their prevalence throughout the proteome, many SLiMs still remain unknown and understanding how they cooperate and evolved are ongoing endeavours.

The goal of this thesis is to address the properties of SLiMs and how they evolved, using bioinformatics methods and evolutionary models. To further examine the role of SLiMs in signal transduction, I show that deletion of predicted

SLiMs (pSLiMs) has a broad range of quantitative effects on signalling pathway output. Next, to explore what properties are important in substrate recognition, I show that the combination and order of motifs can predict target specificity.

Lastly, using a comparative phylogenetic approach to investigate the evolution of motifs, I provide evidence that and docking sites coevolved.

ii Acknowledgements

I would first and foremost like to thank Alan for giving me this opportunity to join his lab. Thank you for pushing me to pursue bioinformatics and always supporting me whether it is advice on my project or otherwise - you have helped me to gain a much deeper understanding and appreciation for science. I would also like to thank my committee supervisors for their support and guidance throughout my project. A great thank you to Belinda Chang for her advisement to pursue covariation of motifs, and Julie Forman-Kay for her expertise in understanding the field of intrinsically disordered regions. Additionally I would like to thank Nick Provart for agreeing to evaluate this thesis.

Thank you to the Moses lab, both past and present (Alex L, Alex N, Bob,

Caressa, Gavin, Ian, Liz, Muluye, Nirvana, Purnima, Selma and Taraneh). Your shared knowledge, acumen, thoughtfulness, sincerity and understanding for science and for your peers were great to be apart of, and something I hope to emulate in my future endeavours. A special thanks to Bob for allowing me to join you on the pSLiM journey.

Finally, a big thanks to my family and friends for their continued support outside the lab. Adrian, thanks for your constant and unwavering council. Haeri, thank you for showing me the way forward, motivating me, and being my scientific soundboard, when I needed it most. You inspired me to pursue this thesis, and then helped me every step of the way to achieve it.

iii Table of Contents

Abstract ...... ii

Acknowledgements ...... iii

Table of Contents ...... iv

List of Tables ...... vii

List of Figures ...... viii

List of Abbreviations ...... ix

Organization of thesis ...... x

1. Introduction ...... 1

1.1 Intrinsically disordered regions ...... 2

1.2 Short linear motifs ...... 3 1.2.1 Post-translational modifications ...... 4

1.2.2 Docking motifs ...... 5

1.2.3 Degradation motifs ...... 5

1.2.4 Localization signals ...... 6

1.3 Computational analysis of SLiMs ...... 6

1.4 Cooperativity of multiple SLiMs ...... 7

1.5 SLiM evolution ...... 8

1.6 Research Objectives ...... 9

2. Results ...... 10

2.1 Quantifying the effect of predicted SLiM deletions ...... 10

2.1.1 Effect of mutations in known SLiMs on HOG signaling ...... 15

2.1.2 Effect of mutations in pSLiMs on HOG signaling ...... 17

iv 2.2 Testing how recognition of SLiM combinations generalize across the proteome for target specificity ...... 17 2.2.1 Enrichment analysis of SLiM combinations to discriminate Clb5 specific targets ...... 19

2.2.2 Evolutionary analysis of SLiM combinations to discriminate Clb5 specific targets ...... 26

2.2.3 Enrichment analysis of SLiM combinations to discriminate Clb5 specific targets using a Hidden Markov Model framework ...... 28

2.3 Detecting coevolution of SLiMs ...... 31 2.3.1 Coevolution of phosphorylation and docking sites in Cbk1 targets .. 33

3. Discussion...... 38

3.1 Quantifying the effect of pSLiM deletions ...... 38

3.2 Recognition of SLiM combinations for target specificity ...... 40

3.3 Detecting coevolution of SLiMs ...... 45

4. Future Directions ...... 47

4.1 Exploring pSLiM deletions in other signalling pathways ...... 47

4.2 Confirming Clb5 specific target predictions ...... 48

4.3 Extending the coevolution analysis to another model ...... 49

5. Conclusions ...... 50

6. Materials and Methods ...... 51

6.1 pSLiM deletion analysis ...... 51 6.1.1 Yeast strains and mutagenesis ...... 51

6.1.2 Pathway induction and flow cytometry assay ...... 51

6.1.3 Histogram data analysis ...... 52

6.1.4 Calculating fraction of “on” cells and change in mean GFP intensity of “on” cells ...... 53 6.1.5 Statistical analysis of mutant strains ...... 54

v 6.2 Clb5-Cdk1 motif combination analysis ...... 55 6.2.1 Defining Cdk1 (Clb5 specific and non-Clb5 specific) and non-Cdk1 protein datasets ...... 55

6.2.2 Identifying occurrences of matches to combinations of Cks1-Cdk1- Clb5 motifs ...... 56

6.2.3 Ortholog assignment and multiple sequence alignments of related yeast species ...... 56

6.2.4 Constructing an HMM model of the Cks1-Cdk1-Clb5 motif ordering ...... 57

6.2.5 Parameter estimation of emission probabilities ...... 57

6.2.6 Comparison of substrates that match rules in each subset for enrichment analysis ...... 58

6.3 Cbk1 phosphorylation and docking site correlated evolution analysis ...... 59 6.3.1 Defining Cbk1 substrate dataset ...... 59

6.3.2 Ortholog assignment and multiple sequence alignments of related yeast species for the 10 known/likely substrates ...... 59

6.3.3 Constructing Phylogenetic trees ...... 60

6.3.4 Data filtering of ortholog assignment and manually defining rapidly evolving regions ...... 60

6.3.5 Identifying Cbk1 phosphorylation and docking sites ...... 60

6.3.6 Correlated evolution of phosphorylation and docking sites ...... 61

6.3.7 Likelihood ratio test (LRT) used to assess the fit of the dependent model to the independent model ...... 62

6.3.8 Simulation of Ace2 protein evolution ...... 62

7. References ...... 63

8. Appendices ...... 76

vi List of Tables Caption (Page number)

Table 1: Summary of targets in Clb5 specific dataset. (22)

Appendix Table 1. Predicted short linear motifs (pSLIMs) from PhyloHMM in the Hypotonic stress pathway of S.cerevisiae. (76)

Appendix Table 2. Predicted short linear motifs (pSLiMs) from SLiMPrints (Davey et al. 2012b) in the canonical Human Wnt signaling pathway. (77)

Appendix Table 3. T-tests comparing the mean fraction of "on" cells and the mean change in mean GFP intensity of "on" cells for each predicted SLiM to wild-type. (78)

Appendix Table 4. List of species used for analysis of each Cbk1 target (predicted and known from Gógl et al. 2015). (85)

vii List of Figures Caption (Page number)

Figure 1: Schematic diagram of the Sho1-dependant branch of the yeast high osmolarity glycerol (HOG) pathway as a model system for analysis of short linear motif function in signal transduction. (11)

Figure 2: Quantifying the effect of SLiM deletions on the HOG signalling pathway output. (13)

Figure 3. Novel SLiM deletions show significant effect on HOG signaling pathway output. (14)

Figure 4. Novel SLiM deletions show small but significant effects on HOG signaling pathway output. (15)

Figure 5. Schematic of the Clb5-Cdk1-Cks1 complex as a model of SLiM combinations for cyclin specific target recognition. (19)

Figure 6. Clb5 specific targets compared to non-Clb5 specific targets show significant difference in enrichment of the Clb5 specific proposed combination and order of motifs. (21)

Figure 7. Clb5 specific targets compared to non-Clb5 specific targets show no significant difference in conservation of the Clb5 specific proposed combination and order of motifs among orthologs. (28)

Figure 8. Sequence logo representing the Clb5 specific sequence HMM profile. (29)

Figure 9. Clb5 specific targets compared to non-Clb5 specific targets show no significant difference in enrichment of the Clb5 specific sequence HMM profile. (31)

Figure 10. Analysis of Ace2 shows evidence of correlated evolution for Cbk1 phosphorylation sites and docking motifs. (34)

Figure 11. Analysis of Ssd1 shows no evidence of correlated evolution for Cbk1 phosphorylation sites and docking motifs. (36)

Figure 12. Two of the ten known or predicted Cbk1 substrates show evidence of correlated evolution for Cbk1 phosphorylation sites and docking motifs. (37)

Appendix Figure 1. Discrepancy in disorder predictions for false-negatives in Clb5 specific targets, Fin1, Spc110, and Cnn1. (79)

Appendix Figure 2. Comparison of Clb5 specific targets that lack the Clb5 sequence pattern (3) to the rest of the Clb5 specific targets (16). (80)

Appendix Figure 3. Loss of correlated evolution signal due to errors in disorder predictors. (81)

Appendix Figure 4. Simulation of Ace2 ortholog sequence evolution shows no evidence for correlated evolution. (82)

Appendix Figure 5. Cbk1 targets contain non-binary occurrences of phosphorylation and docking sites. (83)

Appendix Figure 6. Detecting correlated evolution in Cbk1 substrates before filtering poor ortholog assignments results in loss of signal in Dsf2. (84)

viii List of Abbreviations

APC Anaphase Promoting Complex CASP Critical Assessment of Structure Prediction CD Circular Dichroism Cdk1 Cyclin-Dependent 1 ChIP Chromatin immunoprecipitation DNA Deoxyribonucleic Acid GFP Green Fluorescent Protein GST Glutathione S-transferase HMM Hidden Markov Model HOG High Osmolarity Glycerol IDR Intrinsically Disordered Region KID Kinase Interaction Database LRT Likelihood Ratio Test MAFFT Multiple Alignment using Fast Fourier Transformation MAPK Mitogen Activated Protein Kinase MAPKK Mitogen Activated Protein Kinase Kinase MEN Mitotic Exit Network ML Maximum Likelihood MSA Multiple Sequence Alignment NLS Nuclear Localization Signal NMR Nuclear Magnetic Resonance ORF Open Reading Frame PAML Phylogenetic Analysis Maximum Likelihood PCA Protein fragment Complementation Assay PONDR Predictor of Natural Disorder Regions PrDOS Protein Disorder Prediction Server pSLiM predicted Short Linear Motif PTM Post-Translational Modifications RNA Ribonucleic Acid SAXS Small Angle X-ray Scattering SGD Saccharomyces Genome Database SLiM Short Linear Motif TFBS Transcription Factor Binding Site YGOB Yeast Gene Order Browser

ix Organization of Thesis

The work presented in this thesis is split into 3 smaller projects, as opposed to one main project.

The first project – quantifying the effect of pSLiM deletions on HOG signalling is part of a manuscript in preparation with Bob Strome (first author), Ian Hsu,

Taraneh Zarin, Alex Nguyen Ba, and Alan Moses. My contribution to the project was quantifying the effects of each pSLiM deletion. Bob Strome performed all wet-lab experiments discussed in this thesis.

The list of potential Clb5 targets from the second project – exploring how combinations of SLiMs can predict target specificity will be sent to Mart Loog to test and validate the predictions.

The data from the third project – coevolution of motifs will be submitted for publication shortly.

x 1. Introduction

Proteins are major functional components of the cell. They participate in many interactions with multiple partners in a temporal and spatial dependent manner, forming complex networks. Complex protein interaction networks can be broken down temporally into sequential individual protein-protein interactions and subsequently the region(s) that mediate these interactions. Understanding which regions of protein are essential in interactions has been a long-standing interest in biology.

It has been previously established that protein-protein interactions are mediated by globular domains; these are large regions of protein, typically 50 to

200 amino acids in length that form stable tertiary structures (Dyson and Wright

2005). Globular domains are highly conserved throughout evolution and are involved in high affinity interactions due to their large interaction surfaces (Van

Roey et al. 2014). Although globular domains account for many of the interactions we observe, they alone cannot explain the complexity of protein networks. In fact, binding partners for two-thirds of all globular domain families still remain unknown (Stein et al. 2011). The majority of protein-protein interactions have been attributed to globular domains due to previous methods used to study protein structure and function that selected for stable structured proteins. These methods relied on detecting a protein activity of interest; followed by purifying the corresponding protein by fractionation, gel filtration or chromatography (Dyson and Wright 2005).

1 1.1 Intrinsically disordered regions

More recently genetic and biochemical methods have been used to study protein structure and function, where a function of interest is mapped to the corresponding gene followed by subsequent expression, purification and structural determination of the protein product (Dyson and Wright 2005). This has led to the study of intrinsically disordered regions (IDRs). These are regions that do not adopt stable tertiary conformations but instead fluctuate between an ensemble of conformations (Bhowmick et al. 2016). The application of structural biology methods such as circular dichroism (CD), nuclear magnetic resonance

(NMR), and more recently small angle x-ray scattering (SAXS) has led to experimental validation of IDRs (Habchi et al. 2014; Kikhney and Svergun 2015).

There are database resources (such as DISPROT) (Vucetic et al. 2005) of proteins with experimentally validated IDRs.

Analysis of experimentally verified IDRs at the sequence level has led to the identification of characteristic signatures and patterns of IDRs. IDRs show an enrichment of low complexity regions (repeat regions), and composition bias with a low content of bulky hydrophobic amino acids (such as

V, L, I, M, F, W, Y), and a high content of polar and/or charged amino acids (Q,

S, P, E, K, and sometimes G and A) (Romero et al. 2001; Vucetic et al. 2003).

Using these sequence signatures, there have been a number of computational efforts to predict IDRs (such as PONDR and DISOPRED) (Romero et al. 1997;

Ward et al. 2004) based on amino acid sequences. From these predictions, it has been suggested that IDRs frequently arise, increasing in occurrence with

2 organism complexity (Ward, Sodhi et al. 2004). Additionally, IDRs have been implicated in protein regulation (Van Roey et al. 2014). Given that organism complexity cannot be explained by genome or proteome size, it has been proposed that the observed complexity is likely explained at the regulatory level

(King and Wilson 1975). Taken together, this suggests IDRs may play a key role in determining organismal complexity due to their implications in protein regulation.

1.2 Short linear motifs

There is an increasing proportion of known protein-protein interactions mediated by IDRs (Tompa et al. 2015; Wright and Dyson 2015). Comparative proteomics studies have used evolutionary conservation to identify short functional elements in these IDRs (Neduva and Russell 2005), and suggest that many remain undiscovered (Nguyen Ba et al. 2012). The short conserved elements identified are often short linear motifs (SLiMs), which are ubiquitous and functionally diverse elements (3-10 amino acids), and are often (~80%) located within IDRs (Van Roey et al. 2014). Unlike well-defined large, globular domains, the functional significance of these motifs is inherently difficult to determine due to their small size and relatively low binding affinities (Davey et al.

2012a). However, one of the best-understood functions of SLiMs is to mediate signaling interactions, both through post-translational modifications and peptide- domain interactions (Tompa et al. 2015; Wright and Dyson 2015; Dunker et al.

2001; Gsponer and Babu 2009; Xie et al. 2007). SLiMs are involved in regulating protein signalling, complex formation, stability and localization by acting as sites

3 for post-translational modifications (PTMs), or binding sites for docking, degradation, and localization (Van Roey et al. 2014). Given that IDRs comprise approximately one-third of the human proteome (Dunker et al. 2008; Pentony and Jones 2010), this suggests that these interaction sites frequently found in these rapidly evolving regions are a common way for cells to regulate proteins and signalling (Ward et al. 2004; Tompa et al. 2014; Davey et al. 2015).

1.2.1 Post-translational modification sites

Post-translational modifications are covalent additions of chemical groups to specific amino acids of a protein after the protein has been translated (Walsh et al. 2005). These modifications are important for regulating protein function depending on both the type of modification and the specific substrate being modified. One of the most well-studied PTMs is the addition of phosphate, usually to either Serine, Threonine, or Tyrosine residues (Walsh et al. 2005).

Often the modified residue is embedded within a SLiM pattern that allows for greater specificity and recognition by , the modifying enzyme that catalyzes the addition of the phosphate group (Van Roey et al. 2014). For example cyclin-dependent kinase 1 (Cdk1), a master cell-cycle regulator, recognizes the minimal consensus sequence Serine followed by a Proline (S-P), where Serine is the residue being modified (Enserink and Kolodner 2010).

Phosphorylation sites act as switches where the addition of a phosphate group to a protein substrate by a kinase facilitates subsequent substrate recognition and interaction with other protein partners. The phosphorylation is a reversible reaction, where the removal of the phosphate group is catalyzed by

4 phosphatases, another group of modifying enzymes (Ubersax and Ferrell 2007).

This dynamic switching between phosphorylated and dephosphorylated states allows the cell to regulate when a protein is on or active, and when it is off or inactive.

1.2.2 Docking motifs

Docking motifs are sites that are involved in recruiting enzymes used to increase substrate specificity. In the case of modifying enzymes, the docking motif will help recruit the modifying enzyme to its substrate, where it can then recognize and modify the SLiM containing the PTM site (Van Roey et al. 2014).

For example, the Cy docking motif is a pattern characterized by an Arginine or

Lysine – followed by any amino acid, and then a Leucine ([RK]-X-L), located on

Cdk1 substrates that interact with the hydrophobic patch of the Clb5 cyclin subunit of the Cdk1 complex (Adams et al. 1996). This interaction is thought to help stabilize the substrate interaction with the Cdk1 complex, facilitating an increase in phosphorylation efficiency, as well as cyclin specificity given that

Cdk1 binds a number of different cyclins dependent on the cell cycle. (Reményi et al. 2006; Ubersax and Ferrell 2007; Van Roey 2014).

1.2.3 Degradation motifs

Degradation motifs are important regulatory elements that are recognized in a temporal and spatial dependent manner by the proteasome machinery, which leads to targeted poly-ubiquitination and subsequent degradation of the target protein (Dyson and Wright 2005; Van Roey 2014). For example, D-boxes

5 are short degradation sites defined by Arginine, followed by any two amino acids, and a Leucine (R-X-X-L) that are found in S-phase active proteins, which are recognized by the Cdc20 bound Anaphase Promoting Complex (APC), a ubiquitin ligase, leading to degradation and exit out of S-phase, allowing cells to enter the G2-phase of the cell cycle (Glotzer et al. 1991; Morgan 2013).

Similiarly, KEN boxes are also short degradation sites in this case defined by a

Lysine-Glutamic acid-Asparagine (K-E-N) pattern that occur in late mitotic proteins and are recognized by the Cdh1 bound APC, which leads to their targeted degradation and subsequent exit from mitosis (Pfleger and Kirschner

2000; Morgan 2013).

1.2.4 Localization motifs

Localization motifs are sequences recognized by cellular transport machinery that aid in the translocation of the localization motif-containing protein between different sub-cellular compartments (Van Roey et al. 2014). One of the most well-studied examples is nuclear localization signals (NLS), sequence patterns enriched in basic amino acid residues, that are recognized by the

Importin complex, which results in transport of the NLS containing protein from the cytoplasm to the nucleus (Lange et al. 2007). These can occur as a single monopartite pattern, or a bipartite pattern with a spacer sequence between the two individual parts of the pattern (Lange et al. 2007; Poon and Jans 2005).

1.3 Computational analysis of SLiMs

6 With the availability of sequence data in the post-genomic era, exploring what information is encoded at the sequence level has led to discovery of regulatory binding sites. This has been demonstrated in the study of cis- regulatory elements and transcription factor binding sites (TFBS) at the DNA sequence level (Bussemaker et al. 2001; Moses et al. 2004). At the protein level, exploring protein interactions through DNA sequence analysis has become an ongoing endeavour. Bioinformatic approaches have been very useful in predicting both known SLiMs and potentially novel SLiMs (Neduva et al. 2005;

Linding et al. 2007; Nguyen Ba et al. 2012). Based on the observation that SLiMs are enriched in IDRs, the proteome can be searched for over-representation of these sequence patterns (Van Roey et al. 2014). Given that these sequence patterns are short in length, it is a challenge to differentiate functional SLiMs from random matches to the sequence pattern in the proteome. In addition, the function of a SLiM is dependent on spatial and temporal protein expression in order for a SLiM-containing protein to interact with its corresponding SLiM- recognizing protein partner(s). Although there have been a number of efforts to predict SLiMs, it is still unknown whether all SLiMs have been identified and how to assign function to these predictions.

1.4 Cooperativity of multiple SLiMs

The short sequence length of SLiMs allows for a given protein to contain multiple SLiMs (Van Roey et al. 2014). Multiple SLiMs can be recognized in combination, either as multiple copies of the same SLiM that act as a threshold for a signaling switch such as a cluster of phosphorylation sites, or multiple

7 SLiMs that interact weakly individually, but stronger in combination for macromolecular assembly (Wright and Dyson 2015; Diella et al. 2008; Pierce et al. 2015; Guharoy et al. 2016). Although there are some cases having shown recognition of SLiM combinations, exploration of how this system of cooperativity generalizes throughout the proteome is still needed.

1.5 SLiM evolution

Understanding the evolution of SLiMs has been an ongoing challenge.

Unlike large globular domains occurring in homologous proteins through divergent evolution, the degenerate nature of SLiMs allows them to be gained and lost through in unrelated proteins (Van Roey et al.

2014; Davey et al. 2015). The appearance and disappearance of SLiMs can occur rapidly, given that only a few point substitutions are needed to select for a novel SLiM or select against deleterious/disadvantageous interactions (Van Roey et al. 2014; Davey et al. 2015). Despite their rapid evolution, some instances of

SLiMs show evolutionary conservation, which has led to de novo prediction of

SLiMs (Nguyen Ba et al. 2012; Davey et al. 2012b). More recently, it has been suggested that coevolution of kinases and substrates could explain increased signaling complexity based on the co-occurrence of phosphorylation sites and docking motifs (Gógl et al. 2015).

Coevolution, first observed by Darwin (Darwin 1862), has been defined as reciprocal changes between species (Janzen 1980), with many initial studies focusing on phenotypic traits in plant-pollinator interactions (Dobzhansky et al.

8 1950; Wallace 1953; Ehrlich and Raven 1964). Since then coevolution has been applied across multiple disciplines to understand the observed biological diversity

(Nuismer et al. 2010; Ridenhour 2014). More recently coevolution has been studied at the molecular level both in the study of RNA and protein structure

(Chao et al. 2008; de Juan et al. 2013). Although coevolution has been explored at the protein sequence level with co-variation of single residue or groups of residues (de Juan et al.2013), it has not been applied to co-occurring SLiMs.

1.6 Research Objectives

The functions and properties of SLiMs and how they have evolved have been the subject of extensive study in a few well-known proteins, but how far this understanding extends beyond these well-studied cases remains unknown.

Using bioinformatics methods and evolutionary models to address this, my goal is to explore how generalizable the function of SLiMs are to the proteome by investigating:

1. What percentage of predicted SLiMs are functional, and what effect(s) do

they have?

Hypothesis 1: Deletions of predicted SLiMs will have quantitative effects on

signal pathway output. Testing a sample of predicted SLiMs will allow me to

estimate the percentage that are functional.

2. To what extent are SLiMs recognized in combination and to what extent can

these combinations predict protein interacting partners?

Hypothesis 2: Docking sites in combination with phosphorylation sites will

predict kinase protein targets.

9 3. Do SLiMs coevolve, and how can we detect this?

Hypothesis 3: Detecting correlated evolution will show that docking sites

coevolve with phosphorylation sites.

2. RESULTS

2.1 Quantifying the effect of predicted SLiM deletions

In order to get an estimate of the possible SLiMs that have yet to be discovered, we decided to use a previously developed bioinformatic tool from our lab that predicts SLiMs based on conservation (Nguyen Ba et al., 2012). We applied this tool to proteins involved in the well-characterized High Osmolarity

Glycerol (HOG) pathway, the osmostress pathway in yeast (Hohmann 2002).

The HOG pathway is a highly conserved MAPK pathway that responds to external osmostress by altering metabolism, cell cycle progression and gene expression to re-establish cell homeostasis (Brewster et al. 1993; Hohmann

2002; O’Rourke et al. 2002; Ruis and Schüller 1995; Westfall et al. 2004). The pathway has two partially redundant branches (the Sln1-dependent and Sho1- dependent branches) (Chen and Thorner 2007) that converge at the MAPKK

(Pbs2) (Hohmann 2002; O’Rourke et al. 2002; Ruis and Schüller 1995; Klipp et al. 2005). When the pathway is activated due to increased salt stress, Pbs2 phosphorylates Hog1, the MAPK, which then translocates to the nucleus, initiating transcription of up to 600 genes (Hohmann 2002; O’Rourke et al. 2002).

We decided to focus on the Sho1-dependent branch by introducing mutations (to the Ssk2 and Ssk22 MAPKKs) that silenced the Sln1-dependent branch of the HOG pathway (O’Rourke and Herskowitz 2004; Tatebayashi et al.

10 2007). We interrogated the remaining Sho1 branch of the HOG pathway for

SLiMs and found 29 predicted SLiMs (pSLiMs) occurring within 14 IDRs (a total of 147 residues, or 1.9% of amino acids in the proteins of the Sho1 branch,

Figure 1).

Figure 1: Schematic diagram of the Sho1-dependant branch of the yeast high osmolarity glycerol (HOG) pathway as a model system for analysis of short linear motif function in signal transduction. Upon exposure to salt stress, a number of HOG components interact leading to the MAPK kinase (MAPKK) Pbs2p initiating a phosphorylation cascade (grey arrows) that culminates in phosphorylation of the MAPK Hog1p. Activated Hog1p-P then accumulates inside the nucleus and initiates a complex transcriptional response in order to re-establish homeostasis within the cell. A Hog1-responsive promoter driving yeGFP reporter expression was used to quantify pathway activity for a series of deletions in predicted short linear motifs in pathway components – shown in pink. Figure adapted, manuscript in preparation.

11 Although the HOG pathway has been characterized at the molecular level

(Maeda et al. 1995; Marles et al. 2004) that has also led to quantitative analysis and modeling (Hohmann 2002; O’Rourke et al. 2002; Klipp et al. 2005) we found

24 novel pSLiMs with no function characterized in the literature. Given our surprise to find such a high number of novel predictions in a well known pathway, we confirmed similar findings in two other well-known pathways: 36 pSLiMs in 11 proteins in the cell wall integrity pathway in yeast (Chen and Thorner 2007)

(Appendix Table 1), and 23 pSLiMs in 14 proteins in the human canonical Wnt regulatory pathway (Kanehisa et al. 2015) (Davey et al. 2012b, Appendix Table

2). This suggests the identification of many novel pSLiMs identified in the HOG pathway is not a feature specific to the osmostress pathway, but more likely a general feature of signalling pathways.

We decided to test 14 unknown SLiMs and 4 known SLiMs to use as controls. For each SLiM, we made a strain where we deleted the predicted SLiM and used a single cell reporter assay driven by the Stl1 promoter, a known downstream target gene of the HOG pathway that is activated under salt stress, with a GFP reporter, as a measure of signal pathway output (Cormack et al.

1997). Each experiment consisted of the wild-type reporter strain, a positive control (Pbs2 ∆96-99 – the SH3 binding site/SLiM, a known mutation that inactivates the pathway), and a number of deletion strains being tested. Each strain was tested in at least 3 independent experiments (between-experiment replicates), and each experiment had 2-4 replicates per strain (within-experiment replicates).

12 To quantify the effect of each SLiM deletion I needed a way of comparing the signal pathway output to wild-type. To this end, I developed a data analysis pipeline that performed a standard histogram subtraction algorithm to calculate the difference in the distribution of induced samples and uninduced samples

(Overton 1988). I defined this difference as the fraction of “on” cells that responded to the salt signal (Figure 2). I then defined the amount of response to the signal as the difference in mean GFP intensity of the fraction of “on” cells compared to the uninduced population (Figure 2).

! ∑ (� × �!! � = !!! ! !" ∑! !!! �!

Equation 1 �!" is the mean GFP intensity of “on” cells, where � is a bin containing “on” cells, with the number of “on” cells, �!, and the bin median, �!, for each bin � where � is the total number of bins with “on” cells.

Figure 2: Quantifying the effect of SLiM deletions on the HOG signalling pathway output. HOG pathway specific output is measured with an STL1 promoter driving a GFP reporter. The population of “on” cells that respond to salt signal is defined as the difference in the distribution of induced and uninduced cells. Determination of the mean values for the “on” and “off” (uninduced) populations allow calculation of the change in (∆) GFP mean intensity - used for subsequent analysis of novel motif deletions. Figure adapted, manuscript in preparation.

Given that the histogram subtraction calculates the number of “on” cells that have a GFP intensity within each bin, but the corresponding GFP intensity of each individual “on” cell is not known, I estimated the mean GFP intensity of “on”

13 cells by approximating each cell to have a GFP intensity equal to its corresponding bin median (Equation 1). Using these two metrics, I compared the 3.8 Small or no effect mutations from WT SCH9 Δ279-95 fraction of “on” cells and the3.6 change in mean GFP intensity of “on” cells, for each Negative control (WT) 3.8 OPY2 Δ196-209 Small or no effect mutations from WT predicted SLiM deletion compared3.4 toPositive wild control-type. (PBS2 Δ91-100) SCH9 Δ279-95 3.8 3.63.8 Wild-type and small or no effect PBS2Small Δ269 or no-74 effect mutations from WT NegativeSmallChange or control no in meaneffect (WT) GFP mutations from WT Partially inactivating HOT1 Δ373-85 OPY2SCH9 Δ 196Δ279-209-95 3.8 3.2 SCH9Change Δ279 in- 95mean GFP Fraction Small or no effect mutations from WT 3.6 3.8 3.6 Negative control (WT) 3.4 PositiveNegative control control (PBS2 (WT)“on” Δ91-100) SmallSCH9 or Δno279 effect-95 mutations from WT OPY2 Δ196-209 PBS2OPY2 Δ269 Δ196-74-209 3.6 SCH9Negative Δ279 -control95 (WT) 3 3.4 Positive control (PBS2Fraction Δ 91-100) 3.23.4 HOT1Positive Δ373 control-85 (PBS2 Δ91-100) 3.6 NegativeOPY2 Δ control196-209 (WT) “on” PBS2 Δ269-74 PBS2 Δ269-74 3.4 OPY2Positive Δ196 control-209 (PBS2 Δ912.8-100) Inactivating 3.2 HOT1 Δ373-85 3.23 HOT1 Δ373-85 3.4 PositivePBS2 Δcontrol269-74 (PBS2 Δ91-100) 1.2 Change in mean GFP PBS2HOT1 Δ269 Δ373-74-85 3.2 2.6 3 HOT1 Δ373-85 3 2.8 3.2 Fraction 13 “on” 2.4 2.62.8 3 2.80 0.2 0.4 0.6 0.8 1 2.8 0.8 2.42.6 2.8 2.6 0 0.2 0.4 0.6 0.8 1 2.6 2.4 2.60.6 2.4 0 0.2 00.4 0.20.6 0.40.8 0.61 0.8 1 2.4 Small or no effect mutations from WT Wild-type 2.40.4 0 0.2 0.4 HOT1Hot10.6 Δ373-85 (known) 0.8 1 0 0.2 0.4 PBS2Pbs20.6 Δ91-100 (known) 0.8 1 PBS2Pbs2 Δ269-74 (novel) 0.2 SCH9Sch9 Δ279-95 (novel) 0 0.2 0.4 0.6 0.8 1 Change in mean GFP intensity of “on” cells (log) of “on”cells intensity Change in mean GFP Fraction of “on” cells

Figure 3. Novel SLiM deletions show significant effect on HOG signaling pathway output. Each symbol represents the data extracted from an individual replicate of a flow cytometry experiment for a single motif deletion. The horizontal axis represents the fraction of “on” cells, and the vertical axis represents the change in mean GFP intensity of “on” cells for each motif deletion. Deletions are classified based on changes from wild-type, into three categories: inactivating, partially inactivating, or small or no effect (shown by grayscale colour). The partially inactivating mutation category contains mutations HOT1∆373-85 and SCH9∆279-95, and inactivating mutation category contains PBS2∆91-100 and PBS2∆269-74. Two of the four mutations that fall into the inactivating and partially inactivating categories are novel (SCH9∆279-95 and PBS2∆269-74). Histograms shown above are representative of the distribution of each respective category (indicated by braces). Means for the corresponding mutations are shown as clear symbols, with the standard deviations shown as error bars. Figure from manuscript in preparation

I found that the deletions showed a broad range of quantitative effects, which I classified as inactivating (fraction of “on” cells <25%, Figure 3), partially inactivating (25%< fraction of “on” cells <75%, Figure 3), or small or no effect

14 from wildtype (fraction of “on” cells < 10% difference compared to wildtype,

Figure 3 and Figure 4). In order to account for day-to-day variation between experimental replicates, strains were compared to the mean values for wild-type replicates that were performed in the same experiment (on the same day).

0.1

0.05

0 * * * * *

-0.05 type centered change type - -0.1

Mean wild -0.15 8 5 8 8 6 8 ------45 89 71 55 80 31 82 13 ------in mean GFP intensity of “on” cells (log) of “on” cells intensity GFP in mean Δ 8 Δ 180 Δ 281 Δ 292 Δ 661 Δ 66 Δ 341 Δ 541 Δ 233 Δ 178 Δ 249 Δ 469 Δ 217 Δ 379 WT reporter WT SIN3 SIN3 Sin Pbs Pbs Pbs Pbs PBS2 PBS2 PBS2 PBS2 Ste Ste STE20 STE20 STE50 Sho Nup Sch Pbs PBS2 Opy NUP133 OPY2 SCH9 SHO1 Ste Ste STE11 STE11 STE20 STE20 strains

Figure 4. Novel SLiM deletions show small but significant effects on HOG signaling pathway output. Each bar represents the mean of all replicates of a single motif. The wild-type strain and all mutant strains that showed a small effect (fraction of “on” cells < 10% difference from wild- type) are listed along the horizontal axis, and the vertical axis represents the mean wild-type centered change in mean GFP intensity of “on” cells for each strain. Five deletions, OPY2∆233- 45, PBS2 ∆292-8, SCH9 ∆178-89, STE20 ∆66-71, STE50 ∆341-6, were statistically significant (indicated by star symbols). Four of the five deletions are novel (indicated by boxes around star symbols), and the remaining deletion is previously known (OPY2∆233-45). Means for the corresponding deletions are shown as clear symbols, with three standard errors shown as error bars. Figure from manuscript in preparation

2.1.1 Effect of mutations in known SLiMs on HOG signaling

Three of the four known SLiMs showed reduced reporter activity: these

SLiMs were found in Pbs2 (our positive control) (Maeda et al. 1995; Marles et al.

2004), Hot1 (a transcription factor downstream of Hog1) (Burns and Wente 2014;

15 Krantz et al. 2006) and Opy2 (a membrane anchor protein for the adaptor,

Ste50p) (Yamamoto et al. 2010).

Deletion of the known SH3-domain binding site in Pbs2 was classified as inactivating, where the signal response was virtually abolished (Figure 3). To confirm the observed effect on signal pathway output was a result of deleting an interaction site and not due to a global effect on Pbs2 (such as protein mis- folding), we made a Pbs2 strain with two point mutations (residues P96 and P99) at highly conserved positions known to affect binding (Maeda et al. 1995). The point mutant Pbs2 strain showed similar results (Appendix Table 3) as the deletion, confirming the effect on signal pathway output being a result of the deleted SLiM interaction site. Deletion of the known Hog1 binding site in Hot1

(Alepuz et al. 2003) was classified as partially inactivating, with an observed bimodal response where some of the cells showed no difference from the uninduced state, and other cells showed weak pathway response (Figure 2 and

Figure 3). This result is similar to previous studies that have also implicated Hot1 with a bimodal response in pathway activation (Burns and Wente 2014). Deletion of a known Ste50 binding site in Opy2 showed a small but significant effect

(P<0.05, Appendix Table 3) on pathway response from wild-type (Figure 4). The fourth known SLiM tested, the deletion of the Bem1 binding site in Ste20 (Takaku et al. 2010) showed no significant quantitative effect on pathway output from wild-type (Appendix Table 3).

16 2.1.2 Effect of mutations in pSLiMs on HOG signaling

Of the 14 novel pSLiMs tested, where a selection of pSLiMs with a length greater than four residues occurring in kinase and other signalling components were given preference, six showed significant quantitative effects on pathway output. Of the six, two showed a highly significant change in the fraction of “on” cells (PBS2 Δ269-74, 12% of wild-type, P<10-5 N=4, and SCH9 Δ279-95, 71% of wild-type, P<10-8 N=11; Figure 3). The remaining four showed small but significant effects on either the fraction of “on” cells or the wild-type-centered change in pathway activity (all P<0.001; Figure 4) when compared to wild-type.

These varied effects include nearly complete abolishment of the pathway activity, subtle quantitative effects, as well as bimodal responses. This indicates that the novel motifs show a varied functional significance similar to the previously known motifs. Taken together, the deletions of both known SLiMs and pSLiMs in the

HOG signalling pathway shows a broad range of quantitative effects on signalling.

2.2. Testing how recognition of SLiM combinations generalize across the proteome for target specificity

To understand how often SLiMs are recognized in combination, I decided to test a recent model that suggests cyclin-dependent kinase (Cdk1), a master regulator of cell-cycle progression, uses a combination of motifs as a mechanism of regulating phosphorylation (Valk et al. 2014). Cdk1 phosphorylates hundreds of targets, where it needs to recognize and bind specific substrates both in a

17 temporally and spatially dependent manner (Ubersax and Ferrell 2007).

Previously it has been shown that Cdk1 substrates tend to have clusters of phosphorylation sites, conservation of phosphorylation sites can be used to predict Cdk1 substrates and local conservation of phosphorylation clusters can be used to better predict kinase targets (Moses et al. 2007; Nguyen Ba and

Moses 2010; Lai et al. 2012). More recently it has been suggested that the presence of docking sites with phosphorylation site clusters can predict kinase targets (Kõivomägi et al. 2011a). This recent model is of particular interest to study because it provides a hypothesis for how Cdk1 can differentiate targets based on detailed characterization of multiple SLiMs required for recognition.

In this model, based on in vitro kinase assays and in vivo cell viability assays of Sic1 (Kõivomägi et al. 2011a; Kõivomägi et al. 2013; McGrath et al.

2013), the S-phase cell-cycle inhibitor in yeast, phosphorylation of a Cdk1-cyclin specific target depends on the presence of a phosphorylated site, a phosphorylation site, and a docking motif with minimum spacing between these regulatory elements (Kõivomägi et al. 2013; Valk et al. 2014). The phosphorylated site is recognized by the phosphoadaptor Cks1, thought to enhance subsequent phosphorylation of the Cdk1 phosphorylation site, and the docking motif is recognized by the cyclin subunit thought to confer cell-cycle specificity (Valk et al. 2014) (Figure 5). Varying the spacing between motifs and observing how this affects Cdk1’s ability to phosphorylate its targets was used to determine the minimum spacer length (Kõivomägi et al. 2013). Additionally, the ordering of these motifs, where the Cks1 phosphorylated site is N-terminal to the

18 Cdk1 phosphorylation site and Clb5 docking motif is C-terminal to the Cdk1 phosphorylation site, is important for Cdk1’s ability to phosphorylate its targets

(Figure 5) (Kõivomägi et al. 2013). Taken together, the presence and ordering of these three motifs with minimum spacing is proposed to reflect the structural constraints imposed by the Cks1-Cdk1-Clb5 complex.

Figure 5. Schematic of the Clb5-Cdk1-Cks1 complex as a model of SLiM combinations for cyclin specific target recognition. A Clb5 specific target is shown with the Cks1 phosphorylated site represented as an oval, Cdk1 phosphorylation site represented as a trapezoid and Clb5 docking sites represented as a rectangle, with the indicated minimum spacer lengths between these SLiMs indicated by the braces. The strong sequence pattern of each site is shown within each respective symbol and weak sequence patterns are bolded and underlined, where square brackes ( [ ] ) indicates an amino acid position that can match to more than one amino acid. X represents any amino acid and Φ represents any hydrophobic amino acid, with Φ1 = [MPVLIFWYQ] and Φ2 = [MPVLIFY] (Dinkel et al. 2012). An amino acid followed by “{a,b}” represents repeated matches, where the amino acid can match with a minimum of “a” times and a maximum of “b” times.

2.2.1 Enrichment analysis of SLiM combinations to discriminate

Clb5 specific targets

If this mechanism is a way for Cdk1 to regulate phosphorylation in a cyclin-specific manner, then this combination of SLiMs at minimum spacing with the correct order should discriminate cyclin-specific targets from other Cdk1

19 targets. To test this I first defined a non-Clb5 specific dataset (but are Cdk1 targets) using the Kinase Interaction Database (KID) (Sharifpoor et al. 2011) and a non-Cdk1 dataset (which I also refer to as the ‘rest of the proteome’) (SGD

Project 2011) to use as backgrounds (see Methods). I then defined a distinct subset from the Cdk1 targets called the Clb5 specific dataset, where Clb5 is the

S phase cyclin in yeast, based on substrates identified from an in vitro kinase study and a more recent in vitro/in vivo phosphoproteomic study (see Methods)

(Loog and Morgan 2005; Li et al. 2014) To determine which feature is best able to differentiate Clb5 specific targets, I used regular expressions that defined i) the combination of SLiMs, ii) with the proposed order iii) and with minimum spacing between motifs (Figure 5) to search within the IDRs amongst the Clb5 specific, non-Clb5 specific, and non-Cdk1 datasets. Regular expressions are commonly used to represent SLiMs due to their simplicity where a pattern matches in an all- or-none manner (Schneider 2002). For the Cks1, Cdk1 and Clb5 motifs, I tested all combinations of the strong (Figure 5) and weak (Figure 5 indicated in bold and underlined) regular expressions. Of all the combinations of three regular expressions tested, the presence and ordering of the weak Cks1, weak Cdk1, followed by the strong Clb5 motifs was most significantly enriched in the Clb5 specific dataset in comparison to the non-Clb5 specific dataset (13 out of 19 Clb5 specific targets vs. 63 out of 165 non-Clb5 specific targets, P < 0.012, one-sided

Fisher’s test) (Figure 6, Table 1). This suggests that the combination of these

SLiMs in the proposed order is a way Cdk1 discriminates Clb5 specific targets.

20 Figure 6. Clb5 specific targets compared to non-Clb5 specific targets show significant difference in enrichment of the Clb5 specific proposed combination and order of motifs. The barplots represent the fraction of proteins in each dataset where the sequence pattern matched, with the explicit counts shown above each barplot. The datasets are indicated in the legend - Clb5 specific shown in black, non-Clb5 specific shown in white and non-Cdk1 shown in light grey. The three sequence patterns tested: match (presence of all three SLiMs), match-order (presence of all 3 SLiMs in proposed order) and match-order-distance (presence of all three SLiMs in the proposed order with the minimum spacer length) are indicated on the x-axis. P-values were calculated with Fisher’s exact test (one-sided), with the braces indicating comparison of the Clb5 specific dataset to the non-Clb5 specific dataset.

Despite the ordering of motifs showing a significant enrichment in the Clb5 specific dataset, there were still six Clb5 targets that did not contain the 3 motifs in the proposed order. Looking at the six Clb5 specific targets that did not contain the matches to the three motifs in the proposed order, three (Fin1, Spc110, and

Cnn1) of these targets did in fact match this pattern, but were not all found in the predicted IDRs (using DISOPRED3) (Table 1). Comparing the predicted IDRs of

DISOPRED3 to a database of nine disorder predictors (Oates et al. 2013), indicated discrepancies between the predictors. Considering the regions where the motifs occur overlapped with the variable predicted regions suggests that these are likely motifs in IDRs where the boundaries of the predicted region is difficult to predict (Appendix Figure 1). The remaining three proteins (Cdc19,

Dpb2, and Bfr1) contained few predicted IDRs (1.2%, 21.3%, and 28.7%,

21 respectively), although they did have predicted domains (SUPERFAMILY, Gough et al. 2001, Oates et al. 2013). These three targets show a different distribution of phosphorylation sites and docking motifs and average percentage disordered

(54.4% vs. 17.1%, P = 0.013, t-test) when compared to the other 16 targets in the

Clb5 specific dataset (P = 0.013 for phosphorylation sites, P = 0.009 for docking sites, t-test, Appendix Figure 2).

Table 1: Summary of targets in Clb5 specific dataset. The top 13 targets matched the Clb5 specific sequence pattern (Cks1, Cdk1 and Clb5 motif in N-C order. The three targets that did not match highlighted in the grey box, are potential false negatives due to error in the disorder predictor (DISOPRED3). The remaining three targets that did not match at the bottom of the table are potential non-multisite targets (see text above and Discussion). Clb5 specific substrates were based on an in vitro kinase study (Loog and Morgan 2005) and an in vitro/in vivo phosphoproteomic study (Li et al. 2014) (see Methods).

Of the 63 targets in the non-Clb5 specific dataset that contained the three motifs in the proposed order, I ranked my dataset by KID score and looked for evidence in the literature that supports Cdk1-Clb5 specific phosphorylation. I found 12 proteins (Swe1, Kar9, Ndd1, Fkh2, Yta7, Pah1, Sld3, Kar3, Dbf20,

Smc4, Bud4 and Fpk1) showed evidence for Clb5 phosphorylation.

22 Swe1 is the G2/M inhibitor analogous to Sic1 for the S phase. It is stably expressed during G1 and peaks at the end of S phase upon targeted degradation in the G2/M phase (Sia et al. 1998). Swe1 is phosphorylated in vitro by Clb5-

Cdk1 (Ptacek et al. 2005), has been implicated in DNA damage response (Liu and Wang 2006), and is potentially phosphorylated at the budneck during mid S- phase by Clb5-Cdk1 (Lee et al. 2005).

Kar9 mediates cytoplasmic microtubule/actin interactions for proper spindle assembly for metaphase at the budneck (Kammerer et al. 2010). Kar9 phosphorylation is highest during metaphase in mitosis and during S phase

(Leisner et al. 2008). Clb5-Cdk1 phosphorylation of Kar9 during the G/S phase is important for asymmetric localization of Kar9 to one of the spindle pole bodies

(SPBs) (Moore et al. 2006, Moore and Miller 2007) and for proper SPB inheritance for bud cells (Hotz et al. 2012). It was also identified with highest affinity for Clb5 in a recent in vivo protein-fragment component assay(PCA) (Ear et al. 2013). The known phosphorylation sites are the same sites that match the regular expression pattern.

Ndd1 is the transcriptional activator of the CLB2 mitotic gene cluster, and has previously been shown to be phosphorylated by Clb2-Cdk1 functioning in a positive feedback loop to activate expression of the CLB2 gene cluster (Darieva et al. 2003, Reynolds et al. 2003, Edenberg et al. 2015). Ndd1 is synthesized during late G1/S phase (Loy et al. 1999, Koranda et al. 2000), with the mechanism of repressing Ndd1 still unknown. Ndd1 has been linked to DNA damage, which inhibits turnover of Ndd1 at the DNA damage checkpoint

23 (Edenberg et al. 2015). Although how Ndd1 is repressed during G/S phase is unknown, it has been suggested that priming phosphorylation (phosphorylation of a site that promotes phosphorylation of another site) may occur to initially overcome the repression and activate the positive feedback loop, where Clb2-

Cdk1 subsequently phosphorylates Ndd1 (Wittenberg and Reed, 2005).

Fkh2 is part of the forkhead family of transcription factors important for cell cycle control, cell death and proliferative responses and differentiation (Sherriff et al. 2007, Edenberg et al. 2015). Fkh2 has been shown to be phosphorylated by

Clb5-Cdk1 at the C-terminal end, between the S and G2 phase (Pic-Taylor et al.

2004).

Yta7 is a chromatin boundary protein, where deletion causes decreased transcription of Hta1 (a histone gene) (Tackett et al. 2005). The region surrounding the AAA-ATPase domain is hyperphosphorylated by Clb5-Cdk1

(Kurat et al. 2011). Phosphorylation of Yta7 causes dissociation from the HTA1 chromatin, thereby allowing transcription of HTA1 and other histone genes during

S phase. Histone synthesis is one of the key steps that occur in S phase in parallel with DNA replication that is important for preparation for mitosis (Gunjan et al. 2005).

Pah1 is a phosphatase that regulates nuclear/ER membrane growth and phospholipid synthesis, which is phosphorylated at S602, T723, and S744 by

Cdk1-cyclin B in (recombinant from human) (Choi et al. 2011, Hsieh et al. 2015).

It has been previously identified as a target of Clb5 specific phosphorylation from

24 in vitro protein chromatin immunoprecipitation (ChiP) analysis and an in vivo PCA

(Ptacek et al. 2005, Ear et al. 2013).

Sld3 is a paralog of Sld2, which forms a co-complex with Dbp11 for initiation of DNA replication. It was first found to be hypophosphorylated in S phase (Nakajima and Masukata 2002). Later identified at sites T600 and S622,

Sld3 was demonstrated to be a Clb5 specific target from both a purified

GST(Glutathione S-transferase)-tagged in vitro kinase assay, and phosphomimic mutants of Sld2 co-expressed with fused Sld3-Dbp11 which bypassed the necessity for the Clb5 cyclin (Tanaka et al. 2007, Zegerman and Diffey 2007).

Kar3 is a kinesin involved in the transport of newly captured chromosomes along microtubules (Tanaka et al. 2005), shown to interact with Clb5 and Cdk1 in two-hybrid interaction experiments (Maekawa et al. 2003). Recently Kar3 has been shown to be phosphorylated in S phase (using truncated Kar3), which regulates the interaction with Bim1, postulated as a reversible mechanism for the minus-end movement of the Kar3 complex. Five sites matching Cdk1 consensus phosphorylation sites were mapped (mass-spectrometry) to the N terminal (T7,

S18, T19, S21, T39, and S58) (Molodtsov et al. 2016). Kar3 has also been linked to kinetochore-microtubule interactions after stressful DNA replication in S-phase

(Liu et al. 2011).

Dbf20 is a kinase important for the mitotic exit network (MEN). Dbf20 is expressed at low levels independent of cell cycle regulation (Toyn et al. 1991).

Dbf20 is a phosphoprotein activated by Cdc15, but it is possible that Cdk1 regulates phosphorylation of independent sites (than those recognized by Cdc15)

25 in order to keep it inactive and unable to translocate to the bud neck (where it is found when active, phosphorylating its downstream targets in the MEN) (Hwa

Lim et al. 2003).

Smc4, a protein involved in nuclear segregation and chromosome condensation, is phosphorylated in early mitosis, where deletion of Clb5 shows a decrease in phosphorylation (Robellet et al. 2015). Furthermore, deletion of the

N-terminal Clb5 docking sites also resulted in decreased phosphorylation of

Smc4 (N-terminal peptide 1-142 kept, i.e. delete/chop off C-terminus 143-1418, then mutate RxL motifs in the N-terminal peptide, replace with Ala). Previous work had identified the Cdk1 phosphorylation sites at S128 by mass- spectrometry (Kao et al. 2014).

Bud4 is involved in bud-site selection and septin organization, and Fpk1 is a protein with unknown function, that also localizes to the cortical site of bud emergence. Both of these substrates were previously identified to be phosphorylated by Clb5-Cdk1 in protein ChiP in vitro kinase assays (Ptacek et al.

2005).

2.2.2 Evolutionary analysis of SLiM combinations to discriminate Clb5 specific targets

If using combinations of SLiMs in a specific order is an important recognition mechanism that Cdk1 employs to regulate phosphorylation, then it should be conserved over evolution. Given that the ordering and presence of motifs gave the most significant enrichment in Clb5 specific targets, I used the same regular expression to search the yeast orthologs (see Methods) for each of

26 the 13 Clb5 specific and the 63 non-Clb5 specific proteins that matched in S. cerevisiae. By comparing the average fraction of orthologs that matched the sequence pattern in the 13 Clb5 specific targets compared to the 63 non-Clb5 specific targets, I tested conservation of the combination and ordering of SLiMs between these sets. This comparison showed no significant difference between the average fraction of orthologs between the two sets (0.482 for Clb5 specific targets vs. 0.459 for non-Clb5 specific targets, P = 0.393, t-test)(Figure 7A), although there was a significant difference when compared to the rest of the proteome (0.482 for Clb5 specific targets vs. 0.300 for non-Cdk1 targets, P <

0.005, t-test) (Figure 7A). In order to control for the error in disorder predictions I observed in the enrichment analysis and avoid the possibility of missing motifs based on the sequence region I am searching, I also compared the average fraction of orthologs between datasets that matched the combination and ordering of SLiMs based on the entire sequence. Although there is a larger average fraction of orthologs where this pattern is conserved in all datasets, there is still no significant difference between the Clb5 specific and non-Clb5 specific dataset (0.892 for Clb5 specific targets vs 0.853 for non-Clb5 specific targets, P = 0.202, t-test) (Figure 7B).

27

Figure 7. Clb5 specific targets compared to non-Clb5 specific targets show no significant difference in conservation of the Clb5 specific proposed combination and order of motifs among orthologs. A) Fraction of orthologs that contain the Clb5 specific sequence pattern that occur in IDRs B) Fraction of orthologs that contain the Clb5 specific sequence pattern, searched in the entire sequence. The barplots represent the average fraction of orthologs where the Clb5 specific sequence pattern was conserved for both the Clb5 specific targets and non-Clb5 specific targets, indicated on the x-axis. The sequence pattern includes the presence of the three SLiMs in the proposed order (N-terminal - Cks1 phosphorylated site – Cdk1 phosphorylation site – Clb5 docking site - C terminal). The error bars indicate standard deviation. P-values were calculated using a t-test, with the smaller brace indicating comparison of the Clb5 specific dataset to the non-Clb5 specific dataset, and the larger brace indicating comparison of the Clb5 specific dataset to the rest of the Proteome.

2.2.3 Enrichment analysis of SLiM combinations to discriminate Clb5 specific targets using a Hidden Markov Model framework

In order to address whether the lack of significant enrichment amongst the

Clb5 specific targets was due to the simple representation of SLiMs as regular expressions, I decided to employ a Hidden Markov Model (HMM). HMMs are probabilistic models that unlike regular expressions describe the probability of observing each amino acid for every position of the SLiM (see Methods) (Durbin et al. 1998; Eddy 1998). This can be advantageous when considering positions that allow a subset of amino acids to occur. For example, consider the Cy

28 docking motif which is defined by an Arginine or Lysine, followed by any amino acid followed by a Leucine (R or K-X-L, Figure 5). Using regular expressions, observing an Arginine or a Lysine is equally likely to occur at the first position. In comparison, an HMM will estimate the probability of observing an Arginine or a

Lysine at the first position of the Clb5 motif, but with different probabilities based on the training dataset that was used to construct the HMM (Figure 8).

Figure 8. Sequence logo representing the Clb5 specific sequence HMM profile. The Cks1 motif is shown spanning position 1 to 4, the Cdk1 motif is shown spanning position 5 and 6, and the Clb5 docking motif is shown spanning position 7 to 11 (indicated with braces), where positions are indicated along the top. The letters represent amino acids, with the heights corresponding to the relative frequency of observing a given amino acid at that position based on the training set used to construct the model, plotted in bits on the y-axis. The red lines between the motifs, at position 4/5 and position 6/7, represent gap insertions with the gap length distribution based on the training dataset used to construct the model.

Similarly, the HMM will also allow variability of the minimum spacer lengths between SLiMs, based on the training dataset that was used to construct the HMM. In this case, I used the 13 substrates from the Clb5 specific dataset that matched the combination of SLiMs in the correct order based on the results of my previous enrichment analysis. I manually aligned the 13 protein sequences, and used HMMbuild (HMMER version 3.0. 2010) to construct my

29 HMM profile (see Methods) (Figure 8).

Using this profile, I searched the proteome looking for recall amongst each dataset (Clb5 specific, non-Clb5 specific, and non-Cdk1) to test for enrichment of

Clb5 specific targets (Figure 9). The HMM model did not significantly enrich for

Clb5 specific targets (6 out of 19 Clb5 specific targets vs. 24 out of 165 non-Clb5 specific targets, P = 0.0646, one-sided Fisher’s test) (Figure 9). Furthermore, a smaller fraction of the Clb5 specific targets were identified when compared to the simpler regular expression analysis (6/19 vs. 13/19). This suggests that using a more complex model that explicitly accounts for both the variability of each amino acid and spacer lengths between motifs does not increase differentiating Clb5 specific targets from non-Clb5 specific targets.

Of the 86 substrates that matched the profile-HMM, 45 of them contained all match sites corresponding to IDRs. Looking at the top 30 predicted targets of the 86 substrates that matched the profile-HMM, 17 targets contained all match positions in predicted IDRs. Five (Sld2, Sic1, Orc2, Far1, Orc1) were targets from my Clb5 specific dataset. Nine (Ash1, Plm2, Ace2, Whi5, Ste20, Yhp1, Stb1,

Bem3, Swi5) targets were from the non-Clb5 dataset, where three of the nine

(Ash1, Plm2, Ace2) matched the regular expression model. The remaining three

(Zrg8, Sli15, Wsc4) hits were targets from the rest of the proteome that also matched the regular expression model. In parallel, I also explored a motif discovery approach (GLAM2) that explicitly accounts for gaps that has been demonstrated to be effective for re-discovering known short protein motifs (Frith et al. 2008), but was unsuccessful in my efforts.

30

Figure 9. Clb5 specific targets compared to non-Clb5 specific targets show no significant difference in enrichment of the Clb5 specific sequence HMM profile. The barplots represent the fraction of proteins in each dataset where the HMM profile matched, with the explicit counts shown above each barplot. The datasets are indicated in the legend - Clb5 specific shown in black, non-Clb5 specific shown in white and non-Cdk1 shown in light grey. P-value was calculated with Fisher’s exact test, with the braces indicating comparison of the Clb5 specific dataset to the non-Clb5 specific dataset.

Although the evolutionary analysis did not show evidence for conservation of the Clb5 specific sequence pattern, enrichment analysis from the regular expression model suggests Cdk1 uses this sequence pattern to differentiate Clb5 specific targets. In addition, for a number of targets containing this pattern in the non-Clb5 dataset and predicted from the HMM profile, careful reading of the literature suggests these are likely Clb5 specific targets.

2.3 Detecting coevolution of SLiMs

In order to detect whether there is coevolution of SLiMs, I decided to test a recent model that suggests Cbk1, a kinase important for mitotic exit, coevolved with their substrates through sequential acquisition of phosphorylation sites (H-X-

R-X-X-[S or T]) and docking motifs (Y-X-F-P), respectively (Gógl et al. 2015).

Cbk1 is part of the NDR family of kinases in fungal species. In budding yeast,

31 Cbk1 bound to its co-activator, Mob2, forming a co-complex, controls destruction of the septum between the mother and daughter cell that forms during cytokinesis (Weiss 2012). The Cbk1-Mob2 co-complex recruits Ace2, a transcription factor that activates transcription of hydrolases that trigger the degradation of the septum (Doolin et al. 2001). The hypothesis that the fungal

NDR kinase evolved a hydrophobic region that recognizes the docking motif was based on the observation that docking motifs were acquired after phosphorylation sites in Cbk1 substrates (both known and predicted) (Gógl et al.

2015). It is then postulated that this put a selective pressure for Cbk1 substrates to evolve the docking motif (Gógl et al. 2015). If this is indicative of coevolution of kinases and substrates, then it follows that these motifs coevolved as well.

Given that SLiMs are involved in regulating protein function and that the presence or absence of a SLiM can directly affect a protein’s function, I consider them a quantitative molecular trait. I can then apply comparative phylogenetic methods that have been classically used to study phenotypic traits (Pagel 1999), to test for coevolution of SLiMs. Coevolution of two traits show correlated evolution based on the assumption that functionally linked traits are gained and lost together (Barker and Pagel 2005). To test whether Cbk1 phosphorylation sites and docking motifs coevolved given that they are functionally linked for

Cbk1 substrate recognition, I employed a Discrete model (Pagel 1994) to test for correlated evolution between the two motifs. This comparative phylogenetic analysis uses binary traits to test the null hypothesis - in this case that Cbk1 phosphorylation sites and docking motifs evolved independently along a

32 phylogeny. The alternative hypothesis is that Cbk1 phosphorylation site and docking motifs evolved dependently, showing correlated evolution. Evidence for correlated evolution is detected when the model describing dependent evolution fits the trait data significantly better than the model describing independent evolution. This is assessed using a likelihood ratio test (LRT) (see Methods).

2.3.1 Coevolution of phosphorylation and docking sites in Cbk1 targets

Using a dataset of 10 Cbk1 substrates (known and predicted) from a recent study (Gógl et al. 2015), I found the number of Cbk1 phosphorylation and docking motifs occurring in IDRs for each S.cerevisiae substrate and their respective yeast orthologs, from YGOB (Byrne and Wolfe 2005) and the Fungal

Repository (Wapinski et al. 2007) (see Methods). Subsequently, each ortholog sequence was compared to the Saccharomyces cerevisiae sequence by BLAST

(Altschul et al. 1997) to confirm the ortholog assignment. To avoid the potential error in using predicted IDRs (as demonstrated in the Clb5 model) (see Appendix

Figure 3 of coevolution analysis of Ace2 comparing two disorder predictors,

DISOPRED3 and PrDOS), I manually defined the IDRs as rapidly evolving regions based on manual assignment from the multiple sequence alignment

(MSA) of ortholog sequences. Using the MSA to identify conserved regions likely to be protein domains in combination with predicted protein domains

(SUPERFAMILY, Gough et al. 2001, Oates et al. 2013) to check for agreement, I defined the surrounding rapidly evolving regions as the region where I expect regulatory motifs to occur. Based on the occurrence of motifs in rapidly evolving

33 regions, I converted the trait values to binary traits, either the absence or presence of the motif (for both phosphorylation sites and docking motifs).

I first focused on the two known Cbk1 substrates, Ace2 and Ssd1. Ace2, a transcription factor involved in activating expression of septum destroying hydrolases (Doolin et al. 2001), contains a conserved zinc finger DNA binding domain at the C-terminus, with the occurrence of both motifs in the N-terminus

(Figure 10A). Ace2 contained trait values that appeared binary (either having one or none), where the traits were represented by presence/absence of motifs

(Figure 10B). Ace2 showed correlated evolution indicating the Cbk1

Figure 10. Analysis of Ace2 shows evidence of correlated evolution for Cbk1 phosphorylation sites and docking motifs. A) Schematic of Ace2 and ortholog sequences. The conserved zinc-finger DNA binding domain is shown in black rectangles, Cbk1 phosphorylation sites are shown in black vertical lines, and Cbk1 docking sites are shown in grey vertical lines. Sequence schematic depicted is to scale, with 100 amino acids shown at the bottom for reference. B) Phylogenetic tree inferred from multiple sequence alignment (MSA) of Ace2 protein sequence with 30 fungal orthologs. Binary trait values for each species are indicated in the table (right of the phylogenetic tree), where a 1 represents presence of motif and a 0 represents absence of motif. Cbk1 phosphorylation trait values are based on inclusion of the known site in Ace2 and the corresponding site based on the MSA for orthologous species.

34 phosphorylation site and docking motif coevolved (P < 0.03, LRT, with 4 degrees of freedom) (Figure 12). To compare the signal I observed in Ace2 to a neutral expectation, I simulated molecular evolution in Ace2 that explicitly accounts for the rapid evolution of IDRs (Nguyen Ba et al. 2012, Nguyen Ba et al. 2014) (see

Methods). Filtering out the conserved region in the simulated Ace2 sequence, and performing the same correlated evolution analysis, showed no signal for coevolution of Cbk1 phosphorylation and docking sites (simulated Ace2 - P =

0.596 vs. real Ace2 - P = 0.023, LRT, with 4 degrees of freedom, Appendix

Figure 4).

In comparison, Ssd1, an RNA-binding protein that regulates translation of polar growth factors (Kurischko et al. 2011), shows a much different distribution of motifs (Figure 11A, B). Given that all ortholog sequences contained multiple occurrences of the Cbk1 phosphorylation site, I decided to use a threshold inferred from the motif distributions to make this trait binary (phosphorylation site

>= 4 is set to 1, phosphorylation < 4 is set to 0. Figure 11A, B). Analysis of the binary converted traits (Figure 11C) for Ssd1 showed no evidence for correlated evolution of the Cbk1 phosphorylation and docking site (P < 0.26, LRT, with 4 degrees of freedom) (Figure 12).

35

Figure 11. Analysis of Ssd1 shows no evidence of correlated evolution for Cbk1 phosphorylation sites and docking motifs. A) Distribution of Cbk1 phosphorylation sites shown on the bar plot on the left in black, and Cbk1 docking sites shown on the bar plot on the right in grey. Red dashed lines indicate the threshold value used to make the trait binary. B) Schematic of Ssd1 and ortholog sequences. Conserved regions are shown in black rectangles, Cbk1 phosphorylation sites are shown in black vertical lines, and Cbk1 docking sites are shown in grey vertical lines. Sequence schematic depicted is to scale, with 100 amino acids shown at the bottom for reference. C) Phylogenetic tree inferred from multiple sequence alignment (MSA) of Ssd1 protein sequence with 32 fungal orthologs. Binary trait values for each species are indicated in the table (right of the phylogenetic tree), where a 1 represents occurrence of motif greater than a threshold value and a 0 represents motif occurrence less than threshold value. Cbk1 phosphorylation trait values were determined based on phosphorylation site threshold inferred from the distribution (shown in A).

Similarly, for the 8 remaining predicted Cbk1 substrates, Cbk1 phosphorylation and docking site thresholds were determined based on their respective distributions (Appendix Figure 5) after assessing all ortholog assignments (Appendix Table 4). One of the predicted substrates, Dsf2, showed the strongest overall evidence for correlated evolution of Cbk1 phosphorylation and docking sites (P < 0.004, LRT, with 4 degrees of freedom) (Figure 12). This

36 result suggests the acquisition of the second phosphorylation site coevolved with the acquisition of the docking site.

Figure 12. Two of the ten known or predicted Cbk1 substrates show evidence of correlated evolution for Cbk1 phosphorylation sites and docking motifs. The barplots represent the log LRT scores comparing Cbk1 phosphorylation site and docking motif trait data fit to a model of dependent evolution versus independent evolution. Known substrates are indicated in parentheses. The known Cbk1 substrate Ace2 and predicted substrate Dsf2 show significant LRT scores. The red horizontal line indicates an LRT score that is significant at a P = 0.05, with four degrees of freedom, based on the difference in parameters of the dependent and independent model. The table below shows the threshold used to make the traits (number of phosphorylation and docking sites) binary.

From the coevolution analysis of both known and predicted Cbk1 targets,

Ace2 and Dsf2 showed evidence of correlated evolution for Cbk1 phosphorylation and docking sites. Since only 2 of 10 targets tested showed evidence for correlated evolution from the comparative phylogenetic analysis, this suggests Cbk1 phosphorylation and docking sites did not coevolve in all

Cbk1 targets.

37 3. DISCUSSION

3.1 Quantifying the effect of pSLiM deletions

Deletions of predicted SLiMs showed a broad range of quantitative effects on signalling output, which may be indicative of the varying strength of individual binding interactions that SLiMs mediate. Quantifying the effects of weak individual binding interactions can be challenging, given that you need to be able to discern these small effects from experimental noise such as day-to-day variation. Having both between and within experimental replicates allowed for differentiation between SLiMs with weak effects in signal output and day-to-day variation (Figure 4). Having both between and within experimental replicates is important since it allowed us to differentiate between SLiMs with weak effects in signal output and day-to-day variation.

The fact that we could only detect quantitative effects for three of the four known SLiM deletions we tested, suggests i) our assay cannot detect all effects of SLiM deletions and ii) further testing of SLiM deletions under different conditions or reporter assays is needed. For example, the Bem1 binding site in

Ste20 has been functionally linked to the yeast-mating pathway (Tataku et al.

2010). Although the deletion of this SLiM showed no significant effect on HOG pathway output, it is likely there would be a detectable effect on the yeast-mating pathway as the Bem1 binding site in Ste20 has been functionally linked to the latter pathway (Tataku et al. 2010). Despite this, the results from the three known

SLiM deletions that showed an effect on signalling output suggest our single cell reporter assay is sensitive enough to detect small quantitative effects. This is

38 most evident in the case of the Ste50 binding site in Opy2. This SLiM is one of three motifs known to cooperatively bind in order to facilitate the complete Ste50-

Opy2 interaction (Yamamoto et al. 2010).

Interestingly, quantifying deletions of predicted SLiMs also revealed bimodal pathway responses, for both the known short motif in Hot1 (Alepuz et al.

2003; Burns and Wente 2014) and the novel short motif in Sch9. Both Hot1 and

Sch9 are involved in directly regulating several transcription factors, suggesting that SLiMs may be important interaction sites that mediate bimodal responses in signalling pathways. To our knowledge, our results for the effects of mutations in short motifs in Sch9 and Hot1 are the first demonstrations that short motifs in

IDRs can modulate the bimodal response of a signaling pathway (Pelet et al.

2011; Neuert et al. 2013).

Given that 9 of 29 (~30%) SLiM deletions showed a quantitative effect on pathway output, and if we conservatively assume that the 10 SLiMs we did not test in this study would not have shown any effects, this sets an estimated lower bound of possible functional SLiMs. If we extrapolate that 6 of 14 (~40%) unknown predicted SLiM deletions showed effects on signalling output in the

HOG pathway to potential novel SLiMs involved in signalling that have yet to be discovered, an estimated 40% of predicted SLiMs could be functional. Taken together, this gives an initial estimated bound of approximately 30-40% of possible functional SLiMs. Although this is an approximation, we believe this is likely an under-estimate of SLiMs that could be functional. Since only one concentration of salt stress was explored for the HOG pathway, it is possible that

39 some of the motifs we tested that showed no change in signal pathway output may be detectable at varying levels of salt stress. In addition, the HOG pathway shares components with multiple other pathways, such as the mating pathway. It is also possible that some of the motifs we tested that showed no change in signal pathway output may be detectable under different stresses/assays. Taken together, this suggests that there are still many short motifs to be discovered in well studied signaling pathways (Tompa et al. 2014) and that comparative proteomics methods will be powerful tools to focus discovery efforts (Nguyen Ba et al. 2012; Davey et al. 2012b; Beltrao and Serrano 2005; Budovskaya et al.

2005).

3.2 Recognition of SLiM combinations for target specificity

The ability of the sequence pattern search to differentiate between Clb5 specific substrates suggests that this model can be used to predict cyclin-specific targets of Cdk1. Given that 13 of 19 targets contained the combination of motifs in the proposed order, in addition to the three targets that were missed/false- negatives (due to error in the disorder predictor), the enrichment analysis is likely an underestimate of the true enrichment of this sequence pattern in Clb5 targets.

The remaining three substrates (Bfr1, Cdc19, and Dpb2) that did not contain the

Clb5 sequence pattern showed a difference in distribution of phosphorylation and docking sites compared to the other 16 substrates (P = 0.013 for phosphorylation sites, P = 0.009 for docking sites, t-test, Appendix Figure 2) suggesting they are not multi-phosphorylation site targets. Upon closer inspection, despite Bfr1 being

40 phosphorylated both in vitro and in vivo (Li et al. 2014), it only contains one Cdk1 phosphorylation site. In the same screen, Cdc19 was shown to be phosphorylated in vitro but not in vivo (Li et al. 2014). Although Cdc19 was also demonstrated to interact with Clb5-Cdk1 in a protein-fragment complementation assay (PCA) (Ear et al. 2013), the authors ruled out the interaction as a false- positive. It is possible Cdk1-Clb5 recognizes these three substrates through a different mechanism/interaction, given that they are not multi-phosphorylation site targets. Recently it was reported that aside from the well-known Clb5 docking motif, the N-terminus of Clb5 also contributes to target specificity (DeCesare and

Stuart, 2012). In addition to the underestimated enrichment of the Clb5 sequence pattern due to false-negatives and noise in the Clb5 specific dataset, 12 of the 63 non-Clb5 specific targets that contained the Clb5 sequence pattern showed evidence in the literature for Clb5-specific phosphorylation.

This illustrates two of the limitations that I encountered during this study: the use of KID to define my datasets and the error in disorder predictors. There are missed substrates that have been reported recently in the literature, due to how I am defining my Cdk1 dataset. Although the use of the KID to define a list of high confidence likely-Cdk1 targets limits the possibility of including false positive

Cdk1 substrates, it does exclude some true Cdk1 targets that may also be Clb5- specific. For example Fun30 (Chen et al. 2016), Ase1 (Khmelinskii et al. 2009), and Bir1 (Widlund et al. 2006) are all likely Cdk1 targets with evidence for Clb5- specific phosphorylation that did not pass the KID threshold that I used to define

Cdk1 targets. This limitation is a general challenge of manual curation for

41 databases, which can contain errors (for example the case of Cdc19 as a false- positive still lists the PCA study as evidence for Cdk1 interaction). This illustrates the importance of careful validation for targets when defining substrate datasets based on databases.

The biggest technical limitation that I encountered during this study is the use of the disorder predictor for mapping motifs to IDRs. This is evident in the false negatives that I found in my Clb5 specific dataset (Fin1, Spc110, Cnn1), and additional substrates in my other datasets with evidence in the literature (Spc42 – non-Clb5 dataset, Fun30 – non-Cdk1 dataset, for example). Similarly, the error in the disorder predictor could explain the lack of signal in the conservation analysis of the Clb5 sequence pattern, due to missing true occurrences of the Cks1, Cdk1 and Clb5 motifs. Although I have only used one disorder predictor, DISOPRED3,

I believe this global analysis has shown the inherent error associated with using disorder predictors, and bioinformatics predictors in general. Although one possible explanation could be that the error I see is specific to DISOPRED3, this is unlikely as all disorder predictors have error (CASP, Monastyrskyy et al. 2014), with no perfect predictor available. In addition, DISOPRED3, aside from PrDOS, scored the highest overall across all categories tested in the most recent benchmarking of disorder predictors (CASP, Monastyrskyy et al. 2014). PrDOS is also susceptible to the same error, as shown in my coevolution analysis of Cbk1 phosphorylation and docking sites in Ace2 (see below). This highlights the error in the disorder predictions I observed were minimal compared to the use of other competitive predictors, and the need for a better disorder predictor but more

42 importantly underscores the caution that needs to be taken when using predictors for global analysis.

Aside from the error that is inherent to disorder predictors, there was also error in my method for defining motif matches in IDRs. My method of pattern matching was conservative, given that I required the entire motif to be within the boundaries of the predicted IDRs. I can address this limitation by relaxing the stringency of the motif pattern match. For example, I can allow one amino acid to occur in ordered regions in order to relax any mismatches that may occur near the boundaries of the predicted IDRs.

Lastly, I encountered the biological complexity of the Cdk1 network to be a challenge. From this analysis and previous studies (Kõivomägi et al. 2011a,

Archambault et al. 2005, Loog and Morgan 2005), it is apparent that there are many Cdk1 targets that are active during a number of the stages in the cell cycle, and therefore are recognized by the G1, S, and M phase cyclins (Cln2/3, Clb5/6 and Clb2). Although the combination and order of these motifs shows evidence for Clb5-specificity, it does not allow for distinction from G1 and M phase targets.

This illustrates a violation in the assumption I made when defining my datasets – where I defined a Cdk1 target as either a Clb5 specific or a non-Clb5 specific target. In order to address this limitation, I can split Clb5 specific targets into sets only specific to the S cyclin, shared by the G1 cyclin, and shared with the M cyclin, using co-occurrence of specific motifs that are recognized by these cyclins. For example, it has recently been proposed that the G1 cyclin Cln2 has an analogous system of Cks1/Cdk1/Cln2 docking site (LP motif) recognition

43 (Bhaduri et al. 2011, Kõivomägi et al. 2011b). Similiarly for the M cyclin Clb2, one of the current models of how Cdk1 recognizes its substrates is based on decreasing specificity for its target and increasing affinity through the presence of the full Cdk1 consensus sites found in later M phase targets (Kõivomägi et al.

2011b).

Given that Cdk1 phosphorylation regulates localization and stability of its targets, substrates show cell cycle regulated patterns of expression at the transcription and protein level. With the availability of full genome/proteome data from microarray (DeRisi et al. 1997, Spellman et al. 1998, Eisen et al.1998) and

GFP imaging experiments (Huh et al. 2003, Handfield et al. 2013, Koh et al.

2015), the integration of expression data would extend this analysis beyond the information encoded at the sequence level. Although it is important to note this metadata approach is not straightforward – there is no clear expectation of when a cyclin specific target will be expressed, aside from the restriction that it needs to be expressed at the same time that the cyclin is expressed. This is a challenge to separate the complexity of Cdk1-cyclin specific targets given that the cyclins are expressed for more than one stage (Clb5 is expressed at the end of G1- throughout S phase, decreases in G2 and has a second peak in expression during the M phase) (Schwob and Nasmyth 1993, Tyers et al. 1993, Mathiasen and Lisby 2014).

44 3.3 Detecting coevolution of SLiMs

The findings that two of ten substrates showed evidence for coevolution of

Cbk1 phosphorylation and docking sites suggests this is a possible evolutionary mechanism for how motifs fix over evolution in rapidly evolving IDRs. Although only two of ten substrates gave a signal for coevolution, it is important to note that this is limited by the implementation of a discrete model of correlated evolution that requires binary traits. With the exception of Ace2, the rest of the

Cbk1 substrates do not have binary traits, as demonstrated by the necessity for manual thresholds (Figure 12). This limitation could be addressed by analyzing these targets using a continuous model of correlated evolution. I am interested in exploring coevolution of motifs in a continuous correlated evolution framework, for which software already exists (Pagel 1997, 1999).

I encountered two major limitations during the course of this study. The first being similar to the Clb5 study, the error in the predicted IDRs. Analysis of

Ace2 using the two disorder predictors (DISOPRED3 and PrDOS) that scored the best overall in a recent benchmarking of disorder predictors (CASP,

Monastyrskyy et al. 2014), showed no evidence for correlated evolution. Mapping of the Cbk1 phosphorylation site to the predicted IDRs using DISOPRED3 showed the last residue (the experimentally verified Serine that gets phosphorylated by Cbk1 (Mazanka et al. 2008)) was outside the predicted bounds (Appendix Figure 3A). Similarly, the second predictor, PrDOS, mapped the Cbk1 docking site to partly outside the predicted bounds (Appendix Figure

3A). This again highlights the error associated with using disorder predictors. In

45 this case, use of two independent disorder predictors gave different errors in missing motifs that both resulted in failure to detect coevolution (Ace2 PrDOS – P

= 0.402 vs. Ace2 DISOPRED3 – P = 0.249, LRT, with 4 degrees of freedom,

Appendix Figure 3B). Due to error in the disorder predictions, this forced me to manually define these regions based on alignment of the ortholog sequences.

This was reasonable for the small set of known and predicted substrates (N = 10) involved for this study, but this is not a feasible method for investigating kinase models with larger substrate lists. This limitation also does not allow for a genome-wide search for potential candidate substrates outside of the set I tested from the recent Cbk1 study (Gógl et al. 2015).

Secondly, assessing data quality in order to filter out any noise in the ortholog assignment was a challenge. Inclusions of incorrect orthologs due to errors in ortholog assignment (likely from poor sequence quality from sequencing errors) lead to both cases with a larger (Ace2) and smaller (Dsf2) signal for coevolution (Appendix Figure 6). Given that detecting coevolution depends on observing changes that occurred during evolution, there is a balance of including enough orthologs to capture changes in traits over evolution while avoiding orthologs with poor sequence quality that introduce noise, in order to assess whether coevolution is occurring. Analogous to the limitation in using disorder predictors, I addressed this issue by manually assessing each ortholog assignment to confirm their inclusion in the coevolution analysis. Although this allows for careful consideration of data quality, manually assessing each ortholog assignment may not be feasible for coevolution analysis of a kinase models with

46 much larger substrate lists. This limitation not only highlights the sensitivity of this coevolutionary analysis to ortholog assignments, but also difficulties associated with ortholog assignment. Although it is important to validate the ortholog assignment for comparative phylogenetic analysis, perhaps a periodic re- assessment and update of ortholog assignments is needed for ortholog databases.

4. FUTURE DIRECTIONS

4.1 Exploring pSLiM deletions in other signalling pathways

Although I have quantified the effects of pSLiM deletions in the HOG pathway, it would be interesting to apply an analogous framework for a single cell reporter assay to a pathway that responds to a different inducer. For example, testing the yeast-mating pathway with the addition of pheromone, and seeing whether there are similar bimodal responses in downstream components that directly regulate TFs. In this case the analogous components would be downstream of Fus3, the MAPK of the yeast-mating pathway.

Additionally, it would be interesting to test other downstream components in the HOG signalling pathway that were not included in this initial study. Testing single pSLiM deletions but also multiple deletions, as many SLiMs act as weak interaction sites individually but show cooperative binding with other motifs would also be useful (Pierce et al. 2015). In the multiple deletion framework, testing whether a non-additive effect on pathway output would be indicative of cooperative binding (using a well-known cooperative binding model, such as the

47 Bem1 binding site in Ste20). Lastly, in order to better understand the mechanism behind the bimodal response, it would be of interest to characterize the binding interactions of Sch9.

4.2 Confirming Clb5 specific target predictions

Similar strategies that have been recently employed to establish in vivo and in vitro cyclin specific phosphorylation can be applied to test the two substrates from the non-Clb5 dataset, Bud4 and Fpk4 that contained the Cks1,

Cdk1 and Clb5 motifs in the proposed order (Chen et al. 2016, Molodtsov 2016

Cell). Mass-spectrometry (MS) can be used to confirm the Cks1 and Cdk1 motifs are phosphorylated in vivo. Arresting cells in different stages of the cell cycle, followed by MS will indicate cell cycle specific phosphorylation. In parallel, truncated mutants where the region containing the Cks1, Cdk1 and Clb5 motifs are removed can assess the necessity for this peptide region in Clb5 specific recognition. In vitro kinase assays of the truncated mutants with purified the

Cdk1-Clb5 complex will assess the Clb5-specific phosphorylation. Lastly, using single Alanine substitutions at key residues for each of the motifs individually followed by in vitro Clb-Cdk1 kinase assays will assess the necessity of each of these motifs for Clb5-specific phosphorylation. In parallel, combinations of

Alanine substitutions in the three motifs followed by measuring the rate of phosphorylation rate by Clb5-Cdk1 may indicate how these motifs cooperate for

Clb5 specific phosphorylation.

48 4.3 Extending the coevolution analysis to another model

Although coevolution of motifs has been demonstrated in at least (limited by the Discrete correlated evolution approach) two substrates for the Cbk1 model, it would be interesting to see whether there are other examples of motif coevolution. For example, a recent study suggests docking motifs are used not only by kinases but also their respective phosphatase counterpart for substrate specificity recognition, based on the observation that the kinase/phosphatase pairs recognize the same substrates over evolution. This suggests there is an added constraint for these motifs to be conserved over evolution (Goldman et al.

2014, Kõivomägi and Skotheim 2014). In this study, the yeast-mating pathway

MAPK Fus3 and Calcineurin, Ca2+/Calmodulin regulated phosphatase recognize the same docking motif. Coevolution analysis of the shared docking motif and phosphorylation site of Fus3 and Calcineurin may provide evidence for how these motifs were gained over evolution in the shared targets. If these motifs coevolved, applying this coevolution analysis genome-wide may also identify potential novel targets in the Fus3/Calcineurin network.

Extending the coevolution analysis of motifs beyond two models (Cbk1 and Fus3/Calcineurin) and phosphorylation and docking sites, a systematic approach can be taken to assess the generality of motif coevolution. The switches.ELM database (Van Roey et al. 2013) contains literature curated motifs that act as molecular switches. Coevolution analysis can be applied in a non-bias approach, by taking all of the yeast motifs in the switches.ELM database (to minimize errors in ortholog assignment that are more prevalent in higher

49 ), and searching for multiple motifs that occur in the same target. This approach may indicate how often motifs coevolve together on a proteome-wide scale.

5. CONCLUSION

In conclusion I have shown that deletion of pSLiMs have a broad range of quantitative effects on signalling and based on our study of the HOG pathway, we estimate 40% of pSLiMs could be functional. In addition, I have shown that combinations of SLiMs, specifically phosphorylation and docking sites, and their sequential N-C terminal order can predict Clb5 target specificity. I have identified a number Cdk1 targets as potential Clb5 specific candidates that can be experimentally verified. Lastly, using comparative phylogenetic methods, I have found 2 cases of phosphorylation and docking sites showing correlated evolution which suggest these SLiMs coevolved. To our knowledge, this is the first demonstration of coevolution of SLiMs.

50 6. Materials and Methods

6.1 pSLiM deletion analysis

6.1.1 Yeast strains and mutagenesis

HOG pathway flow cytometry assays were all performed in BY4741

(S288C-derivative strain: MATa his3Δ1 leu2Δ0 met15Δ0 ura3Δ0) lacking the

Sln1-dependent branch of the osmoresponse pathway (ssk2Δ, ssk22Δ) (Schaber et al. 2012; Hayashi and Maeda 2006). The GFP-based reporter for the flow cytometry assay contained 800 bp of the native Stl1 promoter controlling expression of the yeast-enhanced yEGFP (Cormack et al. 1997), integrated into the HO locus (SSK2 Δ::HisMX3 SSK22 Δ::0; HO Δ::STL1pr-GFP,LEU2). All subsequent deletions/mutations of the genomic sequence were performed using the Delitto Perfetto method for seamless mutagenesis of the yeast genome

(Dtorici et al. 2006). The positive control strain for the flow cytometry assay was created by deleting the conserved Sho1 SH3 domain-binding site (Δ91-99) in

Pbs2 (Maeda et al. 1995). A secondary, more conservative control strain was created using double point mutations P96A and P99A (Maeda et al. 1995)) essential to the proline-rich SH3 domain-binding site (PXXP).

6.1.2 Pathway induction and flow cytometry assay

HOG pathway induction experiments were performed as follows: Replicate overnight cultures inoculated from single colonies were diluted 10-fold into the appropriate selective media and grown to mid-log phase (4 hours at 30˚C). Un- induced samples were taken at this point, and cells were pelleted and re-

51 suspended in an equivalent volume of fixative (1X PBS, pH 7.0; 0.5% paraformaldehyde). Cultures were then induced by addition of NaCl to a final concentration of 0.4M and incubated at 30˚C for an additional 60 minutes.

Induced samples were then taken and treated as above. All fixed samples were stored at 4˚C pending flow cytometry analysis (within 48 hours).

Reporter activation in fixed cells from all induction experiments was subsequently measured by flow cytometry. Flow cytometry analysis was performed on a BD FACSAria IIU High Speed Cell Sorter, incorporating three air- cooled lasers at 488, 633, and 407 nm wavelengths, and equipped with BD

FACSDiva™ software (v. 6.1.3). The instrument was calibrated prior to each experiment using a blend of size-calibrated fluorescent beads. The GFP fluorescence was excited at 488 nm and collected through 530/30 nm bandpass filter. All samples were further diluted by a factor of 10, and a total of 50,000 events (cells) were counted for every sample within gated SSC and FSC populations that excluded doublets and small debris. Typical acquisition rates were 700-900 events/second. Initial flow cytometry data was analyzed using

FLowJo software (v 10, OSX).

6.1.3 Histogram data analysis

Data analysis of flow cytometry data was performed using in-house Perl scripts. Each experimental dataset contained a wild-type reporter strain,

PBS2Δ91-100 mutant strain (a known mutation that inactivates the Hog1 signalling pathway) (Maeda et al. 1995), and novel pSLiM mutant strains. GFP

52 intensity values were measured for uninduced and induced samples for all strains, from replicate experiments. For each replicate 0.1% of the highest GFP intensity values were excluded, as these were likely FACS artefacts. All data were log-transformed (and subsequently, any cell with a GFP intensity value <= 0 was also thrown out). All experiments were further trimmed to have equal cell count (using the experiment with the smallest cell count = 49216 cells), and histograms were created for each replicate with 100 bins, determined using the overall minimum and maximum log GFP intensity values from all experiments. I repeated the data analysis using 50 or 200 bins, and found similar results.

6.1.4 Calculating fraction of “on” cells and change in mean GFP intensity of “on” cells

I applied the Overton subtraction method (Overton 1988) to perform histogram subtractions of uninduced samples from induced samples to determine the fraction of cells that respond to salt stress (“on” cells). I calculated the difference in cell count between the uninduced and induced samples in each bin of the histogram, and summed all positive differences to calculate the fraction of

“on” cells. Earlier positive cell counts and later negative cell counts from the histogram subtraction are removed from the fraction of “on” cells by calculating the total sum of differences over all bins, and including only positive differences that where the total sum of differences is positive.

For each replicate the change in mean GFP intensity was calculated by comparing the mean GFP intensity of the uninduced sample to the mean GFP intensity of the “on” cells. The histogram subtraction calculates the number of

53 “on” cells that have a GFP intensity within each bin, but the corresponding GFP intensity of each individual “on” cell is not known. I therefore estimated the mean

GFP intensity of the “on” cells, by approximating each cell to have a GFP intensity equal to its corresponding bin median as given by the equation below:

! !!!(�! × �!) �!" = ! !!! �!

�!" is the mean GFP intensity of “on” cells, where � is a bin containing “on” cells, with the number of “on” cells, �!, and the bin median, �!, for each bin � where

� is the total number of bins with “on” cells.

6.1.5 Statistical analysis of mutant strains

For each strain, t-tests were used to compare both the fraction of “on” cells and change in mean GFP intensity of “on” cells to wild-type after each replicate was wild-type centered by the mean of the wild-type replicates performed on the respective experiment. I used a false discovery rate = 0.01 to determine p-value threshold for significantly different mutations from wild-type.

Only the mean change in GFP fluorescence for each small or no effect mutation was compared to the wild-type (given that all of these strains were classified as small or no effect with <10% difference in fraction of “on” cells from wild-type).

54 6.2 Clb5-Cdk1 motif combination analysis

6.2.1 Defining Cdk1 (Clb5-specific and non Clb5-specific) and non-Cdk1 protein datasets

To test the whether spacer lengths between and combinations of Cks1-

Cdk1-Clb5 motifs could discriminate Clb5-specific substrates, I sought to define a

“positive” set of Clb5-specific substrates and a “negative” set of non-Clb5 specific substrates that could be used for comparison. I first defined the Cdk1 dataset using the Kinase Interaction Database (KID), a literature curated list of kinase- substrate pair interactions where proteins are scored based on both high- through-put (HTP) and low-through-put (LTP) experimental evidence (Sharifpoor et al. 2011). A threshold score of 4.52 (equivalent to a P-value = 0.05) was used as our cut-off to include 184 proteins in our “Cdk1 dataset”.

Next I defined a “Clb5-specific dataset” based on two cyclin specificity studies – an in vitro kinase assay comparing Clb5 and Clb2 specific phosphorylation (Loog and Morgan 2005) and a combined in vitro/in vivo mass spectrometry study (Li et al. 2014). Using the identified proteins from these two experiments, I split our

“Cdk1 dataset” into a “Clb5-specific dataset” of 19 proteins and a “non-Clb5- specific dataset” of 165 proteins.

The 5733 proteins in the S.cerevisiae set of 5,917 non-dubious proteins from the Saccharomyces Genome Database (SGD) (SGD Project 2011) that did not pass the 4.52 KID threshold made up our “non-Cdk1 dataset”.

55 6.2.2 Identifying occurrences of matches to combination of Cks1-Cdk1-Clb5 motifs

All S.cerevisiae sequences were obtained from the 5917 non-dubious open reading frames (ORFs) from the Saccharomyces Genome Database (SGD

Project 2011). The Cks1 phosphoadaptor motif ( [MPVLIFWYQ]-X-T*-P ), Cdk1 phosphorylation motif ( [ST]*-P ), and Clb5 docking motif ( [RK]-X-L-X{0,1}-

[FYLIVMP] ) were represented by their consensus sequences, referred to as regular expressions (shown in parentheses). Regular expressions are patterns interpretable by the computer that can be used to search a sequence and identify matches (Schneider 2002). In this case I am searching protein sequence where each amino acid is represented by the IUPAC one-letter shorthand (X represents any amino acid), square braces ( “[ ]” ) indicate a selection of amino acids that can be matched at that position and curly braces ( “{ }” ) indicate the length of the match for the given amino acid. The regular expressions for the Cks1, Cdk1, and

Clb5 motifs were obtained from the Eukaryotic Linear Motif (ELM) database

(Dinkel et al. 2012). IDRs were obtained from the DISOPRED3 predictor (Jones and Cozzetto 2015) for each S.cerevisiae sequence. The combination of motifs

(represented by their regular expressions) was mapped to the IDRs of each protein sequence using Perl scripts.

6.2.3 Ortholog assignment and multiple sequence alignments of related yeast species

The multiple sequence alignments for the proteins used in calculating the fraction of orthologs where the combination of motifs is conserved, were obtained

56 from a previous study (Nguyen Ba et al. 2014). The multiple sequences alignments are based on ortholog assignments for the 19 yeast species in the

Yeast Gene Order Browser (YGOB) (Byrne and Wolfe 2005) and additionally six candida species from the Fungal Orthogroup Repository (Wapinski et al. 2007).

6.2.4 Constructing an HMM model of the Cks1-Cdk1-Clb5 motif ordering

To test the ordering of the Cks1-Cdk1-Clb5 motifs that accounts for variation in amino acid identity as well as sequence length in a probabilistic framework, I used HMMER (HMMER version 3.0. 2010) to make a Hidden

Markov Model (HMM). Using the 13 substrates from the Clb5-specific dataset that matched the ordering of the motifs, I manually aligned these 13 sequences based on the three motifs with random gaps inserted in between. I used the

HMMbuild package from HMMER to construct an HMM-profile based on this alignment. An HMM model differs from a regular expression model given that matching the motif patterns are no longer assessed in binary (either you match or do not match the motif pattern), but the explicit probability of observing each amino acid is modeled.

6.2.5 Parameter estimation of emission probabilities

HMMER – HMMbuild first must define which columns in the alignment are the Match states. Model construction is performed using maximum a posteriori

(MAP) to choose which columns are marked as match states and which columns are not marked – therefore considered insert states. It is important to note that

57 this method is a dynamic programming approach, where at least two alternative methods exist: a manual approach with the user specifying which columns are match states and a heuristic approach where deciding which columns are match states are set by a rule (i.e. by default insert states could be defined by columns where >50% of amino acids are gaps).

In order to estimate 20 parameters (the probability of observing each amino acid) per each match state when I am giving an alignment of 13 sequences, parameter estimation is performed with pseudocounts to account for possible unobserved data. Rather than simply adding a count of 1 to each amino acid frequency (so that every amino acid is “observed” and will have a subsequent probability), HMMbuild uses prior information. This is done using a mixture of Dirichlet distributions as the prior, as this has been shown to work well even for as few as 13 sequences.

6.2.6 Comparison of substrates that match rules in each subset for enrichment analysis

Fisher’s exact test was used to compare the number of matches in the

Clb5-specific dataset to the non Clb5-specific and the non-Cdk1 datasets.

Comparisons were made for matches to the simple regular expression model

(see Figure 6) and the profile HMM model (see Figure 9).

58 6.3 Cbk1 phosphorylation and docking site correlated evolution analysis

6.3.1 Defining Cbk1 substrate dataset

I obtained the list of ten substrates identified in Gógl et al. 2015, containing two known substrates, Ace2 and Ssd1. The remaining eight substrates were predicted to be likely Cbk1 targets based on the presence and conservation of Cbk1 phosphorylation and docking sites using a previously developed bioinformatics tool (Lai et al. 2012).

6.3.2 Ortholog assignment and multiple sequence alignments of related yeast species for the 10 known/likely Cbk1 substrates

Ortholog assignments and sequences were again obtained from a previous study (Nguyen Ba et al. 2014), but additionally sequences from the following seven species: Lodderomyces elongosporus, Yarrowia lipolytica,

Aspergillu nidulans, Neurospora crassa, Schizosaccharomyces japonicas,

Schizosaccharomyces octosporus, and Schizosaccharomyces pombe were included from the Fungal Orthogroup Repository to account for divergence of molecular traits occurring over a longer evolutionary distance. All sequences were obtained from YGOB (Byrne and Wolfe 2005) and the Fungal Orthogroup

Repository (Wapinski et al. 2007). In the case where multiple orthologs were found in the same species to S. cerevisiae, the sequence with the best score assessed by blast score was selected (Altschul et al. 1997). For each Cbk1 substrate, sequences were aligned using MAFFT at default parameters (Katoh et al. 2002).

59 6.3.3 Constructing Phylogenetic trees

Phylogenetic trees were constructed using PAML (Yang 2007) at default parameters with: the WAG evolutionary model (Whelan and Goldman 2001) for amino acid and a starting topology obtained from the Fungal Orthogroup

Repository, where tips were dropped for any missing species for each Cbk1 substrate.

6.3.4. Data filtering of ortholog assignment and manually defining rapidly evolving regions

Each ortholog sequence was individually assessed by blast to the S. cerevisiae sequence (Altschul et al. 1997), in order to verify ortholog assignment.

In addition, each sequence was manually assessed based on the multiple sequence alignment (MSA) in combination with domain predictions using

(SUPERFAMILY, Gough et al. 2001, Oates et al. 2013) to ensure presence of conserved regions. The conserved regions were filtered out, with the surrounding regions defined as the rapidly evolving regions in which I expect to find regulatory motifs.

6.3.5 Identifying Cbk1 phosphorylation and docking sites

Phosphorylation and docking sites were identified using regular expressions (similar to cdk1 analysis), to map occurrences of the cbk1 phosphorylation sites (H-X-K-X-X-[ST]) and docking sites ([YF]-X-F-P) to the manually defined rapidly evolving regions. Additionally, analysis of the known

60 substrate Ace2, was also conducted using two independent disorder predictors

(DISOPRED3, Jones and Cozzetto 2015) (PrDOS, Ishida and Kinoshita 2007)

(Appendix Figure 3). Occurrences of motifs were mapped to each sequence and identified using Perl scripts.

6.3.6 Correlated evolution of docking motif and phosphorylation sites

To model the evolution of molecular traits I used the Discrete package from Mark Pagel’s BayesTraits program (Pagel 1994). Traits were converted to binary values (0 = absence of trait, 1 = presence of trait) based on their distribution for each substrate (Appendix Figure 5), given that the Discrete package is designed to analyze correlated evolution of two discrete binary traits.

Both the data and the phylogenetic tree were then fit to an independent model

(four - parameters) of evolution where the evolution of one trait does not depend on the presence or absence of the other. This was compared to when the data and tree were fit to a dependent model (eight parameters) of evolution where the evolution of one trait now does depend on the presence or absence of the other trait. The model calculates the maximum likelihood (ML) of observing the trait data, given the phylogenetic tree and selected model (independent or dependent) of evolution. Both of these models are described by transition rate parameters.

61 6.3.7 Likelihood ratio test (LRT) used to assess the fit of the dependent model to the independent model

The maximum likelihood of the dependent model is compared to the maximum likelihood of the independent model, using standard likelihood ratio tests, where the difference in fit is assessed with a Chi-square distribution based on the degrees of freedom. In this case there are four degrees of freedom between the dependent and independent model.

6.3.8 Simulation of Ace2 protein evolution

To simulate Ace2 protein evolution I used a previously developed software that explicitly accounts for the rapid rate of evolution observed in IDRs (Nguyen

Ba et al. 2014). Using the manually defined rapidly evolving regions (defined from the Ace2 alignment) to define the IDRs and the starting topology obtained from the Fungal Orthogroup Repository, I ran 1000 independent simulations evolving Ace2 protein sequences. In order to get a fair comparison of the simulated Ace2 and real Ace2 sequences, I chose the simulation that had the most similar total number of Cbk1 phosphorylation and docking sites as observed in the real Ace2 sequences. Given that the protein evolution software defines two distinct rates of evolution (for disordered and ordered regions), each Ace2 sequence retained the conserved zinc finger DNA-binding domain (manually confirmed by alignment). The coevolution analysis was then run exactly the same way by obtaining the occurrences of Cbk1 phosphorylation and docking sites in the rapidly evolving regions surrounding the conserved DNA-binding domain.

62 7. REFERENCES

Adams PD, Sellers WR, Sharma SK, Wu AD, Nalin CM, Kaelin WG Jr. 1996. Identification of a cyclin-cdk2 recognition motif present in substrates and p21-like cyclin-dependent kinase inhibitors. Mol. Cell Biol. 16:6623-6633.

Alepuz PM, de Nadal E, Zapater M, Ammerer G, Posas F. 2003. Osmostress- induced transcription by Hot1 depends on Hog1-mediated recruitment of the RNA Pol II. EMBO J. 22:2433-2442.

Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389–3402.

Barker D, Pagel M. 2005. Predicting functional gene links from phylogenetic- statistical analyses of whole genomes. PLoS. Comput. Biol. 1:e3.

Beltrao P, Serrano L. 2005. Comparative genomics and disorder prediction identify biologically relevant SH3 protein interactions. PLoS Comput. Biol. 1:e26.

Bhaduri S, Pyrciak PM. 2011. Cyclin-specific docking motifs promote phosphorylation of yeast signaling proteins by G1/S Cdk complexes. Curr. Biol. 21:1615-1623.

Bhowmick A et al. 2016. Finding our way in the dark proteome. J. Am. Chem Soc. 138:9730-9742.

Brewster JL, de Valoir T, Dwyer ND, Winter E, Gustin MC. 1993. An osmosensing signal transduction pathway in yeast. Science 259:1760-1763.

Budovskaya YV, Stephan JS, Deminoff SJ, Herman PK. 2005. An evolutionary proteomics approach identifies substrates of the cAMP-dependent protein kinase. Proc. Natl. Acad. Sci. USA. 102:13933–13938.

Burns LT, Wente SR. 2014. Casein kinase II regulation of the Hot1 transcription factor promotes stochastic gene expression. J. Biol. Chem. 289:17668-17679.

Bussemaker HJ, Li H, Siggia ED. 2001. Regulatory element detection using correlation with expression. Nat. genetics 27:167-174.

Byrne KP, Wolfe KH. 2005. The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Res. 15:1456-1461.

Chao JA, Patskovsky Y, Almo SC, Singer RH. 2008. Structural basis for the coevolution of a viral RNA-protein complex. Nat. Struct. Mol. Biol. 15:103-105.

63 Chen RE, Thorner J. 2007. Function and regulation in MAPK signaling pathways: Lessons learned from the yeast Saccharomyces cerevisiae. Biochim. Biophys. Acta BBA - Mol. Cell Res. 1773:1311-1340.

Chen X, et al. 2016. Enrichment of Cdk1-cyclins at DNA double-strand breaks stimulates Fun30 phosphorylation and DNA end resection. Nucleic Acids Res. 44:2742-2753.

Choi HS, et al. 2011. Phosphorylation of phosphatidate phosphatase regulates its membrane association and physiological functions in Saccharomyces cerevisiae: identification of SER(602), THR(723), and SER(744) as the sites phosphorylated by CDC28 (CDK1)-encoded cyclin-dependent kinase. J. Biol. Chem. 286:1486-1498.

Cormack BP, et al. 1997. Yeast-enhanced green fluorescent protein (yEGFP): a reporter of gene expression in Candida albicans. Microbiology 143:303-311.

Darieva Z, et al. 2003. Cell cycle-regulated transcription through the FHA domain of Fkh2p and the coactivator Ndd1p. Curr. Biol. 13:1740-1745.

Darwin CR. 1862. On the various contrivances by which British and foreign orchids are fertilised by insects, and on the good effects of intercrossing. London: John Murray.

Davey NE, et al. 2012a. Attributes of short linear motifs. Mol. Biosyst. 8:268-281.

Davey NE, et al. 2012b. SLiMPrints: conservation-based discovery of functional motif fingerprints in intrinsically disordered protein regions. Nucleic Acids Res. 40:10628-10641.

Davey NE, Cyert MS, Moses AM. 2015. Short linear motifs – ex nihilo evolution of protein regulation. Cell Commun. Signal. 13:43.

DeCesare JM, Stuart DT. 2012. Among B-type cyclins only CLB5 and CLB6 promote premeiotic S phase in Saccharomyces cerevisiae. Genetics 190:1001- 1016. de Juan D, Pazos F, Valencia A. 2013. Emerging methods in protein co- evolution. Nat. Rev. Genet. 14:249-261.

DeRisi JL, Iyer VR, Brown PO. 1997. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278:680-686.

Diella F, et al. 2008. Understanding eukaryotic linear motifs and their role in and regulation. Front. Biosci. J. Virtual Libr. 13:6580-6603.

64 Dinkel H, et al. 2012. ELM—the database of eukaryotic linear motifs. Nucl. Acids Res. 40:D242-D251.

Dobzhansky T.1950. Genetics of natural populations. XIX. Origin of heterosis through natural selection in populations of Drosophila psuedoobscura. Genetics 35:288-302.

Doolin MT, Johnson AL, Johnston LH, Butler G. 2001. Overlapping and distinct roles of the duplicated yeast transcription factors Ace2p and Swi5p. Mol. Microbiol. 40:422-432.

Dunker AK et al. 2001. Intrinsically disordered protein. J. Mol. Graph. Model. 19:26-59.

Dunker AK, Silman I, Uversky VN, Sussman JL. 2008. Function and structure of inherently disordered proteins. Curr. Opin. Struct. Biol. 18:756-764.

Durbin R, Eddy SR, Krogh A, Mitchison G. 1998. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Twelfth. Cambridge: Cambridge University Press.

Dyson HJ, Wright PE. 2005. Intrinsically unstructured proteins and their functions. Nat. Rev. Mol. Cell Biol. 6:197-208.

Ear PH, et al. 2013. Dissection of Cdk1-cyclin complexes in vivo. Proc.Natl. Acad. Sci. USA. 110:15716-15721.

Eddy SR. 1998. Profile hidden Markov models. Bioinforma. Oxf. Engl. 14:755– 763.

Edenberg ER, Mark KG, Toczyski DP. 2015. Ndd1 turnover by SCF(Grr1) is inhibited by the DNA damage checkpoint in Saccharomyces cerevisiae. PLoS Genet. 11:e1005162.

Ehrlich P, Raven P. 1964. Butterflies and plants: a study in coevolution. Evolution 18:586-608.

Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA. 95:14863-14868.

Enserink JM, Kolodner RD. 2010. An overview of Cdk1-controlled targets and processes. Cell Div. 5:11.

Frith MC, Saunders NF, Kobe B, Bailey TL. 2008. Discovering sequence motifs with arbitrary insertions and deletions. PLoS Comput.Biol. 4:e1000071.

65 Glotzer M, Murray AW, Kirschner MW. 1991. Cyclin is degraded by the ubiquitin pathway. Nature 349:132-138.

Gógl G, et al. 2015. The Structure of an NDR/LATS Kinase–Mob Complex Reveals a Novel Kinase–Coactivator System and Substrate Docking Mechanism. PLoS Biol. 13:e1002146.

Goldman A, et al. 2014. The calcineurin signaling network evolves via conserved kinase-phosphatase modules that transcend substrate identity. Mol. Cell. 55:422- 435.

Gough J, Karplus K, Hughey R, Chothia C. 2001. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol. 313:903-919.

Gsponer J, Madan Babu MM. 2009. The rules of disorder or why disorder rules. Prog. Biophys. Mol. Biol. 99:94-103.

Guharoy M, Bhowmick P, Sallam M, and Tompa P. 2016. Tripartite degrons confer diversity and specificity on regulated protein degradation in the ubiquitin- proteasome system. Nat. Commun. 7:10239.

Gunjan A, Paik J, Verreault A. 2005. Regulation of histone synthesis and nucleosome assembly. Biochimie. 87:625-635.

Habchi J, Tompa P, Longhi S, Uversky VN. 2014. Introducing protein intrinsic disorder. Chem Rev. 114:6561-6588.

Hayashi M, Maeda T. 2006. Activation of the HOG pathway upon cold stress in Saccharomyces cerevisiae. J. Biochem. (Tokyo). 139:797-803.

HMMER version 3.0. 2010. HMMER software suite. Available from: http://hmmer.org

Hohmann S. 2002. Osmotic stress signaling and osmoadaptation in yeasts. Microbiol. Mol. Biol. Rev. 66:300-372.

Hotz M, Lengefeld J, Barral Y. 2012. The MEN mediates the effects of the spindle assembly checkpoint on Kar9-dependent spindle pole body inheritance in budding yeast. Cell Cycle 11:3109-3116.

Hsieh LS, Su WM, Han GS, Carman GM. 2015. Phosphorylation regulates the ubiquitin-independent degradation of yeast Pah1 phosphatidate phosphatase by the 20S proteasome. J. Biol. Chem. 290:11467-11478.

Hwa Lim H, Yeong FM, Surana H. 2003. Inactivation of mitotic kinase triggers

66 translocation of MEN components to mother-daughter neck in yeast. Mol. Biol. Cell. 14:4734-4743.

Janzen DH. 1980. When is it Coevolution? Am. Nat. 175:525-537.

Jones DT, Cozzetto D. 2015. DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics 31:857-863.

Kammerer D, Stevermann L, Liakopoulos D. 2010. Ubiquitylation regulates interactions of astral microtubules with the cleavage apparatus. Curr. Biol. 20:1233-1243.

Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M. 2016. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44:D457- D462.

Kao L, et al. 2014. Global analysis of cdc14 dephosphorylation sites reveals essential regulatory role in mitosis and cytokinesis. Mol. Cell. Proteomics 13:594- 605.

Katoh K, Misawa K, Kuma K, Miyata T. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30:3059–3066.

Khmelinskii A, Roostalu J, Roque H, Antony C, Schiebel E. 2009. Phosphorylation-dependent protein interactions at the spindle midzone mediate cell cycle regulation of spindle elongation. Dev. Cell. 17:244-256.

Kikhney AG, Svergun DI. 2015. A practical guide to small angle X-ray scattering (SAXS) of flexible and intrinsically disordered proteins. FEBS Lett. 589:2570- 2577.

King MC, Wilson AC. 1975. Evolution at two levels in humans and chimpanzees. Science 188:107-116.

Klipp E, Nordlander B, Krüger R, Gennemark P, Hohmann S. 2005. Integrative model of the response of yeast to osmotic shock. Nat. Biotechnol. 23:975-982.

Koh JL, et al. 2015. CYCLoPs: A Comprehensive Database Constructed from Automated Analysis of Protein Abundance and Subcellular Localization Patterns in Saccharomyces cerevisiae. G3. 5:1223-1232.

Kõivomägi M, et al. 2011a. Cascades of multisite phosphorylation control Sic1 destruction at the onset of S phase. Nature 480:128-131.

67 Kõivomägi M, et al. 2011b. Dynamics of Cdk1 substrate specificity during the cell cycle. Mol. Cell. 42:610-623.

Kõivomägi M, et al. 2013. Multisite phosphorylation networks as signal processors for Cdk1. Nat. Struct. Mol. Biol. 20:1415-1424.

Kõivomägi M, Skotheim JM. 2014. Docking interactions: cell-cycle regulation and beyond. Curr. Biol. 24:R647-649.

Koranda M, Schleiffer A, Endler L, Ammerer G. 2000. Forkhead-like transcription factors recruit Ndd1 to the chromatin of G2/M-specific promoters. Nature 406:94- 98.

Krantz M, Becit E, Hohmann S. 2006. Comparative analysis of HOG pathway proteins to generate hypotheses for functional analysis. Curr. Genet. 49:152-165.

Kurat CF, et al. 2011. Restriction of histone gene transcription to S phase by phosphorylation of a chromatin boundary protein. Genes Dev. 25:2489-2501.

Kurischko C, Kim HK, Kuravi VK, Pratzka J, Luca FC. 2011. The yeast Cbk1 kinase regulates mRNA localization via the mRNA-binding protein Ssd1. J. Cell. Biol. 192:583-598.

Lai AC, Nguyen Ba AN, Moses AM. 2012. Predicting kinase substrates using conservation of local motif density. Bioinformatics 28:962-969.

Lange A, et al. 2007. Classical Nuclear Localization Signals: Definition, Function, and Interaction with Importin α. J. Biol. Chem. 282:5101-5105.

Lee KS, Asano S, Park JE, Sakchaisri K, Erikson RL. 2005. Monitoring the cell cycle by multi-kinase-dependent regulation of Swe1/Wee1 in budding yeast. Cell Cycle 4:1346-1349.

Leisner C, et al. 2008. Regulation of mitotic spindle asymmetry by SUMO and the spindle-assembly checkpoint in yeast. Curr. Biol. 18:1249-1255.

Li Y, Cross FR, Chait BT. 2014. Method for identifying phosphorylated substrates of specific cyclin/cyclin-dependent kinase complexes. Proc. Natl. Acad. Sci. USA. 111:11323-11328.

Linding R, Jensen LJ, Ostheimer GJ, et al. 2007. Systematic discovery of in vivo phosphorylation networks. Cell 129:1415–1426.

Liu H, Jin F, Liang F, Tian X, Wang Y. 2011. The Cik1/Kar3 motor complex is required for the proper kinetochore-microtubule interaction after stressful DNA replication. Genetics 187:397-407.

68 Liu H, Wang Y. 2006. The function and regulation of budding yeast Swe1 in response to interrupted DNA synthesis. Mol. Biol. Cell. 17:2746-2756.

Loog M, Morgan DO. 2005. Cyclin specificity in the phosphorylation of cyclin- dependent kinase substrates. Nature 434:104-108.

Loy CY, Lydall D, Surana U. 1998. NDD1, a high-dosage suppressor of cdc28- 1N, is essential for expression of a subset of late-S-phase-specific genes in Saccharomyces cerevisiae. Mol. Cell. Biol. 19:3312-3327.

Maeda T, Takekawa M, Saito H. 1995. Activation of yeast PBS2 MAPKK by MAPKKKs or by binding of an SH3-containing osmosensor. Science 269: 554- 558.

Maekawa H, Usai T, Knop M, Schiebel E. 2003. Yeast Cdk1 translocates to the plus end of cytoplasmic microtubules to regulate bud cortex interactions. EMBO J. 22:438-449.

Marles JA, Dahesh S, Haynes J, Andrews BJ, Davidson AR. 2004. Protein- protein interaction affinity plays a crucial role in controlling the Sho1p-mediated signal transduction pathway in yeast. Mol. Cell. 14:813-823.

Mathiasen DP, Lisby M. 2014. Cell cycle regulation of homologous recombination in Saccharomyces cerevisiae. FEMS Microbiol. Rev. 38:172-184.

McGrath DA, et al. 2013. Cks confers specificity to phosphorylation-dependent CDK signaling pathways. Nat. Struct. Mol. Biol. 20:1407-1414.

Molodtsov MI, et al. 2016. A Force-Induced Directional Switch of a Molecular Motor Enables Parallel Microtubule Bundle Formation. Cell 167:539-552.

Monastyrskyy B, Kryshtafovych A, Moult J, Tramontano A, Fidelis K. 2014. Assessment of protein disorder region predictions in CASP10. Proteins 82:127- 137.

Moore JK, D’Silva S, Miller RK. 2006. The CLIP-170 homologue Bik1p promotes the phosphorylation and asymmetric localization of Kar9p. Mol. Biol. Cell. 17:178- 191.

Moore JK, Miller RK. 2007. The cyclin-dependent kinase Cdc28p regulates multiple aspects of Kar9p function in yeast. Mol. Biol. Cell. 18:1187-1202.

Morgan DO. 2013. The D box meets its match. Mol. Cell. 50:609-610.

69 Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen MB. 2004. MONKEY: Identifying conserved transcription-factor binding sites in multiple sequence alignments using a binding site-specific evolutionary model. Genome Biol. 5:R98.

Moses AM, Hériché JK, Durbin R. 2007. Clustering of phosphorylation site recognition motifs can be exploited to predict the targets of cyclin-dependent kinase. Genome Biol. 8:R23.

Nakajima R, Masukata H. 2002. SpSld3 is required for loading and maintenance of SpCdc45 on chromatin in DNA replication in fission yeast. Mol. Biol. Cell. 13:1462-1472.

Neduva V, Russell RB. 2005. Linear motifs: Evolutionary interaction switches. FEBS Lett. 579:3342-3345.

Neduva V, et al. 2005. Systematic Discovery of New Recognition Peptides Mediating Protein Interaction Networks. PLoS Biol. 3:e405.

Neuert G, et al. 2013. Systematic Identification of Signal-Activated Stochastic Gene Regulation. Science 339:584-587.

Nguyen Ba AN, Moses AM. 2010. Evolution of characterized phosphorylation sites in budding yeast. Mol. Biol. Evol. 27:2027-2037.

Nguyen Ba AN, et al. 2012. Proteome-wide discovery of evolutionary conserved sequences in disordered regions. Sci. Signal. 5:rs1.

Nguyen Ba AN, et al. 2014. Detecting functional divergence after gene duplication through evolutionary changes in posttranslational regulatory sequences. PLoS Comput. Biol. 10:e1003977.

Nuismer SL, Gomulkiewicz R, Ridenhour BJ. 2010. When is correlation coevolution? Am. Nat. 175:525-537.

Oates ME, et al. 2013. D2P2: Database of Disordered Protein Predictions. Nucleic Acids Res. 41:D508-D516.

O’Rourke SM, Herskowitz I, O’Shea EK. 2002. Yeast go the whole HOG for the hyperosmotic response. Trends Genet. TIG. 18:405-412.

O’Rourke SM, Herskowitz I. 2004. Unique and Redundant Roles for HOG MAPK Pathway Components as Revealed by Whole-Genome Expression Analysis. Mol. Biol. Cell. 15:532-542.

Overton WR. 1988. Modified histogram subtraction technique for analysis of flow cytometry data. Cytometry 9:619-626.

70 Pagel M. 1994. Detecting correlated evolution on phylogenies: a general method for the comparative analysis of discrete characters. Proc. R. Soc. Lond. B 255:37-45.

Pagel M. 1997. Inferring evolutionary processes from phylogenies. Zool. Scripta. 26:331-348.

Pagel M. 1999. Inferring the historical patterns of biological evolution. Nature 401:877-884.

Pelet S, Peter M. 2011. Dynamic processes at stress promoters regulate the bimodal expression of HOG response genes. Commun. Integr. Biol. 4:699-702.

Pentony MM, Jones DT. 2010. Modularity of intrinsic disorder in the human proteome. Proteins 78:212-221.

Pfleger CM, and Kirschner MW. 2000. The KEN box: an APC recognition signal distinct from the D box targeted by Cdh1. Genes Dev. 14:655-665.

Pic-Taylor A, Darieva Z, Morgan BA, Sharrocks AD. 2004. Regulation of cell cycle-specific gene expression through cyclin-dependent kinase-mediated phosphorylation of the forkhead transcription factor Fkh2p. Mol. Cell. Biol. 24:10036-10046.

Pierce WK, et al. 2015. Multiple weak linear motifs enhance recruitment and processivity in SPOP-mediated substrate ubiquitination. J. Mol. Biol. 428:1256- 1271.

Poon IK, Jans DA. 2005. Regulation of nuclear transport: central role in development and transformation? Traffic 6:173-186.

Ptacek J, et al. 2005. Global analysis of protein phosphorylation in yeast. Nature 338:679-684.

Reményi A, Good MC, Lim WA. 2006. Docking interactions in protein kinase and phosphatase networks. Curr. Opin. Struct. Biol. 16:676–685.

Reynolds D, et al. 2003. Recruitment of Thr 319-phosphorylated Ndd1p to the FHA domain of Fkh2p requires Clb kinase activity: a mechanism for CLB cluster gene activation. Genes Dev. 17:1789-1802.

Ridenhour BJ. 2014. Coevolution. In Oxford Bibliographies in Evolutionary Biology. New York: Oxford University Press.

Robellet X, et al. 2015. A high-sensitivity phospho-switch triggered by Cdk1 governs chromosome morphogenesis during cell division. Genes Dev. 29:426-

71 439.

Romero P, Obradovic Z, Kissinger CR, Villafranca JE, Dunker AK. 1997. Identifying disordered regions in proteins from amino acid sequences. Proc. IEEE Int. Conf. Neural Netw. 1:90-95.

Romero P, Obradovic Z, Li X, Garner EC, Brown CJ, Dunker AK. 2001. Sequence complexity of disordered proteins. Proteins 42:38-42.

Ruis H, Schüller C. 1995. Stress signaling in yeast. BioEssays News Rev. Mol. Cell. Dev. Biol. 17:959-965.

Schaber J, Baltanas R, Bush A, Klipp E, Colman-Lerner A. 2012. Modelling reveals novel roles of two parallel signalling pathways and homeostatic feedbacks in yeast. Mol. Syst. Biol. 8:622.

Schneider TD. 2002. Consensus sequence Zen. Appl. Bioinformatics 1:111-119.

Schwob E, Nasmyth K. 1993. CLB5 and CLB6, a new pair of B cyclins involved in DNA replication in Saccharomyces cerevisiae. Genes Dev. 7:1160-1175.

SGD Project. 2011. Saccharomyces Genome Database. Available from: http://www.yeastgenome.org/

Sharifpoor S, et al. 2011. A quantitative literature-curated gold standard for kinase-substrate pairs. Genome Biol. 12:R39.

Sia RH, Bardes ES, Lew DJ. 1998. Control of Swe1p degradation by the morphogenesis checkpoint. EMBO J. 17:6678-6688.

Sherriff JA, Kent NA, Mellor J. 2007. The Isw2 chromatin-remodeling ATPase cooperates with the Fkh2 transcription factor to repress transcription of the B- type cyclin gene CLB2. Mol. Cell. Biol. 27:2848-2860.

Spellman PT, et al. 1998. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell. 9:3273-3297.

Stein A, Ceol A, Aloy P. 2011. 3did: identification and classification of domain- based interactions of known three-dimensional structure. Nucleic Acids Res. 39:D718-723.

Tackett AJ, et al. 2005. Proteomic and genomic characterization of chromatin complexes at a boundary. J. Cell. Biol. 169:35-47.

72 Takaku T, Ogura K, Kumeta H, Yoshida N, Inagaki F. 2010. Solution structure of a novel Cdc42 binding module of Bem1 and its interaction with Ste20 and Cdc42. J. Biol. Chem. 285:19346-19353.

Tanaka K. et al. 2005. Molecular mechanisms of kinetochore capture by spindle microtubules. Nature 434:987-994.

Tanaka S. et al. 2007. CDK-dependent phosphorylation of Sld2 and Sld3 initiates DNA replication in budding yeast. Nature 445:328-332.

Tatebayashi K,et al. 2007. Transmembrane mucins Hkr1 and Msb2 are putative osmosensors in the SHO1 branch of yeast HOG pathway. EMBO J. 26:3521- 3533.

Tompa P, Davey NE, Gibson TJ, Babu MM. 2014. A Million Peptide Motifs for the Molecular Biologist. Mol. Cell. 55:161-169.

Tompa P, Schad E, Tantos A, Kalmar L. 2015. Intrinsically disordered proteins: emerging interaction specialists. Curr. Opin. Struct. Biol. 35:49-59.

Toyn JH, Araki H, Sugino A, Johnston LH. 1991. The cell-cycle-regulated budding yeast gene DBF2, encoding a putative protein kinase, has a homologue that is not under cell-cycle control. Gene 104:63-70.

Tyers M, Tokiwa G, Flutcher B. 1993. Comparison of the Saccharomyces cerevisiae G1 cyclins: Cln3 may be an upstream activator of Cln1, Cln2 and other cyclins. EMBO J. 12:1955-1968.

Ubersax JA, Ferrell JE Jr. 2007. Mechanisms of specificity in protein phosphorylation. Nat. Rev. Mol. Cell Biol. 8:530-541.

Valk E, et al. 2014. Multistep phosphorylation systems: tunable components of biological signaling circuits. Mol. Biol. Cell. 25:3456-3460.

Van Roey K, Dinkel H, Weatheritt RJ, Gibson TJ, Davey NE. 2013. The switches.ELM resource: a compendium of conditional regulatory interaction interfaces. Sci. Signal. 6:rs7.

Van Roey K, et al. 2014. Short linear motifs: ubiquitous and functionally diverse protein interaction modules directing cell regulation. Chem. Rev. 114:6733-6778.

Vucetic S, Brown CJ, Dunker AK, Obradovic Z. 2003. Flavors of protein disorder. Proteins 52:573-584.

Vucetic S, et al. 2005. DisProt: a database of protein disorder. Bioinformatics 21:137-140.

73 Wallace B. 1953. ON coadaptation in Drosophila. Am. Nat. 87:343-358.

Walsh CT, Garneau-Tsodikova S, Gatto GJ Jr. 2005. Protein posttranslational modifications: the chemistry of proteome diversifications. Angew. Chem. Int. Ed. Engl. 44:7342-7372.

Wapinski I, Pfeffer A, Friedman N, Regev A. 2007. Natural history and evolutionary principles of gene duplication in fungi. Nature 449:54–61.

Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT. 2004. The DISOPRED server for the prediction of protein disorder. Bioinformatics 20:2138-2139.

Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT. 2004. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol. 337:635-645.

Weiss EL. 2012. Mitotic exit and separation of mother and daughter cells. Genetics 192:1165-1202.

Westfall PJ, Ballon DR, Thorner J. 2004. When the Stress of Your Environment Makes You Go HOG Wild. Science 306:1511-1512.

Whelan S, Goldman N. 2001. A general model of protein evolution derived from multiple protein families using a maximum likelihood approach. Mol. Biol. Evol. 18:691- 699.

Widlund PO, et al. 2006. Phosphorylation of the chromosomal passenger protein Bir1 is required for localization of Ndc10 to the spindle during anaphase and full spindle elongation. Mol. Biol. Cell. 17:1065-1074.

Wittenberg C, Reed SI. 2005. Cell cycle-dependent transcription in yeast: promoters, transcription factors, and transcriptomes. 24:2746-2755.

Wright PE, Dyson HJ. 2015. Intrinsically disordered proteins in cellular signalling and regulation. Nat. Rev. Mol. Cell Biol. 16:18-29.

Xie H, et al. 2007. Functional Anthology of Intrinsic Disorder. 1. Biological Processes and Functions of Proteins with Long Disordered Regions. J. Proteome Res. 6:1882-1898.

Yamamoto K, Tatebayashi K, Tanaka K, Saito H. 2010. Dynamic control of yeast MAP kinase network by induced association and dissociation between the Ste50 scaffold and the Opy2 membrane anchor. Mol. Cell. 40:87-98.

Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24:1586–1591.

74 Zegerman P, DiffeyJF. 2007. Phosphorylation of Sld2 and Sld3 by cyclin- dependent kinases promotes DNA replication in budding yeast. Nature 445:281- 285.

75 8. APPENDIX

Appendix Table 1. Predicted short linear motifs (pSLIMs) from PhyloHMM in the Hypotonic stress pathway of S.cerevisiae. Signaling components were identified from KEGG, with 36 SLiMs predicted in 11 Hypotonic stress signaling proteins using PhyloHMM. Periods indicate positions where any amino acid can match.

76

Appendix Table 2. Predicted short linear motifs (pSLiMs) from SLiMPrints (Davey et al. 2012b) in the canonical Human Wnt signaling pathway. Signaling components were identified from KEGG,with 24 SLiMs predicted in 16 Wnt signaling proteins using SLiMPrints. Periods indicate positions where any amino acid can match.

77

Appendix Table 3. T-tests comparing the mean fraction of "on" cells and the mean change in mean GFP intensity of "on" cells for each predicted SLiM to wild-type. All data shown was wild- type centered for both the fraction of "on" cells and change in mean GFP intensity by the mean of the wild-type replicates in each experimental set (to account for day-day experimental variation). Means of wild-type centered data was subsequently calculated and statistical significance was assessed using a false discovery rate = 0.01 to get a P-value threshold. Predicted SLiMs with a P-value (out of 38 total, with 2 P-values for each predicted SLiM) above the P-value threshold were identified as significant. Of these, 3 of 4 known SLiMs tested were identified as significant, and 6 of 14 novel predicted SLiMs were identified as significant. Three standard errors was also calculated for the mean change in GFP intensity (see Figure 4).

78

Appendix Figure 1. Discrepancy in disorder predictions for false-negatives in Clb5 specific targets, Fin1, Spc110, and Cnn1. Schematic showing the Cks1, Cdk1 and Clb5 motifs ordered from N to C terminus in orange, red and blue, respectively. Black indicates the sequence of the target protein from N to C terminus, with the amino acid positions along the x-axis. Grey indicates predicted IDRs with 9 predictors shown along the x-axis.

79

Appendix Figure 2. Comparison of Clb5 specific targets that lack the Clb5 sequence pattern (3) to the rest of the Clb5 specific targets (16). Distribution of phosphorylation sites on the top bar plot and docking sites on the bottom bar plot. Clb5 specific targets that do not match are shown in grey, the rest of the Clb5 specific targets are shown in black.

80

Appendix Figure 3. Loss of correlated evolution signal due to errors in disorder predictors. A) Schematic of motifs with predicted IDRs indicated for both DISOPRED3 and PrDOS for Saccharomyces cerevisiae, shown on the bottom in black. The Cbk1 phosphorylation site is shown in red and the Cbk1 docking site is shown in blue. Vertical dashed lines show whether these motifs were found within the predicted IDR, with arrows indicating missing regions not predicted to be disordered that mapped to motifs. B) Bar plot showing comparison of all Ace2 coevolution analysis: mapped to IDRs predicted by DISOPRED, mapped to IDRs predicted by PrDOS, and mapped to manually defined IDRs. The red horizontal line indicates an LRT score that is significant at a P = 0.05, with 4 degrees of freedom, based on the difference in parameters of the dependent and independent model.

81

Appendix Figure 4. Simulation of Ace2 ortholog sequence evolution shows no evidence for correlated evolution. Schematic of Cbk1 phosphorylation and docking sites in evolved sequences for Ace2 and each of its orthologs shown on the left. The conserved zinc-finger DNA binding domain is shown in black rectangles, Cbk1 phosphorylation sites are shown in black vertical lines, and Cbk1 docking sites are shown in grey vertical lines. Sequence schematic depicted is to scale, with 100 amino acids shown at the bottom for reference. Barplot on the right represent the log LRT scores comparing Cbk1 phosphorylation site and docking motif trait data fit to a model of dependent evolution versus independent evolution. Simulated Ace2 is compared with the real Ace2 analysis. The red horizontal line indicates an LRT score that is significant at a P = 0.05, with 4 degrees of freedom, based on the difference in parameters of the dependent and independent model.

82

Appendix Figure 5. Cbk1 targets contain non-binary occurrences of phosphorylation and docking sites. Distribution of quantitative traits for 7 substrates used to determine individual thresholds (Ace2 and Fir1 showed binary traits, therefore analysis was performed using presence/absence of motif – threshold = 1). Distribution of Cbk1 phosphorylation sites shown on the bar plot on the left in black, and Cbk1 docking sites shown on the bar plot on the right in grey. Red dashed lines indicate the threshold value used to make the trait binary for each target.

83

Appendix Figure 6. Detecting correlated evolution in Cbk1 substrates before filtering poor ortholog assignments results in loss of signal in Dsf2. The barplots represent the log LRT scores comparing Cbk1 phosphorylation site and docking motif trait data fit to a model of dependent evolution versus independent evolution. Known substrates are indicated in parentheses. Grey bars indicate the analysis inclusive of all ortholog sequences prior to assessment of ortholog assignment using the threshold determined by the distributions of phosphorylation and docking sites (Appendix Figure 5). White bars indicate the analysis performed after filtering poor ortholog assignments, based on absence/presence of phosphorylation and docking sites (threshold = 1). Black bars indicate the analysis of filtered ortholog assignments and threshold traits (Figure 12). Missing bars indicate there was no difference from the analysis of filtered ortholog assignments and threshold traits for that target. The red horizontal line indicates an LRT score that is significant at a P = 0.05, with 4 degrees of freedom, based on the difference in parameters of the dependent and independent model.

84

Appendix Table 4. List of species used for analysis of each Cbk1 target (predicted and known from Gógl et al. 2015). 33 species listed where sequences were from YGOB (Byrne and Wolfe 2005) and the Fungal Repository (Wapinski et al. 2007). First column has species listed in short- form corresponding to the order shown in the phylogenies (see Figure 10 and 11), and the second column contains the corresponding full species name. Substrates are listed in each subsequent column. Checkmarks indicate the target protein contained an ortholog sequence in the corresponding species. Checkmarks highlighted in red indicates the ortholog sequence did not pass the quality control filtering (assessment of ortholog assignment).

85