Bioinformatical analysis of intrinsically disordered regions in eukaryotes: Insights into the evolution of folding-on-binding

regions and post-translational modifications

Mohanalakshmi Narasumani

Department of Biology McGill University Montreal, Quebec, Canada

APRIL 2018

A thesis submitted to McGill University in partial fulfilment of the requirements of the degree of Doctor of Philosophy in Biology (Bioinformatics concentration)

© 2018 Mohanalakshmi Narasumani

ABSTRACT

Intrinsically disordered (IDPs) or Intrinsically disordered regions (IDRs) in proteins can exhibit a partially folded or unfolded state under physiological conditions but confer several functional advantages. IDRs can fold into a stable tertiary structure when bound to their partner molecule, a transition that can be promoted by post-translational modifications (PTMs). Intrinsic disorder is found in all domains of life but is prevalent in eukaryotes. This thesis investigates the composition and evolutionary behaviour of disordered regions that undergo disorder-to-order transition and the evolutionary trend of

PTMs in IDRs across eukaryotes using computational methods and tools.

Bioinformatical parsing of human folding-on-binding (FB) proteins into four subsets

(ordered, FBs, disordered regions that surround FBs, and other disordered regions) was performed to examine whether the composition and evolutionary behaviour (across vertebrate orthologs) are different in these four subsets. This analysis revealed that compositionally, ordered regions are distinct from the three other subsets, but the

FB regions are of comparable evolutionary conservation to the ordered regions.

Disordered regions surrounding FB regions are more negatively charged and less conserved than their adjacent FB regions. The presented results suggest the role of hydrophilic or charged residues around FBs in steering FB regions towards the binding sites of partner molecules. The insights gained from analysis of evolutionary conservation for FBs provided motivation to examine a related question, namely the evolutionary conservation of PTMs in IDPs/IDRs, in comparison to PTMs in ordered regions.

i

In another bioinformatical approach, the conservation and emergence of methylation, acetylation and ubiquitination sites in ordered and disordered regions were examined across 11 evolutionary clades down from the whole eukaryotic domain to the ape superfamily. These sites occur mainly at arginine and lysine residues. It was discovered that MAU PTM is a major driver of conservation for arginines and lysines in both ordered and disordered regions, across the 11 levels, most significantly across the mammalian clade. Furthermore, the emergence of a significant number of new lysine

MAU sites is found in the disordered regions of proteins in deuterostomes and mammals.

In histones, MAU sites exhibit a distinct significant conservation pattern evident as far back as the last common ancestor of mammals. In a separate multiple evolutionary level analysis of the experimentally-verified human FB regions, a significant enrichment of conserved ubiquitination sites in FB regions was identified at all evolutionary levels back as far as mammals. Similarly, FB regions showed a significant preference for sites with multiple MAU modifications when treated both as a sample of ordered and of disordered regions. These results indicate the need to consider sequence analysis of IDRs at multiple evolutionary levels in order to understand their complex evolutionary patterns. The presented study as a whole demonstrates the distinctive amino acid composition, PTM preference and conservation of IDRs that exhibit different conformational states (e.g. disordered, disordered around FB and FB regions), and the interplay between these properties.

ii

RÉSUMÉ

Les protéines intrinsèquement désordonnées (IDPs) ou les régions intrinsèquement désordonnées (IDRs) dans les protéines peuvent montrer un état partiellement replié ou non-replié dans des conditions physiologiques mais confèrent plusieurs avantages fonctionnels. Les IDRs peuvent se replier en une structure tertiaire stable quand elles se lient à leur molécule associée, une transition qui peut être facilité par des modifications post-traduction (PTMs). Le désordre intrinsèque se rencontre dans tous les domaines du vivant mais est très fréquent chez les eucaryotes. Cette thèse

étudie la composition et le comportement évolutif des régions désordonnées qui subissent une transition du désordre vers l’ordre et la tendance évolutive des PTMs dans les IDRs chez les eucaryotes grâce à des méthodes et des outils informatiques. Le classement bioinformatique des protéines humaines se pliant en se liant (FB) en quatre ensembles (ordonnées, FBs, régions désordonnées qui entourent les FBs, et les autres régions désordonnées) a été effectué pour examiner si la composition et le comportement

évolutif (chez les différents vertébrées orthologues) sont différent dans ces quatre ensembles. L’analyse a révélé que par la composition, les régions ordonnées des protéines sont distincts des trois autres ensembles, mais les régions FB montrent une conservation évolutive similaire aux régions ordonnées. Les régions désordonnées qui entourent les régions FB sont plus négativement chargées et moins conservées que leurs régions FB adjacentes. Les résultats présentés suggèrent le rôle joué par les résidus hydrophiles ou chargés autour des FBs pour piloter les régions FB vers les sites de liaison

iii

des molécules associées. La connaissance fournie par l’analyse de conservation

évolutive pour les FBs encourage à étudier une question associée, à savoir la conservation évolutive des PTMs dans les IDPs/IDRs, comparé aux PTMs des régions ordonnées. Dans une autre approche bioinformatique, la conservation et l’émergence de sites de méthylation, acétylation, et ubiquitination dans les régions ordonnées et désordonnées ont été étudiées sur 11 clades évolutifs transmis depuis le domaine eucaryote entier jusqu’à la superfamille des grands singes. Ces sites se rencontrent surtout dans les résidus arginine et lysine. Il a été montré que MAU PTM joue un rôle majeur dans la conservation pour les arginines et les lysines à la fois dans les régions ordonnées et désordonnées, dans tous les 11 niveaux, de manière plus significative dans le clade des mammifères. Il a été montré qu’un nombre significatif de nouveaux sites

MAU lysine apparaît dans les régions désordonnées des protéines chez les deutérostomes et les mammifères. Dans les histones, les sites MAU montrent un patron de conservation significativement distinct et évident jusque chez le plus lointain ancêtre commun des mammifères. Dans une analyse séparée des niveaux évolutifs multiples des régions FB humaines vérifiées expérimentalement, un enrichissement significatif des sites d’ubiquitination conservés dans les régions FB a été découvert dans tous les niveaux évolutifs des mammifères. De manière similaire, les régions FB montrent une préférence significative pour les sites avec de multiples modifications MAU quand elles sont traitées à la fois comme des régions ordonnées ou désordonnées. Ces résultats indiquent la nécessité de considérer l’analyse des IDRs à des niveaux évolutifs multiples afin de comprendre leurs patrons évolutifs complexes. La présente étude dans son

iv

ensemble démontre la composition distincte en acides aminés, la préférence PTM et la conservation des IDRs qui montrent différent états de conformation (e.g., désordonnées, désordonnées autour des FB et les régions FB), et les inter-relations entre ces propriétés.

v

ACKNOWLEDGEMENTS

First and foremost, I would like to express my deepest gratitude to my supervisor

Professor Paul Harrison, for his guidance, motivation and continuous support during my studies. He introduced me to the fascinating world of intrinsically disordered proteins.

Most of all, I thank him for his patience and encouragement. Working under his supervision is a once-in-a-lifetime experience for me. His insightful comments, critical thinking and expertise helped me to improve my research in the past five years.

I am grateful to the members of my supervisory committee, Prof. Jérôme

Waldispühl and Prof. Jacek Majewski for the guidance, suggestions and fruitful discussions during supervisory committee meetings.

I am extremely grateful to the Biology Department for the support to complete my thesis. I would like to especially thank Ancil Gittens, Susan Bocti, Susan Gabe, Anne-

Marie L'Heureux, Sonal Patel and Tony for their support in the Biology Department.

Special thanks to Sébastien Portalier for translating my thesis abstract into French.

I would like to thank the members of Compute Canada and Calcul Québec support group for their prompt responses and patience in answering my questions and requests.

I owe my sincere thanks to my undergraduate mentor Prof. C.S. Parameshwari for the support and encouragement. Many thanks to my Master’s thesis supervisors Dr.

Trishul Artham and Prof. Mukesh Doble for giving me the opportunity to work under their guidance.

vi

I must thank my friends Swetha, Charana, Aman, Jaskaran, Saleem, Saravan and

Elika for their support.

I would like to thank my brothers, N. Kannabiran and N. Sathiya Narayanan for

their constant support and, encouragement. I thank my sister-in-law Revathy Kannabiran

for her support. I would also like to thank my little friends, Sujan and Charan for their love

and affection.

Finally, this thesis would not have been possible without the support of my fiancé,

Vijay. You are my source of inspiration, my best friend and you made me who I am today.

I would like to dedicate this thesis to my parents M. V. Narasumani and N. Thulasi for

their faith, unconditional love, motivation, and constant support. I am blessed to have

parents like them.

vii

PREFACE

Thesis format and organization

This thesis is written in manuscript form as given by the McGill University Graduate

Studies and Research. The work presented here is the original work of the candidate. It

is comprised of two manuscripts on which the candidate is the lead author.

The detailed background information of the current literature, review of the topics

and objectives of the research project are presented in Chapter 1. I investigated the

sequence composition, post-translational modifications and evolutionary trends of

disordered regions in eukaryotes. The findings of this investigation are presented in

Chapter 2 and Chapter 3. In Chapter 4, I discuss the implications of this work and provide

the conclusion, thereby creating a cohesive document.

Contribution to authors

Chapter 2:

Narasumani, M. and Harrison, P. M. Bioinformatical parsing of folding-on-binding

proteins reveals their compositional and evolutionary sequence design. Sci. Rep.

5, 18586; doi: 10.1038/srep18586 (2015).

Professor Harrison and I designed the study and I performed almost all of the data analysis.

viii

Chapter 3:

Narasumani, M. and Harrison, P. M. Discerning evolutionary trends in post-

translational modification and the effect of intrinsic disorder: Analysis of

methylation, acetylation and ubiquitination sites in human proteins. Accepted in

PLOS Computational Biology.

Professor Harrison and I designed the study and I performed all of the analysis.

I have written the initial drafts of all the manuscripts and Professor Harrison edited the later drafts of the manuscripts.

Contribution to Knowledge

Data analysis and interpretation of the research presented in Chapters II and III was

performed by me, under the supervision of Prof. Paul Harrison. The results of these

studies have been prepared and submitted to peer-reviewed publications.

Chapter 2

This study attempts to understand the composition and evolutionary behaviour of

the human proteins that contain folding-on-binding regions. Here, I examined the amino acid composition and evolution of the four parsed regions (i) Ordered, (ii) Other

Disordered (iii) Disordered-around-FB (DFB) (iv) FB regions. As a result of this study, I found that Ordered and FB regions group together as highly conserved. I also observed

ix

that DFB regions are more extremely hydrophilic than the other disordered datasets. In this study, we describe the possible compositionally-based steering mechanism of FB region folding-on-binding. This analysis is first of its kind to perform the bioinformatical parsing of FB proteins that emphasizes the similarities between Ordered and FB regions and highlights the differences between FB and other disordered datasets.

Chapter 3

The presented work is the large-scale evolutionary analysis of more than 15,000 experimentally determined human methylation, acetylation and ubiquitination (MAU) sites in ordered, FB and disordered regions at 11 eukaryotic clades. This is to my knowledge the first large-scale analysis of evolutionary trend in ordered and disordered regions of proteins in 380 eukaryotic species. In this study, I identified a significant conservation of ubiquitination sites in FB regions across mammals. Here, I find that MAU PTM is a major driver of conservation for arginines and lysines in both ordered and disordered regions across the mammalian clade. I also observed the emergence of new lysine MAU sites in the disordered regions of proteins in deuterostomes and mammals. The analysis of sequence conservation in ordered and disordered regions at multiple evolutionary levels is a novel approach to examine the complex patterns of PTM evolution across eukaryotes.

x

Table of Contents

ABSTRACT ...... i

RÉSUMÉ ...... iii

ACKNOWLEDGEMENTS ...... vi

PREFACE ...... viii

Contribution to authors ...... viii

Contribution to Knowledge ...... ix

LIST OF ABBREVIATIONS ...... ix

1 Introduction ...... 1 1.1 Intrinsically Disordered Regions (IDRs) ...... 2 1.1.1 Brief history of and function ...... 2 1.1.2 Intrinsically Disordered Regions/Proteins ...... 3 1.1.3 Coupled folding and binding of IDRs ...... 6 1.1.4 Sequence composition of IDRs ...... 9 1.1.5 Characterization of intrinsically disordered regions/proteins ...... 10 1.1.6 Experimental determination of IDRs ...... 11 1.1.7 Computational tools to predict intrinsically disordered regions and proteins ...... 12 1.2 Functional advantages of IDRs ...... 13 1.2.1 Molecular Recognition ...... 16 1.2.2 Post-translational modifications in disordered regions ...... 17 1.3 Evolution of IDRs ...... 20 1.4 Classification of IDRs ...... 22 1.5 Disease associated with IDRs ...... 23 1.6 Role of IDRs in drug development: ...... 24 1.7 Objectives of the Research ...... 26 1.8 References ...... 28

i

2 Bioinformatical parsing of folding-on-binding proteins reveals compositional sequence design and evidence for a general guiding mechanism for binding ..... 43 2.1 Abstract: ...... 44 2.2 Introduction ...... 45 2.3 Methods ...... 47 2.3.1 Data sets ...... 47 2.3.2 Multiple sequence alignments ...... 47 2.3.3 Conservation analysis of the aligned sequences ...... 48 2.3.4 Hydrophobicity and Charge calculation ...... 48 2.4 Results and Discussion ...... 49 2.4.1 Overview of the data sets ...... 49 2.4.2 Analysis of Ordered, Disordered, FB and Disordered around FB regions as populations of sequences ...... 52 2.4.3 Further analysis of compositional differences between the Ordered, Disordered, FB and Disordered around FB parsed subsets ...... 59 2.4.4 Complex pattern of sequence conservation in FB-containing proteins ...... 66 2.4.5 Sampling analysis of parsed subsets ...... 67 2.4.6 A possible guidance mechanism during FB folding-on-binding with protein interaction partners ...... 69 2.5 Concluding remarks ...... 71 2.6 References ...... 72 2.7 Connecting Text for Chapter 2 to Chapter 3 ...... 77

3 Discerning evolutionary trends in post-translational modification and the effect of intrinsic disorder: Analysis of methylation, acetylation and ubiquitination sites in human proteins ...... 78 3.1 Abstract ...... 79 3.2 Introduction ...... 81 3.3 Methods ...... 84 3.3.1 PTM Datasets ...... 84 3.3.2 Eukaryotic ...... 84

ii

3.3.3 Sequence analysis ...... 85 3.3.4 Identification of ordered and disordered regions in proteins ...... 85 3.3.5 Conservation & statistical analysis ...... 86 3.4 Results and Discussion ...... 87 3.4.1 Distribution of MAU sites in ordered and disordered regions ...... 88 3.4.2 FB regions as display areas for PTMs ...... 92 3.4.3 PTMs are depleted in homopeptides and prion-like proteins ...... 99 3.4.4 Evolutionary behaviour of MAU sites at eleven evolutionary levels ...... 100 3.4.5 Evidence for methylation as a driver of lysine conservation during eukaryotic evolution, and for the emergence of new lysine methylation sites ...... 109 3.4.6 Arginine methylation conservation is highly favoured in ordered regions across human evolutionary descent in eukaryotes ...... 111 3.4.7 Human acetylated lysines are favoured for significant conservation in disordered regions rather than in ordered regions across eukaryote evolution ...... 113 3.4.8 Ubiquitination-site residue conservation is favoured in disordered regions of eukaryotic proteins ...... 115 3.4.9 Conservation signals for MAU sites in Histones ...... 116 3.4.10 Methylation site lysine residues in the disordered regions of linker H1 and H3 variants are conserved as far back as mammals ...... 117 3.4.11 Ubiquitination sites in H2A and H3 variants in mammalian histones ...... 118 3.4.12 Sites with multiple MAU PTMs ...... 119 3.4.13 Ubiquitination is a major driver of conservation of lysines in folding-on-binding (FB) regions ...... 120 3.4.14 Functional trends in MAU-site containing proteins ...... 121 3.5 Concluding remarks ...... 122 3.6 References ...... 124

Chapter IV...... 133

4 Discussion and Conclusion ...... 133 4.1 Discussion ...... 134 4.2 Conclusion ...... 140

iii

4.3 References ...... 141

APPENDIX A...... 145 Supplementary Data for Chapter III ...... 145

iv

List of Figures

Figure 1.1 Structure of human calcineurin heterodimer ...... 4 Figure 1. 2 Proposed structure model of Hrk and its binding mechanism...... 7 Figure 1. 3 Structural polymorphism of disordered regions...... 15 Figure 1. 4 Post-translational modifications induce structural changes in IDRs...... 18

Figure 2. 1 Pipeline of the analysis performed ...... 50 Figure 2. 2 Example alignment of a parsed protein...... 51 Figure 2. 3 Analysis of the four region types as populations of sequences...... 54 Figure 2. 4 Analysis of the four region types as populations of sequences...... 57 Figure 2. 5 Trends in composition and conservation for the four parsed region types. .. 60 Figure 2. 6 Comparison of the overall amino-acid composition of the four region types...... 65

Figure 3. 1 Overview of the number of methylation, acetylation and ubiquitination sites and the coincidence of different MAU types at the same residues in ordered and disordered regions ...... 90 Figure 3. 2 Percentage distribution of MAU and phosphorylation sites in ordered and disordered regions of human proteins ...... 91 Figure 3. 3 Distribution of PTM sites in ordered and disordered regions of human proteins for various subsets of the data...... 97 Figure 3. 4 Organismal phylogeny and pipeline...... 103 Figure 3. 5 Example of a protein with methylation, acetylation and ubiquitination sites in ordered and disordered regions...... 104 Figure 3. 6 Summary of significantly enriched conserved MAU sites in ordered and disordered regions at 11 evolutionary clades ...... 107

v

SUPPLEMENTARY FIGURES

Figure S3. 1 Conservation of lysine methylation sites and new conserved sites in ordered and disordered regions at each eukaryotic level...... 147 Figure S3. 2 Conservation of arginine methylation sites and new conserved sites in ordered and disordered regions at each eukaryotic level...... 148 Figure S3. 3 Conservation of lysine acetylation sites and new conserved sites in ordered and disordered regions at each eukaryotic level...... 149 Figure S3. 4 Conservation of lysine ubiquitination sites and new conserved sites in ordered and disordered regions at each eukaryotic level...... 150 Figure S3. 5 Conservation of MAU sites in ordered and disordered regions across all eukaryotic clades...... 152 Figure S3. 6 Conservation of MAU sites as other MAU residue type in ordered and disordered regions across all eukaryotic clades...... 153 Figure S3. 7 Conservation of MAU sites filtered with ZORRO program in ordered and disordered regions across all eukaryotic clades...... 154 Figure S3. 8 Conservation of MAU sites in non-histone proteins in ordered and disordered regions across all eukaryotic clades...... 155 Figure S3. 9 Conservation of new MAU sites in 'old' proteins in ordered and disordered regions across all eukaryotic clades...... 156 Figure S3. 10 Conservation of MAU sites in ordered and disordered regions of histone proteins across all eukaryotic clades...... 157 Figure S3. 11: Conservation of MAU sites in as other MAU residue type in ordered and disordered regions of histone proteins across all eukaryotic clades...... 158 Figure S3. 12 Conservation of sites with multiple MAUs in ordered and disordered regions across all eukaryotic clades...... 159 Figure S3. 13 Conservation of sites with MAUs in FB (treated as a sample of O and DO) regions across all eukaryotic clades...... 159

vi

Figure S3. 14 Conservation of MAU sites in disordered regions across all eukaryotic clades, sequence are aligned using KMAD alignment tool...... 160 Figure S3. 15 Conservation of MAU sites in disordered regions (disordered regions predicted by IUPRED tool) regions across all eukaryotic clades...... 161 Figure S3. 16 Gene Ontology category enrichments at different evolutionary levels. .. 162

vii

LIST OF TABLES

Table 2. 1 Comparison of the hydrophobicities of the parsed subsets...... 61 Table 2. 2 Mean hydrophobicity values of the four region types...... 62 Table 2. 3 Comparison of the net charges of the parsed subsets...... 62 Table 2. 4 Mean net-charge values of the parsed subsets...... 63 Table 2. 5 Comparison of the conservation scores of the parsed subsets...... 66 Table 2. 6 Mean conservation score values of the parsed region types...... 67 Table 2. 7 FB set as sample of total ordered and disordered sets...... 68

Table 3. 1: Enrichment of ‘multiple-MAU’ sites in FB (treated as a sample of either ordered (O) or disordered (DO) regions across all eukaryotes...... 99 Table 3. 2: Percentages of human MAU-site residues in ordered and disordered regions that are conserved across all eukaryotes...... 108

viii

LIST OF ABBREVIATIONS

4E-BP2: 4E-binding protein 2 ...... 138 APC: adenomatous polyposis coil ...... 6 BH3: Bcl-2 homology 3 ...... 6 BLASTP: Basic Local Alignment Search Tool for proteins ...... 85 BRD4: Bromodomain-containing Protein 4 ...... 25 CBP: CREB-binding protein ...... 11 CD: Circular Dichroism ...... 11 CH: Charge-Hydropathy ...... 9 CRD1: Cell-cycle regulatory domain-1 ...... 13 DFB: Disordered around FB ...... 50 DISOPRED: Disorder Prediction tool ...... 86 DisProt: Database of Protein Disorder ...... 47 DO: Disordered ...... 99 ELM: Eukaryotic Linear Motif ...... 23 FB: Folding on Binding ...... 6 GO: GO: Gene Ontology ...... 70 HIPK2: Homeodomain-interacting protein kinase-2 ...... 25 HPV: Human papillomavirus ...... 24 Hrk: Harakiri ...... 6 IDEAL: Intrinsically disordered proteins with extensive annotation and literature ...... 47 IDP: Intrinsically Disordered Proteins ...... 4 IDR: Intrinsically Disordered Regions ...... 4 IQR: Inter-quartile range ...... 55 IUPRED: Prediction of intrinsically unstructured regions ...... 86 MAU: Methylation, Acetylation and Ubiquitination ...... 27 MoRFs: Molecular Recognition Features ...... 16 MSA: Multiple sequence alignments ...... 47

ix

NMD: nonsense-mediated decay ...... 6 NMR: Nuclear Magnetic Resonance ...... 3 NUPR1: Nuclear protein 1 ...... 25 O: Ordered ...... 99 PDB: Protein Data Bank ...... 12 PDZ: postsynaptic protein PSD-95/SAP90, Drosophila septate junction protein Discs- large, tight junction protein ZO-1 ...... 25 PTM: post-translational modification ...... 14 SLiMs: Short Linear Motifs ...... 22 UPF1: Up-frameshift 1 ...... 6 WASP: Wiskott-Aldrich Syndrome Protein ...... 14

x

CHAPTER I

1 Introduction

1

1.1 Intrinsically Disordered Regions (IDRs)

1.1.1 Brief history of Protein structure and function

The classical protein structure-function paradigm is derived from the experiments

favouring the view that the three-dimensional structure of a protein is the prerequisite for

its function. In 1894, Fischer [1] proposed the lock and key model explaining that enzymes exhibit complementary geometric shapes to their substrates [2]. This model demonstrates that the enzyme and substrate fit together like a lock and key. Therefore, the protein and ligand interaction is facilitated by the complementary structures in the binding site of the protein [2]. To state the importance of protein’s structure-function paradigm, in 1906 Fischer wrote [2] “Since the proteins participate in one way or another in all chemical processes in the living organism, one may expect highly significant

information for biological chemistry from the elucidation of their structure and their

transformations.” Indeed, the protein structure is crucial for understanding its function and

the fact that proteins unfolding under denaturing conditions lose both their structure and

their function has supported this paradigm [2, 3].

However, later two major models: 'configurational adaptability' by Karush [4] and

'induced fit' by Koshland [5] described that the active sites or the whole domain of enzymes undergo conformational changes to facilitate the interaction of specific functional groups with the substrate, and these conformational changes are crucial for function [2, 3]. Several proteins have been proposed to exhibit induced fit mechanism [6-

2

8]. For example, the neutron and X-ray structure of α-cyclodextrin, a model

macromolecule, in complex with water and other substrates demonstrated that the

change in conformation and hydrogen bonding energy on α-cyclodextrin mediates the

complex formation [9, 10]. Furthermore, the conformational diversity of Immunoglobulin

E (IgE) antibody SPE7 enables its binding with unrelated antigens [11]. Despite these phenomena, the structure-function paradigm remained the prerequisite for protein function.

1.1.2 Intrinsically Disordered Regions/Proteins

The well-defined 3-Dimensional structure of a protein determines its biological

function, often referred to as the ordered state. The crystal structure of numerous proteins

is reported to exhibit missing regions. This was often linked to protein purification errors

or the failure to solve the phase problem, but importantly, the most common reason was

the failure of unobserved atom or residue to scatter X-rays coherently [3, 12-15], but later

many of these regions have been identified as actual disordered regions or local disorder

[3]. In 1978, both X-ray crystallography and nuclear magnetic resonance (NMR) studies

revealed the functional disorder in proteins. More precisely, NMR determined the

structure of the functional yet disordered tail of histone H5, [3]. Solution-state NMR

spectroscopy has been used to characterize disordered regions/proteins. Later,

numerous NMR studies have characterized various proteins with functional disordered

regions [3]. These are referred to as intrinsically disordered/unstructured regions or

3

proteins (IDRs or IDPs) [3, 16, 17]. One of the earlier examples of IDRs is calcineurin, a

serine/threonine phosphatase and abundant calmodulin-binding protein in the brain.

Calcineurin plays a vital role in T cell activation. The calmodulin binding region in

calcineurin subunit A is situated in a long disordered region. The binding of calcium to

calmodulin activates calcineurin’s phosphatase (Figure 1.1 sections A-B) [18, 19].

Figure 1.1 Structure of human calcineurin heterodimer (A) Linear representation of the catalytic (A subunit) and regulatory (B subunit)

subunits in calcineurin. Calcineurin A subunit (in red) with the N-terminal catalytic

domain, a calcineurin B-binding segment, a calmodulin-binding segment and an

autoinhibitory peptide are highlighted. Calcineurin B subunit (in green) with four EF hands

that bind four Ca+ ions are also indicated. Figure from [18] (Copyright Confirmation

number:Li et al.4326820691899) Page 16 NIH-PA Author Manuscript

NIH-PA Author Manuscript

4 NIH-PA Author Manuscript Figure 1. Calcineurin domain organization and proposed mechanism of activation. (a) Regional organization of calcineurin. Here and in Figure 1b, calcineurin A and calcineurin B are color-coded in shades of red and green, respectively. (b) The proposed mechanism of activation of calcineurin. In this widely accepted model of calcineurin activation 10, 11, Ca2+ occupancy of the low affinity sites on calcineurin B causes dissociation of the calmodulin-binding region of calcineurin A from the calcineurin B–binding region and causes the transition from Form I to Form II, facilitating the subsequent binding of calmodulin (Form III), which leads to displacement of the autoinhibitory peptide and full calcineurin activation (Form IV). The structure of the calcineurin A regulatory region between the calcineurin B-binding helix and the autoinhibitory peptide in resting calcineurin (Form I) remains to be determined. In Form IV, the portions of calcineurin A between the calcineurin B-binding helix and the calmodulin binding site and C-terminal to the calmodulin binding site are depicted as random coil, but may in fact be structured. Recent

Trends Cell Biol. Author manuscript; available in PMC 2012 February 1. extended-disorder for proteins and regions that exist under physiological conditions primarily as random coil. (continued from Figure 1.1, previous page)

Figure 1. (a) 3-D structure of (B) Three-dimensional calcineurin, showing the A subunit (purple), The B subunit (blue), the structure of calcineurin auto-inhibitory peptide (green) and showing the disordered the location of a 95-residue disordered region (red). The region in A subunit and calmodulin binding site (yellow helix) ordered regions. The is located within the disordered region (orange). (b) Side and top experimentally views of calmodulin (blue) binding a target helix (yellow). Note that the determined structure of A calmodulin molecule surrounds the subunit (yellow), B target helix when bound. subunit (yellow surface) and autoinhibitory domain (green) are demonstrated. The long

A significant body of work disordered tail (red) and the calmodulin binding site (helix in red) of A subunit are also suggests that the unfolded state is not a demonstrated. Figure from [19]. (Copyright Confirmation Number: 4392560289396)

true random coil, but instead possesses substantial amounts of an extended form IDRs are described as extremely flexible with no defined secondary structure under 24,25,26 that resembles the polyproline II helix asphysiological well as other conditions local conformations [20]. Several studies that resemblehave reported that the lack of structure of the native state. For this reason, extended-disorderdisordered may regionsbe a pr eferableconfers several term asfunctional compared advantages to [3, 12, 15, 21-23]. One such advantage is that disordered regions can bind to various targets in different conformations random coil, but the latter term continues to have widespread usage and so, for convenience, we and undergo disorder-to-order transition. will continue to use this term here – with the understanding that by the term random coil we do not mean the true random coil defined by the polymer chemist.

It is useful to introduce the topic of natively disordered proteins with a specific, very 5 clear example. Calcineurin (Figure 1a) makes a persuasive case for the existence and importance of native disorder27,28,29. This protein contains a catalytic A subunit and a B subunit with 35% sequence identity to calmodulin. The A subunit is a serine-threonine phosphatase that becomes activated upon association with the Ca2+-calmodulin complex. Thus, calcineurin, which is

7

1.1.3 Coupled folding and binding of IDRs

Many disordered regions can exhibit a well-defined structure when they bind to a

specific partner molecule, and they remain disordered in the absence of their interacting

molecules [3, 13-15, 24]. It has been hypothesized that a disordered region forms a weak and nonspecific binding with its target and exhibits a folded state as it approaches the binding site; this has been described as the ‘fly-casting’ mechanism [25]. This mechanism is observed in the assembly of the nonsense-mediated decay (NMD) complex, where the disordered C-terminal domain of UPF2 (up-frameshift 2) initiates the complex formation by binding with UPF1 (up-frameshift 1) [26]. Furthermore, both short regions within the longer disordered regions or the entire disordered regions can undergo disorder-to-order transition [24]. These regions are referred to here as ‘folding on binding’ (FB) regions.

For example, the interaction between the FB region of adenomatous polyposis coil (APC), a tumour suppressor protein, and axin promotes complex formation for the phosphorylation of beta-catenin [27]. In the case of membrane proteins, Harakiri (Hrk), for example, a Bcl-2 (B-cell lymphoma 2) family protein, induces cell death in BH3-only

(Bcl-2 homology 3) subfamily [28]. The binding of disordered cytosolic domain of Hrk with the survival Bcl-2 and Bcl-xL (B-cell lymphoma-extra large) members allows the cytosolic

domain to form a-helical conformation and it has been suggested that the disordered

domain could have increased the capture radius for prosurvival partners to mediate

binding (Figure 1.2) [28].

6

Figure 1. 2 Proposed structure model of Hrk and its binding mechanism. The arrow shows the disorder-to-order transition of cytosolic domain in Hrk (green ribbon) upon binding to Bcl-2 or Bcl-xL (red ribbon) protein. The hydrophobic (light yellow surface) and hydrophilic (dark yellow surface) parts of lipid bilayer are also highlighted. Figure from

[28]. (Copyright right permission to reuse this figure is not needed).

The coupled folding and binding process of IDRs facilitates the high specificity and low affinity towards a partner molecule [3, 13-15, 24]. Therefore, IDRs show prominent roles in signalling processes. The highly specific binding mediates the initiation of signalling pathways and the low affinity facilitates rapid dissociation of partner molecules

7

[29, 30]. An example of this short-lived association is the interaction between p27 and

cyclin-CDK during the cell cycle [29, 30]. More examples of coupled folding on binding in

IDRs are discussed below.

The kinase inducible activation domain (pKID) of CREB is disordered in the free state [31]. NMR titrations and N relaxation dispersion studies revealed that phosphorylated pKID forms two a-helices when bound to the KIX domain of the transcriptional coactivator CREB binding protein (CBP) [31, 32]. Interestingly, small variations to entropy or enthalpy of binding [29, 33] caused by post-translational modifications have been suggested to facilitate the transition of disordered to ordered conformations and change protein charge [15, 34, 35]. For example, phosphorylation of

Ser684, Ser686 and Ser692 in E-cadherin stabilizes cadherin structure by promoting additional hydrogen bond interactions with beta-catenin [15]. Another example is that serine phosphorylation in the calmodulin domain of human p4.1 enables the ability of 17- residue peptide to adopt an alpha-helical conformation [36]. Proteins that involve in coupled folding and binding exhibit many functional advantages, however, the evolution and amino acid composition of these regions and other disordered regions around them is not well studied.

8

1.1.4 Sequence composition of IDRs

Compositionally biased or repetitive regions and low complexity regions are often associated with IDRs, and they show a preference for charged, polar and structure- breaking amino acids such as alanine, arginine, glutamic acid, glutamine, glycine, lysine, proline and serine termed as “disorder-promoting amino acids.” On the other hand, amino acids such as asparagine, cysteine, isoleucine, leucine, methionine, phenylalanine, tryptophan, tyrosine and valine are common in ordered regions, named as “order- promoting amino acids” [19, 37, 38]. Thus, sequence composition holds information for protein structure. In 2000, Uversky et al. developed a charge-hydropathy (CH) plot to distinguish ordered and disordered regions based on the net charge and hydrophobicity.

In this plot, a linear boundary is used to separate the ordered and disordered regions, and IDRs occupy the relatively high net charge and low hydrophobicity area of the plane

[13].

The amino acid composition of IDRs is more highly conserved among orthologs than their sequence [39, 40]. The distinct sequence composition of IDRs can influence their conformational stability and functions [29]. For instance, computational studies of sequence composition have shown that charged and hydrophobic residues dictate the conformation of IDRs and regulate cell cycle [41].

The relationship between amino acid composition and intrinsic disorder in histones has been reviewed [42, 43]. In histones, the sequence composition of terminal domains and disordered regions determines its role in molecular recognition [42, 43]. The

9

disordered C-terminal domain (CTD) of linker histone rich in Ala, Lys, and Pro is necessary to stabilize chromatin fibers. Many higher eukaryotes have six somatic linker histone isoforms with a distinct primary sequence in CTDs. However, the amino acid composition of each CTDs is similar and contain ~40% Lys, ~20-35% Ala, and ~15% Pro.

Furthermore, the composition of CTDs has been reported to play key roles in DNA binding and protein-protein interactions. In the case of core histones, the disordered N-terminal domain of H2A and H4 share a similar composition, whereas H2B and H3 show similar composition to the CTD of linker histone [42]. In addition, the amino acid composition can mediate different types of functional IDRs [42, 44].

The distinct amino acid composition of IDRs has been used to develop several computational tools to predict intrinsic disorder. The disorder predictions tools based on the flexibility, hydropathy, and charge of the amino acids are discussed in a later section

(Section 1.4.2).

1.1.5 Characterization of intrinsically disordered regions/proteins

The structural differences between ordered and disordered regions have led to rapid development of a variety of computational and experimental methods. Although the regions with missing electron density in X-ray crystallographic structure have been reported to provide disordered regions information, the missing regions in proteins may be due to protein purification errors or crystal defects [3]. NMR spectroscopy and circular

10

dichroism (CD) spectroscopy are some of the experimental methods used to observe the conformational propensities of disordered regions [45, 46].

1.1.6 Experimental determination of IDRs

The structural determination of disordered regions using NMR has more advantages than X-ray diffraction because of the absence of requirement for crystallization and NMR can provide different conformational ensembles for disordered regions [3, 46]. In addition,

NMR can show disordered regions within the ordered regions, regions that fold upon binding to other proteins ('folding-on-binding' regions), and completely disordered proteins in solution [47]. For example, the disordered NH2-terminal region of p21 in unbound state and the ordered conformation of the same region when bound to Cdk2 is observed using NMR [47]. Specifically, the structural and functional details of disordered regions and structured domains of CBP (CREB-binding protein) and its homologue p300 have been determined by NMR [46].

CD spectroscopy is also used to determine the presence of ordered, molten globule and random coil states of a protein. However, it is not possible to obtain clear structural information of both ordered and disordered regions [48].

11

1.1.7 Computational tools to predict intrinsically disordered regions

and proteins

Limitations and the differences in experimental methods have led to the

development of numerous prediction tools to identify disordered regions in protein [12,

49]. IDRs are characterized by amino acid compositional bias, low sequence complexity, low hydrophobicity and somewhat higher net charge [13, 17, 22]. Initially, disordered regions are predicted by identifying the missing coordinates in crystal structure [49]. Later, the differences between the amino acid composition of ordered and disordered regions have been used to predict the structure of disordered regions from the primary sequence

[17, 49]. Several disorder predictors have been developed, such as: ANCHOR [50],

DisEMBL [51], DISOPRED [52] and DISOPRED3 [53], DisPredict [54], DISpro [55],

DNdisorder [56], FoldIndex [57], FoldUnfold [58], GlobPlot [59], iPDA [60], IUPred [61],

PONDRâs [62-64], PrDOS [65], RONN [66], SPINE-D [67], Spritz [68], CSpritz [69],

ESpritz [70] and RAPID [71]. Although the past decade has seen an increasing number

of predictors, the gap between the number of structures available for disordered regions

in PDB (Protein Data Bank) and in nature is immense. Moreover, an earlier analysis of

proteins deposited in the major sequence databases showed the abundance of

disordered regions in proteins [22]. Therefore, predicting the structure of disorder regions

has a significant role in understanding their function.

12

1.2 Functional advantages of IDRs

The lack of a well-defined structure provides numerous functional advantages to

disordered regions. Bioinformatical analysis of complete proteomes revealed that IDRs in

eukaryotes are abundant in regulatory and signaling proteins [12, 14, 22]. Several studies

have indicated a substantial number of intrinsically disordered regions in complex

organisms and much greater percentages of proteins with predicted disorder are reported

in eukaryotes (35 to 51%) in comparison to bacteria (6 to 33%) and archaea (9 to 37%)

[12, 51, 72-76]. Furthermore, a study on the structural disorder in eukaryotes reported

that protists (single-celled eukaryotes) have high levels of predicted structural disorder

[77-79]. Disordered regions are abundant in parasites with complex life cycle and host-

changing pathogenic lifestyle such as apicomplexan and Trypanosoma genus and are

involved in host-parasite interactions [78]. Therefore, the functional importance of disordered regions is emphasized by their abundance in a wide range of organisms.

Furthermore, in many eukaryotic proteins, the globular domains are connected by the linker regions. These regions are enriched in polar, uncharged and Proline residues but high sequence conservation, an increased number of charged and hydrophobic residues have occasionally been observed. For example, the cell-cycle regulatory domain-1 (CRD1), enriched in charged residues, is located between the KIX domain and the bromodomain in CBP, and sumoylation in the CRD1 motif is necessary for transcription repression.

13

Previously, Dunker et al reported that disordered regions with 30 or longer residues are mainly involved in cell signaling, transcription and translation regulation, protein- nucleic acid and protein-protein binding [22]. The structural polymorphism of disordered regions enables the interaction with different target molecules and adoption of different structures depending on binding partner [15]. For example, GTPase binding domain of

Wiskott-Aldrich syndrome protein (WASP) exhibits different folded structures when it binds to Cdc42 GTPase and the VCA inhibitory peptide (Figure 1.3A) [15]. Another interesting example is the interaction of glycogen synthase kinase 3b (GSK3b) with

FRAT1 and axin. Both FRAT and axin proteins bind at the C-terminal domain of GSK3b and acquire different conformations (Figure 1.3B) [37]. In addition, the nuclear coactivator binding domain of the CBP protein exhibits two different structures when it binds to the activation domain of p160 [80]. Furthermore, thymosin b4, a small actin-binding protein is reported to have disordered regions in solution and is involved in the recognition of several target molecules [81]. Other major functions of IDRs include the housing of protein phosphorylation and other post-translational modifications (PTMs). IDRs are reported to provide enhanced accessibility of PTM sites for regulation [22]. From the above studies, it is clear that the discovery of IDRs has created a new paradigm in the protein structure/function relationship, which augments the paradigm that had been accepted for the past 60 years. Hence, exhaustive bioinformatical studies on protein sequences of

IDRs will illuminate the evolution and function of these relatively less characterized protein types. More detailed functions of IDRs are discussed below.

14 1246 V.N. Uversky, A.K. Dunker / Biochimica et Biophysica Acta 1804 (2010) 1231–1264 and spatial orientation [240]. Some scaffolds create focal points for ternary product complex, the tetrahydrofolate binary complex as well spatial and temporal coordination of enzymatic activity of kinases and as the tetrahydrofolate–NADPH complex. These structures can be phosphatases. used to reconstruct a 2.1 Å resolution movie, depicting the sequence Modulation of the phosphorylation state of downstream members of events during the catalytic cycle, which showed that the enzyme of signal transduction pathways is a primary mode of action for many adopts different conformational substates while complexed with dif- scaffold proteins. Compartmentalization is provided by the fact that ferent ligands, suggesting that the process of enzymatic catalysis the activity of bound members is directed towards neighboring sub- might be accompanied by significant conformational changes [244]. strates that may or may not be bound to the scaffold. Enzymes may be Signaling and regulation are proposed to be among the most activated or inhibited upon association with the scaffold. Associations important functions of ID proteins/regions [101]. Qualitatively, it seems are dynamic and may serve to coordinate the responses among path- reasonable that highly mobile proteins would provide a better basis for ways. Scaffolds contain several domains for protein–protein inter- signaling and recognition. For example, disordered regions can bind action. Furthermore, scaffold proteins can play a role in modulating partners with both high specificity and low affinity [245]. This means the activation of alternative pathways by promoting interactions that the regulatory interactions can be specific and also can be easily between various signaling proteins [241]. dispersed. Obviously this represents a keystone of signaling — turning a In order to understand the role of ID for scaffolding functions, signal off is as important as turning it on [72]. several well-characterized scaffold proteins with structurally and Another crucial property of ID proteins for their function in sig- functionally characterized ID regions were analyzed [241]. Based on naling networks is binding diversity; i.e., their ability to partner with the analysis of these several famous scaffolds, including axin, breast many other proteins and other ligands, such as nucleic acids [89]. This cancer type 1 susceptibility protein (BRCA1), A-kinase anchoring opens the possibility for one regulatory region or one regulatory proteins AKAP79 and AKAP250, microtubule-associated protein 2 protein to bind to many different partners. A protein that binds to (MAP2), titin and several others, large ID regions appear to be crucial multiple partners might be expected to be crucial for a number of for successful scaffold function. These signaling scaffold proteins different biological processes and therefore might be especially utilize the various features of highly flexible ID regions toA obtain more important for the survival of the cell. In agreement with this idea, functionality from less structure [241]. proteins that make multiple interactions are more likely to lead to The more function from less structure conclusion was further lethality if deleted [246]. supported by a recent study on structural properties of the CASK- There are several other reasons of why ID proteins might be interactive protein 1 [242], which is a post-synaptic density protein in superior for certain tasks compared to their ordered counterparts. mammalian neurons where it acts as a specific scaffold interacting This includes, but is not limited to: binding commonality in which with many important proteins including κ-casein (CASK), stathmin-3, multiple, distinct sequences recognize a common binding site synaptotagmin, neurexin-2, septin-4, neural cell adhesion molecule (with perhaps different folds in the various complexed ID proteins) L1 (L1CAM), SH2/SH3 adaptor protein NCK-alpha (NCK1), and several [176]; the ability to form large interaction surfaces as the disordered others. Using a set of bioinformatics tools, CD spectroscopy, wide-line region wraps-up [247] or surrounds its partner [248]; faster rates and 1H-NMR spectroscopy, limited proteolysis and gel-filtration of association by reducing dependence on orientation factors and chromatography, the entire C-terminal proline-rich region of 800 by enlarging target sizes [21]; and faster rates of dissociation by amino acids of CASK-interactive protein 1 exhibits the set of char- unzippering mechanisms [72]. acteristics associated with being intrinsically disordered [242]. An interesting consequence of the capability of ID proteins to Furthermore, the authors extended their finding of a high level of ID interact with different binding partners is their polymorphism in in CASK-interactive protein 1 by assembling a set of 74 scaffold bound state; i.e., an ID protein (or ID region) can have completely proteins and predicting their disorder by three different algorithms. A different geometries in the rigidified structures induced by associating very high fraction of the residues was found to fall into local disorder, with its partner, depending on the nature of the bound partner. and ordered domains of these scaffold proteins were shown to be Crystallographic studies on glycogen synthase kinase 3β (GSK3β), a connected by linker regions which were mostly disordered. Thus, the Ser/Thr protein kinase and its interactions with FRAT1 and axin usual design of a scaffold protein includes a set of short globular provide an illustrative example of these polymorphic bound states domains (∼80 amino acids on average) connected by longer linker [249]. Fig. 10 shows that a sharp turn breaks the structure of FRAT regions (∼150 residues on average) with crucial binding functions [242].

2.5.4. The functional advantages of ID proteins/regions B Importantly, even sturdy key holes (i.e., protein active sites) have been shown to be rather flexible. In fact, as early as in 1958 it was recognized that some enzymes could act on rather differently shaped substrates, suggesting that a degree of flexibility would be needed to fit the different substrates and thereby to be functional. To explain these ideas, a modification of the “lock and key” model called the “induced fit” theory was proposed by Koshland [243]. According to this theory and its subsequent modifications/interpretations, the enzyme is partially flexible and the substrate does not simply bind to the active site, but it has to bring about changes to the shape of the active site to activate the enzyme and make the reaction possible. Substantial experimental evidence has been accumulated to support this view for many different enzymes. For example, the existence of functional flexibility within the active site has been demonstrated by X-ray crystallographic analysis of E. coli dihydrofolate reductase Fig. 10. Polymorphism in the bound state. Comparison of axin and FRAT binding to GSK3β. The binding sites for the axin (383–401) peptide and FRAT (197–222) peptides liganded with different cofactors and substrates. In fact, Sawaya Figure 1. 3 Structuralare co-localized polymorphism in the C-terminal domain of disordered of GSK3β. However, theregions. two peptides have no and Kraut have analyzed crystal structures of different forms of sequence homology, have different conformations in their bound state, and possess this protein, including the holo-enzyme, the(A) Michaelis The complex, GTPase the bindingdifferent sets domain of interactions of with WASP GSK3β. showing different folded structures when

interacting with (a) Cdc42 GTPase and (b) VCA peptide. Figure from [15]. (Copyright

15

Confirmation Number: 4326830865336). (B) Polymorphism in bound state of GSK3b. The binding of Axin and FRAT proteins with GSK3b demonstrate different conformations and interactions. Figure from [37] (Copyright Order Number: 4326831050111).

1.2.1 Molecular Recognition

As mentioned earlier, disorder-to-order state transition is a common process in

disordered regions. Disorder-mediated molecular recognition requires low binding affinity

and often involves interaction with a large number of diverse partner molecules. DNA

recognition and transcription activation are mediated by disorder-to-order state transition.

For instance, the disordered high mobility group (HMG) domain in lymphoid enhancer-

binding factor (LEF-1) binds with DNA and regulates the T cell receptor-a gene enhancer.

The analysis of IDRs in transmembrane proteins revealed that loop regions rich in

positively charged amino acids provide structural stability and facilitate regulatory

interactions whereas IDRs with a deficit of positive residues in terminal regions mediate

protein-protein interactions (e.g., receptor clustering or recruiting signaling partners) [82].

An earlier analysis of protein complexes deposited in the PDB have shown short peptides

(10-70 residues) bound to globular proteins [83]. These peptides are located in the long

IDRs and exhibit a-helix and b-sheet and irregular secondary structure upon binding to

target molecules. They have been called molecular recognition features (MoRFs), and

16

are primarily associated with molecular recognition and protein-protein interactions in

signaling events [83, 84]. Furthermore, predicted phosphorylation sites are reported in

one third of MoRFs [83] and sites with multiple PTM show strong preference of MoRFs

[85].

1.2.2 Post-translational modifications in disordered regions

Post-translational modifications play vital roles in signaling, maturation, folding of

newly synthesized proteins and protein interaction networks [86, 87]. Additionally, PTMs

in disordered regions have been reported to regulate protein-protein interactions in

transcriptional and developmental processes [85]. Interestingly, PTMs can facilitate

disorder-to-order transition by changing the physical and chemical properties of IDRs

(Figure 1.4) [88]. It has also been hypothesized that PTMs are primarily present in the disordered regions due to their high accessibility [88]. Indeed, histones undergo several

types of PTMs such as acetylation, methylation, phosphorylation, ubiquitination,

sumoylation and ADP-ribosylation and these modifications are significant for nucleosome

stability, transcription activation, gene repression and offer a distinct function to chromatin

[89, 90]. Acetylation and methylation in the disordered N-terminal tail of core histones

facilitate specific protein-protein interactions and induce coupled folding and binding of

the N-terminal domain [91].

17

Figure 1. 4 Post-translational modifications induce structural changes in IDRs. Figure from [88] (Copyright right permission to reuse this figure is not needed).

In DNA binding proteins, acetylation and phosphorylation can modulate their specific and non-specific interactions with DNA by altering the charge of the modified residues

[92]. In addition, the intrinsically disordered N and C-terminal domains of p53 are subjected to various PTMs including acetylation, methylation, glycosylation, phosphorylation, poly-ribosylation, O-GlcNacylation, sumoylation and ubiquitination [93].

These modifications regulate the interaction between p53 and its partner molecule. For

example, phosphorylation in the transcription activation domain (TAD) of p53 can

18

increase its binding affinity to CH3 and TAZ1 (transcriptional adapter zinc-binding). On the other hand, phosphorylation at Ser15, Thr18, and Ser20 can inhibit the Mdm2 binding

[93]. In the case of membrane proteins, myristoylation has a significant role in membrane targeting. The disordered N-terminus of Hrk, a BH-3 only member of Bcl-2 family, is predicted to contain a N-myristoylation site at Gly 63 [28].

Many studies have shown that PTMs such as acetylation, fatty acylation, methylation, glycosylation, phosphorylation, and ubiquitination can occur predominantly in the disordered regions. Specifically, phosphorylation is significant for signaling in eukaryotic proteins, and nearly one-third of the eukaryotic proteins undergo phosphorylation [94]. Recently, the analysis of IDRs in 504 kinases showed 83% of kinases with at least one disordered region [95, 96] and each kinase group is categorised based on their differential evolution [97]. In addition, it has been reported that phosphorylation shows preference for IDRs in both animals and plants [3, 34, 81, 91, 98].

The amino acids adjacent to phosphorylation sites generally have similar properties with residues in disordered regions. Therefore, a web-based tool has been developed to identify the phosphorylation sites in disordered regions [34]. However, a correlation study between protein disorder and PTMs have shown contradicting results [99]. Hence, the

IDRs may not preferentially exhibit sites for all types of PTMs.

The evolutionary studies of regulatory enzymes and modification sites have revealed that PTMs such as acetylation, glycosylation and phosphorylation are found in all domains of life [100]. In addition, the non-enzymatic acetylation of lysine residues and phosphorylation further suggest their ancient origin [100]. In 2012, Hagai and co-workers

19

studied the rate of evolutionary changes and formation of ubiquitination sites. They reported that mammalian proteins are more conserved than the unmodified residues and the shift in the location of ubiquitination sites may be compensated by the residues in the disordered regions [101]. Furthermore, bioinformatics analysis revealed the emergence of 281 novel ubiquitination sites in the human lineage during primate evolution [102].

Similarly, 37 human-specific phosphorylation sites have also been identified [103]. The recent advances in the sequencing technologies have facilitated the large-scale analysis of PTMs and their evolution. However, PTMs in many proteins are still need to be studied to understand their function and obtain a broader view of the origin of PTMs in different clades. Moreover, a large-scale analysis comparing the conservation of PTMs in ordered and disordered regions is necessary to understand their functional role in different species. This thesis focuses on the bioinformatics analysis of the evolution and emergence of PTMs in the disordered regions.

1.3 Evolution of IDRs

The evolution of disordered regions is different from that of ordered regions [39, 96,

104-109]. An early comparative study on the evolution of IDRs and ordered proteins showed that the disordered regions evolve faster than the ordered regions in 19 families out of 26 [110]. However, a smaller group of IDRs have been shown to evolve slowly

[110]. These disordered regions are involved in binding sites for proteins, DNA, RNA, and flexible linkers, while the slowly evolving regions are involved in DNA binding. In addition,

20

disordered proteins tend to undergo more amino acid replacement than the ordered

proteins [110].

The rewiring of protein interaction during evolution suggests that disordered

interactions are less conserved than the ordered interactions [111]. IDRs show a distinct

pattern of point accepted mutations and higher rates of insertion and deletion. The

aromatic amino acids possess a lower substitution rate than the charged amino acids;

this also contributes to the differences in the evolutionary rate between ordered and

disordered regions [111]. In addition, a recent study suggested that the evolutionarily young proteins are enriched in disordered regions, and they can be ordered over evolutionary time [112].

In 2018, the evolutionary analysis of human proteins with IDRs show that they are

frequent targets for positive selection than other regions in the protein [113]. Furthermore,

the dynamic analysis of linker domain in RPA70 protein using NMR found maintenance

of similar backbone flexibility and same length across the diverged species, yet the amino

acids sequences showed no conserved regions [114]. However, a recent study has

suggested that the conserved IDRs within a single domain protein may provide multiple

functions that are typically observed in proteins with multiple domains [115]. Moreover, many studies have used different approaches to study the evolution and sequence conservation in disordered regions [106, 116-120]. Thus, understanding the natural

selection and domain evolution of IDRs is important to decipher their functions.

Evolutionary studies on disordered proteins will provide an insight into the degree of

21

natural selection on specific parts of intrinsically disordered proteins and how these relate to each other during evolution.

1.4 Classification of IDRs

The structural and amino acid sequence differences between ordered and disordered regions provide many ways to classify IDRs [3, 121-125]. Furthermore, previous studies have shown different approaches to identify different types of IDRs [126,

127]. In 1997, Romero and coworkers have classified IDRs into three types (i) short (7-

21), medium (22-44) and long (>45 residues) [123, 124]. The sequence analyses of short

(£30 residues) and long (>30) disordered regions show that short disordered regions are enriched in glycine and aspartic acid and the long disordered regions are enriched in lysine, glutamic acid and proline [64, 128-130], and disordered regions of different length exhibit different types of functions [44, 131]. For instance, short disordered regions may contain short linkers or MoRFs or short linear motifs (SLiMs ) with 3-10 residues involving partner recognition and post-translational modification [44, 132-135]. On the other hand long disordered regions may contain multiple motifs or domains [44]. Therefore, several studies have investigated the functions of IDRs based on their amino acid sequence length [75, 110, 132, 136, 137].

Moreover, IDRs can also be classified in terms of their structure, function, functional features, evolution. regulation, protein interactions and biophysical properties[44, 138]. In specific, functional features such as MoRFs and (SLiMs) are identified as partner binding

22

regions. MoRFs can undergo disorder-to-order transition upon binding to a target

molecule and SLiMs exhibit sequence conservation and provide sites for post-

translational modifications [139]. For instance, phosphorylation in the cyclin dependent

kinase binding motif is involved in regulating cell cycle progression[140]. In addition,

MoRFs can be predicted using several predictors including alpha-MoRFpred [73, 141],

MoRFpred [142], MFSPSSMpred [143], MoRFCHiBi [144], MoRFchibi SYSTEM [145],

fMoRFpred [146], retro-MoRF [147] and OPAL [148]. The predictors of SLiMs including

SLiMpred [149], PepBindPred [150], SLiMDisc [151] HHMOTiF [152] and SLiMSearch

[153]. In addition, the manually curated SLiMs can be found in eukaryotic linear motif

(ELM) resource [154]. Both MoRFs and SLiMs can be identified based on their sequence conservation [143, 155]. For example, SH2, SH3 and Ser/Thr Kinase interacting SLiMs

are conserved in disordered regions [156].

1.5 Disease associated with IDRs

The abundance of IDRs in eukaryotes shows that these proteins exhibit crucial roles

in normal cellular function; with malfunctioning of IDRs make them vulnerable to many

diseases. A broad range of diseases originates from the misfolding or unfolding nature

of certain proteins, referred to as protein misfolding diseases [157]. In 2008, Vladimir et

al., introduced the D2 (disorder in disorders) concept to highlight the abundant association

of IDRs with human diseases [158]. Furthermore, impaired interactions with the

endogenous factors such as chaperones, intracellular or extracellular matrices, or small

23

molecules can increase the misfolding propensity of pathogenic proteins [159]. For

example, the aggregation of the IDR alpha-synuclein in the cytoplasm is involved in the pathogenesis of Parkinson’s disease, dementia with Lewy bodies, Alzheimer's disease,

Down's syndrome, multiple system atrophy, and neurodegeneration with brain iron accumulation type 1 [160]. These are termed synucleinopathies.

Studies on disordered regions in p53 (tumour suppressor) and Human

papillomavirus (HPV ) proteins proved that these proteins are found to contain an

increased amount of intrinsic disorder [160]. The intrinsically disordered regulatory region

near the C-terminus of p53 folds into helical, β-strand, and extended irregular structures

on binding to different protein partners [161]. Misfolding diseases tend to spread to

multiple tissues and to cause damage to multiple organs. Hence, these studies highlight

the alarming need for understanding the regulation of IDRs because altered expression

of IDRs is associated with many diseases.

1.6 Role of IDRs in drug development:

Several studies on IDRs associated with diseases and their significant role in cellular

function postulated that IDRs could be potential drug targets for many life-threatening

diseases like cancer and neurological disorders [162-167]. Vladimir et al., surmised that a protein-protein interaction between one disordered partner and one structured partner is likely to be a good target for drug discovery [158]. In addition, IDRs can form a helix with a hydrophobic surface in a groove of structured proteins, which is observed in the

24

MoRF dataset referred to above. One such example is the p53-Mdm2 (a negative

regulator of p53) interaction. The p53 binding site contains an alpha-helical MoRF and

Mdm2 forms a groove for it to form against [163]. Previous studies have suggested that the binding of small molecules to MoRFs in IDRs stabilise its bound state and inhibits

protein interaction with other partners [168]. Interestingly, a recent study has reported that

the disordered region of nuclear protein 1 (NUPR1) remain disordered upon binding of

the fifteen FDA-approved compounds [164].

During the disorder-to-order transition of IDRs, the binding energy is used to

overcome the high entropy of the unfolded state. As a consequence, the interaction

between the disordered and structured partners is weaker than the interaction between

two structured proteins and favours a way to inhibit them with small molecules [163].

As mentioned earlier, IDRs biased for proline-rich residues might be a drug target

for immune-mediated disorders [162]. In addition, binding of small molecules in the

disordered region of Myc protein can regulate the over-expression of Myc in cancer [159].

Furthermore, the dishevelled PDZ domain (which is up-regulated in some cancers) facilitates the binding of designed peptides that inhibit the Wnt signaling pathway [169].

Recently, a cell-permeable small molecule was shown to displace BRD4 (Bromodomain- containing Protein 4) oncoprotein fusion from chromatin and inhibits the cell proliferation of human squamous-cell cancer [170]. In addition, targeting the enzymes that post- translationally modify IDRs could be a possible approach. Inhibitory activity of phytochemicals against Sirt1-deacetylase or siRNAs reduced the level of kinase HIPK2

(homeodomain-interacting protein kinase-2) , led to the increased stability of p53 and

25

facilitated apoptosis in cancer cells [171]. Interestingly, venetoclax or ABT-199, a drug that mimics the binding of intrinsically disordered BH3 protein to Bcl-2, has been shown to inhibit the growth of Bcl-2 dependent tumours [172-174].

1.7 Objectives of the Research

Within the past three decades, the interest in studying intrinsically disordered regions has increased exponentially. This is mainly due to their distinct structural and functional characteristics such as coupled folding and binding, cell signaling and post- translational modifications. Due to the difficulties in determining the structure of disordered regions, a significant number of bioinformatics approaches have been developed to understand their sequence composition, evolution, structure and functional properties. However, the composition and evolutionary behaviour of intrinsically disordered regions is incomplete. Hence this thesis performs the bioinformatical analysis of IDRs that specifically aims to:

Analyse the composition of different disordered region types in human FB

proteins, and examine evolutionary behaviour of these regions across vertebrate

orthologs.

Analyse the distribution and evolutionary trend of PTMs in ordered, FB and

disordered regions across eukaryotic organisms.

26

Chapter 2 is a bioinformatical parsing of human folding-on-binding proteins into four

different subsets. This chapter aims to compare the charge, hydrophobicity and evolutionary behaviour of (i) ordered, (ii) other disordered, (iii) folding-on-binding and (iv) disordered-around-FB regions in human proteins. To this end, the conservation analysis of the four parsed datasets across vertebrate evolution, and compositional differences of the three parsed disordered region types have been analysed.

Chapter 3 is a bioinformatical sequence analysis that attempts to understand the evolutionary trend of methylation, acetylation and ubiquitination (MAU) sites in ordered, disordered and FB regions at 11 evolutionary clades from the whole eukaryotic domain down to the level of the ape superfamily. This study also aims to determine the enrichment of MAU and other PTM sites in subsets of IDRs such as prion-like domain and FB regions.

27

1.8 References

1. Fischer, E., Einfluss der configuration auf die wirkung der enzyme. Ber. Dt. Chem. Ges., 1894. 27: p. 2985-2993. 2. Lehninger, A.L., D.L. Nelson, and M.M. Cox, Lehninger principles of biochemistry. 6th ed. 2013, New York: W.H. Freeman. 3. Dunker, A.K., et al., Intrinsically disordered proteins. Journal of Molecular Graphics, 2001. 19(1): p. 26-59. 4. Karush, F., Heterogeneity of the binding sites of bovine serum albumin. J. Am. Chem. Soc., 1950. 72: p. 2705-2713. 5. Koshland, D.E., Application of a Theory of Enzyme Specificity to Protein Synthesis. Proc Natl Acad Sci U S A, 1958. 44(2): p. 98-104. 6. McEwan, I.J., et al., Functional interaction of the c-Myc transactivation domain with the TATA binding protein: Evidence for an induced fit model of transactivation domain folding. Biochemistry, 1996. 35(29): p. 9584-9593. 7. Mazza, C., et al., Large-scale induced fit recognition of an m(7)GpppG cap analogue by the human nuclear cap-binding complex. Embo Journal, 2002. 21(20): p. 5548-5557. 8. Fletcher, C.M. and G. Wagner, The interaction of eIF4E with 4E-BP1 is an induced fit to a completely disordered protein. Protein Science, 1998. 7(7): p. 1639-1642. 9. Hingerty, B., et al., Neutron diffraction of alpha, beta and gamma cyclodextrins: hydrogen bonding patterns. J Biomol Struct Dyn, 1984. 2(1): p. 249-60. 10. Koshland, D.E., Jr., Enzyme flexibility and enzyme action. J Cell Comp Physiol, 1959. 54: p. 245-58. 11. James, L.C. and D.S. Tawfik, The specificity of cross-reactivity: Promiscuous antibody binding involves specific hydrogen bonds rather than nonspecific hydrophobic stickiness. Protein Science, 2003. 12(10): p. 2183-2193.

28

12. A. Keith Dunker, Z.O., Pedro Romero, Ethan C. Garner Intrinsic protein disorder in complete genomes. Genome Inform Ser Workshop Genome Inform, 2000. 11: p. 161-71. 13. Vladimir N. Uversky, J.R.G., and Anthony L. Fink, Why are “natively unfolded” proteins unstructured under physiologic conditions. PROTEINS: Structure, Function, and Genetics, 2000. 41: p. 415–427. 14. Wright , P.E. and H.J. Dyson, Intrinsically unstructured proteins- re-assessing the protein structure-function paradigm. J. Mol. Biol., 1999. 293: p. 321-331. 15. Dyson, H.J. and P.E. Wright, Coupling of folding and binding for unstructured proteins. Current Opinion in Structural Biology, 2002. 12(1): p. 54-60. 16. Tompa, P., Intrinsically unstructured proteins. Trends Biochem Sci, 2002. 27(10): p. 527-33. 17. Dyson, H.J. and P.E. Wright, Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol, 2005. 6(3): p. 197-208. 18. Li, H.M., A. Rao, and P.G. Hogan, Interaction of calcineurin with substrates and targeting proteins. Trends in Cell Biology, 2011. 21(2): p. 91-103. 19. Oldfield, C.J. and A.K. Dunker, Intrinsically disordered proteins and intrinsically disordered protein regions. Annu Rev Biochem, 2014. 83: p. 553-84. 20. Uversky, V.N., What does it mean to be natively unfolded. Eur. J. Biochem, 2002. 269: p. 2-12. 21. Obradovic, A.K.D.a.Z., The protein trinity—linking function and disorder. Nat. Biotechnol. Nature Biotechnology, 2001. 19. 22. A. Keith Dunker, a., * Celeste J. Brown,á J. David Lawson,á and a.a.Z.O. Lilia M. Iakoucheva, Intrinsic disorder and protein function. Biochemistry, 2002. 41(21): p. 6573-82. 23. A. KEITH DUNKER, C.J.B., Identification and functions of usefully disordered proteins. Adv Protein Chem, 2002. 62: p. 25-49. 24. Oldfield, C.J., et al., Coupled folding and binding with alpha-helix-forming molecular recognition elements. Biochemistry, 2005. 44(37): p. 12454-70.

29

25. Shoemaker, B.A., J.J. Portman, and P.G. Wolynes, Speeding molecular recognition by using the folding funnel: The fly-casting mechanism. Proceedings of the National Academy of Sciences of the United States of America, 2000. 97(16): p. 8868-+. 26. Tompa, P., Unstructural biology coming of age. Current Opinion in Structural Biology, 2011. 21(3): p. 419-425. 27. Spink, K.E., P. Polakis, and W.I. Weis, Structural basis of the Axin-adenomatous polyposis coli interaction. EMBO J, 2000. 19(10): p. 2270-9. 28. Barrera-Vilarmau, S., P. Obregon, and E. de Alba, Intrinsic order and disorder in the bcl-2 member harakiri: insights into its proapoptotic activity. PLoS One, 2011. 6(6): p. e21413. 29. Babu, M.M., The contribution of intrinsically disordered regions to protein function, cellular complexity, and human disease. Biochem Soc Trans, 2016. 44(5): p. 1185-1200. 30. Zhou, H.X., Intrinsic disorder: signaling via highly specific but short-lived association. Trends Biochem Sci, 2012. 37(2): p. 43-8. 31. Sugase, K., H.J. Dyson, and P.E. Wright, Mechanism of coupled folding and binding of an intrinsically disordered protein. Nature, 2007. 447(7147): p. 1021-5. 32. Wright, P.E. and H.J. Dyson, Intrinsically disordered proteins in cellular signalling and regulation. Nat Rev Mol Cell Biol, 2015. 16(1): p. 18-29. 33. Flock, T., et al., Controlling entropy to tune the functions of intrinsically disordered regions. Curr Opin Struct Biol, 2014. 26: p. 62-72. 34. Iakoucheva, L.M., et al., The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Research, 2004. 32(3): p. 1037-1049. 35. Johnson, L.N. and R.J. Lewis, Structural basis for control by phosphorylation. Chem Rev, 2001. 101(8): p. 2209-42. 36. Vetter, S.W. and E. Leclerc, Phosphorylation of serine residues affects the conformation of the calmodulin binding domain of human protein 4.1. Eur J Biochem, 2001. 268(15): p. 4292-9.

30

37. Uversky, V.N. and A.K. Dunker, Understanding protein non-folding. Biochim Biophys Acta, 2010. 1804(6): p. 1231-64. 38. Habchi, J., et al., Introducing protein intrinsic disorder. Chem Rev, 2014. 114(13): p. 6561-88. 39. Brown, C.J., et al., Evolution and disorder. Curr Opin Struct Biol, 2011. 21(3): p. 441-6. 40. Moesa, H.A., et al., Chemical composition is maintained in poorly conserved intrinsically disordered regions and suggests a means for their classification. Mol Biosyst, 2012. 8(12): p. 3262-73. 41. Das, R.K., et al., Cryptic sequence features within the disordered protein p27(Kip1) regulate cell cycle signaling. Proceedings of the National Academy of Sciences of the United States of America, 2016. 113(20): p. 5616-5621. 42. Hansen, J.C., et al., Intrinsic protein disorder, amino acid composition, and histone terminal domains. Journal of Biological Chemistry, 2006. 281(4): p. 1853- 1856. 43. Lu, X., et al., Chromatin Condensing Functions of the Linker Histone C-Terminal Domain Are Mediated by Specific Amino Acid Composition and Intrinsic Protein Disorder. Biochemistry, 2009. 48(1): p. 164-172. 44. van der Lee, R., et al., Classification of intrinsically disordered regions and proteins. Chem Rev, 2014. 114(13): p. 6589-631. 45. Schanda, P., V. Forge, and B. Brutscher, and unfolding studied at atomic resolution by fast two-dimensional NMR spectroscopy. Proceedings of the National Academy of Sciences of the United States of America, 2007. 104(27): p. 11257-11262. 46. Dyson, H.J. and P.E. Wright, Unfolded proteins and protein folding studied by NMR. Chemical Reviews, 2004. 104(8): p. 3607-3622. 47. Kriwacki, R.W., et al., Structural studies of p21Waf1/Cip1/Sdi1 in the free and Cdk2-bound state: conformational disorder mediates binding diversity. Proc Natl Acad Sci U S A, 1996. 93(21): p. 11504-9.

31

48. Mohan, A., A study of intrinsic disorder and its role in functional proteomics. Indiana University, 2009. 49. Zoran Obradovic, K.P., Slobodan Vucetic, Predrag Radivojac, Celeste J. Brown, and A. Keith Dunker, Predicting intrinsic disorder from amino acid sequence. PROTEINS: Structure, Function, and Genetics, 2003. 50. Dosztanyi, Z., B. Meszaros, and I. Simon, ANCHOR: web server for predicting protein binding regions in disordered proteins. Bioinformatics, 2009. 25(20): p. 2745-6. 51. Linding, R., et al., Protein disorder prediction: Implications for structural proteomics. Structure, 2003. 11(11): p. 1453-1459. 52. Ward, J.J., et al., The DISOPRED server for the prediction of protein disorder. Bioinformatics, 2004. 20(13): p. 2138-9. 53. Jones, D.T. and D. Cozzetto, DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics, 2015. 31(6): p. 857-63. 54. Iqbal, S. and M.T. Hoque, DisPredict: A Predictor of Disordered Protein Using Optimized RBF Kernel. PLoS One, 2015. 10(10): p. e0141551. 55. Cheng, J.L., M.J. Sweredoski, and P. Baldi, Accurate prediction of protein disordered regions by mining protein structure data. Data Mining and Knowledge Discovery, 2005. 11(3): p. 213-222. 56. Eickholt, J. and J. Cheng, DNdisorder: predicting protein disorder using boosting and deep networks. BMC Bioinformatics, 2013. 14: p. 88. 57. Prilusky, J., et al., FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics, 2005. 21(16): p. 3435-8. 58. Galzitskaya, O.V., S.O. Garbuzynskiy, and M.Y. Lobanov, FoldUnfold: web server for the prediction of disordered regions in protein chain. Bioinformatics, 2006. 22(23): p. 2948-2949. 59. Linding, R., et al., GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Res, 2003. 31(13): p. 3701-8.

32

60. Su, C.T., C.Y. Chen, and C.M. Hsu, iPDA: integrated protein disorder analyzer. Nucleic Acids Res, 2007. 35(Web Server issue): p. W465-72. 61. Dosztanyi, Z., et al., IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics, 2005. 21(16): p. 3433-3434. 62. Romero, P., et al., Sequence complexity of disordered protein. Proteins, 2001. 42(1): p. 38-48. 63. Obradovic, Z., et al., Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins, 2005. 61 Suppl 7: p. 176-82. 64. Peng, K., et al., Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics, 2006. 7: p. 208. 65. Ishida, T. and K. Kinoshita, PrDOS: prediction of disordered protein regions from amino acid sequence. Nucleic Acids Res, 2007. 35(Web Server issue): p. W460- 4. 66. Yang, Z.R., et al., RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics, 2005. 21(16): p. 3369-3376. 67. Zhang, T., et al., Intrinsic Disorder and Semi-disorder Prediction by SPINE-D. Methods Mol Biol, 2017. 1484: p. 159-174. 68. Vullo, A., et al., Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Res, 2006. 34(Web Server issue): p. W164-8. 69. Walsh, I., et al., CSpritz: accurate prediction of protein disorder segments with annotation for homology, secondary structure and linear motifs. Nucleic Acids Research, 2011. 39: p. W190-W196. 70. Walsh, I., et al., ESpritz: accurate and fast prediction of protein disorder. Bioinformatics, 2012. 28(4): p. 503-9.

33

71. Yan, J., et al., RAPID: Fast and accurate sequence-based prediction of intrinsic disorder content on proteomic scale. Biochimica Et Biophysica Acta-Proteins and Proteomics, 2013. 1834(8): p. 1671-1680. 72. Oates, M.E., et al., D(2)P(2): database of disordered protein predictions. Nucleic Acids Res, 2013. 41(Database issue): p. D508-16. 73. Oldfield, C.J., et al., Comparing and combining predictors of mostly disordered proteins. Biochemistry, 2005. 44(6): p. 1989-2000. 74. Uversky, V.N., Protein folding revisited. A polypeptide chain at the folding- misfolding-nonfolding cross-roads: which way to go? Cellular and Molecular Life Sciences, 2003. 60(9): p. 1852-1871. 75. Ward, J.J., et al., Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol, 2004. 337(3): p. 635-45. 76. Dunker, A.K., et al., Intrinsic disorder and protein function. Biochemistry, 2002. 41(21): p. 6573-82. 77. Pancsa, R. and P. Tompa, Structural disorder in eukaryotes. PLoS One, 2012. 7(4): p. e34687. 78. Feng, Z.P., et al., Abundance of intrinsically unstructured proteins in P. falciparum and other apicomplexan parasite proteomes. Mol Biochem Parasitol, 2006. 150(2): p. 256-67. 79. Mohan, A., et al., Intrinsic disorder in pathogenic and non-pathogenic microbes: discovering and analyzing the unfoldomes of early-branching eukaryotes. Mol Biosyst, 2008. 4(4): p. 328-40. 80. Uversky, V.N., Functional roles of transiently and intrinsically disordered regions within proteins. FEBS J, 2015. 282(7): p. 1182-9. 81. Xie, H., et al., Functional anthology of intrinsic disorder. 3. Ligands, post- translational modifications, and diseases associated with intrinsically disordered proteins. J Res, 2007. 6(5): p. 1917-32. 82. Tusnady, G.E., L. Dobson, and P. Tompa, Disordered regions in transmembrane proteins. Biochim Biophys Acta, 2015. 1848(11 Pt A): p. 2839-48.

34

83. Mohan, A., et al., Analysis of molecular recognition features (MoRFs). Journal of Molecular Biology, 2006. 362(5): p. 1043-1059. 84. Vacic, V., et al., Characterization of molecular recognition features, MoRFs, and their binding partners. Journal of Proteome Research, 2007. 6(6): p. 2351-2366. 85. Pejaver, V., et al., The structural and functional signatures of proteins that undergo multiple events of post-translational modification. Protein Science, 2014. 23(8): p. 1077-1093. 86. Deribe, Y.L., T. Pawson, and I. Dikic, Post-translational modifications in signal integration. Nature Structural & Molecular Biology, 2010. 17(6): p. 666-672. 87. Duan, G.Y. and D. Walther, The Roles of Post-translational Modifications in the Context of Protein Interaction Networks. Plos Computational Biology, 2015. 11(2). 88. Bah, A. and J.D. Forman-Kay, Modulation of Intrinsically Disordered Protein Function by Post-translational Modifications. J Biol Chem, 2016. 291(13): p. 6696-705. 89. Peterson, C.L. and M.A. Laniel, Histones and histone modifications. Current Biology, 2004. 14(14): p. R546-R551. 90. Garcia, B.A., et al., Organismal differences in post-translational modifications in histones H3 and H4. Journal of Biological Chemistry, 2007. 282(10): p. 7641- 7655. 91. Hansen, J.C., et al., Intrinsic protein disorder, amino acid composition, and histone terminal domains. J Biol Chem, 2006. 281(4): p. 1853-6. 92. Vuzman, D., Y. Hoffman, and Y. Levy, Modulating Protein-DNA Interactions by Post-Translational Modifications at Disordered Regions. Pacific Symposium on Biocomputing 2012, 2012: p. 188-199. 93. Uversky, V.N., p53 Proteoforms and Intrinsic Disorder: An Illustration of the Protein Structure-Function Continuum Concept. Int J Mol Sci, 2016. 17(11). 94. Marks, F., Protein Phosphorylation. VCH Weinheim, New York, Basel, Cambridge, Tokyo, 1996.

35

95. Kathiriya, J.J., et al., Presence and utility of intrinsically disordered regions in kinases. Mol Biosyst, 2014. 10(11): p. 2876-88. 96. Kathiriya, J.J., et al., Data on evolution of intrinsically disordered regions of the human kinome and contribution of FAK1 IDRs to cytoskeletal remodeling. Data Brief, 2017. 10: p. 315-324. 97. Manning, G., et al., Evolution of protein kinase signaling from yeast to man. Trends Biochem Sci, 2002. 27(10): p. 514-20. 98. Gao, J. and D. Xu, Correlation between posttranslational modification and intrinsic disorder in protein. Pac Symp Biocomput, 2012: p. 94–103. 99. Gao, J.J. and D. Xu, Correlation between Posttranslational Modification and Intrinsic Disorder in Protein. Pacific Symposium on Biocomputing 2012, 2012: p. 94-103. 100. Beltrao, P., et al., Evolution and functional cross-talk of protein post-translational modifications. Mol Syst Biol, 2013. 9: p. 714. 101. Hagai, T., et al., The origins and evolution of ubiquitination sites. Mol Biosyst, 2012. 8(7): p. 1865-77. 102. Kim, D.S. and Y. Hahn, Gains of ubiquitylation sites in highly conserved proteins in the human lineage. BMC Bioinformatics, 2012. 13: p. 306. 103. Kim, D.S. and Y. Hahn, Identification of novel phosphorylation modification sites in human proteins that originated after the human-chimpanzee divergence. Bioinformatics, 2011. 27(18): p. 2494-2501. 104. Nilsson, J., M. Grahn, and A.P. Wright, Proteome-wide evidence for enhanced positive Darwinian selection within intrinsically disordered regions in proteins. Genome Biol, 2011. 12(7): p. R65. 105. Chen, J.W., et al., Conservation of intrinsic disorder in protein domains and families: I. A database of conserved predicted disordered regions. J Proteome Res, 2006. 5(4): p. 879-87. 106. Brown, C.J., A.K. Johnson, and G.W. Daughdrill, Comparing models of evolution for ordered and disordered proteins. Mol Biol Evol, 2010. 27(3): p. 609-21.

36

107. Bellay, J., et al., Bringing order to protein disorder through comparative genomics and genetic interactions. Genome Biology, 2011. 12(2). 108. Kim, P.M., et al., The role of disorder in interaction networks: a structural analysis. Mol Syst Biol, 2008. 4: p. 179. 109. Zarin, T., et al., Selection maintains signaling function of a highly diverged intrinsically disordered region. Proceedings of the National Academy of Sciences of the United States of America, 2017. 114(8): p. E1450-E1459. 110. Brown, C.J., et al., Evolutionary rate heterogeneity in proteins with long disordered regions. J Mol Evol, 2002. 55(1): p. 104-10. 111. Mosca, R., R.A. Pache, and P. Aloy, The Role of Structural Disorder in the Rewiring of Protein Interactions through Evolution. Molecular & Cellular Proteomics, 2012. 11(7). 112. Wilson, B.A., et al., Young Genes are Highly Disordered as Predicted by the Preadaptation Hypothesis of De Novo Gene Birth. Nat Ecol Evol, 2017. 1(6): p. 0146-146. 113. Afanasyeva, A., et al., Human long intrinsically disordered protein regions are frequent targets of positive selection. Genome Res, 2018. 28(7): p. 975-982. 114. Daughdrill, G.W., et al., Dynamic behavior of an intrinsically unstructured linker domain is conserved in the face of negligible amino acid sequence conservation. Journal of Molecular Evolution, 2007. 65(3): p. 277-288. 115. Banerjee, S., S. Chakraborty, and R.K. De, Deciphering the cause of evolutionary variance within intrinsically disordered regions in human proteins. Journal of Biomolecular Structure & Dynamics, 2017. 35(2): p. 233-249. 116. Szalkowski, A.M. and M. Anisimova, Markov models of amino acid substitution to study proteins with intrinsically disordered regions. PLoS One, 2011. 6(5): p. e20488. 117. Midic, U., A.K. Dunker, and Z. Obradovic, Protein sequence alignment and structural disorder: a substitution matrix for an extended alphabet. Proceedings

37

of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics. New York, NY, USA: ACM, StReBio '09., 2009: p. 2731. 118. Thompson, J.D., et al., A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives. Plos One, 2011. 6(3). 119. Varadi, M., et al., DisCons: a novel tool to quantify and classify evolutionary conservation of intrinsic protein disorder. Bmc Bioinformatics, 2015. 16. 120. Lange, J., L.S. Wyrwicz, and G. Vriend, KMAD: knowledge-based multiple sequence alignment for intrinsically disordered proteins. Bioinformatics, 2016. 32(6): p. 932-6. 121. Uversky, V.N., Intrinsically disordered proteins from A to Z. Int J Biochem Cell Biol, 2011. 43(8): p. 1090-103. 122. Uversky, V.N., Intrinsically disordered proteins and their environment: effects of strong denaturants, temperature, pH, counter ions, membranes, binding partners, osmolytes, and macromolecular crowding. Protein J, 2009. 28(7-8): p. 305-25. 123. Romero, P., et al., Identifying disordered regions in proteins from amino acid sequence. 1997 Ieee International Conference on Neural Networks, Vols 1-4, 1997: p. 90-95. 124. Pedro Romero, Z.O., 1¥ Xiaohong Li,1‡ Ethan C. Garner,2† Celeste J. Brown,2 and A. Keith Dunker, Sequence Complexity of Disordered Protein. PROTEINS: Structure, Function, and Genetics, 2001. 42. 125. Fukuchi, S., et al., Binary classification of protein molecules into intrinsically disordered and ordered segments. Bmc Structural Biology, 2011. 11. 126. Zhang, T., et al., SPINE-D: accurate prediction of short and long disordered regions by a single neural-network based method. J Biomol Struct Dyn, 2012. 29(4): p. 799-813. 127. Ba ́lint Me ́sza ́ros, I.n.S., Zsuzsanna Doszta ́nyi, Prediction of Protein Binding Regions in Disordered Proteins. PLoS Comput Biol, 2009. 5(5).

38

128. He, B., et al., Predicting intrinsic disorder in proteins: an overview. Cell Research, 2009. 19(8): p. 929-949. 129. Radivojac, P., et al., Protein flexibility and intrinsic disorder. Protein Science, 2004. 13(1): p. 71-80. 130. Romero, Obradovic, and K. Dunker, Sequence Data Analysis for Long Disordered Regions Prediction in the Calcineurin Family. Genome Inform Ser Workshop Genome Inform, 1997. 8: p. 110-124. 131. Lobley, A., et al., Inferring function using patterns of native disorder in proteins. Plos Computational Biology, 2007. 3(8): p. 1567-1579. 132. Tompa, P. and L. Kalmar, Power Law Distribution Defines Structural Disorder as a Structural Element Directly Linked with Function. Journal of Molecular Biology, 2010. 403(3): p. 346-350. 133. Fuxreiter, M., P. Tompa, and I. Simon, Local structural disorder imparts plasticity on linear motifs. Bioinformatics, 2007. 23(8): p. 950-6. 134. Monika, F.J., et al., Preformed structural elements feature in partner recognition by intrinsically unstructured proteins. Biophysical Journal, 2005. 88(1): p. 560a- 560a. 135. Tompa, P., et al., Close encounters of the third kind: disordered domains and the interactions of proteins. Bioessays, 2009. 31(3): p. 328-335. 136. Pentony, M.M. and D.T. Jones, Modularity of intrinsic disorder in the human proteome. Proteins-Structure Function and Bioinformatics, 2010. 78(1): p. 212- 221. 137. Edwards, Y.J., et al., Insights into the regulation of intrinsically disordered proteins in the human proteome by analyzing sequence and gene expression data. Genome Biol, 2009. 10(5): p. R50. 138. Gsponer, J. and M.M. Babu, The rules of disorder or why disorder rules. Progress in Biophysics & Molecular Biology, 2009. 99(2-3): p. 94-103.

39

139. Meng, F., V.N. Uversky, and L. Kurgan, Comprehensive review of methods for prediction of intrinsic disorder and its molecular functions. Cell Mol Life Sci, 2017. 74(17): p. 3069-3090. 140. Pines, J., Cyclins and Cyclin-Dependent Kinases - a Biochemical View. Biochemical Journal, 1995. 308: p. 697-711. 141. Cheng, Y., et al., Mining alpha-helix-forming molecular recognition features with cross species sequence alignments. Biochemistry, 2007. 46(47): p. 13468-77. 142. Disfani, F.M., et al., MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins. Bioinformatics, 2012. 28(12): p. I75-I83. 143. Fang, C., et al., MFSPSSMpred: identifying short disorder-to-order binding regions in disordered proteins based on contextual local evolutionary conservation. Bmc Bioinformatics, 2013. 14. 144. Malhis, N. and J. Gsponer, Computational identification of MoRFs in protein sequences. Bioinformatics, 2015. 31(11): p. 1738-44. 145. Malhis, N., M. Jacobson, and J. Gsponer, MoRFchibi SYSTEM: software tools for the identification of MoRFs in protein sequences. Nucleic Acids Res, 2016. 44(W1): p. W488-93. 146. Yan, J., et al., Molecular recognition features (MoRFs) in three domains of life. Mol Biosyst, 2016. 12(3): p. 697-710. 147. Xue, B., A.K. Dunker, and V.N. Uversky, Retro-MoRFs: identifying protein binding sites by normal and reverse alignment and intrinsic disorder prediction. Int J Mol Sci, 2010. 11(10): p. 3725-47. 148. Sharma, R., et al., OPAL: prediction of MoRF regions in intrinsically disordered protein sequences. Bioinformatics, 2018. 34(11): p. 1850-1858. 149. Mooney, C., et al., Prediction of short linear protein binding regions. J Mol Biol, 2012. 415(1): p. 193-204. 150. Khan, W., et al., Predicting Binding within Disordered Protein Regions to Structurally Characterised Peptide-Binding Domains. Plos One, 2013. 8(9).

40

151. Davey, N.E., D.C. Shields, and R.J. Edwards, SLiMDisc: short, linear motif discovery, correcting for common evolutionary descent. Nucleic Acids Res, 2006. 34(12): p. 3546-54. 152. Prytuliak, R., et al., HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons. Nucleic Acids Res, 2017. 45(18): p. 10921. 153. Krystkowiak, I. and N.E. Davey, SLiMSearch: a framework for proteome-wide discovery and annotation of functional modules in intrinsically disordered regions. Nucleic Acids Research, 2017. 45(W1): p. W464-W469. 154. Gouw, M., et al., The eukaryotic linear motif resource - 2018 update. Nucleic Acids Res, 2018. 46(D1): p. D428-D434. 155. Davey, N.E., et al., SLiMPrints: conservation-based discovery of functional motif fingerprints in intrinsically disordered protein regions. Nucleic Acids Res, 2012. 40(21): p. 10628-41. 156. Ren, S., et al., Short Linear Motifs recognized by SH2, SH3 and Ser/Thr Kinase domains are conserved in disordered protein regions. Bmc Genomics, 2008. 9. 157. Uversky, V.N., Intrinsically disordered proteins and their (disordered) proteomes in neurodegenerative disorders. Frontiers in Aging Neuroscience, 2015. 7. 158. Uversky, V.N., C.J. Oldfield, and A.K. Dunker, Intrinsically disordered proteins in human diseases: introducing the D2 concept. Annu Rev Biophys, 2008. 37: p. 215-46. 159. Babu, M.M., et al., Intrinsically Disordered Proteins: Regulation and Disease. Biomolecular Forms and Functions: A Celebration of 50 Years of the Ramachandran Map, 2013: p. 346-361. 160. Uversky, V.N., et al., Unfoldomics of human diseases: linking protein intrinsic disorder with diseases. Bmc Genomics, 2009. 10. 161. Oldfield, C.J., et al., Flexible nets: disorder and induced fit in the associations of p53 and 14-3-3 with their partners. Bmc Genomics, 2008. 9.

41

162. Srinivasan, M. and A.K. Dunker, Proline rich motifs as drug targets in immune mediated disorders. Int J Pept, 2012. 2012: p. 634769. 163. Uversky, V.N., et al., Unfoldomics of human diseases: linking protein intrinsic disorder with diseases. BMC Genomics, 2009. 10 Suppl 1: p. S7. 164. Neira, J.L., et al., Identification of a Drug Targeting an Intrinsically Disordered Protein Involved in Pancreatic Adenocarcinoma. Scientific Reports, 2017. 7. 165. Ambadipudi, S. and M. Zweckstetter, Targeting intrinsically disordered proteins in rational drug discovery. Expert Opinion on Drug Discovery, 2016. 11(1): p. 65-77. 166. Kumar, D., N. Sharma, and R. Giri, Therapeutic Interventions of Cancers Using Intrinsically Disordered Proteins as Drug Targets: c-Myc as Model System. Cancer Informatics, 2017. 16. 167. Maity, B.K., Dynamics Based Drug Design for Intrinsically Disordered Proteins. Biophysical Journal, 2018. 114(3): p. 590a-590a. 168. Metallo, S.J., Intrinsically disordered proteins are potential drug targets. Current Opinion in Chemical Biology, 2010. 14(4): p. 481-488. 169. Zhang, Y.N., et al., Inhibition of Wnt signaling by Dishevelled PDZ peptides. Nature Chemical Biology, 2009. 5(4): p. 217-219. 170. Filippakopoulos, P., et al., Selective inhibition of BET bromodomains. Nature, 2010. 468(7327): p. 1067-1073. 171. Puca, R., et al., Regulation of p53 activity by HIPK2: molecular mechanisms and therapeutical implications in human cancer cells. Oncogene, 2010. 29(31): p. 4378-4387. 172. Souers, A.J., et al., ABT-199, a potent and selective BCL-2 inhibitor, achieves antitumor activity while sparing platelets. Nature Medicine, 2013. 19(2): p. 202- 208. 173. Adams, J.M. and S. Cory, The BCL-2 arbiters of apoptosis and their growing role as cancer targets. Cell Death and Differentiation, 2018. 25(1): p. 27-36. 174. Reed, J.C., Bcl-2 on the brink of breakthroughs in cancer treatment. Cell Death and Differentiation, 2018. 25(1): p. 3-6.

42

CHAPTER II

2 Bioinformatical parsing of folding-on-binding proteins

reveals compositional sequence design and evidence for

a general guiding mechanism for binding

A version of this chapter is originally published as: Narasumani, M. and Harrison, P. M. Bioinformatical parsing of folding-on-binding proteins reveals their compositional and evolutionary sequence design. Sci. Rep. 5, 18586; doi: 10.1038/srep18586 (2015).

43

2.1 Abstract:

Intrinsic disorder occurs when (part of) a protein remains unfolded during normal

functioning. Intrinsically-disordered regions can contain segments that ‘fold on binding’ to

another molecule. Here, we perform bioinformatical parsing of human ‘folding-on-binding’

(FB) proteins, into four subsets: Ordered regions, FB regions, Disordered regions that

surround FB regions (‘Disordered-around-FB’), and Other-Disordered regions. We

examined the composition and evolutionary behaviour (across vertebrate orthologs) of

these subsets. From a convergence of three separate analyses, we find that for

hydrophobicity, Ordered regions segregate from the other subsets, but the Ordered and

FB regions group together as highly conserved, and the Disordered-around-FB and

Other-Disordered regions as less conserved (with a lesser significant difference between

Ordered and FB regions). FB regions are highly-conserved with net positive charge, whereas Disordered-around-FB have net negative charge and are relatively less hydrophobic than FB regions. Indeed, these Disordered-around-FB regions are

excessively hydrophilic compared to other disordered regions generally. We describe how

our results point towards a possible compositionally-based steering mechanism of

folding-on-binding.

44

2.2 Introduction

Intrinsically disordered regions, in at least one of their functional modes, do not have

a well-defined three-dimensional structure under physiological conditions [1]. They are involved in specific functions such as molecular recognition, molecular assembly, protein modification, and entropic chain activities [2]. They are mostly found in eukaryotes rather than in prokaryotes [3, 4]. Approximately a third of proteins in eukaryotes are estimated to contain long disordered regions with 30 amino acids or higher [3, 5]. These regions are associated with a wide variety of functions, most notably signal transduction, transcription and translation regulation [3, 5]. Disordered regions are characterized by using several approaches, such as analysis of areas with missing electron density in an X-ray determined structure, or by NMR spectroscopy. They can be predicted by algorithms that analyse charge, hydrophobicity, low sequence complexity, amino acid composition and other factors [6-9]. Statistical studies of amino acid sequences in disordered regions show that they are significantly different than ordered regions [10].

Protein interaction analysis has showed that disordered regions are abundant in proteins with large numbers of interacting partners [11, 12]. Many proteins with disordered regions exhibit coupled folding and binding which has been proved to be a common process of molecular recognition and plays significant roles in protein function [13, 14].

Such disordered regions, which are termed here ‘folding on binding’ (FB) regions, are highly flexible and exhibit a well-defined structure only upon binding to a specific partner

45

molecule [15]. These regions have been reported to confer high specificity towards a partner molecule [16].

In general, disordered regions are usually characterised by low hydrophobicity and somewhat higher net charge [17, 18]. However, such trends are not clear for the specific character of FB regions [19, 20]. A study of FB region complexes showed that the interfaces of FB regions are enriched in hydrophobic residues and appear to be more conserved than other disordered regions in the same proteins [21]. IDRs exhibit different accepted point mutations, and show increased rates of insertions and deletions [17, 22,

23]. A comparative study on the evolution of ordered and disordered proteins suggested that disordered proteins evolve more rapidly than ordered proteins[17] . However, this condition is not always true and also a smaller group of disordered proteins appear to evolve very slowly [23]. Analysis of the evolution of disordered regions has thus yielded contradicting results [22, 24].

Here, we have studied the composition and conservation of proteins that form FB regions in human protein complexes. Specifically, we have parsed these proteins into four subsets of sequence: (i) Ordered regions, (ii) FB regions, (iii) disordered regions around

FB regions (‘Disordered-around-FB’), and (iv) Other-Disordered regions in the proteins.

We wish to ask whether the composition, and conservation behaviour across eukaryotic orthologs for these proteins is significantly different for these biophysically relevant subsets. We found a complex pattern of conservation and composition, with all of these regions having significantly different combinations of composition and conservation behaviour. Indeed, ‘Disordered-around-FB’ regions are the least hydrophobic regions,

46

and more evolutionarily variable, and the FB regions are of comparable hydrophobicity to

Other-Disordered regions in the proteins. We discuss the mechanistic implications of this

compositional sequence design.

2.3 Methods

2.3.1 Data sets

Human experimentally-verified intrinsically disordered protein sequences were

retrieved from the IDEAL (Intrinsically disordered proteins with extensive annotation and

literature) database [25, 26] (sequences retrieved in August 2014). The data sets were

reduced for sequence redundancy (at 40% sequence identity level) using the CD-HIT tool

[27]. This gave us a total of 99 human intrinsically disordered proteins with FB regions.

For some analysis we also used a data set of 134 disordered proteins from the DisProt

(Database of Protein disorder) DisProt release 6.02 [28]. To make multiple sequence

alignments, orthologs of these human proteins in other vertebrates were obtained from

the Ensembl BioMart data mining tool [29].

2.3.2 Multiple sequence alignments

Multiple sequence alignments (MSAs) of human intrinsically disordered proteins

along with their orthologs from other vertebrates were generated using MUSCLE v3.8.31

[30].

47

2.3.3 Conservation analysis of the aligned sequences

The position-specific conservation of the aligned protein sequences was calculated using the AL2CO program [31]. This program was used to calculate a conservation index for each position of the human proteins in the MUSCLE multiple sequence alignments. In AL2CO, the amino acid frequencies at each position are estimated and the conservation index is calculated from these frequencies. The entropy- based method of AL2CO was used to calculate the conservation index. This uses sequence information entropy, and calculates the frequency of amino acids by grouping the amino acids with similar physicochemical properties. We think this is suitable for analysing intrinsically disordered regions, since they are compositionally defined regions of a protein sequences.

2.3.4 Hydrophobicity and Charge calculation

The hydrophobicity of the four parsed regions in human protein sequences was calculated by ProtScale [32] using the Kyte & Doolittle [33] hydrophobicity scale with a window size of 5. The net charge at pH 7.0 was also calculated by adding up total numbers of positively and negatively charged residues [18]. The absolute value (i.e., the total ‘chargedness’) was also calculated by making all negative values positive (this is presented in Figure 2.3A).

48

2.4 Results and Discussion

2.4.1 Overview of the data sets

From the 99 human proteins containing FB regions that are the subject of this study, were parsed the following four sets of regions: (i) ’Ordered’ protein domains; (ii) folding-on-binding regions (‘FB’ set); (iii) the intrinsically-disordered regions around FB regions (‘Disordered-around-FB’ regions), and (iv) intrinsically disordered regions that do not contain FB regions (‘Other-Disordered’ regions). The Ordered region set comprises experimentally verified structures that do not have a known alternative intrinsically- disordered state. The Disordered-around-FB and Other-Disordered regions are only experimentally reported as intrinsically disordered. The FB regions contain experimentally determined structure in bound form to their partner molecule, as well as being shown to be intrinsically disordered at other times. These data sets are compared for their trends in composition and conservation, as populations of sequences, using the pipeline of methods detailed in Figure 2.1. The conservation of the four parsed region types across vertebrate evolution was analysed, and a conservation score calculated (as detailed in

Methods). An example of the parsing of a sequence into the four subsets is shown for human parathyroid hormone –like protein (Figure 2.2), with the same colour scheme as

Figure 2.1.

49

A

B

Figure 2. 1 Pipeline of the analysis performed (A) The four parsed datasets (i) Ordered set (ii) Disordered set (iii) folding-on- binding

regions (‘FB’ set), and (iv) the disordered around FB regions (DFB) are represented. (B)

The sequence analysis performed of the four parsed datasets is highlighted.

50

(continued from Figure 2.1, previous page)

Figure 2. 2 Example alignment of a parsed protein.

51

(continued from Figure 2.2, previous page)

Multiple sequence alignment of human parathyroid hormone-like protein and its vertebrate Orthologs, depicted using JalView, showing the four region types. This figure uses the same colour scheme as Figure 2.1.

2.4.2 Analysis of Ordered, Disordered, FB and Disordered around FB

regions as populations of sequences

Firstly, we asked whether we could distinguish the four region types according to their broad compositional characteristics. Since IDRs exhibit distinct amino acid composition, the order and disordered regions can be classified based on their net charge and hydrophobicity. Specifically, we wish to understand whether the composition of the three parsed disordered region types is different. Comparison of mean hydrophobicity and mean net charge of the four parsed region types is shown in Figure 2.3A, B. For the first plot, we use the absolute value of the mean net charge (Figure 2.3A), and for the second plot the raw mean net charge value (Figure 2.3B; see Methods for details). In these plots we only consider longer tracts, ≥20 residues. In line with a previous study [18], the Ordered subset stands out as more hydrophobic than the three other region types.

We fitted lines (as described in the figure legend) that give us optimum discrimination

(>95%) of the Ordered subset from the Other-Disordered set. The black and red represent the two extremes of slope for such fitted boundary lines (Figure 2.3A). In Figure 2.3A, the

52

other three sets scatter on either side of the lines and are not well segregated (24%–46% on the other side of the line). In Figure 2.3B, using the raw value of the mean net charge, while the two disordered sets are not well discriminated from the Ordered set (39–50%), the FB regions segregate better with the Ordered set (74% on same side of the line).

53

Datasets A Ordered Folding on Binding (FB)

Disordered around FB

Other Disordered

0.8

Ordered Folding on Binding (FB) Disordered around FB 0.6 Other Disordered

0.4 Mean Net − Charge

0.2

0.0

0.0 0.2 0.4 0.6 0.8 Mean Hydrophobicity

Datasets B Ordered Folding on Binding (FB)

Disordered around FB

Other Disordered

0.50

Ordered Folding on Binding (FB) 0.25 Disordered around FB Other Disordered

0.00

Mean Net − Charge −0.25

−0.50

0.0 0.2 0.4 0.6 Mean Hydrophobicity

Figure 2. 3 Analysis of the four region types as populations of sequences.

54

(Continued from Figure 2.3, previous page)

Only fragments ≥20 residues in length are used in the plots. The values of mean

hydrophobicity and mean conservation score are normalized to the range [0, 1]. (A) Mean hydrophobicity versus mean net-charge (absolute value). Lines were fitted to discriminate between Ordered and Other-Disordered regions by iterative Monte Carlo sampling of a wide range of intercept and slope values. The two lines (red and black) represent the two extremes of slope that give the same best percentage discrimination of Ordered regions

(100%) (equations C = 1.21 H – 0.34, and C = 0.47 H – 0.06, where C is the mean net charge and H is the mean hydrophobicity, in the fragments). Here the absolute value of the mean net-charge is used (i.e., negative values are made positive). Box plots are drawn using the same colour coding as the main scatter plot. The whiskers extend from the hinge to the highest/lowest values that are within 1.5 * IQR of the hinge, where IQR is the inter-quartile range, or distance between the first and third quartiles. (B) Mean hydrophobicity versus mean net-charge (raw value). Lines were fitted as above in (A).

The two lines (red and black) represent the two extremes of slope that give the same best percentage discrimination of Ordered regions (94%) (equations C = 0.11 H – 0.11, and

C = 0.05 H – 0.08, where C is the mean net charge and H is the mean hydrophobicity, in the fragments).

55

A plot of hydrophobicity versus region length shows that a single length threshold effectively segregates Ordered regions from the three other parsed subsets, which are intermingled (81% discrimination of Ordered set, >85% for other three sets on the other side of the line Figure 2.4A). Finally, an almost horizontal boundary line was found to discriminate effectively the Ordered and Other-Disordered regions (Figure 2.4B), with the

Ordered set pulling the FB regions with them (93% correct discrimination ordered, 62%

FB regions), and the Other-Disordered set pulling the Disordered-around-FB regions with them (85% Disordered, 82% disordered around FB regions).

56

A 1000

Ordered Folding on Binding (FB) Disordered around FB Other Disordered 750

500 Amino acid Length

250

0

0.0 0.2 0.4 0.6 0.8 Mean Hydrophobicity B 1.00 Ordered Folding on Binding (FB) Disordered around FB Other Disordered 0.75 Datasets

Ordered 0.50 Folding on Binding (FB)

Disordered around FB

0.25 Other Disordered Mean Conservation Score Mean Conservation Score

0.00

0.00 0.25 0.50 0.75 1.00 Mean Hydrophobicity Datasets

Figure 2. 4 Analysis of the four region types as populations of sequences. Only fragments ≥20 residues in length are used in the plots. The values of mean

hydrophobicity and mean conservation score are normalized to the range [0,1].

57

(continued from Figure 2.4, previous page)

(A) Mean Hydrophobicity versus length. The colour scheme is as for Figure 2.3. A simple

length threshold of region length = 93 was found to be the best boundary between

Ordered and Other-Disordered regions; the same line was also optimal for discriminating

between Ordered and either Disordered-around-FB or FB regions. (B) Mean conservation score versus mean hydrophobicity. The colour scheme is as for part (A). An almost horizontal line was found to be the best boundary between Ordered and Other-Disordered regions (equation S = 0.01 H + 0.59, where S is the mean conservation score and H is the mean hydrophobicity, in the fragments). Box plots are drawn using the same colour coding as the main scatter plot (see Figure 2.3 legend for details).

Thus, ordered regions are distinguished from the other region types by their hydrophobicity and length, whereas more segregation of Ordered along with FB regions

(versus Disordered-around-FB along with Other-Disordered regions) is achieved when conservation is considered.

58

2.4.3 Further analysis of compositional differences between the

Ordered, Disordered, FB and Disordered around FB parsed

subsets

The distribution of hydrophobicity and net charge for the populations of residues in the four parsed subsets (shown in Figure 2.5A, B) was analysed for significant differences

(Tables 2.1 to 2.4). This analysis includes the data for shorter sequence tracts (<20 residues in length).

A 80 70

60 Ordered 50 Folding on Binding (FB) Disordered around FB 40 Other Disordered 30 Percentage 20 10 0 -1 0 1 Charge

59

B 60

50

40

30 Ordered Folding on Binding (FB) 20

Percentage Disordered around FB 10 Other Disordered

0 0 0.2 0.4 0.6 0.8 1 Conservation Score

C 60

50

Ordered 40 Folding on Binding (FB) 30 Disordered around FB Other Disordered 20 Percentage

10

0 0 0.2 0.4 0.6 0.8 1 Hydrophobicity

Figure 2. 5 Trends in composition and conservation for the four parsed region types.

60

(continued from Figure 2.5, previous page)

(A) Histogram of charge for the total set of residues for the four subsets. The colour scheme is: Ordered, blue (total = 17868); Other-Disordered, red (2040); FB, green (3205);

Disordered-around-FB, orange (2936). Percentages are shown. (B) Histogram of hydrophobicity for the total set of residues for the four subsets. The colour scheme is the same as part (A). (C) Histogram of conservation score for the total set of residues for the four subsets. The colour scheme is the same as part (A).

Table 2. 1 Comparison of the hydrophobicities of the parsed subsets.

Datasets P-value*

Ordered vs Other-Disordered <0.0001

Ordered vs FB <0.0001

Ordered vs Disordered-around-FB <0.0001

Other-Disordered vs FB NS†

Other-Disordered vs Disordered-around-FB <0.0001

FB vs Disordered-around-FB <0.0001

*P-values for Wilcoxon ranked sum test.

†Not significant.

61

Table 2. 2 Mean hydrophobicity values of the four region types.

Subset Mean*

Ordered -0.3219 (±1.373)

Other-Disordered -0.867 (±1.278)

FB -0.834 (±1.326)

Disordered-around-FB -1.026 (±1.178)

*Sample size: 17869 (Ordered), 2036 (Other-Disordered), 3201 (FB), 2932 (Disordered- around-FB).

Table 2. 3 Comparison of the net charges of the parsed subsets.

Datasets P-value*

Ordered vs Other-Disordered <0.0001

Ordered vs FB NS†

Ordered vs Disordered-around-FB <0.0001

Other-Disordered vs FB <0.0001

Other-Disordered vs Disordered-around-FB NS†

FB vs Disordered-around-FB <0.0001

*P-values for Wilcoxon ranked sum test.

†Not significant.

62

Table 2. 4 Mean net-charge values of the parsed subsets.

Subset Mean*

Ordered 0.004 (±0.508)

Other-Disordered -0.060 (±0.502)

FB 0.020 (±0.559)

Disordered-around-FB -0.045 (±0.493)

*Sample size: 17869 (Ordered), 2036 (Other-Disordered), 3201 (FB), 2932 (Disordered-

around-FB).

In composite, the results for hydrophobicity (Tables 2.1 and 2.2) indicate the

following significant trend:

Ordered > (Other Disordered ~ FB) > Disordered-around-FB

Thus, Disordered-around-FB regions are distinctly the most hydrophilic parsed

subset, with FB regions, in general, approximately as hydrophobic as Other-Disordered

regions in the same sequences. It has been observed previously that the interfaces of

proteins that undergo disorder to order transition are more hydrophobic [34, 35], as is generally observed in protein-protein interactions [36]. However, it has also been suggested that the polar and charged amino acids present in FB proteins play a major role in interacting with the partner molecules [37], thus leading to overall hydrophobicity

63

in FB regions that is here indistinguishable from other disordered tracts; however, the

Disordered-around-FB regions are clearly distinct in composition to the FB regions.

The total net charge of each of the four datasets was calculated at pH 7 (Figure

2.5A). In composite, the results for net charge (Tables 2.3 and 2.4) indicate a significant trend, summarized by the following inequality:

(Ordered ~ FB) > (Disordered-around FB ~ Other-Disordered)

Thus, regions that can be structured (Ordered and FB) have overall positive charge, whereas the other sets have negative charge overall. If we examine the prevalences of the twenty amino acids in the four subsets, there are some distinctive trends for each subset (Figure 2.6); the Disordered-around-FB regions have a pronounced preference for T, S, G and P, with the Other-Disordered regions having a similar, less pronounced preference for S, G and P. Glycine and proline residues control the flexibility of the polypeptide chain, and so areas rich in these residues may be designed to bend or deform in specific ways.

64

16 O"Ordered

D"Disordered 14 FB"FB

12 DFB"DFB

10

8

6

Percentage of amino acids 4

2

0 R K D E C Q H S T Y N A I L M V W F G P Amino acids

Figure 2. 6 Comparison of the overall amino-acid composition of the four region types. The amino acid composition of Ordered, Disordered, FB and Disordered-around-FB regions are represented.

65

2.4.4 Complex pattern of sequence conservation in FB-containing

proteins

The distribution of conservation scores (shown in Figure 2.5C) was analysed for significant trends (Tables 2.5 and 2.6). In composite, we get the following overall tendency for conservation:

Ordered > FB > (Disordered-around FB ~ Other-Disordered)

Thus, FB regions are distinctly a highly conserved set, but not as highly conserved as the

Ordered set. The Disordered-around-FB and Other-Disordered regions are the most evolutionarily variable (Tables 2.5 and 2.6).

Table 2. 5 Comparison of the conservation scores of the parsed subsets.

Datasets P-value*

Ordered vs Other-Disordered <0.0001

Ordered vs FB 0.031

Ordered vs Disordered-around-FB <0.0001

Other-Disordered vs FB <0.0001

Other-Disordered vs Disordered-around-FB NS†

FB vs Disordered-around-FB <0.0001

66

(continued from Table 2.5, previous page)

*P-values for Wilcoxon ranked sum test. †Not significant

Table 2. 6 Mean conservation score values of the parsed region types.

Subset Mean*

Ordered 0.278 (±0.916)

Other-Disordered -0.368 (±1.050)

FB 0.234 (±1.021)

Disordered-around-FB -0.310 (±0.986)

*Sample size: 17869 (Ordered), 2036 (Other-Disordered), 3201 (FB), 2932 (Disordered- around-FB).

2.4.5 Sampling analysis of parsed subsets

We also analysed the parsed FB subset as a sample of larger total ordered and disordered sets (Table 2.7). We examined the FB set as a sample of the total ordered regions (Ordered + FB), and also as a sample of the total disordered regions

(FB + Disordered-around-FB + Other-Disordered). The results are in agreement with the analyses performed above, with the FB regions being very distinctive among the total disordered set for conservation (<0.1% of the random samples are more conserved) and

67

net charge (<0.1% are more positively charged), and for hydrophobicity in the total ordered set (<0.1% are less hydrophobic).

Table 2. 7 FB set as sample of total ordered and disordered sets.

Sampling* Ranking of means of each quality for

original set in list of samples**

Conservation

FB in total ordered 21.6 percentile

FB in total disordered 99.9 percentile

Hydrophobicity

FB in total ordered 0.1 percentile

FB in total disordered 87.8 percentile

Charge

FB in total ordered 89.5 percentile

FB in total disordered 99.9 percentile

*Total ordered=Ordered + FB; total disordered = Disordered-around-FB+FB+Other-

Disordered.

**10,000 samples of the same distribution of region lengths as observed for the FB set were taken from each total population of ordered and disordered regions. The ranking for

68

the mean value of the original FB subset in the list of samples is expressed as a

percentile, i.e. at 5 percentile, 5% of the samples are less conserved, hydrophobic or

positively charged.

2.4.6 A possible guidance mechanism during FB folding-on-binding

with protein interaction partners

FB regions have high conservation and slight net positive charge, with contiguous disordered regions having low conservation and slight net negative charge and excessive hydrophilicity. Indeed, the Disordered-around-FB regions are excessively hydrophilic compared to the Other-Disordered regions. It is interesting that these results parallel analyses of conserved areas in protein-protein interfaces, which tend to be more hydrophobic than non-conserved parts [36].

Our results suggest a possible guidance mechanism for FB regions, wherein excessively hydrophilic Disordered-around-FB regions steer the FB towards the binding site of its interaction partner, by lessening the occurrence of off-target interactions, and thus facilitating the folding-on-binding [38-40]. Such an electrostatic steering mechanism has been shown experimentally and simulationally for the binding of the cell cycle regulator p27 to cyclin A [41, 42]. The positive charge in the FB region is likely due to the charge character of the binding partners, or specific functional design. Indeed, fourteen

69

of the FB regions analysed are for binding DNA/RNA (which are negatively charged), and

a further eleven FB regions are nuclear localization signals, which are positively charged

for their specific function. In some cases, the disordered regions between the FB regions

have also been observed. For example, cancer susceptibility candidate gene 3 protein

(CASC3), a core component of exon junction complex (EJC), exhibits intrinsic disorder

within the region (residues 136-283) involved in RNA binding. In this functional segment,

a short disordered region is located between the two FB regions [43]. This region could

act as a flexible linker to prevent the steric hindrance of the structured region. Thus, the

polar natures of the disordered-around-FB regions suggest their significant role in

mediating the binding of FB regions to the partner molecule.

We performed enrichment analysis of Gene Ontology molecular function categories, using GOrilla [44]. Indeed, the proteins with FB regions are significantly enriched for

nucleic acid binding (GO:0003676, corrected P-value = 0.0074) and DNA binding

(GO:0003677, corrected P-value = 0.018, using a non-redundant DisProt set as

background population), which is consistent with the positive charge of the FB regions. It

has been previously shown that the charge in disordered regions correlates with

molecular function [44].

70

2.5 Concluding remarks

We performed a bioinformatical parsing of folding-on-binding proteins into four

distinct region types: Ordered, folding-on-binding (FB), Disordered-around-FB, and

Other-Disordered. From a convergence of three separate analyses (treating the sets as

fragments, as populations of residues and as samples of fragments from populations),

we observe that compositionally, the Ordered regions segregate as more hydrophobic

than the three other region types, but that in terms of conservation, the Ordered and FB

regions tend to group together and the Disordered-around-FB and Other-Disordered

regions with each other, although there is still some lesser significant difference between

the Ordered and FB sets. We described how our results point towards a possible

compositionally-based steering mechanism of FB region folding-on-binding. Simulation

studies of coupled folding and binding of disordered regions have been shown to improve

the understanding of this process [45, 46]. Hence, further experimental and simulational work is required to investigate this hypothesis.

71

2.6 References

1. Wright , P.E. and H.J. Dyson, Intrinsically unstructured proteins- re-assessing the protein structure-function paradigm. J. Mol. Biol., 1999. 293: p. 321-331. 2. Dunker, A.K., et al., Intrinsic disorder and protein function. Biochemistry, 2002. 41(21): p. 6573-82. 3. Ward, J.J., et al., Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol, 2004. 337(3): p. 635-45. 4. Tompa, P., Intrinsically unstructured proteins. Trends Biochem Sci, 2002. 27(10): p. 527-33. 5. Xie, H., et al., Functional anthology of intrinsic disorder. 1. Biological processes and functions of proteins with long disordered regions. J Proteome Res, 2007. 6(5): p. 1882-98. 6. Peng, K., et al., Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics, 2006. 7: p. 208. 7. Zoran Obradovic, K.P., Slobodan Vucetic, Predrag Radivojac, Celeste J. Brown, and A. Keith Dunker, Predicting intrinsic disorder from amino acid sequence. PROTEINS: Structure, Function, and Genetics, 2003. 8. Jones, D.T. and J.J. Ward, Prediction of disordered regions in proteins from position specific score matrices. Proteins-Structure Function and Bioinformatics, 2003. 53: p. 573-578. 9. Pedro Romero, Z.O., 1¥ Xiaohong Li,1‡ Ethan C. Garner,2† Celeste J. Brown,2 and A. Keith Dunker, Sequence Complexity of Disordered Protein. PROTEINS: Structure, Function, and Genetics, 2001. 42. 10. Radivojac, P., et al., Intrinsic disorder and functional proteomics. Biophys J, 2007. 92(5): p. 1439-56.

72

11. Dosztanyi, Z., et al., Disorder and sequence repeats in hub proteins and their implications for network evolution. Journal of Proteome Research, 2006. 5(11): p. 2985-2995. 12. Haynes, C., et al., Intrinsic disorder is a common feature of hub proteins from four eukaryotic interactomes. Plos Computational Biology, 2006. 2(8): p. 890- 901. 13. Dyson, H.J. and P.E. Wright, Coupling of folding and binding for unstructured proteins. Current Opinion in Structural Biology, 2002. 12(1): p. 54-60. 14. Shoemaker, B.A., J.J. Portman, and P.G. Wolynes, Speeding molecular recognition by using the folding funnel: The fly-casting mechanism. Proceedings of the National Academy of Sciences of the United States of America, 2000. 97(16): p. 8868-+. 15. Uversky, V.N. and A.K. Dunker, Understanding protein non-folding. Biochim Biophys Acta, 2010. 1804(6): p. 1231-64. 16. Wright, P.E. and H.J. Dyson, Linking folding and binding. Curr Opin Struct Biol, 2009. 19(1): p. 31-8. 17. Brown, C.J., et al., Evolutionary rate heterogeneity in proteins with long disordered regions. J Mol Evol, 2002. 55(1): p. 104-10. 18. Vladimir N. Uversky, J.R.G., and Anthony L. Fink, Why are “natively unfolded” proteins unstructured under physiologic conditions. PROTEINS: Structure, Function, and Genetics, 2000. 41: p. 415–427. 19. Sotomayor-Perez, A.C., D. Ladant, and A. Chenal, Disorder-to-Order Transition in the CyaA Toxin RTX Domain: Implications for Toxin Secretion. Toxins, 2015. 7(1): p. 1-20. 20. Forman-Kay, J.D. and T. Mittag, From sequence and forces to structure, function, and evolution of intrinsically disordered proteins. Structure, 2013. 21(9): p. 1492-9. 21. Meszaros, B., et al., Molecular principles of the interactions of disordered proteins. Journal of Molecular Biology, 2007. 372(2): p. 549-561.

73

22. Brown, C.J., A.K. Johnson, and G.W. Daughdrill, Comparing models of evolution for ordered and disordered proteins. Mol Biol Evol, 2010. 27(3): p. 609-21. 23. Brown, C.J., et al., Evolution and disorder. Curr Opin Struct Biol, 2011. 21(3): p. 441-6. 24. Szalkowski, A.M. and M. Anisimova, Markov models of amino acid substitution to study proteins with intrinsically disordered regions. PLoS One, 2011. 6(5): p. e20488. 25. Fukuchi, S., et al., IDEAL: Intrinsically Disordered proteins with Extensive Annotations and Literature. Nucleic Acids Res, 2012. 40(Database issue): p. D507-11. 26. Fukuchi, S., et al., IDEAL in 2014 illustrates interaction networks composed of intrinsically disordered proteins and their binding partners. Nucleic Acids Res, 2014. 42(Database issue): p. D320-5. 27. Huang, Y., et al., CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, 2010. 26(5): p. 680-682. 28. Sickmeier, M., et al., DisProt: the Database of Disordered Proteins. Nucleic Acids Res, 2007. 35(Database issue): p. D786-93. 29. Flicek, P., et al., Ensembl 2014. Nucleic Acids Res, 2014. 42(Database issue): p. D749-55. 30. Edgar, R.C., MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 2004. 5: p. 113. 31. Grishin, J.P.a.N.V., AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics, 2001. 17(8). 32. Wilkins, M.R., et al., Protein identification and analysis tools in the ExPASy server. Methods Mol Biol, 1999. 112: p. 531-52. 33. Kyte, J. and R.F. Doolittle, A simple method for displaying the hydropathic character of a protein. J Mol Biol, 1982. 157(1): p. 105-32.

74

34. Gunasekaran, K., C.J. Tsai, and R. Nussinov, Analysis of ordered and disordered protein complexes reveals structural features discriminating between stable and unstable monomers. J Mol Biol, 2004. 341(5): p. 1327-41. 35. Vacic, V., et al., Characterization of molecular recognition features, MoRFs, and their binding partners. J Proteome Res, 2007. 6(6): p. 2351-66. 36. Guharoy, M. and P. Chakrabarti, Conserved residue clusters at protein-protein interfaces and their use in binding site identification. BMC Bioinformatics, 2010. 11: p. 286. 37. Wong, E.T., D. Na, and J. Gsponer, On the importance of polar interactions for complexes containing intrinsically disordered proteins. PLoS Comput Biol, 2013. 9(8): p. e1003192. 38. Kissinger, C.R., et al., Crystal structures of human calcineurin and the human FKBP12-FK506-calcineurin complex. Nature, 1995. 378(6557): p. 641-4. 39. Uversky, V.N., Multitude of binding modes attainable by intrinsically disordered proteins: a portrait gallery of disorder-based complexes. Chem Soc Rev, 2011. 40(3): p. 1623-34. 40. Romero, Obradovic, and K. Dunker, Sequence Data Analysis for Long Disordered Regions Prediction in the Calcineurin Family. Genome Inform Ser Workshop Genome Inform, 1997. 8: p. 110-124. 41. Ganguly, D., et al., Electrostatically accelerated coupled binding and folding of intrinsically disordered proteins. J Mol Biol, 2012. 422(5): p. 674-84. 42. Ganguly, D., W. Zhang, and J. Chen, Electrostatically accelerated encounter and folding for facile recognition of intrinsically disordered proteins. PLoS Comput Biol, 2013. 9(11): p. e1003363. 43. Nielsen, K.H., et al., Mechanism of ATP turnover inhibition in the EJC. RNA, 2009. 15(1): p. 67-75. 44. Eden, E., et al., GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics, 2009. 10: p. 48.

75

45. Verkhivker, G.M., et al., Simulating disorder-order transitions in molecular recognition of unstructured proteins: where folding meets binding. Proc Natl Acad Sci U S A, 2003. 100(9): p. 5148-53. 46. Chen, T., J. Song, and H.S. Chan, Theoretical perspectives on nonnative interactions and intrinsic disorder in protein folding and binding. Curr Opin Struct Biol, 2015. 30: p. 32-42.

Acknowledgements

This work was supported by Natural Sciences and Engineering Research Council of

Canada (NSERC).

76

2.7 Connecting Text for Chapter 2 to Chapter 3

In Chapter 2, I have performed a bioinformatical evolutionary analysis of the different types of intrinsically disordered regions in human FB proteins. FB and DFB regions have shown an interesting compositional and conservation trend when compare to other disorder regions types. I observed that FB regions have mild positive charge and relatively high conservation. These results motivated me to use similar evolutionary methods to study trends of PTMs in disordered regions relative to ordered regions (PTMs are known to induce disorder-to-order transition). In Chapter 3, I examined the major

PTMs such as methylation, acetylation and ubiquitination (MAU) sites in FB regions across 380 eukaryotic species. However, the experimental data available for FB proteins is limited; hence I expanded the study by performing a large-scale conservation analysis of MAU sites in ordered and disordered regions generally.

77

CHAPTER III

3 Discerning evolutionary trends in post-translational

modification and the effect of intrinsic disorder: Analysis

of methylation, acetylation and ubiquitination sites in

human proteins

A version of this chapter has been submitted as:

Narasumani, M. and Harrison, P. M. Discerning evolutionary trends in post-translational modification and the effect of intrinsic disorder: Analysis of methylation, acetylation and ubiquitination sites in human proteins. Accepted in PLOS Computational Biology

78

3.1 Abstract

Intrinsically disordered regions (IDRs) of proteins play significant biological functional

roles despite lacking a well-defined 3D structure. For example, IDRs provide efficient

housing for large numbers of post-translational modification (PTM) sites in eukaryotic

proteins. Here, we study the distribution of more than 15,000 experimentally determined

human methylation, acetylation and ubiquitination sites (collectively termed ‘MAU’ sites)

in ordered and disordered regions, and analyse their conservation across 380 eukaryotic

species. Conservation signals for the maintenance and novel emergence of MAU sites

are examined at 11 evolutionary levels from the whole eukaryotic domain down to the

ape superfamily, in both ordered and disordered regions. We discover that MAU PTM is

a major driver of conservation for arginines and lysines in both ordered and disordered regions, across the 11 levels, most significantly across the mammalian clade.

Conservation of human methylatable arginines is very strongly favoured for ordered regions rather than for disordered, whereas methylatable lysines are conserved in either

set of regions, and conservation of acetylatable and ubiquitinatable lysines is favoured in

disordered over ordered. Notably, we find evidence for the emergence of new lysine MAU

sites in disordered regions of proteins in deuterostomes and mammals, and in ordered

regions after the dawn of eutherians. For histones specifically, MAU sites demonstrate an

idiosyncratic significant conservation pattern that is evident since the last common

ancestor of mammals. Similarly, folding-on-binding (FB) regions are highly enriched for

MAU sites relative to either ordered or disordered regions, with ubiquitination sites in FBs

being highly conserved at all evolutionary levels back as far as mammals. This

investigation clearly demonstrates the complex patterns of PTM evolution across the

79

human proteome and that it is necessary to consider conservation of sequence features at multiple evolutionary levels in order not to get an incomplete or misleading picture.

Keywords: post-translational modification; lysine; arginine; methylation; acetylation; ubiquitination; intrinsically disordered; folding-on-binding; conservation; human; eukaryote

80

3.2 Introduction

Intrinsically disordered regions (IDRs) in proteins were initially discovered as long

stretches of amino acids in proteins that remain unfolded under physiological conditions

[1, 2]. IDRs can be functional despite this absence of a well-defined three-dimensional

structure, and have caused a re-examination of the protein structure-function paradigm

[1-4]. They are involved in numerous biological functions [2, 4-8] and their improper

functioning leads to various disease conditions [7, 9-11]. Bioinformatical studies have

shown that long (>30 residues) IDRs are common in eukaryotic proteins (33% of them on

average) and occur much less in archaea (2% of proteins) and eubacteria (4%) [12-14].

In addition, Ward et al. reported that long IDRs (>30 residues) in yeast proteins are

associated with transcription regulation and cell signalling [12]. The amino-acid

sequences of IDRs contain compositional bias and low sequence complexity [15]. Many

computational tools have been developed to annotate disordered regions in amino acid

sequences [16-21], facilitating the distinction between ordered and disordered regions.

In many proteins, IDRs exhibit low amino-acid sequence conservation [22] and

tandem repeats are more abundant in IDRs than in ordered regions [23, 24]. Insertions

and deletions are more common in IDRs [25, 26] and they contain more amino acid substitutions than the ordered regions of the same proteins [22]. Furthermore, some disordered regions in proteins show conservation for chemical composition, but not detailed amino-acid sequence conservation [27]. Studies on the evolution of ordered and disordered regions have revealed that disordered regions generally evolve differently from ordered regions, but in some cases similarly to ordered regions [22, 26-31]. Hence,

81

understanding the evolution of disordered regions in comparison to ordered regions has

been challenging.

IDRs are involved in protein-protein interaction [11], including binding to kinases

[32], transcription factors [33], and translation inhibitors [34], and they also mediate

interaction with nucleic acids [33, 35]. Numerous receptors and enzymes with disordered

regions acquire structure when binding to a partner molecule [4, 36-38]. Proteins with

such folding on binding (FB) regions exhibit high specificity and low affinity towards a

partner molecule [1, 39]. Compared to other disordered regions, they are enriched in

hydrophobic residues, and positively charged amino acids [40] and are more conserved

[31]. Indeed, post-translational modifications (PTMs) can induce their disorder-to-order

transitions, and the conformational flexibility of IDRs provides sites for many PTMs per

amino-acid residue [41] [42]. Furthermore, PTMs in disordered regions have a significant

role in signaling and regulation [42]. Experimental and computational studies suggest that

PTMs including methylation and ubiquitination are enriched within IDRs, [6, 7, 42-45]

whereas analysis of acetylation has shown contradictory results [46]. Furthermore, the

phosphorylation sites present in disordered regions have been suggested to facilitate the

evolution of transcriptional regulation [45, 47, 48]. Methylation, Acetylation, and

Ubiquitination (abbreviated here collectively as ‘MAU’) are the three major PTMs, next to phosphorylation and glycosylation, which regulate the function of many eukaryotic proteins. Crosstalk between MAU sites facilitates complex regulatory programs in both histone and non-histone proteins [49]. However, the evolution of MAU sites in IDRs across eukaryotic species is not well understood [50-53]. Therefore, a comparative study on the conservation of human MAU sites in ordered and disordered regions will illuminate

82

their importance across the eukaryotic domain. Analysis of conservation across a large panel of genome-sequenced eukaryotes can give us more comprehensive insights into the evolutionary history of PTMs [45, 47, 48], while avoiding issues of data set completeness that may be a problem for experimental analysis of a variety of multi- cellular species.

We have performed a large-scale analysis of >15,000 experimentally-verified MAU sites from the ordered and disordered regions of >7,000 human proteins. We compiled four such data sets for both ordered and disordered regions: (i) methylated arginines, (ii) methylated lysines, (iii) acetylated lysines and (iv) ubiquitinated lysines. We studied the distribution and conservation of MAU sites in ordered and disordered regions across 380 eukaryotic organisms. Conservation signals for the maintenance and novel emergence of MAU sites were analysed at 11 evolutionary levels from the whole eukaryotic domain down to the level of the ape superfamily. We observed significant conservation attributable to lysine and arginine PTMs in both ordered and disordered regions across the 11 levels, and also some signals for the novel emergence of new MAU sites.

Furthermore, we have pinpointed trends for biologically important subsets of IDRs, such as FB regions and prion-like domains. For example, we observed that MAU and other

PTM sites are highly enriched in FB regions relative to both ordered and disordered regions generally and at evolutionary depths back as far as the emergence of the mammal class.

83

3.3 Methods

3.3.1 PTM Datasets

Human proteins with experimentally-verified PTM sites were retrieved from dbPTM

[54], PHOSIDA [55] and PhosphositePlus [56] databases as of November 2015. We

focused on the evolutionary behaviour of Methylation, Acetylation and Ubiquitination sites

(‘MAU sites’). Redundant annotations for PTMs were removed. This resulted in 1,009

lysine and 1,676 arginine methylation sites, 10,044 acetylation sites and 14,396

ubiquitination sites. We also comparatively analysed the distribution of serine, threonine

and tyrosine phosphorylation sites, and other rarer PTMs (but not their evolutionary

conservation).

3.3.2 Eukaryotic proteomes

Complete proteomes of 380 eukaryotic organisms were downloaded from

ENSEMBL [57], UniProt [58] and NCBI RefSeq [59] databases. The organisms were separated into eleven different taxonomic levels that provide a range of focus on the human: eukaryotes, metazoan, deuterostomes, chordates, vertebrates, mammals, tetrapods, eutherians, supraprimates, primates, and apes. Human proteins with experimentally-verified FB regions were obtained from the IDEAL database [60].

84

3.3.3 Sequence analysis

Phylogenetic trees of the eukaryotic organisms were drawn with Evolview [60]

using Newick-format files generated by phyloT (https://phylot.biobyte.de/) [61]. Human

orthologs in eukaryotic organisms were identified using the reciprocal best hit method

with BLASTP and e-value threshold <1e-04 [62]. Multiple sequence alignment of human

proteins with MAU sites and their orthologs in the 380 organisms was performed using

ClustalOmega [63]. For the evolutionary analysis, human proteins with an orthologue in

at least one of the organisms in a clade are included and the human proteins without an

orthologue in at least one of the organisms are discarded. We used ZORRO, a

probabilistic masking program to evaluate the alignment quality of individual positions

[64]. In doing this, the aligned positions with low ZORRO score were discarded, and the

positions within the recommended score range of five to ten were retained for

conservation analysis. For comparison, the alignment program KMAD was also applied

in some cases [65].

Enrichment analyses of gene ontology (GO) molecular function categories was performed

using the GOrilla tool to identify GO terms enriched in different clades [66].

3.3.4 Identification of ordered and disordered regions in proteins

We performed protein BLASTP [version 2.2.28] [62] against the ASTRAL non-

redundant database (95% identity threshold) [67]. We used PDB atom

records of proteins from ASTRAL domain database to identify the experimentally validated position of ordered regions in human proteins and the disordered regions in

85

human proteins were annotated with DISOPRED and IUPRED per-residue prediction

scores, using default parameters using default parameters [18, 19]. Since ASTRAL

domains are experimentally validated structures, we considered the region given by

ASTRAL BLAST hits as ordered region for the cases that are also predicted as

disordered. To keep the analysis and presentation of results manageable, regions un-

classified in this way were not analysed.

Human prion-like proteins are annotated disordered regions that have a bias for

asparagine or glutamine residues (using the fLPS program [68], run with default parameters except for a binomial P-value threshold of ≤1e-10, as used in previous studies

[69-71]).

3.3.5 Conservation & statistical analysis

A Python script was written to find the conserved MAU sites in ordered and

disordered regions by calculating the completely conserved lysine/arginine residues in

the multiple sequence alignment at each clade. Newly-emerged conserved residues are

those that are completely conserved in a clade but not across a more ancient, wider clade.

To test the significance of conservation, we performed enrichment analysis of the

conserved MAU-site residues at each evolutionary level as subsets of the total sets of

conserved residues of the same type, with appropriate corrections for multiple

hypotheses.

Hypergeometric probability tests were used to find these enrichments of MAU-site

residues in ordered and disordered regions for the different evolutionary levels. A

Bonferroni correction for multiple hypothesis testing was applied for all tests for a given

86

background population. The details of the enrichment calculations are given in the legend for Figure S3.5-S3.15. All enrichment and statistical analyses are performed using the R language [72].

3.4 Results and Discussion

First, we overview the distribution of methylation, acetylation and ubiquitination

(MAU) sites in ordered and disordered regions, and include some specific analysis and discussion of MAU sites in folding-on-binding (FB) regions, prion-like proteins and homopeptides (which are common features of disordered regions [73]).

Then, we examine the effect of MAU sites on the evolutionary behaviour of lysine and arginine residues. To what extent do MAU sites drive the conservation of these residues and the appearance of new conserved residues at different points in eukaryotic evolution? Is there evidence for the appearance of new conserved lysines in evolutionarily old proteins because of MAU site status?

These questions are examined for each of methylation, acetylation and ubiquitination separately in turn. In doing so, we also consider the effects of: (i) allowing mutation to other possible residue types for the same modification (e.g., allowing mutation between arginine and lysine for methylation); (ii) alignment quality on the results (through applying the program ZORRO, as described in Methods); (iii) removal of histones (which are known to have high levels of MAU).

The evolution of MAU sites is also specifically examined for histones, and for folding-on- binding proteins as subsets. Finally, we briefly consider the evolutionary behaviour of

87

sites that are ‘multiple-MAU’ (i.e., that can have more than one different type of MAU modification).

3.4.1 Distribution of MAU sites in ordered and disordered regions

The MAU site contents in the ordered and disordered regions are summarized in

Figure 3.1A. Specific lysine residues can be sites for multiple PTMs, including MAU

(Figure 3.1B). For MAU sites in ordered and disordered regions, the observed overlap between acetylation and ubiquitination sites correlates with an established regulatory relationship [74], and it is also interesting to note the high proportion of methylation sites

(~51%) specifically in ordered regions that have other PTMs, in comparison to any other

MAU in either ordered or disordered regions (Figure 3.1B).

88

89

(continued from Figure 3.1, previous page)

Figure 3. 1 Overview of the number of methylation, acetylation and ubiquitination

sites and the coincidence of different MAU types at the same residues in ordered

and disordered regions

(A) total number of MAU sites in ordered and disordered regions of 7160 human proteins

showing that the higher number of MAU sites in disordered regions than in ordered

regions and ubiquitination sites show preference for ordered regions. (B) Venn diagram illustrates the co-incidence of MAU (i.e., how many can have two or three different MAU at the same residue) in ordered and disordered regions.

In general, PTM sites have been reported to be abundant in the disordered regions of eukaryotic proteins [7, 75]. However, not all PTMs show a preference for disordered

regions. We examined the distribution in ordered and disordered regions of human

proteins of experimentally-verified MAU sites, along with phosphorylation sites for

comparison (as listed in Methods).

We observe that acetylation and ubiquitination sites and methylated lysine sites

generally have a significant preference for ordered regions (Figure 3.2). It is known that

lysine methylation in disordered regions blocks site-specific lysine ubiquitination to

increase protein half-life [76]. This may contribute to the relative abundance of

ubiquitination sites in ordered regions. In comparison, phosphorylation sites prefer

disordered regions, as expected [7, 75] (Figure 3.2).

90

Figure 3. 2 Percentage distribution of MAU and phosphorylation sites in ordered and disordered regions of human proteins

Percentages of MAU and phosphorylation sites (out of the total number of residues of the same type) in ordered and disordered regions of the human proteins analysed. The total number of each site present in ordered (olive green) and disordered (peach) regions are given in the centre of the bar. The hypergeometric distribution is used to identify the enrichment of MAU-modified residues in (dis)ordered regions in all lysines/arginines present in both ordered and disordered regions, with the total set of MAU sites as background population, and the diamond symbol on top of the bar indicates the corrected

91

P-value (0.0071) for significant enrichment of PTMs in ordered and disordered regions, and NS represents non-significant enrichment.

Previous studies have suggested that MAU sites are enriched in disordered regions [6, 7, 42-44] and acetylated lysines have no preference for either ordered or disordered regions [46]. In contrast, our analysis here shows that MAU lysines are significantly relatively enriched in ordered regions (Figure 3.2) rather than in disordered ones, whereas the opposite is true for phosphorylation sites (Figure 3.2).

3.4.2 FB regions as display areas for PTMs

FB regions in proteins are known to interact with multiple and diverse partners [1,

39], and are associated with PTMs [41, 42]. Previously, we found that FB regions are more conserved than contiguous disordered regions that are not known to exhibit disorder-to-order transition [31]. We have analysed the enrichment of MAU sites and other PTMs in FB regions (in 172 human proteins, data taken from the IDEAL database

[77]). Phosphorylation sites are highest in number in FB regions, followed by MAU sites

(Figure 3.3A).

92

A 400 51 25 74 93 132 300

57 200 367 348 306 325 280 Number of residues in FB regionsresidues in of NumberFB 100 205 27

85

0

No. of residues (K/R) in FB regions No. of modified residues (K/R) in FB regions

P-value < 0.0014 in FB with Ordered set P-value < 0.0014 in FB with Disordered set

93

(continued from Figure 3.3, previous page)

A 50 45 132 40 35 30 93 27 25 57 20 74 15 51 10 846 110 272 294 5 86 14 25 219

Percentage distribution of PTM sites PercentagePTM distribution of 0

FB regions Non-FB/unclassified regions

94

(continued from Figure 3.3, previous page)

B NS 2.00 15 1.00 4 2 55 21 18 10 76 0.50 NS NS NS 72 26 NS 45 46 566 0.25 4 odds ratio odds -

Log 0.13 NS NS 0.06 NS 0.03

Ordered-region homopeptides Disordered-region homopeptides

P-value < 0.0014 (Depletion), NS Not significant

95

(continued from Figure 3.3, previous page)

C

16 168 164 134 NS 926 326 0.80 NS 57 odds ratio odds -

Log

0.40 Human Prion-like proteins P-value < 0.0014 (Depletion), NS Not significant

96

(continued from Figure 3.3, previous page)

Figure 3. 3 Distribution of PTM sites in ordered and disordered regions of human proteins for various subsets of the data.

(A) Distribution of MAU and phosphorylation sites in folding on binding (FB) regions and the percentage distribution of sites in FB non-FB/unclassified regions. Enrichment analysis is performed for the FB set as a sample of total ordered or disordered regions.

Due to the limited experimental data, other PTM sites were detected only at very low levels or were not present: nitrosylated cysteines 2 sites, O-linked glycosylation (serine,

1 site and threonine, 5 sites), prenylated cysteine (2 sites), sulfated tyrosine (2 sites) and sumoylated lysines (24 sites), whereas carboxylation, myristoylation, palmitoylation sites are not present in the FB regions. We used hypergeometric probability tests to perform the enrichment/depletion analyses of PTM sites in FB regions. The critical P-value to test the significance is P<0.0014 (to correct for multiple hypotheses). (B) Distribution of MAU and phosphorylation sites in homopeptides. The enrichment and depletion analyses are calculated for homopeptides present in the ordered (olive green) and disordered (peach) regions. The statistical test and critical P-value is as for part (A). (C) Distribution of MAU and phosphorylation sites in Human prion-like proteins (grey). Enrichment analysis is performed for lysines or arginines in the prion-like protein set as a sample of total lysines or arginines in the disordered set, as appropriate. The statistical test and critical P-value is as for part (B).

97

We observed that the major PTMs phosphorylation, methylation, acetylation, and

ubiquitination are highly significantly enriched in FB regions treated as a sample either of

ordered or of disordered regions (Figure 3.3A). In addition, two other less numerous

PTMs namely O-linked glycosylation on threonines (P-values≤3E-05) and sumoylation on

lysines (P-value≤6.7E-15) are significantly enriched in FB regions treated as a sample of

either ordered or disordered regions (not depicted in the figure). Hence, MAU /

phosphorylation site enrichment is a distinctive feature of FB regions relative to other

(dis)ordered regions Furthermore, we calculated the percentage distribution of MAU and phosphorylation sites in FB and non-FB/unclassified regions, and these sites show preference for FB regions, however the number of sites are higher in non-FB/unclassified regions (Figure 3.3A).

PTMs have been reported to induce disorder-to-order transition and facilitate binding to multiple partners [42]. In addition, PTM sites and ‘multiple-MAU’ sites (i.e., individual sites that can have multiple different MAU modifications) have been previously reported to show a preference for molecular recognition features (MoRFs) [44]. MoRFs are short (10-70 residues) structured regions within disordered regions, that are thought to undergo disorder-to-order transition on partner binding [78], whereas FB regions are

of varying length within both ordered and disordered regions. We analysed the

enrichment of multiple-MAU sites within FB regions (Table 3.1). We found a highly

significant enrichment, treated as a sample of either ordered or disordered regions (P<5e-

65). FB regions could be involved in many significant functions due to the prevalence of

long disordered regions (>50 residues) in eukaryotic proteins [79]. Indeed, FB proteins

with multiple-MAU sites such as flap endonuclease 1 (FEN1), a-synuclein, HMG-I and

98

p53 are involved in DNA/RNA binding. For example, acetylation regulates the activity of

FEN1 through p300 [80] and N-terminal acetylation leads to the ahelical oligomerization of a-synuclein [81]. Generally, FB regions are known to be involved in many interactions with high specificity and low affinity towards a partner molecule. Hence, FB PTMs could be crucial for facilitating these interactions.

Table 3. 1: Enrichment of ‘multiple-MAU’ sites in FB (treated as a sample of either ordered (O) or disordered (DO) regions across all eukaryotes.

No. of Lys No. of MAU- No. of Lys No. of MAU Hypergeometric residues in modified residues in modified P-value O/DO sites in FB proteins sites with regions of O/DO with multiple multiple FB proteins MAU sites MAUs in FB regions Ordered 2697 160 109 65 1.3e-56

Disordered 1296 80 109 65 5.1e-65

3.4.3 PTMs are depleted in homopeptides and prion-like proteins

Homopeptide repeats are common in eukaryotic proteins, and they tend to occur in

disordered regions [82]. These repeats occur in a variety of nucleic-acid–binding domains

linked to signalling and transcriptional processes [83]. We calculated the occurrence of

PTMs in homopeptides (≥3 amino acids) in ordered and disordered regions. Among the

major PTMs, a higher proportion of serine phosphorylation and lysine acetylation sites

99

are present in the homopeptides of disordered regions (Figure 3.3B). However, enrichment/depletion analyses show that MAU sites are generally significantly depleted in both ordered- and disordered-region homopeptides, although phosphorylated tyrosines may be enriched in disordered-region homopeptides (Figure 3.3B). Other PTMs analysed do not show significant enrichment/depletion (i.e., P-values are not significant); this might be due to their very limited experimental data. We suggest that the homopeptide lack of

PTMs is due to the rapid evolution of amino-acid repeats [84], and also because they do not well accommodate required sequence motifs.

The intrinsically disordered nature of prion-like proteins and the role of PTMs such as N-glycosylation in changing the conformation and stability of prion proteins [42, 85-87] motivated us to study PTM occurrence in 1269 human prion-like proteins. We performed the analyses as mentioned above (Figure 3.3C). As for homopeptides, there is a general trend for significant depletion. We hypothesize that PTMs may get in the way of regular side-chain hydrogen-bonding patterns that are essential for prion amyloid formation.

Notably also, prion-like proteins do not show a significantly high proportion of N- glycosylation sites, even though they tend to be N-rich (i.e., P-values are non-significant).

3.4.4 Evolutionary behaviour of MAU sites at eleven evolutionary

levels

The main goal of this work is to reveal to what extent the evolutionary behaviour of lysine and arginine amino acids is driven by MAU post-translational modification and by presence in intrinsic disorder. To this end, we analysed the evolutionary sequence

100

variation of experimentally verified methylation (lysine: 1009 and arginine: 1676), lysine

acetylation (10,044) and lysine ubiquitination (14,396) sites in human proteins.

We analysed the conservation trends at eleven evolutionary levels: (i) Apes, (ii)

Primates, (iii) Supraprimates (primates + rodents + lagamorphs), (iv) Eutherians, (v)

Mammals, (vi) Tetrapods, (vii) Vertebrates, (viii) Chordates, (ix) Deuterostomes, (x)

Metazoans, and (xi) Eukaryotes (all 380 eukaryotes species examined) (Figure 3.4A).

Conservation of MAU-site residues was investigated in the ordered and disordered

regions across the 380 eukaryotic organisms using the pipeline of methods illustrated in

Figure 3.4B. An illustrative example of a protein alignment (for ‘human chromobox protein

homolog 3’) indicating the positions of MAU sites in ordered and disordered regions is

shown in Figure 3.5. When we talk about conservation of PTM sites in the following analysis, it is the conservation for the amino-acid residues that is under consideration, and not for PTMs explicitly. However, there is sufficient sequence information to discover conservation signals for the maintenance and emergence of new MAU sites during the evolutionary ancestry of humans.

101

102

(continued from Figure 3.4, previous page)

Figure 3. 4 Organismal phylogeny and pipeline. (A) Organismal phylogenetic tree of eukaryotes separated into eleven clades and the total number of organisms for each is given in brackets. (B) Pipeline for the conservation analysis. MAU sites conserved in ordered and disordered regions are considered as two separate datasets.

103

Figure 3. 5 Example of a protein with methylation, acetylation and ubiquitination sites in ordered and disordered regions.

Multiple sequence alignment of human chromobox protein homolog 3 and its primate orthologs, depicted using JalView [88], showing methylation, acetylation (purple) and ubiquitination (yellow) sites in ordered (green) and disordered (peach) regions. The sites with both acetylation and methylation sites are highlighted in brown, sites with both acetylation and ubiquitination sites are highlighted in cyan and the sites with acetylation, methylation and ubiquitination sites are highlighted in red.

104

We examined the degree of conservation of arginines and lysines that are human

MAU sites at each of the 11 evolutionary levels. We analysed: (i) the MAU site residues that are conserved (out of the total number of conserved arginines and lysines) for each of these 11 clades, and (ii) the MAU site residues that are newly emerged residues for that specific clade and are conserved right across it. To test the significance of conservation, we performed enrichment analysis of the conserved MAU sites at each evolutionary level, with appropriate corrections for multiple hypotheses. The fractions of conserved residues that are MAU sites at different evolutionary stages are shown on schematic species trees in Figure S3.1. A summary schematic of the major results is shown in Figure 3.6.

105

106

(continued from Figure 3.6, previous page)

Figure 3. 6 Summary of significantly enriched conserved MAU sites in ordered

and disordered regions at 11 evolutionary clades

Evolutionary levels with significant enrichment (after correction for multiple hypotheses)

are labelled with four different shapes: lysine methylation (square), arginine methylation

(circle), lysine acetylation (star) and ubiquitination (triangle) sites. The ordered and

disordered regions with enriched MAU sites are labelled in olive green and light orange

and the sites with significant enrichment in both disordered and ordered regions are

coloured blue. Where there are conservation signals for newly emerged MAU sites in

ordered and disordered regions the symbols are marked with black and red borders

respectively. The results are depicted in more detail (with P-values and total numbers of

sites) in Figure S3.1-S3.4. Part (A) is for the total data set, and part (B) is for histones.

In general, we found that, of proteins with MAU sites, 7.3% in ordered and 1.0% in

disordered regions have conserved sites across all eukaryotes, with 3.0% of sites in

ordered and 0.5% of sites in disordered being completely conserved in this way (Table

3.2). For example, the abundant eukaryotic DEAD-box protein p68 contains such

completely conserved acetylatable (ordered: K-351) and ubiquitinatable (ordered: K-351, disordered: K-375) residues in both ordered and disordered regions. PTMs such as acetylation and ubiquitination are reported to regulate transcriptional coactivation and increase the stability of p68 [89]. The presence of conserved acetylation- and

107

ubiquitination-site residues suggest an essential role of very specific PTMs in p68 across all eukaryotes.

Table 3. 2: Percentages of human MAU-site residues in ordered and disordered regions that are conserved across all eukaryotes.

PTMs Conserved residues Proteins with conserved

residues

Ordered Disordered Ordered Disordered

All MAU sites 0.43% 0.02% 7.26% 1.01%

(393/91535) (32/155852) (270/3719) (48/4757)

Lysine Methylatable 5.06% 1% 4.78% 0.48% sites (18/356) (4/593) (11/230) (2/417)

Arginine 4.09% 0.16% 3.33% 0%

Methylatable sites (7/171) (2/1289) (5/150) (0/637)

Acetylatable sites 2.59% 0.18% 5.03% 0.54%

(113/4370) (9/5011) (97/1930) (13/2387)

Ubiquitinatable 1.42% 0.43% 7.51% 1.21% sites (114/8029) (24/5583) (212/2823) (35/2890)

108

3.4.5 Evidence for methylation as a driver of lysine conservation during eukaryotic evolution, and for the emergence of new lysine methylation sites

The fraction of conserved lysine methylation sites in each clade is shown in Figure

S3.1, ordered regions being shown in green and, disordered regions in peach colour. The

bubble size indicates the fraction of conservation. We find substantial evidence for

significant conservation of lysine methylation sites across most of the 11 levels (P-values

= 0.004 to 5e-21) except in apes, primates and vertebrates for ordered regions and apes,

primates, supraprimates, vertebrates and across all eukaryotic organisms for disordered

regions) (Figure S3.1, top and bottom left panels, and Figure S3.5). This strong persistent

conservation signal across most of the levels suggests that methylation is a major driver

of lysine conservation in both ordered and disordered regions across eukaryote evolution.

In addition, for each clade we studied newly emerged lysines that are methylated in humans. By doing so, we can ask: Is lysine methylation also a driver for conservation for newly emerged lysine residues? We observed a significant enrichment of new lysine methylation sites in the ordered regions of eutherians (P=6.9e-06), and in the disordered regions of mammals (P=9.6e-04) and deuterostomes (P=0.0011) (Figure S3.1 / Figure

S3.5). Specifically, we observed a conservation signal for a significant number of evolutionarily new methylation sites appearing at various epochs in old proteins, i.e. proteins that emerged earlier in eukaryotic evolution. The significant enrichment of new sites in old proteins is similar to the above general results except that new sites are more highly enriched in the disordered regions of deuterostomes (P=5e-04) (Figure S3.9).

Examples of such proteins in mammals are microtubule-associated protein tau and

109

chromodomain Y-like protein (CDYL1). In tau proteins, methylatable residues K-163 and

K-267 in disordered regions are conserved across mammals. K-267 residue methylation

is reported to increase frequency of phosphorylation at S-262, and K-163 is identified as a site for both methylation and acetylation [90]. Moreover, methylation at these sites may play important roles in pathological conditions [90]. In mammals, in the protein CDYL1 methylatable K-135 in a disordered region is conserved, and is reported to regulate chromodomain binding to H3K9me3 [91]. These conservation signals for emergence of new lysine methylation sites suggest that clade-specific changes in modifying enzymes might cause progressive addition of more PTM sites to specific proteins in complex organisms.

All conservation signals for new emergent lysine methylation sites appear to be due to new sites in evolutionarily old proteins, i.e., there are no significant contributions from new proteins (such as those arising from new gene duplications). This is also observed generally for all the MAU sites analysed further below.

We also examined the conservation of human lysine methylation sites while allowing for mutation to arginine (i.e., since arginines can also be methylated) and vice versa. This analysis also yields significant conservation signals at various evolutionary levels, with a few differences (Figure S3.6). For example, specifically in eutherians, a signal for the emergence of new sites is observed in both ordered and disordered regions (Figure S3.6).

This indicates that methylated lysine sites could have been mutated to arginines in the epoch after eutherian emergence. Furthermore, in general the conservation analyses of aligned positions for human lysine methylation sites after applying the alignment quality filtering program ZORRO give similar results, but with increased significance (Figure

110

S3.7). Also, overall, there is little difference in the results upon removal of histones (Figure

S3.8), with just three results switching significance status in three of the analysed levels.

In addition, we checked the effect of using an alternative alignment tool called KMAD,

that has some features designed to apply to alignment of disordered proteins [65] (Figure

S3.14). This tool produced considerably less aligned positions overall at all evolutionary levels, but led to increased significance or acquisition of significance in the enrichments detected for 9 of the 11 levels, and decreases in significance for two of them

(Deuterostomes and Metazoan). We also calculated the significant conservation of methylation sites in the disordered regions predicted by IUPRED software (Figure S3.15), for comparison. IUPRED annotates fewer disordered regions than DISOPRED, however only one significance result changes (conservation at the primate level becomes significant) (Figure S3.14).

3.4.6 Arginine methylation conservation is highly favoured in ordered regions across human evolutionary descent in eukaryotes

Arginine methylation has been extensively studied in both histones and non-

histones, and generally involved in signal transduction, mRNA splicing, transcription

factors and DNA repair (reviewed in [92]). Protein arginine methyltransferases have been

identified in many non-mammalian organisms such as invertebrate chordates, arthropods

and nematodes [92]. We find here that right across eukaryotic evolution human

methylated arginine sites have had significant conservation, almost exclusively in ordered

regions (Figure S3.2, top left panel and Figure S3.5). The human methylated arginines in

111

ordered regions show a higher fraction of conservation than in disordered regions at almost all evolutionary levels. There are no significant conservation signals for the emergence of new methylated arginine sites during eukaryotic evolution. However, methylated arginine residues, when allowed to mutate to lysine, show potential emergence of new sites in metazoans, indicating potential allowance of such mutation

(Figure S3.6). Similar conservation results are obtained for IUPRED-predicted disordered regions, with additional enrichment in metazoans (Figure S3.14). In addition, filtering for alignment quality using ZORRO yields similar results as for methylated Ks, i.e., increased and more pervasive significance, with additional enrichments in clades such as primates, eutherians, tetrapods and vertebrates. In general, since such quality filtering gives higher scoring for conserved positions, ordered regions tend to gain higher scores than disordered regions; however, generally in our analyses we see further significant conservation in disordered regions as well (Figure S3.7). Also, similar results are obtained here when histones are removed from the data sets (Figure S3.8).

In the analysis of newly emerged arginine methylation sites at various evolutionary levels, we looked specifically for a conservation signal indicating the emergence of new arginine methylation sites in evolutionarily old proteins (Figure S3.9). We found a significant enrichment of such methylated arginines in the ordered regions of old proteins in tetrapods (P=0.028). In tetrapods, these sites are identified in the ordered regions of proteins such as heterogeneous ribonucleoproteins hnRNP A2/B1 and A0. Arginine methylation sites in hnRNPs A2/B1 and hnRNP A0 are involved in cellular signaling and maturation of hnRNPs [93]. Furthermore, methylation-site arginine residues show conservation in the disordered regions of hnRNP H3 in tetrapods. hnRNP isoforms confer

112

various splicing functions, and hnRNP is reported to transactivate tyrosine hydroxylase

gene transcription in tetrapods [94]. Thus, methylation-site arginine residue conservation

correlates with their vital role in tetrapod hnRNPs.

3.4.7 Human acetylated lysines are favoured for significant conservation in disordered regions rather than in ordered regions across eukaryote evolution

To explore the conservation of lysine acetylation in ordered and disordered regions, we performed the same analysis as for methylation. Here, we find that acetylation sites are significantly enriched (P<0.00417) among conserved lysines in disordered regions at

7 out of the 11 evolutionary levels, more so than in ordered regions (4/11 levels) (Figure

3.6A). Notably, acetylated lysine residues have highly significant conservation in

disordered regions at several levels (P<1e-20) (Figure S3.5 and Figure S3.3, bottom left

panel). Strong conservation evidence for the emergence of new disordered-region lysine

acetylation sites is observed in Deuterostomes (P=3e-21). There is no conservation

signal for the emergence of new lysine acetylation sites in ordered regions at any

evolutionary level (Figure 3.6A and Figure S3.5), except that when mutation to other

possible acetylation sites is allowed, it is observed in eutherians (Figure S3.6).

Since there is a conservation signal for new lysine acetylation sites in disordered regions across deuterostomes, we examined a few proteins that may have acquired new sites in this evolutionary epoch. For example, new conservation at MAU sites is found in the disordered regions of CREB-binding protein (CBP) and p300 HAT. Six acetylated K residues are conserved in CBP IDRs. CBP is hypothesized to increase the acetylation of

113

H3 and H4 histones and NcoA3 [95]. In p300 HAT, we found eight conserved acetylatable

K residues in IDRs in the p300 loop region. The autoacetylation of K residues within this region is proposed to regulate the p300 HAT domain [95].

We analysed for evidence of new lysine acetylation sites in ‘old’ proteins (i.e. proteins that arose in each clade) and in ‘new’ proteins (i.e., proteins that arose earlier in evolution). We find conservation signals for new lysine acetylation sites in old proteins

(Figure S3.9) in both ordered (P=0.0046) and disordered (P=1e-21) regions of old proteins in deuterostomes.

As above for methylation, we checked whether the results are affected by the application of several criteria. Firstly, we compared the results to the case where the conservation of K acetylation sites as other residue types is allowed (i.e., substitution of acetylated K by A, G, M, S, or T; these are amino acids which can also be acetylated).

We observed that the two datasets exhibit little or no difference (Figure S3.6). This result suggests that the overall trend for conservation of human acetylation sites is robust to substitution of acetyl lysine to other possible acetylatable residues. In addition, IUPRED- predicted disordered regions show similar significances but with decreased significance in supraprimates, eutherians and tetrapods, and additional enrichment in primates (Figure

S3.15). As above, applying the ZORRO alignment quality filter or the KMAD tool, or removal of histones give similar or more highly significant enrichments (Figure S3.7 and

S3.14).

114

3.4.8 Ubiquitination-site residue conservation is favoured in disordered regions of eukaryotic proteins

We analysed ubiquitination sites as above. We find that 4 out of 11 eukaryotic levels

show significant enrichment of conserved ubiquitination sites in both ordered and

disordered regions, and furthermore in apes, eutherians and vertebrates, only disordered

regions exhibit significant conservation (P<0.0025) of these sites (Figure S3.4 and S3.5).

In deuterostomes, we found a significant signal for new sites in disordered regions

(P<0.00417). Moreover, when we focused on potential new sites in evolutionarily old

proteins, we found similar enrichment for disordered regions, with all the potential

additional sites found in deuterostomes present in such old proteins (P<1e-10) (Figure

S3.9).

For example, the human ubiquitinated K-56 residue in IDRs is newly conserved in

RNA helicase p68 across deuterostomes. The poly-ubiquitination of overexpressed p68 is reported in colorectal neoplasms [96]. Moreover, mutation of sumoylation sites is reported to increase polyubiquitination, therefore resulting in p68 aggregation [96]. In addition, ubiquitinatable K-207 is newly conserved across deuterostomes in the disordered regions of MCM3, an essential DNA replication licensing factor. K-207 in

MCM3 (Minichromosome Maintenance Complex Component 3) is reported to be ubiquitinated by KEAP1 (Kelch-like ECH-associated protein) and KEAP1-mediated

MCM3 ubiquitination sites are stated to be on predicted exposed surfaces of the C- terminal domain in MCM3 [97]. Such conservation suggests that these ubiquitinatable sites in the disordered regions could have facilitated macromolecular interactions since the dawn of deuterostomes.

115

Previously, for a much smaller data set, it has been observed that ubiquitination

sites are more conserved than unmodified lysines in both ordered and disordered regions

in mammals, whereas these sites are not more significantly conserved than unmodified

sites in yeast [50]. Here, we discover that such conservation has been maintained

throughout various stages of human eukaryotic ancestry. Also, we find a conservation

signal for the emergence of new ubiquitination sites during deuterostome evolution

(Figure 3.6A, Figure S3.5). Furthermore, similar conservation results are observed for the

IUPRED-predicted disordered regions but with loss of significance for two clades (Figure

S3.15). As above, filtering with the ZORRO program or application of the KMAD program

(Figure S3.7 and S3.14) in general accentuates the conservation results with additional enrichments in several further clades, and removing histones makes little or no difference

(Figure S3.7 and S3.8).

3.4.9 Conservation signals for MAU sites in Histones

Histone proteins are highly conserved in all eukaryotes, and their regulatory activity

is intimately linked to MAU and phosphorylation. These modifications provide several

functions to histones and can modify nucleosome shape and stability. For example,

acetylation and phosphorylation alter the charge of histone proteins. Methylation is more

complex, i.e. lysine can be mono-, di- or tri-methylated, and ubiquitination provides a

much larger covalent modification [98]. Most histone modifications occur within the

disordered N-terminal tails, where they are linked to regulation of chromatin structure and

recruitment of enzymes to reposition nucleosomes [98]. Furthermore, ordered regions of

histones are highly conserved and modifications in these regions are also observed.

116

Extensive study of the cross-talk between PTMs in histone tails has given rise to the term

“histone code”, wherein histone tails exhibit sites for multiple PTM types and function in transcriptional regulation [41, 99, 100]. Hence, we wished to compare the conservation behaviour of MAU sites in the ordered and disordered regions of histones.

We examined the MAU sites in histones that are significantly enriched in each clade.

The percentage of histones in the total proteins analysed is 0.69%, which almost triples

(to 1.74%) for proteins with conserved MAU sites across all eukaryotes. We found the same pattern of significant conservation signals across three evolutionary levels

(mammals, eutherians, supraprimates and primates), i.e., for lysine methylation sites and ubiquitination sites in disordered regions, and for acetylation sites in ordered regions, (P- values = 2e-04 to 2.8e-07) (Figure S3.10 and Figure 3.6B). Similar results are obtained when mutation to other possible methylation and acetylation sites are allowed (Figure

S3.11).

3.4.10 Methylation site lysine residues in the disordered regions of linker H1 and H3 variants are conserved as far back as mammals

Histones have a significant enrichment for conserved methylation-site residues in disordered regions in mammalian, eutherian and supraprimates clade alignments (Figure

S3.10). Hence, we examined some individual cases for further perspectives. In mammals, we found a notable number of conserved methylation site residues in the disordered regions of Histone H1 variants H1.0 (K-12, K-102 and K-108) and H1.3 (K-17, K-107 and

K-169) and of Histone H3 variants H3.2 and H3.3 (7 conserved site residues each). The

117

linker histone H1.1 binds between the nucleosomes and is part of higher-order chromatin

structure. H1 variant PTMs might be involved in modulating DNA binding [101]. Lysine

acetylation in the H1 N-terminal region reduces H1 affinity to chromatin, and also recruits

TAF1 to activate transcription [102]. In addition, lysine methylation in the N-terminal

region of Histone H3 has been linked with strong cognitive abilities [103]. Thus,

conservation of methylatable lysine residues in the disordered regions of histone H1 and

H3 variants might facilitate cell-specific transcription and exhibit vital roles in

neurodegenerative diseases.

3.4.11 Ubiquitination sites in H2A and H3 variants in mammalian histones

In mammalian histones, conserved ubiquitination-site lysines are significantly

enriched in disordered regions (Figure S3.10). The highest number of conserved

ubiquitinatable site residues are observed in Histone H2A variants such as Histone H2A.1

(positions 120, 126, 128 and 130) and Histone H2A type 2-B (positions 119, 120, 125,

128 and 130). This could be linked to monoubiquitination being common in H2A and H2B,

and present in all cells of higher organisms [104]. PTMs in intrinsically disordered histone

tail domains have diverse functional impacts. For example, during spermatogenesis,

proteasome-mediated degradation of histones may facilitate chromatin condensation

[104]. Also, ubiquitinated H2A is involved in gene silencing and suppresses transcription initiation by inhibiting methylation of H3 at K-4 [104]. Hence, the results suggest that modifications on the disordered regions of histone variants that altered nucleosome stability were consolidated in the epoch of evolution since the dawn of mammals.

118

3.4.12 Sites with multiple MAU PTMs

Multiple PTMs can occur on the same residue in a protein. Histone proteins are the best-known example of this; they have such ‘multiple-MAU’ sites in their N-terminal tail regions. The association between multiple-MAU sites and signaling is also observed in other proteins, e.g., a-tubulin, RNA polymerase II, p300/CBP and Cdc25C phosphatases [44]. PTM cross-talking at these sites such as between phosphorylation/acetylation, phosphorylation/sumoylation, hydroxylation/O-linked- glycosylation, and acetylation/ubiquitination has been reported [74, 99]. As shown in

Figure 3.1B, our analysis shows the pronounced co-occurrence of acetylation and ubiquitination that plays a major regulatory role [74]. Previously, it was shown that multiple-MAU sites show a strong preference for disordered regions [44].

We checked whether having multiple MAU modifications at one site is linked to increased sequence conservation for the 11 evolutionary levels. This would also be a further strong indicator that the conservation signals we have observed are due to conservation of PTMs at various evolutionary depths. PTM sites in human proteins with more than one MAU modification were separated into ordered (1836 sites) and disordered regions (676 sites). We found significant conservation of multiple-MAU sites in disordered regions in apes (P=0.009) and supraprimates (P=1e-05), and in ordered regions in apes

(P=0.006), eutherians (P=9e-26), chordates (P=1.5e-43) and across all eukaryotes

(P=3.5e-04). Also, there are conservation signals that appear due to the emergence of new conserved sites, e.g., in chordate, eutherian and supraprimate clades for ordered regions, and a very high significance is found in supraprimates (P=1e-82) for disordered regions (Figure S3.12). Previous study has reported that multiple-PTM sites are

119

predominantly present in disordered regions [44]. In our study, we observed a strong

conservation signal for disordered regions in supraprimates. This suggests that a

prominent number of ‘multiple-PTM’ sites could have emerged in supraprimates. Many of

the P-values for these are smaller than the P-value for any relevant individual PTM

enrichment, indicating potential increased conservation due to their multiple-MAU status.

3.4.13 Ubiquitination is a major driver of conservation of lysines in folding-on-binding (FB) regions

Analysis of PTM sites in folding-on-binding regions showed that phosphorylation

and MAU sites are significantly enriched (Figure 3.3A). So, we analysed the conservation

of MAU sites in FB regions across the 11 evolutionary levels (Figure S3.13). Here, we

treated the FB regions as a sample of both ordered regions and disordered regions.

In eutherians, we observe significant enrichments of conserved lysines/arginines for all types of MAU in the FB regions (as samples of either disordered or ordered regions)

(Figure S3.13). Ubiquitination sites are the most enriched, with a persistence of enrichment back as far the mammalian clade (P=4e-3), followed by acetylated lysines

(Figure S3.13). Examples of human proteins with ubiquitination-site residues in FB

regions that have such conservation in mammals are: Myc proto-oncogene, histone H3.3,

Ras-related C3 botulinum toxin substrate-1 (Rac1), and Protein CASC3 (Cancer

susceptibility candidate gene 3 protein).

For example, the Myc proto-oncogene, a transcription factor, is shown to undergo

phosphorylation on T-58 and S-62 prior to its degradation by ubiquitination. The

interaction of Fbw7 on T-58 is reported to promote degradation of Myc protein, and the

120

mutation on this site results in decreased degradation [105]. The ubiquitinated K-18 in the

FB region of the histone H3 tail is identified to mediate DNA methylation by interacting

with the N-terminal regulatory domain of DNMT1 [106]. Furthermore, it has been reported that the ubiquitinated Rac1 might be involved in the internalization of Rac1 from peripheral membrane and relocation of Rac1 towards endocytic vesicles. In addition, mutations at evolutionarily conserved ubiquitination sites are identified to be enriched in cancer [106].

Therefore, these results suggest that the enriched ubiquitination sites in FB or disordered regions could be due to the shorter half-life of these proteins.

3.4.14 Functional trends in MAU-site containing proteins

We checked for any interesting functional trends in the conservation data through

examining Gene Ontology (GO) annotation [107]. Specifically, we were interested in the

functional trends for proteins that have newly emerged conserved MAU-site residues at

each evolutionary level. We see high enrichments for very general GO categories

involved in ‘binding’, such as ‘nucleotide binding’, ‘RNA binding’, ‘protein binding’, ‘anion

binding’, etc. These functional trends tend to be detectable from the tetrapod clade down

to primates, but not so much outside of this range. These results tally well with a general

role for MAU PTMs in modifying binding specificities and modalities (Figure S3.16A).

Interestingly, when compared to the GO category enrichments for the whole set of MAU-

modified proteins (Figure S3.16B), there are a few missing categories, e.g., ‘drug binding’,

‘organic cyclic compound binding’, and ‘microtubule binding’, suggesting that MAU sites

on proteins with these functions do not undergo concerted changes in conservation

across deep eukaryotic evolutionary time.

121

3.5 Concluding remarks

Intrinsically disordered regions can house large numbers of post-translational modifications, such as the MAU sites which are the focus of this study. By examining for the conservation of these sites in ordered and disordered regions separately, we have discovered that MAU is an important driver of arginine/lysine conservation throughout different stages of eukaryotic evolution, and that there is evolutionary evidence for key moments in human ancestry where new MAU sites have arisen in existing proteins, particularly during the epochs of deuterostome and eutherian evolution. The conservation signals for emergence of new PTM sites suggest that clade-specific changes in modifying enzymes might cause the progressive addition of more PTM sites to specific proteins in complex organisms. There is a surprising variety of conservation patterns for MAU-site residues when comparing disordered and ordered regions to each other. The four types of MAU site (methylatable K and R, acetylatable K and ubiquinatable K) each have distinct conservation patterns, with conservation of methylatable Rs being strongly favoured in ordered regions. In contrast, methylatable Ks are conserved in either set of regions, and conservation of acetylatable and ubiquitinatable Ks is favoured in disordered regions over ordered. The strongest conservation signals occur across the mammalian clade, indicating its appropriate use as a baseline conservation level for further analyses (Figure

S3.5A-B and S3.5C-D). Distinct patterns of MAU-site evolution are observed in histones during eukaryote evolution, as compared to non-histones. However, removal of histones from the data makes little or no difference to the overall results. Also, in general filtering for alignment quality increases significances, in both ordered and disordered regions.

122

Examining the scenario where mutation to other possible MAU sites yields some interesting variant results. For example, a conservation signal for an emergent allowance of mutation between methylated arginine (R) and lysine (K) is observed in a certain epoch.

This suggests that some sites switched between R and K and left a trace of this in the conservation pattern.

Folding-on-binding regions have highly significant enrichments in MAU sites

(particularly ubiquitination sites) relative to other ordered or disordered regions, that persist back as far as mammalian emergence. Also, in some cases ‘multiple-MAU’ sites, i.e., sites that can be modified in any of the three ways, demonstrate highly significant conservation that is much more significant than for any corresponding single PTM type.

The number of conserved sites gets smaller when conservation is analysed for larger, wider clades. Conversely, however, the residues that are PTMed in human are a larger fraction of these deeply conserved residues, so significant conservation is still detected.

This investigation demonstrates that analysis of conservation across a large panel of genome-sequenced eukaryotes can give us more comprehensive insights into the evolutionary history of PTMs, while avoiding issues of data set completeness that may be a problem for experimental analysis of such a variety of multi-cellular species. Also it is clear that we need to consider conservation of sequence features at multiple levels in order not to get an incomplete or misleading picture.

123

3.6 References

1. Wright , P.E. and H.J. Dyson, Intrinsically unstructured proteins- re-assessing the protein structure-function paradigm. J. Mol. Biol., 1999. 293: p. 321-331. 2. Dunker, A.K., et al., Intrinsically disordered proteins. Journal of Molecular Graphics, 2001. 19(1): p. 26-59. 3. Tompa, P., Intrinsically unstructured proteins. Trends Biochem Sci, 2002. 27(10): p. 527-33. 4. Dunker, A.K., et al., Intrinsic disorder and protein function. Biochemistry, 2002. 41(21): p. 6573-82. 5. Dunker, A.K., et al., Function and structure of inherently disordered proteins. Curr Opin Struct Biol, 2008. 18(6): p. 756-64. 6. Xie, H., et al., Functional anthology of intrinsic disorder. 1. Biological processes and functions of proteins with long disordered regions. J Proteome Res, 2007. 6(5): p. 1882-98. 7. Xie, H., et al., Functional anthology of intrinsic disorder. 3. Ligands, post- translational modifications, and diseases associated with intrinsically disordered proteins. J Proteome Res, 2007. 6(5): p. 1917-32. 8. Vucetic, S., et al., Functional anthology of intrinsic disorder. 2. Cellular components, domains, technical terms, developmental processes, and coding sequence diversities correlated with long disordered regions. J Proteome Res, 2007. 6(5): p. 1899-916. 9. Uversky, V.N., C.J. Oldfield, and A.K. Dunker, Intrinsically disordered proteins in human diseases: introducing the D2 concept. Annu Rev Biophys, 2008. 37: p. 215-46. 10. Uversky, V.N., et al., Unfoldomics of human diseases: linking protein intrinsic disorder with diseases. BMC Genomics, 2009. 10 Suppl 1: p. S7. 11. Weinreb, P.H., et al., NACP, a protein implicated in Alzheimer's disease and learning, is natively unfolded. Biochemistry, 1996. 35(43): p. 13709-15. 12. Ward, J.J., et al., Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol, 2004. 337(3): p. 635-45.

124

13. Dunker, A.K., et al., Intrinsic protein disorder in complete genomes. Genome Inform Ser Workshop Genome Inform, 2000. 11: p. 161-71. 14. Peng, Z., et al., Exceptionally abundant exceptions: comprehensive characterization of intrinsic disorder in all domains of life. Cell Mol Life Sci, 2015. 72(1): p. 137-51. 15. Pedro Romero, Z.O., 1¥ Xiaohong Li,1‡ Ethan C. Garner,2† Celeste J. Brown,2 and A. Keith Dunker, Sequence Complexity of Disordered Protein. PROTEINS: Structure, Function, and Genetics, 2001. 42. 16. Prilusky, J., et al., FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics, 2005. 21(16): p. 3435-8. 17. Linding, R., et al., GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Res, 2003. 31(13): p. 3701-8. 18. Dosztanyi, Z., et al., IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics, 2005. 21(16): p. 3433-3434. 19. Ward, J.J., et al., The DISOPRED server for the prediction of protein disorder. Bioinformatics, 2004. 20(13): p. 2138-9. 20. Xue, B., et al., PONDR-FIT: a meta-predictor of intrinsically disordered amino acids. Biochim Biophys Acta, 2010. 1804(4): p. 996-1010. 21. Garner, E., et al., Predicting Disordered Regions from Amino Acid Sequence: Common Themes Despite Differing Structural Characterization. Genome Inform Ser Workshop Genome Inform, 1998. 9: p. 201-213. 22. Brown, C.J., et al., Evolutionary rate heterogeneity in proteins with long disordered regions. J Mol Evol, 2002. 55(1): p. 104-10. 23. Szalkowski, A.M. and M. Anisimova, Markov models of amino acid substitution to study proteins with intrinsically disordered regions. PLoS One, 2011. 6(5): p. e20488. 24. Jorda, J., et al., Protein tandem repeats - the more perfect, the less structured. FEBS J, 2010. 277(12): p. 2673-82. 25. Light, S., et al., Protein expansion is primarily due to indels in intrinsically disordered regions. Mol Biol Evol, 2013. 30(12): p. 2645-53.

125

26. Brown, C.J., A.K. Johnson, and G.W. Daughdrill, Comparing models of evolution for ordered and disordered proteins. Mol Biol Evol, 2010. 27(3): p. 609-21. 27. Uversky, V.N., A decade and a half of protein intrinsic disorder: biology still waits for physics. Protein Sci, 2013. 22(6): p. 693-724. 28. Tompa, P., Intrinsically unstructured proteins evolve by repeat expansion. Bioessays, 2003. 25(9): p. 847-55. 29. Chen, J.W., et al., Conservation of intrinsic disorder in protein domains and families: I. A database of conserved predicted disordered regions. J Proteome Res, 2006. 5(4): p. 879-87. 30. Brown, C.J., et al., Evolution and disorder. Curr Opin Struct Biol, 2011. 21(3): p. 441-6. 31. Narasumani, M. and P.M. Harrison, Bioinformatical parsing of folding-on-binding proteins reveals their compositional and evolutionary sequence design. Sci Rep, 2015. 5: p. 18586. 32. Adkins, J.N. and K.J. Lumb, Intrinsic structural disorder and sequence features of the cell cycle inhibitor p57Kip2. Proteins, 2002. 46(1): p. 1-7. 33. Chang, J.F., et al., Oct-1 POU and octamer DNA co-operate to recognise the Bob-1 transcription co-activator via induced folding. J Mol Biol, 1999. 288(5): p. 941-52. 34. Johansson, J., et al., Conformation-dependent antibacterial activity of the naturally occurring human peptide LL-37. J Biol Chem, 1998. 273(6): p. 3718-24. 35. Tucker, P.A., et al., Crystal structure of the adenovirus DNA binding protein reveals a hook-on model for cooperative DNA binding. EMBO J, 1994. 13(13): p. 2994-3002. 36. Cheng, E.H., et al., Conversion of Bcl-2 to a Bax-like death effector by caspases. Science, 1997. 278(5345): p. 1966-8. 37. Bidwell, L.M., et al., Crystal structure of human catecholamine sulfotransferase. J Mol Biol, 1999. 293(3): p. 521-30. 38. Huang, Y., et al., Mechanisms for auto-inhibition and forced product release in glycine N-methyltransferase: crystal structures of wild-type, mutant R175K and

126

S-adenosylhomocysteine-bound R175K enzymes. J Mol Biol, 2000. 298(1): p. 149-62. 39. Dunker, A.K., et al., Protein disorder and the evolution of molecular recognition: theory, predictions and observations. Pac Symp Biocomput, 1998: p. 473-84. 40. Meszaros, B., et al., Molecular principles of the interactions of disordered proteins. Journal of Molecular Biology, 2007. 372(2): p. 549-561. 41. Dyson, H.J. and P.E. Wright, Intrinsically unstructured proteins and their functions. Nature Reviews Molecular Cell Biology, 2005. 6(3): p. 197-208. 42. van der Lee, R., et al., Classification of intrinsically disordered regions and proteins. Chem Rev, 2014. 114(13): p. 6589-631. 43. Pang, C.N., A. Hayen, and M.R. Wilkins, Surface accessibility of protein post- translational modifications. J Proteome Res, 2007. 6(5): p. 1833-45. 44. Pejaver, V., et al., The structural and functional signatures of proteins that undergo multiple events of post-translational modification. Protein Science, 2014. 23(8): p. 1077-1093. 45. Holt, L.J., et al., Global Analysis of Cdk1 Substrate Phosphorylation Sites Provides Insights into Evolution. Science, 2009. 325(5948): p. 1682-1686. 46. Gao, J. and D. Xu, Correlation between posttranslational modification and intrinsic disorder in protein. Pac Symp Biocomput, 2012: p. 94-103. 47. Studer, R.A., et al., Evolution of protein phosphorylation across 18 fungal species. Science, 2016. 354(6309): p. 229-232. 48. Pearlman, S.M., Z. Serber, and J.E. Ferrell, A Mechanism for the Evolution of Phosphorylation Sites. Cell, 2011. 147(4): p. 934-946. 49. Yang, X.J. and E. Seto, Lysine acetylation: codified crosstalk with other posttranslational modifications. Mol Cell, 2008. 31(4): p. 449-61. 50. Hagai, T., et al., The origins and evolution of ubiquitination sites. Mol Biosyst, 2012. 8(7): p. 1865-77. 51. Lu, L., et al., Functional constraints on adaptive evolution of protein ubiquitination sites. Sci Rep, 2017. 7: p. 39949. 52. Simonti, C.N., et al., Evolution of lysine acetylation in the RNA polymerase II C- terminal domain. Bmc Evolutionary Biology, 2015. 15.

127

53. Drazic, A., et al., The world of protein acetylation. Biochimica Et Biophysica Acta- Proteins and Proteomics, 2016. 1864(10): p. 1372-1401. 54. Lee, T.Y., et al., dbPTM: an information repository of protein post-translational modification. Nucleic Acids Res, 2006. 34(Database issue): p. D622-7. 55. Gnad, F., J. Gunawardena, and M. Mann, PHOSIDA 2011: the posttranslational modification database. Nucleic Acids Res, 2011. 39(Database issue): p. D253- 60. 56. Hornbeck, P.V., et al., PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post- translational modifications in man and mouse. Nucleic Acids Res, 2012. 40(Database issue): p. D261-70. 57. Yates, A., et al., Ensembl 2016. Nucleic Acids Res, 2016. 44(D1): p. D710-6. 58. Breuza, L., et al., The UniProtKB guide to the human proteome. Database (Oxford), 2016. 2016. 59. O'Leary, N.A., et al., Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res, 2016. 44(D1): p. D733-45. 60. He, Z.L., et al., Evolview v2: an online visualization and management tool for customized and annotated phylogenetic trees. Nucleic Acids Research, 2016. 44(W1): p. W236-W241. 61. Letunic, I. and P. Bork, Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res, 2016. 44(W1): p. W242-5. 62. Altschul, S.F., et al., Basic local alignment search tool. J Mol Biol, 1990. 215(3): p. 403-10. 63. Sievers, F. and D.G. Higgins, Clustal Omega, accurate alignment of very large numbers of sequences. Methods Mol Biol, 2014. 1079: p. 105-16. 64. Wu, M., S. Chatterji, and J.A. Eisen, Accounting for alignment uncertainty in phylogenomics. PLoS One, 2012. 7(1): p. e30288.

128

65. Lange, J., L.S. Wyrwicz, and G. Vriend, KMAD: knowledge-based multiple sequence alignment for intrinsically disordered proteins. Bioinformatics, 2016. 32(6): p. 932-6. 66. Eden, E., et al., GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics, 2009. 10: p. 48. 67. Chandonia, J.M., et al., The ASTRAL Compendium in 2004. Nucleic Acids Res, 2004. 32(Database issue): p. D189-92. 68. Harrison, P.M., fLPS: Fast discovery of compositional biases for the protein universe. Bmc Bioinformatics, 2017. 18. 69. An, L., D. Fitzpatrick, and P.M. Harrison, Emergence and evolution of yeast prion and prion-like proteins. Bmc Evolutionary Biology, 2016. 16. 70. An, L. and P.M. Harrison, The evolutionary scope and neurological disease linkage of yeast-prion-like proteins in humans. Biology Direct, 2016. 11. 71. Harbi, D., et al., PrionHome: a database of prions and other sequences relevant to prion phenomena. PLoS One, 2012. 7(2): p. e31785. 72. Team, R.C., R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria., 2015. 73. Harrison, P.M., Exhaustive assignment of compositional bias reveals universally prevalent biased regions: analysis of functional associations in human and Drosophila. BMC Bioinformatics, 2006. 7: p. 441. 74. Caron, C., C. Boyault, and S. Khochbin, Regulatory cross-talk between lysine acetylation and ubiquitination: role in the control of protein stability. Bioessays, 2005. 27(4): p. 408-415. 75. Iakoucheva, L.M., et al., The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res, 2004. 32(3): p. 1037-49. 76. Wu, Z., J. Connolly, and K.K. Biggar, Beyond histones - the expanding roles of protein lysine methylation. FEBS J, 2017. 77. Fukuchi, S., et al., IDEAL: Intrinsically Disordered proteins with Extensive Annotations and Literature. Nucleic Acids Res, 2012. 40(Database issue): p. D507-11.

129

78. Mohan, A., et al., Analysis of molecular recognition features (MoRFs). J Mol Biol, 2006. 362(5): p. 1043-59. 79. Dyson, H.J. and P.E. Wright, Coupling of folding and binding for unstructured proteins. Curr Opin Struct Biol, 2002. 12(1): p. 54-60. 80. Hasan, S., et al., Regulation of human flap endonuclease-1 activity by acetylation through the transcriptional coactivator p300. Molecular Cell, 2001. 7(6): p. 1221- 1231. 81. Trexler, A.J. and E. Rhoades, N-terminal acetylation is critical for forming a- helical oligomer of a-synuclein. Protein Science, 2012. 21(5): p. 601-605. 82. Huntley, M. and G.B. Golding, Evolution of simple sequence in proteins. Journal of Molecular Evolution, 2000. 51(2): p. 131-140. 83. Faux, N.G., et al., Functional insights from the distribution and role of homopeptide repeat-containing proteins. Genome Research, 2005. 15(4): p. 537- 551. 84. Alba, M.M., P. Tompa, and R.A. Veitia, Amino acid repeats and the structure and evolution of proteins. Genome Dyn, 2007. 3: p. 119-30. 85. Otvos, L., Jr. and M. Cudic, Post-translational modifications in prion proteins. Curr Protein Pept Sci, 2002. 3(6): p. 643-52. 86. Gendoo, D.M.A. and P.M. Harrison, The Landscape of the Prion Protein's Structural Response to Mutation Revealed by Principal Component Analysis of Multiple NMR Ensembles. Plos Computational Biology, 2012. 8(8). 87. Harrison, P.M., A. Khachane, and M. Kumar, Genomic assessment of the evolution of the prion protein gene family in vertebrates. Genomics, 2010. 95(5): p. 268-77. 88. Waterhouse, A.M., et al., Jalview Version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics, 2009. 25(9): p. 1189-91. 89. Dai, T.Y., et al., P68 RNA helicase as a molecular target for cancer therapy. Journal of Experimental & Clinical Cancer Research, 2014. 33. 90. Kontaxi, C., P. Piccardo, and A.C. Gill, Lysine-Directed Post-translational Modifications of Tau Protein in Alzheimer's Disease and Related Tauopathies. Front Mol Biosci, 2017. 4: p. 56.

130

91. Rathert, P., et al., Protein lysine methyltransferase G9a acts on non-histone targets. Nat Chem Biol, 2008. 4(6): p. 344-6. 92. Wesche, J., et al., Protein arginine methylation: a prominent modification and its demethylation. Cell Mol Life Sci, 2017. 93. Ong, S.E., G. Mittler, and M. Mann, Identifying and quantifying in vivo methylation sites by heavy methyl SILAC. Nat Methods, 2004. 1(2): p. 119-26. 94. Banerjee, K., et al., Regulation of tyrosine hydroxylase transcription by hnRNP K and DNA secondary structure. Nature Communications, 2014. 5. 95. Choudhary, C., et al., Lysine Acetylation Targets Protein Complexes and Co- Regulates Major Cellular Functions. Science, 2009. 325(5942): p. 834-840. 96. Mooney, S.M., et al., Sumoylation of p68 and p72 RNA Helicases Affects Protein Stability and Transactivation Potential. Biochemistry, 2010. 49(1): p. 1-10. 97. Gilberto, S. and M. Peter, Dynamic ubiquitin signaling in cell cycle regulation. Journal of Cell Biology, 2017. 216(8): p. 2259-2271. 98. Bannister, A.J. and T. Kouzarides, Regulation of chromatin by histone modifications. Cell Res, 2011. 21(3): p. 381-95. 99. Beltrao, P., et al., Evolution and functional cross-talk of protein post-translational modifications. Mol Syst Biol, 2013. 9: p. 714. 100. Woo, Y.H. and W.H. Li, Evolutionary Conservation of Histone Modifications in Mammals. Molecular Biology and Evolution, 2012. 29(7): p. 1757-1767. 101. Wisniewski, J.R., et al., Mass spectrometric mapping of linker histone H1 variants reveals multiple acetylations, methylations, and phosphorylation as well as differences between cell culture and tissue. Molecular & Cellular Proteomics, 2007. 6(1): p. 72-87. 102. Hergeth, S.P. and R. Schneider, The H1 linker histones: multifunctional proteins beyond the nucleosomal core particle. Embo Reports, 2015. 16(11): p. 1439- 1453. 103. Parkel, S., J.P. Lopez-Atalaya, and A. Barco, Histone H3 lysine methylation in cognition and intellectual disability disorders. Learning & Memory, 2013. 20(10): p. 570-579.

131

104. Cao, J. and Q. Yan, Histone ubiquitination and deubiquitination in transcription, DNA damage response, and cancer. Front Oncol, 2012. 2: p. 26. 105. Sears, R., et al., Multiple Ras-dependent phosphorylation pathways regulate Myc protein stability. Genes & Development, 2000. 14(19): p. 2501-2514. 106. Qin, W.H., et al., DNA methylation requires a DNMT1 ubiquitin interacting motif (UIM) and histone ubiquitination. Cell Research, 2015. 25(8): p. 911-929. 107. Carbon, S., et al., Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Research, 2017. 45(D1): p. D331-D338.

Acknowledgements:

This work was supported by Natural Sciences and Engineering Research Council of Canada (NSERC).

132

Chapter IV

4 Discussion and Conclusion

133

4.1 Discussion

The functional importance of disordered regions has been well established using

experimental and computational studies. Intrinsic disorder prediction tools were first

developed in 1997 [1], but this field is continuously growing due to the distinct sequence,

evolution and structural characteristics of disordered regions. The present work

investigates the distinctive compositional trends, PTM distributions and evolutionary patterns of different types of disordered regions in eukaryotic proteins.

The amino acid composition provides conformational flexibility to disordered regions and can control whether they adopt a random coil or an ordered state. In general,

IDRs are characterized by relatively higher net charge and low hydrophobicity, which raises an intriguing question whether the composition is maintained in different types of disordered region, specifically IDRs located around FB regions. The bioinformatical analysis of the parsed datasets (explained in Chapter 2) shows that FB regions have slight net positive charge. The enrichment analysis of Gene Ontology molecular function categories indicates the significance of FB regions in nucleic acid binding and DNA binding. The positively charged FB regions are likely to participate in electrostatic interactions with nucleic acids. Furthermore, FB regions with positively charged residues can mediate the recognition and stabilize the nucleic acid and FB complex. Indeed, the role of FB and other disordered regions in electrostatic interactions is highlighted in previous studies [2-4]. Although the some of the associated functions are common in

IDRs, the conservation analysis of the parsed datasets showed a fascinating evolutionary behaviour peculiar to FB regions and the IDRs that embed them. The conservation of FB

134

regions is maintained across vertebrate orthologs and they show a comparable conservation to the ordered regions. This is further confirmed by the sampling analysis of the parsed datasets. It is interesting to note that these results are in agreement with the analysis of conserved intrinsic disorder in RNA binding proteins [4, 5]. Therefore, the sequence conservation of FB regions is important to maintain their function.

The evolutionary analysis of the parsed datasets showed that ‘disordered-around-

FB’ (DFB) regions have extremely hydrophilic residues and mild negative charge. The polar nature of these regions suggest that they may have a major role in complex formations. This result also suggests a possible guiding mechanism where DFB regions facilitate the folding-on-binding by preventing off-target interactions. This mechanism is observed in calcineurin, where the disordered region around the calmodulin binding domain mediates the binding to its target molecule [6]. Furthermore, the polar interactions enabled by DFB regions can confer high specificity for molecular recognition [7].

Similarly, previous studies have separated different types of disordered regions into three categories, (i) flexible disorder, property of disorder is conserved and rapidly evolving amino acid sequence (ii) constrained disorder, both disorder and amino acid sequence are conserved (iii) non-conserved disorder. When compared to constrained and non-conserved disorder, flexible disorder show preference for linear motifs and involve in signalling and regulatory roles [5, 8]. Thus, in addition to understanding the functional role of different disordered regions [5, 8], analysing these regions could be helpful to develop new prediction tools and IDR databases [9-12]. Moreover, both MoRFs and FB regions can undergo disorder-to-order transition and linear motifs or SLiMs overlap with MoRFs [13-16], and linear motifs also involve in folding on binding [10],

135

hence we suggest that FB regions could also be associated with linear motifs to maintain the sequence conservation and facilitate the disorder-to-order transition.

The evolutionary behaviour of the four parsed regions motivated us to perform similar conservational analysis of PTMs in ordered and disordered regions. Chapter 3 demonstrates the distribution and evolution of MAU sites in ordered, FB and disordered regions across eukaryotic species. The significant conservation of ubiquitination sites in

FB regions across mammals are shown in Chapter 3. In Chapter 2, we found that FB regions are more highly conserved than other disordered region types. Hence, I propose that the dual (both disordered and ordered) conformation of FB regions may have the ability to provide and preserve the sites for ubiquitination. Mutations and divergence in disordered regions have been shown to affect the ubiquitination sites. Such is the case in yeast proteins; after gene duplication events, the mutational differences in the disordered regions in yeast proteins are reported to affect protein’s half-life [17]. One of the important

FB proteins that we examined, Myc protein, is a key transcription factor in cancer and mutation in its ubiquitination sites disrupts protein degradation [18].

Although we could not observe the same conservation pattern for other PTMs in

FB regions, we observed a strong conservation signal of ubiquitination sites across mammals. Specifically, the significant conservation of ubiquitinatable lysine residues are observed in RAC1, a member of evolutionarily conserved RhoGTPase family. In

RhoGTPases, ubiquitination plays many vital roles including signalling, regulation and mediate crosstalk between different RhoGTPases [19, 20]. Furthermore, a recent study has shown that polyubiquitination of Rac1 by inhibitors of apoptosis (IAP) in mammalian

136

cells [21] and the hyperactivation of Rac1 signalling indicate their role in breast cancer

[22].

The significant enrichment of ‘multiple-PTM’ sites indicate that these sites could act as a major driver for conservation in FB regions and confer functional diversity [23]. A recent study shows that disordered regions with multiple-PTM sites involve in transcriptional, post-transcriptional and developmental process. For example, acetylation and dimethylation of histone H4 at lysine 20 facilitate interaction of the transcriptional coactivator CBP and the DNA-damage response protein 53BP1 [24-28]. In addition, the intrinsically disordered N-terminal (transactivation domain) and C-terminal (regulatory domain) regions of p53 provide sites for multiple types of PTMs and exhibit ordered state upon binding [29]. Moreover, the crosstalk between p53 modifications, such as methylation between different sites and between methylation and acetylation confers vital roles in protein regulation [24, 30]. For instance, methylation of p53 at lysine 372 inhibits the methylation at lysine 370 and block SMYD2 binding [31]. In addition, SET8 monomethylation at lysine 372 inhibits acetylation at the same residue, suggesting the inhibition of p53-mediated DNA damage response by methylation [32]. Furthermore, it has been reported that the experimentally-determined and predicted multiple-PTMs show higher preference for MoRFs than non-MoRFs, and facilitate different recognition surface for different binding partners [24]. In 2014, Huang et al., reported that the multiple types of PTMs are frequently located in IDRs and demonstrated their role in protein-protein interactions [31]. Thus, although MoRFs and FB regions have different amino acid sequence length, the common features (mentioned in Chapter 1 and in Chapter 3)

137

between them suggest that multiple-PTMs could facilitate the binding of FB regions with diverse interaction partners [24].

In addition, PTMs in disordered regions have been reported to induce disorder-to- order transition. For example, phosphorylation of 4E-binding protein 2 (4E-BP2), at threonine 37 and threonine 46, weakens the affinity for eIF4E, induce folding of 4E-BP2 and translation initiation [33]. In another example, acetylation and methylation in the disordered regions of histones are involved in chromatin remodelling. Acetylation of lysine residues in histones can neutralise the positive charge of lysine residues and alters the interaction between histones and DNA [33, 34]. Furthermore, it is important to note that the FB regions have more positively charged residues compared to other disordered region types [discussed in Chapter 2], hence it may be possible that PTMs involving positively charged residues (lysine and arginine) show preference for FB regions.

However, a comparative study using different types of disordered regions with PTM site information could provide a better understanding of the different types of disordered regions and their preference for different PTMs.

We also perform a large-scale evolutionary analysis of experimentally determined

MAU sites in ordered and disordered regions generally across eukaryotic organisms.

PTMs such as methylation and acetylation have been intensively studied in histone proteins. Our analysis includes a wide range of proteins in eukaryotes organisms. Overall, the MAU sites are highly conserved in both ordered and disordered regions across mammalian clade. Specifically, evidence for the emergence of new MAU (except arginine methylation) sites in the disordered regions is observed in mammals and deuterostomes.

This suggests that the conserved MAU sites in complex organisms could have a vital role

138

in regulatory processes and diseases that arose on these evolutionary epochs. One such example is that MAU PTMs are conserved in Myelin basic protein (MBP), an intrinsically disordered protein, involved in multiple sclerosis, but this role is not reported in MBP for lower vertebrates [35-37]. Hence, studying the evolutionary trends of PTMs at multiple levels could provide insights into the pathogenesis of diseases.

We observed a distinct evolutionary pattern in histones; the lysine methylation and ubiquitination sites are highly conserved in disordered regions across mammals while lysine acetylation sites are conserved in ordered regions. Specifically, we observed a notable number of conserved lysine methylation site residues in Histone H1 and H3 variants. In general, histones are highly conserved proteins and PTMs on histones play vital role in eukaryotic gene expression. However, the linker histone H1 protein is the most divergent family of histones. Interestingly, a variant-specific lysine methylation site has been reported in the C-terminal domain of mammalian histone H1 variant [38]. It may be possible that H1 variants acquired a considerable number of lysine methylation sites in mammals. We also observed a high proportion of conserved ubiquitination site residues in core histones. As mentioned earlier, highly conserved ubiquitination site residues are identified in the FB regions of mammalian proteins. Thus, we hypothesize that the core histones may contain the characteristics similar to FB regions in this respect.

139

4.2 Conclusion

In conclusion, the findings of our study show that FB regions are relatively conserved than other disordered region types. The highly polar nature of disordered- around-FB regions provides a possible compositionally-based steering mechanism of FB region folding-on-binding. Moreover, such FB regions show significant conservation for ubiquitination sites residues. The presented study also finds a complex evolutionary trend for MAU PTM in ordered and disordered regions across eukaryotes. Thus, this study highlights the need to analyse sequence conservation at different levels to gain a complete picture of the evolution of these protein regions.

140

4.3 References

1. Romero, P., et al., Identifying disordered regions in proteins from amino acid sequence. 1997 Ieee International Conference on Neural Networks, Vols 1-4, 1997: p. 90-95. 2. Chu, X.K., et al., Importance of Electrostatic Interactions in the Association of Intrinsically Disordered Histone Chaperone Chz1 and Histone H2A.Z-H2B. Plos Computational Biology, 2012. 8(7). 3. Vuzman, D. and Y. Levy, Intrinsically disordered regions as affinity tuners in protein-DNA interactions. Molecular Biosystems, 2012. 8(1): p. 47-57. 4. Varadi, M., et al., Functional Advantages of Conserved Intrinsic Disorder in RNA- Binding Proteins. Plos One, 2015. 10(10). 5. Bellay, J., et al., Bringing order to protein disorder through comparative genomics and genetic interactions. Genome Biology, 2011. 12(2). 6. Uversky, V.N., Multitude of binding modes attainable by intrinsically disordered proteins: a portrait gallery of disorder-based complexes. Chem Soc Rev, 2011. 40(3): p. 1623-34. 7. Wong, E.T., D. Na, and J. Gsponer, On the importance of polar interactions for complexes containing intrinsically disordered proteins. PLoS Comput Biol, 2013. 9(8): p. e1003192. 8. Colak, R., et al., Distinct Types of Disorder in the Human Proteome: Functional Implications for Alternative Splicing. Plos Computational Biology, 2013. 9(4). 9. Dosztanyi, Z., B. Meszaros, and I. Simon, ANCHOR: web server for predicting protein binding regions in disordered proteins. Bioinformatics, 2009. 25(20): p. 2745-6. 10. van der Lee, R., et al., Classification of intrinsically disordered regions and proteins. Chem Rev, 2014. 114(13): p. 6589-631. 11. Potenza, E., et al., MobiDB 2.0: an improved database of intrinsically disordered and mobile proteins. Nucleic Acids Res, 2014.

141

12. Di Domenico, T., et al., MobiDB: a comprehensive database of intrinsic protein disorder annotations. Bioinformatics, 2012. 28(15): p. 2080-2081. 13. Aouacheria, A., et al., Redefining the BH3 Death Domain as a ''. Trends in Biochemical Sciences, 2015. 40(12): p. 736-748. 14. Van Roey, K., et al., Short Linear Motifs: Ubiquitous and Functionally Diverse Protein Interaction Modules Directing Cell Regulation. Chemical Reviews, 2014. 114(13): p. 6733-6778. 15. Meszaros, B., Z. Dosztanyi, and I. Simon, Disordered Binding Regions and Linear Motifs-Bridging the Gap between Two Models of Molecular Recognition. Plos One, 2012. 7(10). 16. Fuxreiter, M., P. Tompa, and I. Simon, Local structural disorder imparts plasticity on linear motifs. Bioinformatics, 2007. 23(8): p. 950-956. 17. van der Lee, R., et al., Intrinsically Disordered Segments Affect Protein Half-Life in the Cell and during Evolution. Cell Reports, 2014. 8(6): p. 1832-1844. 18. Sears, R., et al., Multiple Ras-dependent phosphorylation pathways regulate Myc protein stability. Genes & Development, 2000. 14(19): p. 2501-2514. 19. Nethe, M. and P.L. Hordijk, The role of ubiquitylation and degradation in RhoGTPase signalling. Journal of Cell Science, 2010. 123(23): p. 4011-4018. 20. Visvikis, O., et al., Activated Rac1, but not the tumorigenic variant Rac1b, is ubiquitinated on Lys 147 through a JNK-regulated process. FEBS J, 2008. 275(2): p. 386-96. 21. Oberoi-Khanuja, T.K. and K. Rajalingam, Ubiquitination of Rac1 by inhibitors of apoptosis (IAPs). Methods Mol Biol, 2014. 1120: p. 43-54. 22. Goka, E.T. and M.E. Lippman, Loss of the E3 ubiquitin ligase HACE1 results in enhanced Rac1 signaling contributing to breast cancer progression. Oncogene, 2015. 34(42): p. 5395-405. 23. Darling, A.L. and V.N. Uversky, Intrinsic Disorder and Posttranslational Modifications: The Darker Side of the Biological Dark Matter. Front Genet, 2018. 9: p. 158.

142

24. Pejaver, V., et al., The structural and functional signatures of proteins that undergo multiple events of post-translational modification. Protein Sci, 2014. 23(8): p. 1077-93. 25. Chapman, J.R., M.R. Taylor, and S.J. Boulton, Playing the end game: DNA double-strand break repair pathway choice. Mol Cell, 2012. 47(4): p. 497-510. 26. Das, C., et al., Binding of the histone chaperone ASF1 to the CBP bromodomain promotes histone acetylation. Proc Natl Acad Sci U S A, 2014. 111(12): p. E1072-81. 27. Deng, Z., et al., The CBP bromodomain and nucleosome targeting are required for Zta-directed nucleosome acetylation and transcription activation. Mol Cell Biol, 2003. 23(8): p. 2633-44. 28. Zeng, L., et al., Structural basis of site-specific histone recognition by the bromodomains of human coactivators PCAF and CBP/p300. Structure, 2008. 16(4): p. 643-52. 29. Dai, C. and W. Gu, p53 post-translational modification: deregulated in tumorigenesis. Trends in Molecular Medicine, 2010. 16(11): p. 528-536. 30. Oldfield, C.J., et al., Flexible nets: disorder and induced fit in the associations of p53 and 14-3-3 with their partners. Bmc Genomics, 2008. 9. 31. Huang, Q.L., et al., Human Proteins with Target Sites of Multiple Post- Translational Modification Types Are More Prone to Be Involved in Disease. Journal of Proteome Research, 2014. 13(6): p. 2735-2748. 32. Huang, J., et al., Repression of p53 activity by Smyd2-mediated methylation. Nature, 2006. 444(7119): p. 629-632. 33. Bah, A. and J.D. Forman-Kay, Modulation of Intrinsically Disordered Protein Function by Post-translational Modifications. J Biol Chem, 2016. 291(13): p. 6696-705. 34. Henikoff, S. and A. Shilatifard, Histone modification: cause or cog? Trends in Genetics, 2011. 27(10): p. 389-396. 35. Bamm, V.V., et al., Structured Functional Domains of Myelin Basic Protein: Cross Talk between Actin Polymerization and Ca2+-Dependent Calmodulin Interaction. Biophysical Journal, 2011. 101(5): p. 1248-1256.

143

36. Kim, J.K., et al., Multiple sclerosis - An important role for post-translational modifications of myelin basic protein in pathogenesis. Molecular & Cellular Proteomics, 2003. 2(7): p. 453-462. 37. Zhang, C.C., et al., Myelin Basic Protein Undergoes a Broader Range of Modifications in Mammals than in Lower Vertebrates. Journal of Proteome Research, 2012. 11(10): p. 4791-4802. 38. Weiss, T., et al., Histone H1 variant-specific lysine methylation by G9a/KMT1C and Glp1/KMT1D. Epigenetics & Chromatin, 2010. 3.

144

APPENDIX A

Supplementary Data for Chapter III

Discerning evolutionary trends in post-translational modification and the effect of intrinsic disorder: Analysis of methylation, acetylation and ubiquitination sites in human proteins

145

A.1: Figures

Figure S3.1-S3.4: Phylogenetic trees showing Methylation, Acetylation and

Ubiquitination sites conservation as other MAU site residue types in ordered and disordered regions at eleven evolutionary levels. The bubble size in the evolutionary trees represents the fraction of conserved and newly-emerged MAU sites in ordered

(green) and disordered (orange) regions at each clade. A hypergeometric probability test is applied to identify the enrichment of conserved and new conserved MAU sites in ordered and disordered regions at each clade. The test is performed with all lysine/arginine residues in the proteins with MAU sites set as background population, all conserved lysine/arginine residues as success in background population, MAU-modified lysine/arginine residues as sample and conserved MAU-modified sites as success in sample. We applied a Bonferroni correction for multiple hypothesis testing and the P- values are considered significant at P < 0.00417 for lysine modifications and P < 0.0125 arginine methylation. The number of conserved MAU sites at each clade and the significance of enriched conserved and newly-emerged MAU sites in ordered and disordered regions are given inside the bubbles.

(A) Fraction of conserved and new conserved lysine methylation sites and the number of conserved and new conserved lysine methylation sites in ordered and disordered regions. (B) Fraction of conserved and new conserved arginine methylation sites and the number of conserved and new arginine methylation sites in ordered and disordered regions. (C) Fraction of conserved and new conserved lysine acetylation sites and the number of conserved and new lysine acetylation sites in ordered and disordered regions.

146

Methylation (K) sites in Ordered regions Methylation (K)Methylationsites (K) sites in Ordered regions Acetylation (K) sites in Ordered regions Ordered in sites (K) Acetylation Acetylation (K) sites in Ordered regions Ordered in sites (K) Acetylation NS Not Significant ✱ < 1E-10 ✦ < 0.0001 ✢ < 0.05 Fraction of conserved MethylationFraction of conserved Methylation (K) sites in ordered regions (K) sites in Ordered regions FractionNS Not Significant ✱ < 1E-10 ✦ < 0.0001 of ✢ < 0.05 new conservedFraction of new conserved Methylation (K) sites in ordered regions Methylation (K) sites in Ordered regions rc o fcnevdAe ltin()ste Cnevtina thrMUstersdetypes) t residue e sit MAU her ot as ion (Conservat es sit (K) ion ylat Acet conserved of ion Fract rc o fcnevdAe ltin()ste Cnevtina thrMUstersdetypes) t residue e sit MAU her ot as ion (Conservat es sit (K) ion ylat Acet conserved of ion Fract 0.35 Other0.35 Eukaryotes Other Eukaryotes 116 116 ✦ ✦ 0.35 0.35

210 Other Metazoa Other Metazoa NS 210 NS 0.23 473 0.23 473 412 ✱ 412

✱ Other Tetrapods

✱ Other Tetrapods ✱

0.22 0.2

✦✦✦ 0.22 18 0.1 Other Mam m als 0.08 Other Mam m als

0.19 ✦✦✦

0.19 NS 738 ✢ 110 738 38 ✢ Other Eutheria Other Eutheria 0.18 0.09 0.08 1181 0.18

NS ✦✦✦

1181 NS 0.19 147 NS Other Apes 0.18 37 Other Apes ✦✦✦ 0.1 NS NS 0.17 NS 29 0.08 356 11 0.05 49 0.11 ✦✦✦ 0.07 ✦✦✦

0.16 Human Human

1693 NS 194 0.08 NS 47 0.06 0.16 ✱ 72 NS 1693 19 NS ✦✦✦ Other Prim ates Other Prim ates ✱ 0.17 0.09 307 NS 0.06 61 2080 53 ✦✦

0.17 7 NS ✢ 0.14 246 2080 Other Supraprim ates 0.1 52 Other Supraprim ates ✢ 0.14 2799 NS 0.14

2799 Other Vertebrates Other Vertebrates NS 0.15 0.11 ✦✦✦ Other Chordates NS Other Chordates 46 17 Other Deuterostom ia Other Deuterostom ia 0.13 3807 NS

Methylation (K) sites in Disordered regions 0.13 Methylation (K) sites in Disordered regions Acetylation (K) sites in Disordered regions Disordered in sites (K) Acetylation 3807 Acetylation (K) sites in Disordered regions Disordered in sites (K) Acetylation NS Not Significant ✱ < 1E-10 ✦ < 0.0001 ✢ < 0.05 NS Not Significant ✱ < 1E-10 ✦ < 0.0001 ✢ < 0.05 NS 0.12 Fraction of conservedFraction of conserved Methylation (K) sites in disordered regions Methylation (K) sites in Disordered regions Fraction of new conservedFraction of new conserved Methylation (K) sites in disordered regions Methylation (K) sites in Disordered regions Fract ion of conserved Acet ylat ion (K) sit es in Disordered regions Disordered in es sit (K) ion ylat Acet conserved of ion Fract Fract ion of conserved Acet ylat ion (K) sit es in Disordered regions Disordered in es sit (K) ion ylat Acet conserved of ion Fract 0.12 0.48 Other Eukaryotes0.48 Other Eukaryotes 12 12 ✢ ✢ 4405 NS 0.41 0.41 Other Metazoa Other Metazoa NS NS 27 4405 27 NS 0.26 236 0.26 236 ✱ ✱ 183

183 Other Tetrapods Other Tetrapods ✱ 0.8 ✱ 0.22 NS 0.22 Other Mam m als Other Mam m als 4 0.08 0.05 ✦✦✦ NS 0.2 0.2 te etrso ia Deuterostom Other Chordates Other Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other Eukaryotes Other 436 106 436 46 ✦ 0.07 ✦ Other Eutheria 0.06 Other Eutheria 0.17 0.17 ✦✦✦ te etrso ia Deuterostom Other Chordates Other Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other Eukaryotes Other

1130 ✦✦✦ 1130 ✱ 0.45 178 Other✱ Apes 0.38 72 Other Apes ✦✦✦ NS 0.07 0.04 NS NS 0.07 0.02 14 0.12 593 10 0.09 112 ✦✦✦ Human NS Human 217 0.05 NS 0.04 0.12 NS 0.12 39 1542 60 NS 1542 21 NS ✱

✱ Other Prim ates ✦✦✦ 0.06 Other Prim ates NS 0.04 172 481 6 0.11 39 0.11 NS 1790 NS 1790 0.06 ✦ 0.14 309 ✦ Other Supraprim ates 92 Other Supraprim ates 2650 2650 0.1 0.1

✦ Other Vertebrates

✦ Other Vertebrates 0.19 0.13 ✦✦✦ Other Chordates ✦✦✦ Other Chordates 33 19 Other Deuterostom ia Other Deuterostom ia 0.09 0.09 4008 4008 NS NS < 0.0025, < 0.00417, NS Not Significant 0.07 0.07 5011 5011 NS NS

te etrso ia Deuterostom Other Chordates Other Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other Eukaryotes Other te etrso ia Deuterostom Other Chordates Other Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other Eukaryotes Other Figure S3. 1 Conservation of lysine methylation sites and new conserved sites in

ordered and disordered regions at each eukaryotic level.

147

Methylation (R) sites in Ordered regions Methylation (R)Methylationsites (R) sites in Ordered regions Acetylation (K) sites in Ordered regions Ordered in sites (K) Acetylation Acetylation (K) sites in Ordered regions Ordered in sites (K) Acetylation NS Not Significant ✱ < 1E-10 ✦ < 0.0001 ✢ < 0.05 Fraction of conserved MethylationFraction of conserved Methylation (R) sites in ordered regions (R) sites in Ordered regions FractionNS Not Significant ✱ < 1E-10 ✦ < 0.0001 of new✢ < 0.05 conservedFraction of new conserved Methylation (R) sites in ordered regions Methylation (R) sites in Ordered regions rc o fcnevdAe ltin()ste Cnevtina thrMUstersdetypes) t residue e sit MAU her ot as ion (Conservat es sit (K) ion ylat Acet conserved of ion Fract rc o fcnevdAe ltin()ste Cnevtina thrMUstersdetypes) t residue e sit MAU her ot as ion (Conservat es sit (K) ion ylat Acet conserved of ion Fract 0.35 Other0.35 Eukaryotes Other Eukaryotes 116 116 ✦ ✦ 0.35 0.35 Other Metazoa Other Metazoa 210 210 NS NS 0.23 0.23 473 473 Other Tetrapods

412 Other Tetrapods 412 ✱ ✱ ✱ 0.11 ✱ 0.22 ✦✦ 0.22 0.11 Other Mam m als 7 0.11 Other Mam m als ✦✦ NS 0.19 40 0.19 16 738 738 Other Eutheria ✢ 0.1 ✢ Other Eutheria 0.08 NS 0.18

✦✦ 0.18 1181 1181 0.1 NS 16 NS 0.1 ✦✦ 56 NS Other Apes Other Apes NS NS NS NS 8 0.1 24 0.1 3 0.05 40 10 0.07 171 0.13 NS 0.11 NS 0.07 Human 0.08 Human 19 75 NS 0.16 NS 0.16 NS Other Prim ates 1693 1693 0.07 25 ✦✦ Other Prim ates 1

✱ 0.09 131 ✱ NS 16 ✦✦ 0.17 0.17

0.17 31 Other Supraprim ates 2080 0.1 106 2080 Other Supraprim ates ✢ ✢ Other Vertebrates 0.14 0.14 2799

2799 Other Vertebrates

NS 0.08 NS 0.09 NS Other Chordates ✦✦✦ Other Chordates 5 15 Other Deuterostom ia Other Deuterostom ia

0.13 Methylation (R) sites in Disordered regions Methylation (R) sites in Disordered regions 0.13 Acetylation (K) sites in Disordered regions Disordered in sites (K) Acetylation Acetylation (K) sites in Disordered regions Disordered in sites (K) Acetylation 3807 3807 NS Not Significant ✱ < 1E-10 ✦ < 0.0001 ✢ < 0.05 NS Not Significant ✱ < 1E-10 ✦ < 0.0001 ✢ < 0.05 NS Fraction of conserved Fraction of conserved Methylation (R) sites in disordered regionsMethylation (R) sites in Disordered regionsNS Fraction of new conservedFraction of new conserved Methylation (R) sites in disordered regions Methylation (R) sites in Disordered regions Fract ion of conserved Acet ylat ion (K) sit es in Disordered regions Disordered in es sit (K) ion ylat Acet conserved of ion Fract Fract ion of conserved Acet ylat ion (K) sit es in Disordered regions Disordered in es sit (K) ion ylat Acet conserved of ion Fract 0.12 0.12 0.48 0.48 Other Eukaryotes Other Eukaryotes 12 12 ✢ ✢ 0.41 0.41 Other Metazoa Other Metazoa NS 27 NS 27 4405 4405 NS NS 0.26 236 0.26 236 ✱ 183 ✱

183 Other Tetrapods Other Tetrapods ✱ ✱

0.67 0.22 0.22 NS Other Mam m als Other Mam m als 2 0.14 0.12

NS 0.2 NS 436 0.2 ✦ 436 286 165 ✦ Other Eutheria Other Eutheria 0.11 0.17 0.08 1130 0.17 te etrso ia Deuterostom Other Chordates Other Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other Eukaryotes Other 1130 te etrso ia Deuterostom Other Chordates Other ✦✦✦ Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other Eukaryotes Other NS ✱

✱ 0.35 479 Other Apes 0.32 193 Other Apes NS 0.11 NS NS 0.11 0.04 NS 0.07 1289 12 241 14 0.19 NS 0.17 NS 0.12 Human Human NS 548 0.09 1542 NS 69 0.07 0.12 ✱ 1542 121 NS 67 NS

✱ Other Prim ates Other Prim ates NS 0.1 1048 0.11 NS 0.08 258 1790

0.11 54 11 NS ✦ NS 1790 0.2 0.11 ✦ 790 Other Supraprim ates 242 Other Supraprim ates 2650 0.1 ✦ 2650 0.1 Other Vertebrates ✦ Other Vertebrates 0.26 0.23 NS Other Chordates NS Other Chordates 43 29

0.09 Other Deuterostom ia Other Deuterostom ia 4008 NS 0.09 4008 NS 0.07 < 0.00417, < 0.0025, NS Not Significant 0.07 5011 NS 5011 NS

ia Deuterostom Other Chordates Other Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other Eukaryotes Other te etrso ia Deuterostom Other Chordates Other Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other Eukaryotes Other

Figure S3. 2 Conservation of arginine methylation sites and new conserved sites

in ordered and disordered regions at each eukaryotic level.

148

Acetylation (K) sites in Ordered regions Acetylation (K)Acetylationsites (K) sites in Ordered regions Acetylation (K) sites in Ordered regions Ordered in sites (K) Acetylation regions Ordered in sites (K) Acetylation FractionNS Not Significant ✱ < 1E-10 of conserved✦ < 0.0001 ✢ < 0.05 AcetylationFraction of conserved Acetylation (K) sites in ordered (K) sites in Ordered regions regions FractionNS Not Significant ✱ < 1E-10 ✦ < 0.0001 of ✢ < 0.05 new conservedFraction of new conserved Acetylation (K) sites in ordered regions Acetylation (K) sites in Ordered regions rc o fcnevdAe ltin()ste Cnevtina thrMUstersdetypes) t residue e sit MAU her ot as ion (Conservat es sit (K) ion ylat Acet conserved of ion Fract ypes) t residue e sit MAU her ot as ion (Conservat es sit (K) ion ylat Acet conserved of ion Fract

0.35 0.35 Other Eukaryotes

116 Other116 Eukaryotes ✦ ✦ 0.35 0.35 210 210 NS NS Other Metazoa Other Metazoa 0.23 0.23 473 473 412 412 ✱ ✱ ✱ ✱ Other Tetrapods Other Tetrapods

0.22 0.4 0.22 ✦✦✦ Other Mam m als Other Mam m als 0.19 104 0.19 0.19 0.18 738 738 NS ✢ NS ✢ 1130 471 0.18 0.18 1181 1181 Other Eutheria 0.12 Other Eutheria NS 0.16 NS ✦✦✦ NS 0.29 0.34 1593 NS 463 Other Apes Other Apes NS 239 NS NS 0.17 0.12 NS 0.18 0.07 0.16 177 4370 0.16 73 721 1693 0.19 1693 0.16 NS Human NS Human ✱ NS 0.13 ✱ 0.11 1926 333 0.17 659 NS 0.17 NS 2080 ✦✦✦ 2080 Other Prim ates NS 0.1 Other Prim ates ✢ 0.14 3649 ✢ 990 420 51 NS 0.14 NS 0.14 2799 0.22 2799 0.16 Other Supraprim ates NS 2659 NS Other Supraprim ates 733 Other Vertebrates Other Vertebrates 0.24 0.18 ✦✦✦ Other Chordates NS Other Chordates 369 192 0.13 0.13 3807 3807 NS NS Other Deuterostom ia Other Deuterostom ia 0.12 0.12 Acetylation (K) sites in Disordered regions Acetylation (K) sites in Disordered regions Acetylation (K) sites in Disordered regions Disordered in sites (K) Acetylation Acetylation (K) sites in Disordered regions Disordered in sites (K) Acetylation NS Not Significant ✱ < 1E-10 ✦ < 0.0001 ✢ < 0.05 Fraction of conserved AcetylationFraction of conserved Acetylation (K) sites in disordered regions (K) sites in Disordered regions NS Not Significant Fraction✱ < 1E-10 ✦ < 0.0001 of✢ < 0.05 new conservedFraction of new conserved Acetylation (K) sites in disordered regions Acetylation (K) sites in Disordered regions Fract ion of conserved Acet ylat ion (K) sit es in Disordered regions Disordered in es sit (K) ion ylat Acet conserved of ion Fract Fract ion of conserved Acet ylat ion (K) sit es in Disordered regions Disordered in es sit (K) ion ylat Acet conserved of ion Fract 4405 4405 0.48 0.48 NS NS Other Eukaryotes Other Eukaryotes 12 ✢ 12 ✢ 0.41 0.41 NS 27 Other Metazoa Other Metazoa NS 27 0.26 236 0.26 236 ✱ 183 ✱ 183 ✱ Other Tetrapods Other Tetrapods ✱ 0.22

0.41 0.22

te etrso ia Deuterostom Other Chordates Other Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other Eukaryotes Other NS ia Deuterostom Other Chordates Other Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other Eukaryotes Other 0.15 Other Mam m als 0.13 Other Mam m als

0.2 9 436 ✦✦✦ 0.2

✦ NS 436

1005 ✦ 569 0.17

1130 Other Eutheria Other Eutheria 0.12 0.17 0.09 ✱ ✦✦✦ 1130 NS ✱ 0.41 ✦✦✦ 1542 Other Apes 0.41 537 Other Apes NS 436 0.11 NS NS 0.09 NS 0.12 0.07 0.04

1542 27 5011 18 1004 0.2 ✦✦✦ 0.18 NS 0.12

✱ Human Human 1790 0.09 1542 NS 248 0.07 0.11 ✱

1790 NS 200 NS ✦✦✦ Other Prim ates Other Prim ates ✦ 0.1 4007 0.11 NS 0.08 1357 236 ✦✦✦ 1790 53 ✦

2650 NS 0.1

✦ 0.22 2650 Other Supraprim ates 0.15 860 Other Supraprim ates 2650 0.1 ✦ Other Vertebrates Other Vertebrates 0.26 0.24 ✦✦✦ Other Chordates ✦✦✦ Other Chordates

0.09 183 156 4008 NS

0.09 Other Deuterostom ia Other Deuterostom ia 4008 0.07 NS 0.07 < 0.00417, < 0.0025, NS Not Significant 5011

NS 5011 NS

te etrso ia Deuterostom Other Chordates Other Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other FigureEukaryotes Other S3. 3 Conservation of lysine acetylation sites and new conserved sites in te etrso ia Deuterostom Other Chordates Other Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other Eukaryotes Other ordered and disordered regions at each eukaryotic level.

149

Ubiquitination sites in Ordered regions Ubiquitination (K) sites in Ordered regions Acetylation (K) sites in Ordered regions Ordered in sites (K) Acetylation Ubiqutination sites regions Ordered in sites (K) Acetylation NS Not Significant ✱ < 1E-10 ✦ < 0.0001 ✢ < 0.05 FractionNS Not Significant ✱ < 1E-10 ✦ < 0.0001 of new✢ < 0.05 conserved Ubiquitination (K) sites in Ordered regions Fraction of conserved UbiquitinationFraction of conserved Ubiquitination sites in ordered regions sites in Ordered regions Fraction of new conserved Ubiquitination sites in ordered regions rc o fcnevdAe ltin()ste Cnevtina thrMUstersdetypes) t residue e sit MAU her ot as ion (Conservat es sit (K) ion ylat Acet conserved of ion Fract rc o fcnevdAe ltin()ste Cnevtina thrMUstersdetypes) t residue e sit MAU her ot as ion (Conservat es sit (K) ion ylat Acet conserved of ion Fract 0.35 0.35 Other Eukaryotes Other Eukaryotes 116 116 ✦ ✦ 0.35 0.35 Other Metazoa

210 Other Metazoa NS 210 NS 0.23 473 0.23 473 412 ✱

412 Other Tetrapods ✱ ✱ Other Tetrapods ✱

0.55 0.22 0.22 ✦✦✦ 0.2 Other Mam m als 264 0.24 Other Mam m als 0.19 NS

0.19 NS 738 ✢ 738 2243 760 ✢ Other Eutheria 0.22 Other0.18 Eutheria 0.2 1181 0.18 NS

1181 ✦✦✦ NS

NS 0.32 0.43 3283 NS 1040 Other Apes Other Apes NS 561 NS NS 0.22 NS 0.23 0.1 419 0.17 155 0.23 1178 0.27 NS 7974 NS 0.16 Human Human NS 0.19 1693 430 0.17 0.16 3713 ✱ 1693 1483 NS NS NS Other Prim ates ✱ ✦✦✦ 6796 0.17 Other Prim ates 0.17 1635 0.2 2080 117 0.17 922 NS ✢

2080 NS 0.2 1448 Other Supraprim ates

✢ 0.29 5161 0.14 Other Supraprim ates 2799 NS 0.14 2799

NS Other Vertebrates Other Vertebrates 0.31 0.24 ✦✦✦ Other Chordates NS Other Chordates 805 386

0.13 Other Deuterostom ia Other Deuterostom ia 3807 NS 0.13 3807 NS 0.12 Ubiquitination (K) sites in Disordered regions Acetylation (K) sites in Disordered regions Disordered in sites (K) Acetylation Acetylation (K) sites in Disordered regions Disordered in sites (K) Acetylation NS Not Significant ✱ < 1E-10 ✦ < 0.0001 ✢ < 0.05 NS Not Significant ✱ < 1E-10 ✦ < 0.0001 ✢ < 0.05 Fraction of conserved Ubiquitination sites in disordered regions Fraction of new conservedFraction of new conserved Ubiquitination (K) sitesUbiquitination in Disordered regionssites in disordered regions 0.12 Ubiquitination sites in Disordered regions Fract ion of conserved Acet ylat ion (K) sit es in Disordered regions Disordered in es sit (K) ion ylat Acet conserved of ion Fract Fractionregions Disordered in of es sit (K) conserved ion ylat Acet conserved of ion Fract Ubiquitination sites in Disordered regions 0.48 0.48 Other Eukaryotes 12 12 4405 ✢ ✢ NS Other Eukaryotes 0.41 0.41 4405 NS NS 27 NS Other Metazoa 27 Other Metazoa 0.26 236 0.26 236 ✱ ✱ 183 183 ✱ ✱ Other Tetrapods 0.22 0.22 Other Tetrapods 0.84 Other Mam m als ✦✦✦ 0.15 0.2 te etrso ia Deuterostom Other Chordates Other Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other Eukaryotes Other 0.2 Other Mam m als 436 436 16 0.19 NS ✦ ✦ NS 460 0.17 0.17 Other Eutheria te etrso ia Deuterostom Other Chordates Other Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other Eukaryotes Other 1130 1130 926 Other Eutheria 0.14 ✱ ✱ 0.17 NS ✦✦✦ 0.51 NS 0.57 665 Other Apes 1591 Other Apes NS 212 0.14 NS NS ✦✦✦ 46 0.06 1196

0.16 0.12 0.12 0.1 0.26 NS 62 1542 1542 5583 Human 0.29 ✦✦✦ 0.1

✱ Human ✱ ✦✦✦ 266 1857 0.12 NS 0.11 0.11 466 Other Prim ates

NS 1790 1790 NS ✦✦✦ Other Prim ates 0.11 1613 ✦

✦ 0.14 4387 53 NS 254 NS 0.26 2650 Other Supraprim ates 2650 917 0.1 0.1 0.32 2774 Other Supraprim ates ✦ ✦ Other Vertebrates Other Vertebrates 0.34 0.29 ✦✦✦ Other Chordates ✦✦✦ Other Chordates 201 139 0.09 0.09 4008 4008 Other Deuterostom ia Other Deuterostom ia NS NS 0.07 0.07 < 0.00417, < 0.0025, NS Not Significant 5011 5011 NS NS

te etrso ia Deuterostom Other Chordates Other Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other Eukaryotes Other te etrso ia Deuterostom Other Chordates Other Vertebrates Other ates Supraprim Other ates Prim Other Hominoidea Eutheria Other als m Mam Other Tetrapods Other Metazoa Other Eukaryotes Other

Figure S3. 4 Conservation of lysine ubiquitination sites and new conserved sites in

ordered and disordered regions at each eukaryotic level.

150

Figure S3.5-S3.15: Supplementary Figures detailing the enrichments of conserved

MAU sites at different evolutionary level. The hypergeometric test is used to perform the enrichment calculations in each clade, where the number of all lysine/arginine

residues (population), number of all conserved lysines/arginines (success in population), number of MAU-modified lysines/arginines (sample) and the number of conserved MAU- modified lysines/arginines (success in sample) are taken for the calculation. We used the

number of all lysines/arginines (population), number of all new conserved

lysines/arginines (success in population), number of MAU-modified lysines/arginines

(sample) and the number of new conserved MAU-modified lysines/arginines (success in

sample) to calculate the enrichment of new (methylatable/acetylatable/ubiquitinatable)

sites in ordered and disordered regions.

151

A B 1 1 0.01 1E-06 0.0001 value value -

- 0.000001 1E-12 1E-08 1E-18 1E-10 1E-12 1E-24 1E-14 1E-30 1E-16 1E-36 HypergeometricP HypergeometricP 1E-18 1E-20 1E-42 1E-22 1E-48

Lysine methylation in Ordered Arginine methylation in Ordered Lysine methylation in Disordered Arginine methylation in Disordered Lysine acetylation in Ordered Ubiquitination in Ordered Lysine acetylation in Disordered Ubiquitination in Disordered

C D 1 1 0.01 0.0001 value

- 0.000001 value - 0.01 1E-08 1E-10 1E-12 1E-14 0.0001 1E-16

HypergeometricP 1E-18 HypergeometricP 1E-20 0.000001 1E-22

Lysine methylation in Ordered (new sites) Arginine methylation in Ordered (new sites) Lysine methylation in Disordered (new sites) Arginine methylation in Disordered (new sites)

Lysine acetylation in Ordered (new sites) Ubiquitination in Ordered (new sites) Lysine acetylation in Disordered (new sites) Ubiquitination in Disordered (new sites)

Figure S3. 5 Conservation of MAU sites in ordered and disordered regions across all eukaryotic clades.

P-value threshold for arginine methylation is < 0.0125 and Lysine methylation, acetylation and ubiquitination sites is < 0.00417.

152

A B 1 1 0.01 0.0001 1E-08 0.0001

value 1E-12 value - 1E-06 - 1E-16 1E-08 1E-20 1E-10 1E-24 1E-12 1E-28 1E-32 1E-14 1E-36 HypergeometricP 1E-16 HypergeometricP 1E-40 1E-18 1E-44

Lysine methylation in Ordered Arginine methylation in Ordered Lysine methylation in Disordered Arginine methylation in Disordered Lysine acetylation in Ordered Lysine acetylation in Disordered

C D 1 1

0.01 0.01 0.0001 value value - - 0.0001 0.000001 1E-08 0.000001 1E-10

1E-12 1E-08 HypergeometricP HypergeometricP 1E-14

1E-10 1E-16

Lysine methylation in Ordered (new sites) Arginine methylation in Ordered (new sites) Lysine methylation in Disordered (new sites) Arginine methylation in Disordered (new sites)

Lysine acetylation in Ordered (new sites) Lysine acetylation in Disordered (new sites)

Figure S3. 6 Conservation of MAU sites as other MAU residue type in ordered and disordered regions across all eukaryotic clades.

P-value threshold for arginine methylation is < 0.0125 and Lysine methylation, acetylation and ubiquitination sites is < 0.00417.

153

1 A B 0.01 1E-10 1E-08 1E-18 value - value - 1E-16 1E-26 1E-34

1E-24 1E-42 1E-50

HypergeometricP 1E-32 1E-58 HypergeometricP 1E-66 1E-40 1E-74

Lysine methylation in Ordered Arginine methylation in Ordered Lysine methylation in Disordered Arginine methylation in Disordered Lysine acetylation in Ordered Ubiquitination in Ordered Lysine acetylation in Disordered Ubiquitination in Disordered

C D 1 1 0.01 0.01 0.0001 value value

- 0.000001 - 0.0001 1E-08 1E-10 0.000001 1E-12 1E-14 1E-08 1E-16 HypergeometricP HypergeometricP 1E-18 1E-10 1E-20

Lysine methylation in Ordered (new sites) Arginine methylation in Ordered (new sites) Lysine methylation in Disordered (new sites) Arginine methylation in Disordered (new sites) Lysine acetylation in Ordered (new sites) Ubiquitination in Ordered (new sites) Lysine acetylation in Disordered (new sites) Ubiquitination in Disordered (new sites)

Figure S3. 7 Conservation of MAU sites filtered with ZORRO program in ordered and disordered regions across all eukaryotic clades.

P-value threshold for arginine methylation is < 0.025 and Lysine methylation, acetylation and ubiquitination sites is < 0.083.

154

A 1 B 1

0.01 1E-06 0.0001

value 1E-12 value - 0.000001 - 1E-08 1E-18 1E-10 1E-24

1E-12 1E-30 1E-14 1E-36 1E-16 HypergeometricP HypergeometricP 1E-18 1E-42 1E-20 1E-48

Lysine methylation in Ordered Arginine methylation in Ordered Lysine methylation in Disordered Arginine methylation in Disordered Lysine acetylation in Ordered Ubiquitination in Ordered Lysine acetylation in Disordered Ubiquitination in Disordered

C D 1 0.01 value value - - 0.000001

1E-10 0

1E-14 HypergeometricP HypergeometricP 1E-18

0 1E-22

Lysine methylation in Ordered (new sites) Arginine methylation in Ordered (new sites) Lysine methylation in Disordered (new sites) Arginine methylation in Disordered (new sites) Lysine acetylation in Disordered (new sites) Ubiquitination in Disordered (new sites) Lysine acetylation in Ordered (new sites) Ubiquitination in Ordered (new sites)

Figure S3. 8 Conservation of MAU sites in non-histone proteins in ordered and disordered regions across all eukaryotic clades.

P-value threshold for arginine methylation is < 0.0125 and Lysine methylation, acetylation and ubiquitination sites is < 0.00417.

155

A 1 B 1 0.01 0.0001 value value 0.000001 - - 1E-08 1E-10 0.01 1E-12 1E-14 1E-16 1E-18 HypergeometricP HypergeometricP 1E-20 0.0001 1E-22

Lysine methylation in Ordered (new sites) Arginine methylation in Ordered (new sites) Lysine methylation in Disordered (new sites) Arginine methylation in Disordered (new sites) Lysine acetylation in Ordered (new sites) Ubiquitination in Ordered (new sites) Lysine acetylation in Disordered (new sites) Ubiquitination in Disordered (new sites)

Figure S3. 9 Conservation of new MAU sites in 'old' proteins in ordered and disordered regions across all eukaryotic clades.

P-value threshold for arginine methylation is < 0.05 and Lysine methylation, acetylation and ubiquitination sites is < 0.0167.

156

A B 1 1

0.01 value value - -

0.01 0.0001

0.000001 HypergeometricP HypergeometricP

0.0001 1E-08

Lysine methylation in Ordered Arginine methylation in Ordered Lysine methylation in Disordered Arginine methylation in Disordered Lysine acetylation in Ordered Ubiquitination in Ordered Lysine acetylation in Disordered Ubiquitination in Disordered

C D

1 1.00 value value - -

0.01 0.10 HypergeometricP HypergeometricP

0.0001 0.01

Lysine methylation in Ordered (new sites) Arginine methylation in Ordered (new sites) Lysine methylation in Disordered (new sites) Arginine methylation in Disordered (new sites)

Lysine acetylation in Ordered (new sites) Ubiquitination in Ordered (new sites) Lysine acetylation in Disordered (new sites) Ubiquitination in Disordered (new sites)

Figure S3. 10 Conservation of MAU sites in ordered and disordered regions of histone proteins across all eukaryotic clades.

P-value threshold for arginine methylation is < 0.025 and Lysine methylation, acetylation and ubiquitination sites is < 0.0083.

157

A B 1.000 1 value value -

0.100 - 0.1

0.010 0.01

0.001 0.001 HypergeometricP HypergeometricP

0.000 0.0001

Lysine methylation in Ordered Arginine methylation in Ordered Lysine methylation in Disordered Arginine methylation in Disordered Lysine acetylation in Ordered Lysine acetylation in Disordered

C D 1 1 value value - - 0.1

0.01 HypergeometricP HypergeometricP

0.001 0.1

Lysine methylation in Disordered (new sites) Arginine methylation in Disordered (new sites) Lysine methylation in Ordered (new sites) Arginine methylation in Ordered (new sites) Lysine acetylation in Disordered (new sites) Lysine acetylation in Ordered (new sites)

Figure S3. 11: Conservation of MAU sites in as other MAU residue type in ordered and disordered regions of histone proteins across all eukaryotic clades.

P-value threshold for arginine methylation is < 0.025 and Lysine methylation, acetylation and ubiquitination sites is < 0.0083.

158

A B 0.01 0.0001

1E-08 value - value 1E-24

- 1E-14

1E-20 1E-44 1E-26

1E-32 1E-64 HypergeometricP

HypergeometricP 1E-38

1E-44 1E-84

Sites with multiple MAUs in ordered Sites with multiple MAUs in disordered New sites with multiple MAUs in ordered New sites with multiple MAUs in disordered

Figure S3. 12 Conservation of sites with multiple MAUs in ordered and disordered regions across all eukaryotic clades.

P-value threshold < 0.05.

A B 1 1 value value

- 0.01 0.01 -

0.0001 0.0001

0.000001 0.000001 HypergeometricP HypergeometricP

1E-08 1E-08

Lysine methylation in Disordered Arginine methylation in Disordered Lysine methylation in Ordered Arginine methylation in Ordered Lysine acetylation in Disordered Ubiquitination in Disordered Lysine acetylation in Ordered Ubiquitination in Ordered

Figure S3. 13 Conservation of sites with MAUs in FB (treated as a sample of O and DO) regions across all eukaryotic clades.

P-value threshold < 0.05.

159

1

1E-50

1E-100 value -

1E-150

1E-200

HypergeometricP 1E-250

1E-300

Lysine methylation in Disordered Arginine methylation in Disordered

Lysine acetylation in Disordered Lysine ubiquitination in Disordered

Figure S3. 14 Conservation of MAU sites in disordered regions across all eukaryotic clades, sequence are aligned using KMAD alignment tool.

P-value threshold for arginine methylation < 0.0125 and Lysine methylation, acetylation and ubiquitination sites < 0.00417.

160

1 0.001 1E-06 1E-09 value - 1E-12 1E-15 1E-18 1E-21

HypergeometricP 1E-24 1E-27 1E-30

Lysine methylation in Disordered Arginine methylation in Disordered

Lysine acetylation in Disordered Lysine ubiquitination in Disordered

Figure S3. 15 Conservation of MAU sites in disordered regions (disordered regions predicted by IUPRED tool) regions across all eukaryotic clades.

P-value threshold for arginine methylation < 0.0125 and Lysine methylation, acetylation and ubiquitination sites < 0.00417.

161

Figure S3. 16 Gene Ontology category enrichments at different evolutionary levels. Gene ontology (GO) category enrichments for proteins with newly emerged conserved

MAU-site residues at each evolutionary level are depicted in heat map form in Figure part

A below. The heat map colour indicates the value of the log P-value for significant GO category enrichments.

In Figure part B, the same calculation is performed for the whole set of MAU- modified proteins.

These calculations were performed as described in the Methods section.

162

(continued FigureS3.16)

Color Key A and Histogram 250 200 150 Count 100 50 0

−100 0 100 Value

heterocyclic compound binding protein binding RNA binding nucleoside phosphate binding nucleotide binding purine ribonucleoside triphosphate binding purine nucleotide binding purine ribonucleotide binding ribonucleotide binding small molecule binding nucleic acid binding anion binding ATP binding adenyl nucleotide binding adenyl ribonucleotide binding carbohydrate derivative binding catalytic activity enzyme binding nucleoside−triphosphatase activity cadherin binding hydrolase activity, acting on acid anhydrides hydrolase activity, acting on acid anhydrides, in phosphorus−containing anhydrides pyrophosphatase activity cytoskeletal protein binding ion binding cell adhesion molecule binding transcription factor binding macromolecular complex binding kinase binding protein serine/threonine kinase activity ATPase activity transferase activity, transferring phosphorus−containing groups protein domain specific binding identical protein binding protein kinase binding mRNA binding ubiquitin−like protein ligase binding kinase activity phosphotransferase activity, alcohol group as acceptor ubiquitin protein ligase binding protein kinase activity catalytic activity, acting on a protein transferase activity nucleoside binding ribonucleoside binding purine nucleoside binding purine ribonucleoside binding guanyl nucleotide binding GTP binding guanyl ribonucleotide binding GTPase activity ATPase activity, coupled hydrolase activity GDP binding ubiquitin−like protein conjugating enzyme activity single−stranded RNA binding ubiquitin conjugating enzyme activity magnesium ion binding enhancer sequence−specific DNA binding RNA polymerase II transcription factor binding RNA polymerase II distal enhancer sequence−specific DNA binding single−stranded DNA binding protein C−terminus binding structural constituent of ribosome isomerase activity catalytic activity, acting on RNA Apes Eutheria Metazoa Primates Mammals Tetrapods Chordates Vertebrates Supraprimates Deuterostomes

163

Color Key B and Histogram 80 60 40 Count 20 0

−150 −50 0 50 150 Value

heterocyclic compound binding organic cyclic compound binding protein binding RNA binding nucleoside phosphate binding nucleotide binding purine ribonucleoside triphosphate binding small molecule binding purine nucleotide binding purine ribonucleotide binding ribonucleotide binding anion binding ATP binding drug binding adenyl nucleotide binding adenyl ribonucleotide binding carbohydrate derivative binding nucleic acid binding catalytic activity nucleoside−triphosphatase activity cadherin binding pyrophosphatase activity hydrolase activity, acting on acid anhydrides, in phosphorus−containing anhydrides hydrolase activity, acting on acid anhydrides ion binding cytoskeletal protein binding transcription factor binding cell adhesion molecule binding macromolecular complex binding tubulin binding kinase binding identical protein binding protein domain specific binding protein serine/threonine kinase activity microtubule binding transferase activity, transferring phosphorus−containing groups ATPase activity protein kinase binding kinase activity nucleoside binding ribonucleoside binding purine nucleoside binding purine ribonucleoside binding protein complex binding GTP binding guanyl nucleotide binding protein kinase activity phosphotransferase activity, alcohol group as acceptor guanyl ribonucleotide binding actin binding chromatin binding hydrolase activity GTPase activity ubiquitin−like protein ligase binding ATPase activity, coupled transferase activity catalytic activity, acting on a protein transcription factor activity, protein binding ubiquitin protein ligase binding mRNA binding transcription factor activity, transcription factor binding structural molecule activity transcription cofactor activity protein dimerization activity actin filament binding histone binding 1 2

164