Bioinformatical analysis of intrinsically disordered regions in eukaryotes: Insights into the evolution of folding-on-binding
regions and post-translational modifications
Mohanalakshmi Narasumani
Department of Biology McGill University Montreal, Quebec, Canada
APRIL 2018
A thesis submitted to McGill University in partial fulfilment of the requirements of the degree of Doctor of Philosophy in Biology (Bioinformatics concentration)
© 2018 Mohanalakshmi Narasumani
ABSTRACT
Intrinsically disordered proteins (IDPs) or Intrinsically disordered regions (IDRs) in proteins can exhibit a partially folded or unfolded state under physiological conditions but confer several functional advantages. IDRs can fold into a stable tertiary structure when bound to their partner molecule, a transition that can be promoted by post-translational modifications (PTMs). Intrinsic disorder is found in all domains of life but is prevalent in eukaryotes. This thesis investigates the composition and evolutionary behaviour of disordered regions that undergo disorder-to-order transition and the evolutionary trend of
PTMs in IDRs across eukaryotes using computational methods and tools.
Bioinformatical parsing of human folding-on-binding (FB) proteins into four subsets
(ordered, FBs, disordered regions that surround FBs, and other disordered regions) was performed to examine whether the composition and evolutionary behaviour (across vertebrate orthologs) are different in these four subsets. This analysis revealed that compositionally, ordered protein regions are distinct from the three other subsets, but the
FB regions are of comparable evolutionary conservation to the ordered regions.
Disordered regions surrounding FB regions are more negatively charged and less conserved than their adjacent FB regions. The presented results suggest the role of hydrophilic or charged residues around FBs in steering FB regions towards the binding sites of partner molecules. The insights gained from analysis of evolutionary conservation for FBs provided motivation to examine a related question, namely the evolutionary conservation of PTMs in IDPs/IDRs, in comparison to PTMs in ordered regions.
i
In another bioinformatical approach, the conservation and emergence of methylation, acetylation and ubiquitination sites in ordered and disordered regions were examined across 11 evolutionary clades down from the whole eukaryotic domain to the ape superfamily. These sites occur mainly at arginine and lysine residues. It was discovered that MAU PTM is a major driver of conservation for arginines and lysines in both ordered and disordered regions, across the 11 levels, most significantly across the mammalian clade. Furthermore, the emergence of a significant number of new lysine
MAU sites is found in the disordered regions of proteins in deuterostomes and mammals.
In histones, MAU sites exhibit a distinct significant conservation pattern evident as far back as the last common ancestor of mammals. In a separate multiple evolutionary level analysis of the experimentally-verified human FB regions, a significant enrichment of conserved ubiquitination sites in FB regions was identified at all evolutionary levels back as far as mammals. Similarly, FB regions showed a significant preference for sites with multiple MAU modifications when treated both as a sample of ordered and of disordered regions. These results indicate the need to consider sequence analysis of IDRs at multiple evolutionary levels in order to understand their complex evolutionary patterns. The presented study as a whole demonstrates the distinctive amino acid composition, PTM preference and conservation of IDRs that exhibit different conformational states (e.g. disordered, disordered around FB and FB regions), and the interplay between these properties.
ii
RÉSUMÉ
Les protéines intrinsèquement désordonnées (IDPs) ou les régions intrinsèquement désordonnées (IDRs) dans les protéines peuvent montrer un état partiellement replié ou non-replié dans des conditions physiologiques mais confèrent plusieurs avantages fonctionnels. Les IDRs peuvent se replier en une structure tertiaire stable quand elles se lient à leur molécule associée, une transition qui peut être facilité par des modifications post-traduction (PTMs). Le désordre intrinsèque se rencontre dans tous les domaines du vivant mais est très fréquent chez les eucaryotes. Cette thèse
étudie la composition et le comportement évolutif des régions désordonnées qui subissent une transition du désordre vers l’ordre et la tendance évolutive des PTMs dans les IDRs chez les eucaryotes grâce à des méthodes et des outils informatiques. Le classement bioinformatique des protéines humaines se pliant en se liant (FB) en quatre ensembles (ordonnées, FBs, régions désordonnées qui entourent les FBs, et les autres régions désordonnées) a été effectué pour examiner si la composition et le comportement
évolutif (chez les différents vertébrées orthologues) sont différent dans ces quatre ensembles. L’analyse a révélé que par la composition, les régions ordonnées des protéines sont distincts des trois autres ensembles, mais les régions FB montrent une conservation évolutive similaire aux régions ordonnées. Les régions désordonnées qui entourent les régions FB sont plus négativement chargées et moins conservées que leurs régions FB adjacentes. Les résultats présentés suggèrent le rôle joué par les résidus hydrophiles ou chargés autour des FBs pour piloter les régions FB vers les sites de liaison
iii
des molécules associées. La connaissance fournie par l’analyse de conservation
évolutive pour les FBs encourage à étudier une question associée, à savoir la conservation évolutive des PTMs dans les IDPs/IDRs, comparé aux PTMs des régions ordonnées. Dans une autre approche bioinformatique, la conservation et l’émergence de sites de méthylation, acétylation, et ubiquitination dans les régions ordonnées et désordonnées ont été étudiées sur 11 clades évolutifs transmis depuis le domaine eucaryote entier jusqu’à la superfamille des grands singes. Ces sites se rencontrent surtout dans les résidus arginine et lysine. Il a été montré que MAU PTM joue un rôle majeur dans la conservation pour les arginines et les lysines à la fois dans les régions ordonnées et désordonnées, dans tous les 11 niveaux, de manière plus significative dans le clade des mammifères. Il a été montré qu’un nombre significatif de nouveaux sites
MAU lysine apparaît dans les régions désordonnées des protéines chez les deutérostomes et les mammifères. Dans les histones, les sites MAU montrent un patron de conservation significativement distinct et évident jusque chez le plus lointain ancêtre commun des mammifères. Dans une analyse séparée des niveaux évolutifs multiples des régions FB humaines vérifiées expérimentalement, un enrichissement significatif des sites d’ubiquitination conservés dans les régions FB a été découvert dans tous les niveaux évolutifs des mammifères. De manière similaire, les régions FB montrent une préférence significative pour les sites avec de multiples modifications MAU quand elles sont traitées à la fois comme des régions ordonnées ou désordonnées. Ces résultats indiquent la nécessité de considérer l’analyse des IDRs à des niveaux évolutifs multiples afin de comprendre leurs patrons évolutifs complexes. La présente étude dans son
iv
ensemble démontre la composition distincte en acides aminés, la préférence PTM et la conservation des IDRs qui montrent différent états de conformation (e.g., désordonnées, désordonnées autour des FB et les régions FB), et les inter-relations entre ces propriétés.
v
ACKNOWLEDGEMENTS
First and foremost, I would like to express my deepest gratitude to my supervisor
Professor Paul Harrison, for his guidance, motivation and continuous support during my studies. He introduced me to the fascinating world of intrinsically disordered proteins.
Most of all, I thank him for his patience and encouragement. Working under his supervision is a once-in-a-lifetime experience for me. His insightful comments, critical thinking and expertise helped me to improve my research in the past five years.
I am grateful to the members of my supervisory committee, Prof. Jérôme
Waldispühl and Prof. Jacek Majewski for the guidance, suggestions and fruitful discussions during supervisory committee meetings.
I am extremely grateful to the Biology Department for the support to complete my thesis. I would like to especially thank Ancil Gittens, Susan Bocti, Susan Gabe, Anne-
Marie L'Heureux, Sonal Patel and Tony for their support in the Biology Department.
Special thanks to Sébastien Portalier for translating my thesis abstract into French.
I would like to thank the members of Compute Canada and Calcul Québec support group for their prompt responses and patience in answering my questions and requests.
I owe my sincere thanks to my undergraduate mentor Prof. C.S. Parameshwari for the support and encouragement. Many thanks to my Master’s thesis supervisors Dr.
Trishul Artham and Prof. Mukesh Doble for giving me the opportunity to work under their guidance.
vi
I must thank my friends Swetha, Charana, Aman, Jaskaran, Saleem, Saravan and
Elika for their support.
I would like to thank my brothers, N. Kannabiran and N. Sathiya Narayanan for
their constant support and, encouragement. I thank my sister-in-law Revathy Kannabiran
for her support. I would also like to thank my little friends, Sujan and Charan for their love
and affection.
Finally, this thesis would not have been possible without the support of my fiancé,
Vijay. You are my source of inspiration, my best friend and you made me who I am today.
I would like to dedicate this thesis to my parents M. V. Narasumani and N. Thulasi for
their faith, unconditional love, motivation, and constant support. I am blessed to have
parents like them.
vii
PREFACE
Thesis format and organization
This thesis is written in manuscript form as given by the McGill University Graduate
Studies and Research. The work presented here is the original work of the candidate. It
is comprised of two manuscripts on which the candidate is the lead author.
The detailed background information of the current literature, review of the topics
and objectives of the research project are presented in Chapter 1. I investigated the
sequence composition, post-translational modifications and evolutionary trends of
disordered regions in eukaryotes. The findings of this investigation are presented in
Chapter 2 and Chapter 3. In Chapter 4, I discuss the implications of this work and provide
the conclusion, thereby creating a cohesive document.
Contribution to authors
Chapter 2:
Narasumani, M. and Harrison, P. M. Bioinformatical parsing of folding-on-binding
proteins reveals their compositional and evolutionary sequence design. Sci. Rep.
5, 18586; doi: 10.1038/srep18586 (2015).
Professor Harrison and I designed the study and I performed almost all of the data analysis.
viii
Chapter 3:
Narasumani, M. and Harrison, P. M. Discerning evolutionary trends in post-
translational modification and the effect of intrinsic disorder: Analysis of
methylation, acetylation and ubiquitination sites in human proteins. Accepted in
PLOS Computational Biology.
Professor Harrison and I designed the study and I performed all of the analysis.
I have written the initial drafts of all the manuscripts and Professor Harrison edited the later drafts of the manuscripts.
Contribution to Knowledge
Data analysis and interpretation of the research presented in Chapters II and III was
performed by me, under the supervision of Prof. Paul Harrison. The results of these
studies have been prepared and submitted to peer-reviewed publications.
Chapter 2
This study attempts to understand the composition and evolutionary behaviour of
the human proteins that contain folding-on-binding regions. Here, I examined the amino acid composition and evolution of the four parsed regions (i) Ordered, (ii) Other
Disordered (iii) Disordered-around-FB (DFB) (iv) FB regions. As a result of this study, I found that Ordered and FB regions group together as highly conserved. I also observed
ix
that DFB regions are more extremely hydrophilic than the other disordered datasets. In this study, we describe the possible compositionally-based steering mechanism of FB region folding-on-binding. This analysis is first of its kind to perform the bioinformatical parsing of FB proteins that emphasizes the similarities between Ordered and FB regions and highlights the differences between FB and other disordered datasets.
Chapter 3
The presented work is the large-scale evolutionary analysis of more than 15,000 experimentally determined human methylation, acetylation and ubiquitination (MAU) sites in ordered, FB and disordered regions at 11 eukaryotic clades. This is to my knowledge the first large-scale analysis of evolutionary trend in ordered and disordered regions of proteins in 380 eukaryotic species. In this study, I identified a significant conservation of ubiquitination sites in FB regions across mammals. Here, I find that MAU PTM is a major driver of conservation for arginines and lysines in both ordered and disordered regions across the mammalian clade. I also observed the emergence of new lysine MAU sites in the disordered regions of proteins in deuterostomes and mammals. The analysis of sequence conservation in ordered and disordered regions at multiple evolutionary levels is a novel approach to examine the complex patterns of PTM evolution across eukaryotes.
x
Table of Contents
ABSTRACT ...... i
RÉSUMÉ ...... iii
ACKNOWLEDGEMENTS ...... vi
PREFACE ...... viii
Contribution to authors ...... viii
Contribution to Knowledge ...... ix
LIST OF ABBREVIATIONS ...... ix
1 Introduction ...... 1 1.1 Intrinsically Disordered Regions (IDRs) ...... 2 1.1.1 Brief history of Protein structure and function ...... 2 1.1.2 Intrinsically Disordered Regions/Proteins ...... 3 1.1.3 Coupled folding and binding of IDRs ...... 6 1.1.4 Sequence composition of IDRs ...... 9 1.1.5 Characterization of intrinsically disordered regions/proteins ...... 10 1.1.6 Experimental determination of IDRs ...... 11 1.1.7 Computational tools to predict intrinsically disordered regions and proteins ...... 12 1.2 Functional advantages of IDRs ...... 13 1.2.1 Molecular Recognition ...... 16 1.2.2 Post-translational modifications in disordered regions ...... 17 1.3 Evolution of IDRs ...... 20 1.4 Classification of IDRs ...... 22 1.5 Disease associated with IDRs ...... 23 1.6 Role of IDRs in drug development: ...... 24 1.7 Objectives of the Research ...... 26 1.8 References ...... 28
i
2 Bioinformatical parsing of folding-on-binding proteins reveals compositional sequence design and evidence for a general guiding mechanism for binding ..... 43 2.1 Abstract: ...... 44 2.2 Introduction ...... 45 2.3 Methods ...... 47 2.3.1 Data sets ...... 47 2.3.2 Multiple sequence alignments ...... 47 2.3.3 Conservation analysis of the aligned sequences ...... 48 2.3.4 Hydrophobicity and Charge calculation ...... 48 2.4 Results and Discussion ...... 49 2.4.1 Overview of the data sets ...... 49 2.4.2 Analysis of Ordered, Disordered, FB and Disordered around FB regions as populations of sequences ...... 52 2.4.3 Further analysis of compositional differences between the Ordered, Disordered, FB and Disordered around FB parsed subsets ...... 59 2.4.4 Complex pattern of sequence conservation in FB-containing proteins ...... 66 2.4.5 Sampling analysis of parsed subsets ...... 67 2.4.6 A possible guidance mechanism during FB folding-on-binding with protein interaction partners ...... 69 2.5 Concluding remarks ...... 71 2.6 References ...... 72 2.7 Connecting Text for Chapter 2 to Chapter 3 ...... 77
3 Discerning evolutionary trends in post-translational modification and the effect of intrinsic disorder: Analysis of methylation, acetylation and ubiquitination sites in human proteins ...... 78 3.1 Abstract ...... 79 3.2 Introduction ...... 81 3.3 Methods ...... 84 3.3.1 PTM Datasets ...... 84 3.3.2 Eukaryotic proteomes ...... 84
ii
3.3.3 Sequence analysis ...... 85 3.3.4 Identification of ordered and disordered regions in proteins ...... 85 3.3.5 Conservation & statistical analysis ...... 86 3.4 Results and Discussion ...... 87 3.4.1 Distribution of MAU sites in ordered and disordered regions ...... 88 3.4.2 FB regions as display areas for PTMs ...... 92 3.4.3 PTMs are depleted in homopeptides and prion-like proteins ...... 99 3.4.4 Evolutionary behaviour of MAU sites at eleven evolutionary levels ...... 100 3.4.5 Evidence for methylation as a driver of lysine conservation during eukaryotic evolution, and for the emergence of new lysine methylation sites ...... 109 3.4.6 Arginine methylation conservation is highly favoured in ordered regions across human evolutionary descent in eukaryotes ...... 111 3.4.7 Human acetylated lysines are favoured for significant conservation in disordered regions rather than in ordered regions across eukaryote evolution ...... 113 3.4.8 Ubiquitination-site residue conservation is favoured in disordered regions of eukaryotic proteins ...... 115 3.4.9 Conservation signals for MAU sites in Histones ...... 116 3.4.10 Methylation site lysine residues in the disordered regions of linker H1 and H3 variants are conserved as far back as mammals ...... 117 3.4.11 Ubiquitination sites in H2A and H3 variants in mammalian histones ...... 118 3.4.12 Sites with multiple MAU PTMs ...... 119 3.4.13 Ubiquitination is a major driver of conservation of lysines in folding-on-binding (FB) regions ...... 120 3.4.14 Functional trends in MAU-site containing proteins ...... 121 3.5 Concluding remarks ...... 122 3.6 References ...... 124
Chapter IV...... 133
4 Discussion and Conclusion ...... 133 4.1 Discussion ...... 134 4.2 Conclusion ...... 140
iii
4.3 References ...... 141
APPENDIX A...... 145 Supplementary Data for Chapter III ...... 145
iv
List of Figures
Figure 1.1 Structure of human calcineurin heterodimer ...... 4 Figure 1. 2 Proposed structure model of Hrk and its binding mechanism...... 7 Figure 1. 3 Structural polymorphism of disordered regions...... 15 Figure 1. 4 Post-translational modifications induce structural changes in IDRs...... 18
Figure 2. 1 Pipeline of the analysis performed ...... 50 Figure 2. 2 Example alignment of a parsed protein...... 51 Figure 2. 3 Analysis of the four region types as populations of sequences...... 54 Figure 2. 4 Analysis of the four region types as populations of sequences...... 57 Figure 2. 5 Trends in composition and conservation for the four parsed region types. .. 60 Figure 2. 6 Comparison of the overall amino-acid composition of the four region types...... 65
Figure 3. 1 Overview of the number of methylation, acetylation and ubiquitination sites and the coincidence of different MAU types at the same residues in ordered and disordered regions ...... 90 Figure 3. 2 Percentage distribution of MAU and phosphorylation sites in ordered and disordered regions of human proteins ...... 91 Figure 3. 3 Distribution of PTM sites in ordered and disordered regions of human proteins for various subsets of the data...... 97 Figure 3. 4 Organismal phylogeny and pipeline...... 103 Figure 3. 5 Example of a protein with methylation, acetylation and ubiquitination sites in ordered and disordered regions...... 104 Figure 3. 6 Summary of significantly enriched conserved MAU sites in ordered and disordered regions at 11 evolutionary clades ...... 107
v
SUPPLEMENTARY FIGURES
Figure S3. 1 Conservation of lysine methylation sites and new conserved sites in ordered and disordered regions at each eukaryotic level...... 147 Figure S3. 2 Conservation of arginine methylation sites and new conserved sites in ordered and disordered regions at each eukaryotic level...... 148 Figure S3. 3 Conservation of lysine acetylation sites and new conserved sites in ordered and disordered regions at each eukaryotic level...... 149 Figure S3. 4 Conservation of lysine ubiquitination sites and new conserved sites in ordered and disordered regions at each eukaryotic level...... 150 Figure S3. 5 Conservation of MAU sites in ordered and disordered regions across all eukaryotic clades...... 152 Figure S3. 6 Conservation of MAU sites as other MAU residue type in ordered and disordered regions across all eukaryotic clades...... 153 Figure S3. 7 Conservation of MAU sites filtered with ZORRO program in ordered and disordered regions across all eukaryotic clades...... 154 Figure S3. 8 Conservation of MAU sites in non-histone proteins in ordered and disordered regions across all eukaryotic clades...... 155 Figure S3. 9 Conservation of new MAU sites in 'old' proteins in ordered and disordered regions across all eukaryotic clades...... 156 Figure S3. 10 Conservation of MAU sites in ordered and disordered regions of histone proteins across all eukaryotic clades...... 157 Figure S3. 11: Conservation of MAU sites in as other MAU residue type in ordered and disordered regions of histone proteins across all eukaryotic clades...... 158 Figure S3. 12 Conservation of sites with multiple MAUs in ordered and disordered regions across all eukaryotic clades...... 159 Figure S3. 13 Conservation of sites with MAUs in FB (treated as a sample of O and DO) regions across all eukaryotic clades...... 159
vi
Figure S3. 14 Conservation of MAU sites in disordered regions across all eukaryotic clades, sequence are aligned using KMAD alignment tool...... 160 Figure S3. 15 Conservation of MAU sites in disordered regions (disordered regions predicted by IUPRED tool) regions across all eukaryotic clades...... 161 Figure S3. 16 Gene Ontology category enrichments at different evolutionary levels. .. 162
vii
LIST OF TABLES
Table 2. 1 Comparison of the hydrophobicities of the parsed subsets...... 61 Table 2. 2 Mean hydrophobicity values of the four region types...... 62 Table 2. 3 Comparison of the net charges of the parsed subsets...... 62 Table 2. 4 Mean net-charge values of the parsed subsets...... 63 Table 2. 5 Comparison of the conservation scores of the parsed subsets...... 66 Table 2. 6 Mean conservation score values of the parsed region types...... 67 Table 2. 7 FB set as sample of total ordered and disordered sets...... 68
Table 3. 1: Enrichment of ‘multiple-MAU’ sites in FB (treated as a sample of either ordered (O) or disordered (DO) regions across all eukaryotes...... 99 Table 3. 2: Percentages of human MAU-site residues in ordered and disordered regions that are conserved across all eukaryotes...... 108
viii
LIST OF ABBREVIATIONS
4E-BP2: 4E-binding protein 2 ...... 138 APC: adenomatous polyposis coil ...... 6 BH3: Bcl-2 homology 3 ...... 6 BLASTP: Basic Local Alignment Search Tool for proteins ...... 85 BRD4: Bromodomain-containing Protein 4 ...... 25 CBP: CREB-binding protein ...... 11 CD: Circular Dichroism ...... 11 CH: Charge-Hydropathy ...... 9 CRD1: Cell-cycle regulatory domain-1 ...... 13 DFB: Disordered around FB ...... 50 DISOPRED: Disorder Prediction tool ...... 86 DisProt: Database of Protein Disorder ...... 47 DO: Disordered ...... 99 ELM: Eukaryotic Linear Motif ...... 23 FB: Folding on Binding ...... 6 GO: GO: Gene Ontology ...... 70 HIPK2: Homeodomain-interacting protein kinase-2 ...... 25 HPV: Human papillomavirus ...... 24 Hrk: Harakiri ...... 6 IDEAL: Intrinsically disordered proteins with extensive annotation and literature ...... 47 IDP: Intrinsically Disordered Proteins ...... 4 IDR: Intrinsically Disordered Regions ...... 4 IQR: Inter-quartile range ...... 55 IUPRED: Prediction of intrinsically unstructured regions ...... 86 MAU: Methylation, Acetylation and Ubiquitination ...... 27 MoRFs: Molecular Recognition Features ...... 16 MSA: Multiple sequence alignments ...... 47
ix
NMD: nonsense-mediated decay ...... 6 NMR: Nuclear Magnetic Resonance ...... 3 NUPR1: Nuclear protein 1 ...... 25 O: Ordered ...... 99 PDB: Protein Data Bank ...... 12 PDZ: postsynaptic protein PSD-95/SAP90, Drosophila septate junction protein Discs- large, tight junction protein ZO-1 ...... 25 PTM: post-translational modification ...... 14 SLiMs: Short Linear Motifs ...... 22 UPF1: Up-frameshift 1 ...... 6 WASP: Wiskott-Aldrich Syndrome Protein ...... 14
x
CHAPTER I
1 Introduction
1
1.1 Intrinsically Disordered Regions (IDRs)
1.1.1 Brief history of Protein structure and function
The classical protein structure-function paradigm is derived from the experiments
favouring the view that the three-dimensional structure of a protein is the prerequisite for
its function. In 1894, Fischer [1] proposed the lock and key model explaining that enzymes exhibit complementary geometric shapes to their substrates [2]. This model demonstrates that the enzyme and substrate fit together like a lock and key. Therefore, the protein and ligand interaction is facilitated by the complementary structures in the binding site of the protein [2]. To state the importance of protein’s structure-function paradigm, in 1906 Fischer wrote [2] “Since the proteins participate in one way or another in all chemical processes in the living organism, one may expect highly significant
information for biological chemistry from the elucidation of their structure and their
transformations.” Indeed, the protein structure is crucial for understanding its function and
the fact that proteins unfolding under denaturing conditions lose both their structure and
their function has supported this paradigm [2, 3].
However, later two major models: 'configurational adaptability' by Karush [4] and
'induced fit' by Koshland [5] described that the active sites or the whole domain of enzymes undergo conformational changes to facilitate the interaction of specific functional groups with the substrate, and these conformational changes are crucial for function [2, 3]. Several proteins have been proposed to exhibit induced fit mechanism [6-
2
8]. For example, the neutron and X-ray structure of α-cyclodextrin, a model
macromolecule, in complex with water and other substrates demonstrated that the
change in conformation and hydrogen bonding energy on α-cyclodextrin mediates the
complex formation [9, 10]. Furthermore, the conformational diversity of Immunoglobulin
E (IgE) antibody SPE7 enables its binding with unrelated antigens [11]. Despite these phenomena, the structure-function paradigm remained the prerequisite for protein function.
1.1.2 Intrinsically Disordered Regions/Proteins
The well-defined 3-Dimensional structure of a protein determines its biological
function, often referred to as the ordered state. The crystal structure of numerous proteins
is reported to exhibit missing regions. This was often linked to protein purification errors
or the failure to solve the phase problem, but importantly, the most common reason was
the failure of unobserved atom or residue to scatter X-rays coherently [3, 12-15], but later
many of these regions have been identified as actual disordered regions or local disorder
[3]. In 1978, both X-ray crystallography and nuclear magnetic resonance (NMR) studies
revealed the functional disorder in proteins. More precisely, NMR determined the
structure of the functional yet disordered tail of histone H5, [3]. Solution-state NMR
spectroscopy has been used to characterize disordered regions/proteins. Later,
numerous NMR studies have characterized various proteins with functional disordered
regions [3]. These are referred to as intrinsically disordered/unstructured regions or
3
proteins (IDRs or IDPs) [3, 16, 17]. One of the earlier examples of IDRs is calcineurin, a
serine/threonine phosphatase and abundant calmodulin-binding protein in the brain.
Calcineurin plays a vital role in T cell activation. The calmodulin binding region in
calcineurin subunit A is situated in a long disordered region. The binding of calcium to
calmodulin activates calcineurin’s phosphatase (Figure 1.1 sections A-B) [18, 19].
Figure 1.1 Structure of human calcineurin heterodimer (A) Linear representation of the catalytic (A subunit) and regulatory (B subunit)
subunits in calcineurin. Calcineurin A subunit (in red) with the N-terminal catalytic
domain, a calcineurin B-binding segment, a calmodulin-binding segment and an
autoinhibitory peptide are highlighted. Calcineurin B subunit (in green) with four EF hands
that bind four Ca+ ions are also indicated. Figure from [18] (Copyright Confirmation
number:Li et al.4326820691899) Page 16 NIH-PA Author Manuscript
NIH-PA Author Manuscript
4 NIH-PA Author Manuscript Figure 1. Calcineurin domain organization and proposed mechanism of activation. (a) Regional organization of calcineurin. Here and in Figure 1b, calcineurin A and calcineurin B are color-coded in shades of red and green, respectively. (b) The proposed mechanism of activation of calcineurin. In this widely accepted model of calcineurin activation 10, 11, Ca2+ occupancy of the low affinity sites on calcineurin B causes dissociation of the calmodulin-binding region of calcineurin A from the calcineurin B–binding region and causes the transition from Form I to Form II, facilitating the subsequent binding of calmodulin (Form III), which leads to displacement of the autoinhibitory peptide and full calcineurin activation (Form IV). The structure of the calcineurin A regulatory region between the calcineurin B-binding helix and the autoinhibitory peptide in resting calcineurin (Form I) remains to be determined. In Form IV, the portions of calcineurin A between the calcineurin B-binding helix and the calmodulin binding site and C-terminal to the calmodulin binding site are depicted as random coil, but may in fact be structured. Recent
Trends Cell Biol. Author manuscript; available in PMC 2012 February 1. extended-disorder for proteins and regions that exist under physiological conditions primarily as random coil. (continued from Figure 1.1, previous page)
Figure 1. (a) 3-D structure of (B) Three-dimensional calcineurin, showing the A subunit (purple), The B subunit (blue), the structure of calcineurin auto-inhibitory peptide (green) and showing the disordered the location of a 95-residue disordered region (red). The region in A subunit and calmodulin binding site (yellow helix) ordered regions. The is located within the disordered region (orange). (b) Side and top experimentally views of calmodulin (blue) binding a target helix (yellow). Note that the determined structure of A calmodulin molecule surrounds the subunit (yellow), B target helix when bound. subunit (yellow surface) and autoinhibitory domain (green) are demonstrated. The long
A significant body of work disordered tail (red) and the calmodulin binding site (helix in red) of A subunit are also suggests that the unfolded state is not a demonstrated. Figure from [19]. (Copyright Confirmation Number: 4392560289396)
true random coil, but instead possesses substantial amounts of an extended form IDRs are described as extremely flexible with no defined secondary structure under 24,25,26 that resembles the polyproline II helix asphysiological well as other conditions local conformations [20]. Several studies that resemblehave reported that the lack of structure of the native state. For this reason, extended-disorderdisordered may regionsbe a pr eferableconfers several term asfunctional compared advantages to [3, 12, 15, 21-23]. One such advantage is that disordered regions can bind to various targets in different conformations random coil, but the latter term continues to have widespread usage and so, for convenience, we and undergo disorder-to-order transition. will continue to use this term here – with the understanding that by the term random coil we do not mean the true random coil defined by the polymer chemist.
It is useful to introduce the topic of natively disordered proteins with a specific, very 5 clear example. Calcineurin (Figure 1a) makes a persuasive case for the existence and importance of native disorder27,28,29. This protein contains a catalytic A subunit and a B subunit with 35% sequence identity to calmodulin. The A subunit is a serine-threonine phosphatase that becomes activated upon association with the Ca2+-calmodulin complex. Thus, calcineurin, which is
7
1.1.3 Coupled folding and binding of IDRs
Many disordered regions can exhibit a well-defined structure when they bind to a
specific partner molecule, and they remain disordered in the absence of their interacting
molecules [3, 13-15, 24]. It has been hypothesized that a disordered region forms a weak and nonspecific binding with its target and exhibits a folded state as it approaches the binding site; this has been described as the ‘fly-casting’ mechanism [25]. This mechanism is observed in the assembly of the nonsense-mediated decay (NMD) complex, where the disordered C-terminal domain of UPF2 (up-frameshift 2) initiates the complex formation by binding with UPF1 (up-frameshift 1) [26]. Furthermore, both short regions within the longer disordered regions or the entire disordered regions can undergo disorder-to-order transition [24]. These regions are referred to here as ‘folding on binding’ (FB) regions.
For example, the interaction between the FB region of adenomatous polyposis coil (APC), a tumour suppressor protein, and axin promotes complex formation for the phosphorylation of beta-catenin [27]. In the case of membrane proteins, Harakiri (Hrk), for example, a Bcl-2 (B-cell lymphoma 2) family protein, induces cell death in BH3-only
(Bcl-2 homology 3) subfamily [28]. The binding of disordered cytosolic domain of Hrk with the survival Bcl-2 and Bcl-xL (B-cell lymphoma-extra large) members allows the cytosolic
domain to form a-helical conformation and it has been suggested that the disordered
domain could have increased the capture radius for prosurvival partners to mediate
binding (Figure 1.2) [28].
6
Figure 1. 2 Proposed structure model of Hrk and its binding mechanism. The arrow shows the disorder-to-order transition of cytosolic domain in Hrk (green ribbon) upon binding to Bcl-2 or Bcl-xL (red ribbon) protein. The hydrophobic (light yellow surface) and hydrophilic (dark yellow surface) parts of lipid bilayer are also highlighted. Figure from
[28]. (Copyright right permission to reuse this figure is not needed).
The coupled folding and binding process of IDRs facilitates the high specificity and low affinity towards a partner molecule [3, 13-15, 24]. Therefore, IDRs show prominent roles in signalling processes. The highly specific binding mediates the initiation of signalling pathways and the low affinity facilitates rapid dissociation of partner molecules
7
[29, 30]. An example of this short-lived association is the interaction between p27 and
cyclin-CDK during the cell cycle [29, 30]. More examples of coupled folding on binding in
IDRs are discussed below.
The kinase inducible activation domain (pKID) of CREB is disordered in the free state [31]. NMR titrations and N relaxation dispersion studies revealed that phosphorylated pKID forms two a-helices when bound to the KIX domain of the transcriptional coactivator CREB binding protein (CBP) [31, 32]. Interestingly, small variations to entropy or enthalpy of binding [29, 33] caused by post-translational modifications have been suggested to facilitate the transition of disordered to ordered conformations and change protein charge [15, 34, 35]. For example, phosphorylation of
Ser684, Ser686 and Ser692 in E-cadherin stabilizes cadherin structure by promoting additional hydrogen bond interactions with beta-catenin [15]. Another example is that serine phosphorylation in the calmodulin domain of human p4.1 enables the ability of 17- residue peptide to adopt an alpha-helical conformation [36]. Proteins that involve in coupled folding and binding exhibit many functional advantages, however, the evolution and amino acid composition of these regions and other disordered regions around them is not well studied.
8
1.1.4 Sequence composition of IDRs
Compositionally biased or repetitive regions and low complexity regions are often associated with IDRs, and they show a preference for charged, polar and structure- breaking amino acids such as alanine, arginine, glutamic acid, glutamine, glycine, lysine, proline and serine termed as “disorder-promoting amino acids.” On the other hand, amino acids such as asparagine, cysteine, isoleucine, leucine, methionine, phenylalanine, tryptophan, tyrosine and valine are common in ordered regions, named as “order- promoting amino acids” [19, 37, 38]. Thus, sequence composition holds information for protein structure. In 2000, Uversky et al. developed a charge-hydropathy (CH) plot to distinguish ordered and disordered regions based on the net charge and hydrophobicity.
In this plot, a linear boundary is used to separate the ordered and disordered regions, and IDRs occupy the relatively high net charge and low hydrophobicity area of the plane
[13].
The amino acid composition of IDRs is more highly conserved among orthologs than their sequence [39, 40]. The distinct sequence composition of IDRs can influence their conformational stability and functions [29]. For instance, computational studies of sequence composition have shown that charged and hydrophobic residues dictate the conformation of IDRs and regulate cell cycle [41].
The relationship between amino acid composition and intrinsic disorder in histones has been reviewed [42, 43]. In histones, the sequence composition of terminal domains and disordered regions determines its role in molecular recognition [42, 43]. The
9
disordered C-terminal domain (CTD) of linker histone rich in Ala, Lys, and Pro is necessary to stabilize chromatin fibers. Many higher eukaryotes have six somatic linker histone isoforms with a distinct primary sequence in CTDs. However, the amino acid composition of each CTDs is similar and contain ~40% Lys, ~20-35% Ala, and ~15% Pro.
Furthermore, the composition of CTDs has been reported to play key roles in DNA binding and protein-protein interactions. In the case of core histones, the disordered N-terminal domain of H2A and H4 share a similar composition, whereas H2B and H3 show similar composition to the CTD of linker histone [42]. In addition, the amino acid composition can mediate different types of functional IDRs [42, 44].
The distinct amino acid composition of IDRs has been used to develop several computational tools to predict intrinsic disorder. The disorder predictions tools based on the flexibility, hydropathy, and charge of the amino acids are discussed in a later section
(Section 1.4.2).
1.1.5 Characterization of intrinsically disordered regions/proteins
The structural differences between ordered and disordered regions have led to rapid development of a variety of computational and experimental methods. Although the regions with missing electron density in X-ray crystallographic structure have been reported to provide disordered regions information, the missing regions in proteins may be due to protein purification errors or crystal defects [3]. NMR spectroscopy and circular
10
dichroism (CD) spectroscopy are some of the experimental methods used to observe the conformational propensities of disordered regions [45, 46].
1.1.6 Experimental determination of IDRs
The structural determination of disordered regions using NMR has more advantages than X-ray diffraction because of the absence of requirement for crystallization and NMR can provide different conformational ensembles for disordered regions [3, 46]. In addition,
NMR can show disordered regions within the ordered regions, regions that fold upon binding to other proteins ('folding-on-binding' regions), and completely disordered proteins in solution [47]. For example, the disordered NH2-terminal region of p21 in unbound state and the ordered conformation of the same region when bound to Cdk2 is observed using NMR [47]. Specifically, the structural and functional details of disordered regions and structured domains of CBP (CREB-binding protein) and its homologue p300 have been determined by NMR [46].
CD spectroscopy is also used to determine the presence of ordered, molten globule and random coil states of a protein. However, it is not possible to obtain clear structural information of both ordered and disordered regions [48].
11
1.1.7 Computational tools to predict intrinsically disordered regions
and proteins
Limitations and the differences in experimental methods have led to the
development of numerous prediction tools to identify disordered regions in protein [12,
49]. IDRs are characterized by amino acid compositional bias, low sequence complexity, low hydrophobicity and somewhat higher net charge [13, 17, 22]. Initially, disordered regions are predicted by identifying the missing coordinates in crystal structure [49]. Later, the differences between the amino acid composition of ordered and disordered regions have been used to predict the structure of disordered regions from the primary sequence
[17, 49]. Several disorder predictors have been developed, such as: ANCHOR [50],
DisEMBL [51], DISOPRED [52] and DISOPRED3 [53], DisPredict [54], DISpro [55],
DNdisorder [56], FoldIndex [57], FoldUnfold [58], GlobPlot [59], iPDA [60], IUPred [61],
PONDRâs [62-64], PrDOS [65], RONN [66], SPINE-D [67], Spritz [68], CSpritz [69],
ESpritz [70] and RAPID [71]. Although the past decade has seen an increasing number
of predictors, the gap between the number of structures available for disordered regions
in PDB (Protein Data Bank) and in nature is immense. Moreover, an earlier analysis of
proteins deposited in the major sequence databases showed the abundance of
disordered regions in proteins [22]. Therefore, predicting the structure of disorder regions
has a significant role in understanding their function.
12
1.2 Functional advantages of IDRs
The lack of a well-defined structure provides numerous functional advantages to
disordered regions. Bioinformatical analysis of complete proteomes revealed that IDRs in
eukaryotes are abundant in regulatory and signaling proteins [12, 14, 22]. Several studies
have indicated a substantial number of intrinsically disordered regions in complex
organisms and much greater percentages of proteins with predicted disorder are reported
in eukaryotes (35 to 51%) in comparison to bacteria (6 to 33%) and archaea (9 to 37%)
[12, 51, 72-76]. Furthermore, a study on the structural disorder in eukaryotes reported
that protists (single-celled eukaryotes) have high levels of predicted structural disorder
[77-79]. Disordered regions are abundant in parasites with complex life cycle and host-
changing pathogenic lifestyle such as apicomplexan and Trypanosoma genus and are
involved in host-parasite interactions [78]. Therefore, the functional importance of disordered regions is emphasized by their abundance in a wide range of organisms.
Furthermore, in many eukaryotic proteins, the globular domains are connected by the linker regions. These regions are enriched in polar, uncharged and Proline residues but high sequence conservation, an increased number of charged and hydrophobic residues have occasionally been observed. For example, the cell-cycle regulatory domain-1 (CRD1), enriched in charged residues, is located between the KIX domain and the bromodomain in CBP, and sumoylation in the CRD1 motif is necessary for transcription repression.
13
Previously, Dunker et al reported that disordered regions with 30 or longer residues are mainly involved in cell signaling, transcription and translation regulation, protein- nucleic acid and protein-protein binding [22]. The structural polymorphism of disordered regions enables the interaction with different target molecules and adoption of different structures depending on binding partner [15]. For example, GTPase binding domain of
Wiskott-Aldrich syndrome protein (WASP) exhibits different folded structures when it binds to Cdc42 GTPase and the VCA inhibitory peptide (Figure 1.3A) [15]. Another interesting example is the interaction of glycogen synthase kinase 3b (GSK3b) with
FRAT1 and axin. Both FRAT and axin proteins bind at the C-terminal domain of GSK3b and acquire different conformations (Figure 1.3B) [37]. In addition, the nuclear coactivator binding domain of the CBP protein exhibits two different structures when it binds to the activation domain of p160 [80]. Furthermore, thymosin b4, a small actin-binding protein is reported to have disordered regions in solution and is involved in the recognition of several target molecules [81]. Other major functions of IDRs include the housing of protein phosphorylation and other post-translational modifications (PTMs). IDRs are reported to provide enhanced accessibility of PTM sites for regulation [22]. From the above studies, it is clear that the discovery of IDRs has created a new paradigm in the protein structure/function relationship, which augments the paradigm that had been accepted for the past 60 years. Hence, exhaustive bioinformatical studies on protein sequences of
IDRs will illuminate the evolution and function of these relatively less characterized protein types. More detailed functions of IDRs are discussed below.
14 1246 V.N. Uversky, A.K. Dunker / Biochimica et Biophysica Acta 1804 (2010) 1231–1264 and spatial orientation [240]. Some scaffolds create focal points for ternary product complex, the tetrahydrofolate binary complex as well spatial and temporal coordination of enzymatic activity of kinases and as the tetrahydrofolate–NADPH complex. These structures can be phosphatases. used to reconstruct a 2.1 Å resolution movie, depicting the sequence Modulation of the phosphorylation state of downstream members of events during the catalytic cycle, which showed that the enzyme of signal transduction pathways is a primary mode of action for many adopts different conformational substates while complexed with dif- scaffold proteins. Compartmentalization is provided by the fact that ferent ligands, suggesting that the process of enzymatic catalysis the activity of bound members is directed towards neighboring sub- might be accompanied by significant conformational changes [244]. strates that may or may not be bound to the scaffold. Enzymes may be Signaling and regulation are proposed to be among the most activated or inhibited upon association with the scaffold. Associations important functions of ID proteins/regions [101]. Qualitatively, it seems are dynamic and may serve to coordinate the responses among path- reasonable that highly mobile proteins would provide a better basis for ways. Scaffolds contain several domains for protein–protein inter- signaling and recognition. For example, disordered regions can bind action. Furthermore, scaffold proteins can play a role in modulating partners with both high specificity and low affinity [245]. This means the activation of alternative pathways by promoting interactions that the regulatory interactions can be specific and also can be easily between various signaling proteins [241]. dispersed. Obviously this represents a keystone of signaling — turning a In order to understand the role of ID for scaffolding functions, signal off is as important as turning it on [72]. several well-characterized scaffold proteins with structurally and Another crucial property of ID proteins for their function in sig- functionally characterized ID regions were analyzed [241]. Based on naling networks is binding diversity; i.e., their ability to partner with the analysis of these several famous scaffolds, including axin, breast many other proteins and other ligands, such as nucleic acids [89]. This cancer type 1 susceptibility protein (BRCA1), A-kinase anchoring opens the possibility for one regulatory region or one regulatory proteins AKAP79 and AKAP250, microtubule-associated protein 2 protein to bind to many different partners. A protein that binds to (MAP2), titin and several others, large ID regions appear to be crucial multiple partners might be expected to be crucial for a number of for successful scaffold function. These signaling scaffold proteins different biological processes and therefore might be especially utilize the various features of highly flexible ID regions toA obtain more important for the survival of the cell. In agreement with this idea, functionality from less structure [241]. proteins that make multiple interactions are more likely to lead to The more function from less structure conclusion was further lethality if deleted [246]. supported by a recent study on structural properties of the CASK- There are several other reasons of why ID proteins might be interactive protein 1 [242], which is a post-synaptic density protein in superior for certain tasks compared to their ordered counterparts. mammalian neurons where it acts as a specific scaffold interacting This includes, but is not limited to: binding commonality in which with many important proteins including κ-casein (CASK), stathmin-3, multiple, distinct sequences recognize a common binding site synaptotagmin, neurexin-2, septin-4, neural cell adhesion molecule (with perhaps different folds in the various complexed ID proteins) L1 (L1CAM), SH2/SH3 adaptor protein NCK-alpha (NCK1), and several [176]; the ability to form large interaction surfaces as the disordered others. Using a set of bioinformatics tools, CD spectroscopy, wide-line region wraps-up [247] or surrounds its partner [248]; faster rates and 1H-NMR spectroscopy, limited proteolysis and gel-filtration of association by reducing dependence on orientation factors and chromatography, the entire C-terminal proline-rich region of 800 by enlarging target sizes [21]; and faster rates of dissociation by amino acids of CASK-interactive protein 1 exhibits the set of char- unzippering mechanisms [72]. acteristics associated with being intrinsically disordered [242]. An interesting consequence of the capability of ID proteins to Furthermore, the authors extended their finding of a high level of ID interact with different binding partners is their polymorphism in in CASK-interactive protein 1 by assembling a set of 74 scaffold bound state; i.e., an ID protein (or ID region) can have completely proteins and predicting their disorder by three different algorithms. A different geometries in the rigidified structures induced by associating very high fraction of the residues was found to fall into local disorder, with its partner, depending on the nature of the bound partner. and ordered domains of these scaffold proteins were shown to be Crystallographic studies on glycogen synthase kinase 3β (GSK3β), a connected by linker regions which were mostly disordered. Thus, the Ser/Thr protein kinase and its interactions with FRAT1 and axin usual design of a scaffold protein includes a set of short globular provide an illustrative example of these polymorphic bound states domains (∼80 amino acids on average) connected by longer linker [249]. Fig. 10 shows that a sharp turn breaks the structure of FRAT regions (∼150 residues on average) with crucial binding functions [242].
2.5.4. The functional advantages of ID proteins/regions B Importantly, even sturdy key holes (i.e., protein active sites) have been shown to be rather flexible. In fact, as early as in 1958 it was recognized that some enzymes could act on rather differently shaped substrates, suggesting that a degree of flexibility would be needed to fit the different substrates and thereby to be functional. To explain these ideas, a modification of the “lock and key” model called the “induced fit” theory was proposed by Koshland [243]. According to this theory and its subsequent modifications/interpretations, the enzyme is partially flexible and the substrate does not simply bind to the active site, but it has to bring about changes to the shape of the active site to activate the enzyme and make the reaction possible. Substantial experimental evidence has been accumulated to support this view for many different enzymes. For example, the existence of functional flexibility within the active site has been demonstrated by X-ray crystallographic analysis of E. coli dihydrofolate reductase Fig. 10. Polymorphism in the bound state. Comparison of axin and FRAT binding to GSK3β. The binding sites for the axin (383–401) peptide and FRAT (197–222) peptides liganded with different cofactors and substrates. In fact, Sawaya Figure 1. 3 Structuralare co-localized polymorphism in the C-terminal domain of disordered of GSK3β. However, theregions. two peptides have no and Kraut have analyzed crystal structures of different forms of sequence homology, have different conformations in their bound state, and possess this protein, including the holo-enzyme, the(A) Michaelis The complex, GTPase the bindingdifferent sets domain of interactions of with WASP GSK3β. showing different folded structures when
interacting with (a) Cdc42 GTPase and (b) VCA peptide. Figure from [15]. (Copyright
15
Confirmation Number: 4326830865336). (B) Polymorphism in bound state of GSK3b. The binding of Axin and FRAT proteins with GSK3b demonstrate different conformations and interactions. Figure from [37] (Copyright Order Number: 4326831050111).
1.2.1 Molecular Recognition
As mentioned earlier, disorder-to-order state transition is a common process in
disordered regions. Disorder-mediated molecular recognition requires low binding affinity
and often involves interaction with a large number of diverse partner molecules. DNA
recognition and transcription activation are mediated by disorder-to-order state transition.
For instance, the disordered high mobility group (HMG) domain in lymphoid enhancer-
binding factor (LEF-1) binds with DNA and regulates the T cell receptor-a gene enhancer.
The analysis of IDRs in transmembrane proteins revealed that loop regions rich in
positively charged amino acids provide structural stability and facilitate regulatory
interactions whereas IDRs with a deficit of positive residues in terminal regions mediate
protein-protein interactions (e.g., receptor clustering or recruiting signaling partners) [82].
An earlier analysis of protein complexes deposited in the PDB have shown short peptides
(10-70 residues) bound to globular proteins [83]. These peptides are located in the long
IDRs and exhibit a-helix and b-sheet and irregular secondary structure upon binding to
target molecules. They have been called molecular recognition features (MoRFs), and
16
are primarily associated with molecular recognition and protein-protein interactions in
signaling events [83, 84]. Furthermore, predicted phosphorylation sites are reported in
one third of MoRFs [83] and sites with multiple PTM show strong preference of MoRFs
[85].
1.2.2 Post-translational modifications in disordered regions
Post-translational modifications play vital roles in signaling, maturation, folding of
newly synthesized proteins and protein interaction networks [86, 87]. Additionally, PTMs
in disordered regions have been reported to regulate protein-protein interactions in
transcriptional and developmental processes [85]. Interestingly, PTMs can facilitate
disorder-to-order transition by changing the physical and chemical properties of IDRs
(Figure 1.4) [88]. It has also been hypothesized that PTMs are primarily present in the disordered regions due to their high accessibility [88]. Indeed, histones undergo several
types of PTMs such as acetylation, methylation, phosphorylation, ubiquitination,
sumoylation and ADP-ribosylation and these modifications are significant for nucleosome
stability, transcription activation, gene repression and offer a distinct function to chromatin
[89, 90]. Acetylation and methylation in the disordered N-terminal tail of core histones
facilitate specific protein-protein interactions and induce coupled folding and binding of
the N-terminal domain [91].
17
Figure 1. 4 Post-translational modifications induce structural changes in IDRs. Figure from [88] (Copyright right permission to reuse this figure is not needed).
In DNA binding proteins, acetylation and phosphorylation can modulate their specific and non-specific interactions with DNA by altering the charge of the modified residues
[92]. In addition, the intrinsically disordered N and C-terminal domains of p53 are subjected to various PTMs including acetylation, methylation, glycosylation, phosphorylation, poly-ribosylation, O-GlcNacylation, sumoylation and ubiquitination [93].
These modifications regulate the interaction between p53 and its partner molecule. For
example, phosphorylation in the transcription activation domain (TAD) of p53 can
18
increase its binding affinity to CH3 and TAZ1 (transcriptional adapter zinc-binding). On the other hand, phosphorylation at Ser15, Thr18, and Ser20 can inhibit the Mdm2 binding
[93]. In the case of membrane proteins, myristoylation has a significant role in membrane targeting. The disordered N-terminus of Hrk, a BH-3 only member of Bcl-2 family, is predicted to contain a N-myristoylation site at Gly 63 [28].
Many studies have shown that PTMs such as acetylation, fatty acylation, methylation, glycosylation, phosphorylation, and ubiquitination can occur predominantly in the disordered regions. Specifically, phosphorylation is significant for signaling in eukaryotic proteins, and nearly one-third of the eukaryotic proteins undergo phosphorylation [94]. Recently, the analysis of IDRs in 504 kinases showed 83% of kinases with at least one disordered region [95, 96] and each kinase group is categorised based on their differential evolution [97]. In addition, it has been reported that phosphorylation shows preference for IDRs in both animals and plants [3, 34, 81, 91, 98].
The amino acids adjacent to phosphorylation sites generally have similar properties with residues in disordered regions. Therefore, a web-based tool has been developed to identify the phosphorylation sites in disordered regions [34]. However, a correlation study between protein disorder and PTMs have shown contradicting results [99]. Hence, the
IDRs may not preferentially exhibit sites for all types of PTMs.
The evolutionary studies of regulatory enzymes and modification sites have revealed that PTMs such as acetylation, glycosylation and phosphorylation are found in all domains of life [100]. In addition, the non-enzymatic acetylation of lysine residues and phosphorylation further suggest their ancient origin [100]. In 2012, Hagai and co-workers
19
studied the rate of evolutionary changes and formation of ubiquitination sites. They reported that mammalian proteins are more conserved than the unmodified residues and the shift in the location of ubiquitination sites may be compensated by the residues in the disordered regions [101]. Furthermore, bioinformatics analysis revealed the emergence of 281 novel ubiquitination sites in the human lineage during primate evolution [102].
Similarly, 37 human-specific phosphorylation sites have also been identified [103]. The recent advances in the sequencing technologies have facilitated the large-scale analysis of PTMs and their evolution. However, PTMs in many proteins are still need to be studied to understand their function and obtain a broader view of the origin of PTMs in different clades. Moreover, a large-scale analysis comparing the conservation of PTMs in ordered and disordered regions is necessary to understand their functional role in different species. This thesis focuses on the bioinformatics analysis of the evolution and emergence of PTMs in the disordered regions.
1.3 Evolution of IDRs
The evolution of disordered regions is different from that of ordered regions [39, 96,
104-109]. An early comparative study on the evolution of IDRs and ordered proteins showed that the disordered regions evolve faster than the ordered regions in 19 families out of 26 [110]. However, a smaller group of IDRs have been shown to evolve slowly
[110]. These disordered regions are involved in binding sites for proteins, DNA, RNA, and flexible linkers, while the slowly evolving regions are involved in DNA binding. In addition,
20
disordered proteins tend to undergo more amino acid replacement than the ordered
proteins [110].
The rewiring of protein interaction during evolution suggests that disordered
interactions are less conserved than the ordered interactions [111]. IDRs show a distinct
pattern of point accepted mutations and higher rates of insertion and deletion. The
aromatic amino acids possess a lower substitution rate than the charged amino acids;
this also contributes to the differences in the evolutionary rate between ordered and
disordered regions [111]. In addition, a recent study suggested that the evolutionarily young proteins are enriched in disordered regions, and they can be ordered over evolutionary time [112].
In 2018, the evolutionary analysis of human proteins with IDRs show that they are
frequent targets for positive selection than other regions in the protein [113]. Furthermore,
the dynamic analysis of linker domain in RPA70 protein using NMR found maintenance
of similar backbone flexibility and same length across the diverged species, yet the amino
acids sequences showed no conserved regions [114]. However, a recent study has
suggested that the conserved IDRs within a single domain protein may provide multiple
functions that are typically observed in proteins with multiple domains [115]. Moreover, many studies have used different approaches to study the evolution and sequence conservation in disordered regions [106, 116-120]. Thus, understanding the natural
selection and domain evolution of IDRs is important to decipher their functions.
Evolutionary studies on disordered proteins will provide an insight into the degree of
21
natural selection on specific parts of intrinsically disordered proteins and how these relate to each other during evolution.
1.4 Classification of IDRs
The structural and amino acid sequence differences between ordered and disordered regions provide many ways to classify IDRs [3, 121-125]. Furthermore, previous studies have shown different approaches to identify different types of IDRs [126,
127]. In 1997, Romero and coworkers have classified IDRs into three types (i) short (7-
21), medium (22-44) and long (>45 residues) [123, 124]. The sequence analyses of short
(£30 residues) and long (>30) disordered regions show that short disordered regions are enriched in glycine and aspartic acid and the long disordered regions are enriched in lysine, glutamic acid and proline [64, 128-130], and disordered regions of different length exhibit different types of functions [44, 131]. For instance, short disordered regions may contain short linkers or MoRFs or short linear motifs (SLiMs ) with 3-10 residues involving partner recognition and post-translational modification [44, 132-135]. On the other hand long disordered regions may contain multiple motifs or domains [44]. Therefore, several studies have investigated the functions of IDRs based on their amino acid sequence length [75, 110, 132, 136, 137].
Moreover, IDRs can also be classified in terms of their structure, function, functional features, evolution. regulation, protein interactions and biophysical properties[44, 138]. In specific, functional features such as MoRFs and (SLiMs) are identified as partner binding
22
regions. MoRFs can undergo disorder-to-order transition upon binding to a target
molecule and SLiMs exhibit sequence conservation and provide sites for post-
translational modifications [139]. For instance, phosphorylation in the cyclin dependent
kinase binding motif is involved in regulating cell cycle progression[140]. In addition,
MoRFs can be predicted using several predictors including alpha-MoRFpred [73, 141],
MoRFpred [142], MFSPSSMpred [143], MoRFCHiBi [144], MoRFchibi SYSTEM [145],
fMoRFpred [146], retro-MoRF [147] and OPAL [148]. The predictors of SLiMs including
SLiMpred [149], PepBindPred [150], SLiMDisc [151] HHMOTiF [152] and SLiMSearch
[153]. In addition, the manually curated SLiMs can be found in eukaryotic linear motif
(ELM) resource [154]. Both MoRFs and SLiMs can be identified based on their sequence conservation [143, 155]. For example, SH2, SH3 and Ser/Thr Kinase interacting SLiMs
are conserved in disordered regions [156].
1.5 Disease associated with IDRs
The abundance of IDRs in eukaryotes shows that these proteins exhibit crucial roles
in normal cellular function; with malfunctioning of IDRs make them vulnerable to many
diseases. A broad range of diseases originates from the misfolding or unfolding nature
of certain proteins, referred to as protein misfolding diseases [157]. In 2008, Vladimir et
al., introduced the D2 (disorder in disorders) concept to highlight the abundant association
of IDRs with human diseases [158]. Furthermore, impaired interactions with the
endogenous factors such as chaperones, intracellular or extracellular matrices, or small
23
molecules can increase the misfolding propensity of pathogenic proteins [159]. For
example, the aggregation of the IDR alpha-synuclein in the cytoplasm is involved in the pathogenesis of Parkinson’s disease, dementia with Lewy bodies, Alzheimer's disease,
Down's syndrome, multiple system atrophy, and neurodegeneration with brain iron accumulation type 1 [160]. These are termed synucleinopathies.
Studies on disordered regions in p53 (tumour suppressor) and Human
papillomavirus (HPV ) proteins proved that these proteins are found to contain an
increased amount of intrinsic disorder [160]. The intrinsically disordered regulatory region
near the C-terminus of p53 folds into helical, β-strand, and extended irregular structures
on binding to different protein partners [161]. Misfolding diseases tend to spread to
multiple tissues and to cause damage to multiple organs. Hence, these studies highlight
the alarming need for understanding the regulation of IDRs because altered expression
of IDRs is associated with many diseases.
1.6 Role of IDRs in drug development:
Several studies on IDRs associated with diseases and their significant role in cellular
function postulated that IDRs could be potential drug targets for many life-threatening
diseases like cancer and neurological disorders [162-167]. Vladimir et al., surmised that a protein-protein interaction between one disordered partner and one structured partner is likely to be a good target for drug discovery [158]. In addition, IDRs can form a helix with a hydrophobic surface in a groove of structured proteins, which is observed in the
24
MoRF dataset referred to above. One such example is the p53-Mdm2 (a negative
regulator of p53) interaction. The p53 binding site contains an alpha-helical MoRF and
Mdm2 forms a groove for it to form against [163]. Previous studies have suggested that the binding of small molecules to MoRFs in IDRs stabilise its bound state and inhibits
protein interaction with other partners [168]. Interestingly, a recent study has reported that
the disordered region of nuclear protein 1 (NUPR1) remain disordered upon binding of
the fifteen FDA-approved compounds [164].
During the disorder-to-order transition of IDRs, the binding energy is used to
overcome the high entropy of the unfolded state. As a consequence, the interaction
between the disordered and structured partners is weaker than the interaction between
two structured proteins and favours a way to inhibit them with small molecules [163].
As mentioned earlier, IDRs biased for proline-rich residues might be a drug target
for immune-mediated disorders [162]. In addition, binding of small molecules in the
disordered region of Myc protein can regulate the over-expression of Myc in cancer [159].
Furthermore, the dishevelled PDZ domain (which is up-regulated in some cancers) facilitates the binding of designed peptides that inhibit the Wnt signaling pathway [169].
Recently, a cell-permeable small molecule was shown to displace BRD4 (Bromodomain- containing Protein 4) oncoprotein fusion from chromatin and inhibits the cell proliferation of human squamous-cell cancer [170]. In addition, targeting the enzymes that post- translationally modify IDRs could be a possible approach. Inhibitory activity of phytochemicals against Sirt1-deacetylase or siRNAs reduced the level of kinase HIPK2
(homeodomain-interacting protein kinase-2) , led to the increased stability of p53 and
25
facilitated apoptosis in cancer cells [171]. Interestingly, venetoclax or ABT-199, a drug that mimics the binding of intrinsically disordered BH3 protein to Bcl-2, has been shown to inhibit the growth of Bcl-2 dependent tumours [172-174].
1.7 Objectives of the Research
Within the past three decades, the interest in studying intrinsically disordered regions has increased exponentially. This is mainly due to their distinct structural and functional characteristics such as coupled folding and binding, cell signaling and post- translational modifications. Due to the difficulties in determining the structure of disordered regions, a significant number of bioinformatics approaches have been developed to understand their sequence composition, evolution, structure and functional properties. However, the composition and evolutionary behaviour of intrinsically disordered regions is incomplete. Hence this thesis performs the bioinformatical analysis of IDRs that specifically aims to:
Analyse the composition of different disordered region types in human FB
proteins, and examine evolutionary behaviour of these regions across vertebrate
orthologs.
Analyse the distribution and evolutionary trend of PTMs in ordered, FB and
disordered regions across eukaryotic organisms.
26
Chapter 2 is a bioinformatical parsing of human folding-on-binding proteins into four
different subsets. This chapter aims to compare the charge, hydrophobicity and evolutionary behaviour of (i) ordered, (ii) other disordered, (iii) folding-on-binding and (iv) disordered-around-FB regions in human proteins. To this end, the conservation analysis of the four parsed datasets across vertebrate evolution, and compositional differences of the three parsed disordered region types have been analysed.
Chapter 3 is a bioinformatical sequence analysis that attempts to understand the evolutionary trend of methylation, acetylation and ubiquitination (MAU) sites in ordered, disordered and FB regions at 11 evolutionary clades from the whole eukaryotic domain down to the level of the ape superfamily. This study also aims to determine the enrichment of MAU and other PTM sites in subsets of IDRs such as prion-like domain and FB regions.
27
1.8 References
1. Fischer, E., Einfluss der configuration auf die wirkung der enzyme. Ber. Dt. Chem. Ges., 1894. 27: p. 2985-2993. 2. Lehninger, A.L., D.L. Nelson, and M.M. Cox, Lehninger principles of biochemistry. 6th ed. 2013, New York: W.H. Freeman. 3. Dunker, A.K., et al., Intrinsically disordered proteins. Journal of Molecular Graphics, 2001. 19(1): p. 26-59. 4. Karush, F., Heterogeneity of the binding sites of bovine serum albumin. J. Am. Chem. Soc., 1950. 72: p. 2705-2713. 5. Koshland, D.E., Application of a Theory of Enzyme Specificity to Protein Synthesis. Proc Natl Acad Sci U S A, 1958. 44(2): p. 98-104. 6. McEwan, I.J., et al., Functional interaction of the c-Myc transactivation domain with the TATA binding protein: Evidence for an induced fit model of transactivation domain folding. Biochemistry, 1996. 35(29): p. 9584-9593. 7. Mazza, C., et al., Large-scale induced fit recognition of an m(7)GpppG cap analogue by the human nuclear cap-binding complex. Embo Journal, 2002. 21(20): p. 5548-5557. 8. Fletcher, C.M. and G. Wagner, The interaction of eIF4E with 4E-BP1 is an induced fit to a completely disordered protein. Protein Science, 1998. 7(7): p. 1639-1642. 9. Hingerty, B., et al., Neutron diffraction of alpha, beta and gamma cyclodextrins: hydrogen bonding patterns. J Biomol Struct Dyn, 1984. 2(1): p. 249-60. 10. Koshland, D.E., Jr., Enzyme flexibility and enzyme action. J Cell Comp Physiol, 1959. 54: p. 245-58. 11. James, L.C. and D.S. Tawfik, The specificity of cross-reactivity: Promiscuous antibody binding involves specific hydrogen bonds rather than nonspecific hydrophobic stickiness. Protein Science, 2003. 12(10): p. 2183-2193.
28
12. A. Keith Dunker, Z.O., Pedro Romero, Ethan C. Garner Intrinsic protein disorder in complete genomes. Genome Inform Ser Workshop Genome Inform, 2000. 11: p. 161-71. 13. Vladimir N. Uversky, J.R.G., and Anthony L. Fink, Why are “natively unfolded” proteins unstructured under physiologic conditions. PROTEINS: Structure, Function, and Genetics, 2000. 41: p. 415–427. 14. Wright , P.E. and H.J. Dyson, Intrinsically unstructured proteins- re-assessing the protein structure-function paradigm. J. Mol. Biol., 1999. 293: p. 321-331. 15. Dyson, H.J. and P.E. Wright, Coupling of folding and binding for unstructured proteins. Current Opinion in Structural Biology, 2002. 12(1): p. 54-60. 16. Tompa, P., Intrinsically unstructured proteins. Trends Biochem Sci, 2002. 27(10): p. 527-33. 17. Dyson, H.J. and P.E. Wright, Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol, 2005. 6(3): p. 197-208. 18. Li, H.M., A. Rao, and P.G. Hogan, Interaction of calcineurin with substrates and targeting proteins. Trends in Cell Biology, 2011. 21(2): p. 91-103. 19. Oldfield, C.J. and A.K. Dunker, Intrinsically disordered proteins and intrinsically disordered protein regions. Annu Rev Biochem, 2014. 83: p. 553-84. 20. Uversky, V.N., What does it mean to be natively unfolded. Eur. J. Biochem, 2002. 269: p. 2-12. 21. Obradovic, A.K.D.a.Z., The protein trinity—linking function and disorder. Nat. Biotechnol. Nature Biotechnology, 2001. 19. 22. A. Keith Dunker, a., * Celeste J. Brown,á J. David Lawson,á and a.a.Z.O. Lilia M. Iakoucheva, Intrinsic disorder and protein function. Biochemistry, 2002. 41(21): p. 6573-82. 23. A. KEITH DUNKER, C.J.B., Identification and functions of usefully disordered proteins. Adv Protein Chem, 2002. 62: p. 25-49. 24. Oldfield, C.J., et al., Coupled folding and binding with alpha-helix-forming molecular recognition elements. Biochemistry, 2005. 44(37): p. 12454-70.
29
25. Shoemaker, B.A., J.J. Portman, and P.G. Wolynes, Speeding molecular recognition by using the folding funnel: The fly-casting mechanism. Proceedings of the National Academy of Sciences of the United States of America, 2000. 97(16): p. 8868-+. 26. Tompa, P., Unstructural biology coming of age. Current Opinion in Structural Biology, 2011. 21(3): p. 419-425. 27. Spink, K.E., P. Polakis, and W.I. Weis, Structural basis of the Axin-adenomatous polyposis coli interaction. EMBO J, 2000. 19(10): p. 2270-9. 28. Barrera-Vilarmau, S., P. Obregon, and E. de Alba, Intrinsic order and disorder in the bcl-2 member harakiri: insights into its proapoptotic activity. PLoS One, 2011. 6(6): p. e21413. 29. Babu, M.M., The contribution of intrinsically disordered regions to protein function, cellular complexity, and human disease. Biochem Soc Trans, 2016. 44(5): p. 1185-1200. 30. Zhou, H.X., Intrinsic disorder: signaling via highly specific but short-lived association. Trends Biochem Sci, 2012. 37(2): p. 43-8. 31. Sugase, K., H.J. Dyson, and P.E. Wright, Mechanism of coupled folding and binding of an intrinsically disordered protein. Nature, 2007. 447(7147): p. 1021-5. 32. Wright, P.E. and H.J. Dyson, Intrinsically disordered proteins in cellular signalling and regulation. Nat Rev Mol Cell Biol, 2015. 16(1): p. 18-29. 33. Flock, T., et al., Controlling entropy to tune the functions of intrinsically disordered regions. Curr Opin Struct Biol, 2014. 26: p. 62-72. 34. Iakoucheva, L.M., et al., The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Research, 2004. 32(3): p. 1037-1049. 35. Johnson, L.N. and R.J. Lewis, Structural basis for control by phosphorylation. Chem Rev, 2001. 101(8): p. 2209-42. 36. Vetter, S.W. and E. Leclerc, Phosphorylation of serine residues affects the conformation of the calmodulin binding domain of human protein 4.1. Eur J Biochem, 2001. 268(15): p. 4292-9.
30
37. Uversky, V.N. and A.K. Dunker, Understanding protein non-folding. Biochim Biophys Acta, 2010. 1804(6): p. 1231-64. 38. Habchi, J., et al., Introducing protein intrinsic disorder. Chem Rev, 2014. 114(13): p. 6561-88. 39. Brown, C.J., et al., Evolution and disorder. Curr Opin Struct Biol, 2011. 21(3): p. 441-6. 40. Moesa, H.A., et al., Chemical composition is maintained in poorly conserved intrinsically disordered regions and suggests a means for their classification. Mol Biosyst, 2012. 8(12): p. 3262-73. 41. Das, R.K., et al., Cryptic sequence features within the disordered protein p27(Kip1) regulate cell cycle signaling. Proceedings of the National Academy of Sciences of the United States of America, 2016. 113(20): p. 5616-5621. 42. Hansen, J.C., et al., Intrinsic protein disorder, amino acid composition, and histone terminal domains. Journal of Biological Chemistry, 2006. 281(4): p. 1853- 1856. 43. Lu, X., et al., Chromatin Condensing Functions of the Linker Histone C-Terminal Domain Are Mediated by Specific Amino Acid Composition and Intrinsic Protein Disorder. Biochemistry, 2009. 48(1): p. 164-172. 44. van der Lee, R., et al., Classification of intrinsically disordered regions and proteins. Chem Rev, 2014. 114(13): p. 6589-631. 45. Schanda, P., V. Forge, and B. Brutscher, Protein folding and unfolding studied at atomic resolution by fast two-dimensional NMR spectroscopy. Proceedings of the National Academy of Sciences of the United States of America, 2007. 104(27): p. 11257-11262. 46. Dyson, H.J. and P.E. Wright, Unfolded proteins and protein folding studied by NMR. Chemical Reviews, 2004. 104(8): p. 3607-3622. 47. Kriwacki, R.W., et al., Structural studies of p21Waf1/Cip1/Sdi1 in the free and Cdk2-bound state: conformational disorder mediates binding diversity. Proc Natl Acad Sci U S A, 1996. 93(21): p. 11504-9.
31
48. Mohan, A., A study of intrinsic disorder and its role in functional proteomics. Indiana University, 2009. 49. Zoran Obradovic, K.P., Slobodan Vucetic, Predrag Radivojac, Celeste J. Brown, and A. Keith Dunker, Predicting intrinsic disorder from amino acid sequence. PROTEINS: Structure, Function, and Genetics, 2003. 50. Dosztanyi, Z., B. Meszaros, and I. Simon, ANCHOR: web server for predicting protein binding regions in disordered proteins. Bioinformatics, 2009. 25(20): p. 2745-6. 51. Linding, R., et al., Protein disorder prediction: Implications for structural proteomics. Structure, 2003. 11(11): p. 1453-1459. 52. Ward, J.J., et al., The DISOPRED server for the prediction of protein disorder. Bioinformatics, 2004. 20(13): p. 2138-9. 53. Jones, D.T. and D. Cozzetto, DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics, 2015. 31(6): p. 857-63. 54. Iqbal, S. and M.T. Hoque, DisPredict: A Predictor of Disordered Protein Using Optimized RBF Kernel. PLoS One, 2015. 10(10): p. e0141551. 55. Cheng, J.L., M.J. Sweredoski, and P. Baldi, Accurate prediction of protein disordered regions by mining protein structure data. Data Mining and Knowledge Discovery, 2005. 11(3): p. 213-222. 56. Eickholt, J. and J. Cheng, DNdisorder: predicting protein disorder using boosting and deep networks. BMC Bioinformatics, 2013. 14: p. 88. 57. Prilusky, J., et al., FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics, 2005. 21(16): p. 3435-8. 58. Galzitskaya, O.V., S.O. Garbuzynskiy, and M.Y. Lobanov, FoldUnfold: web server for the prediction of disordered regions in protein chain. Bioinformatics, 2006. 22(23): p. 2948-2949. 59. Linding, R., et al., GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Res, 2003. 31(13): p. 3701-8.
32
60. Su, C.T., C.Y. Chen, and C.M. Hsu, iPDA: integrated protein disorder analyzer. Nucleic Acids Res, 2007. 35(Web Server issue): p. W465-72. 61. Dosztanyi, Z., et al., IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics, 2005. 21(16): p. 3433-3434. 62. Romero, P., et al., Sequence complexity of disordered protein. Proteins, 2001. 42(1): p. 38-48. 63. Obradovic, Z., et al., Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins, 2005. 61 Suppl 7: p. 176-82. 64. Peng, K., et al., Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics, 2006. 7: p. 208. 65. Ishida, T. and K. Kinoshita, PrDOS: prediction of disordered protein regions from amino acid sequence. Nucleic Acids Res, 2007. 35(Web Server issue): p. W460- 4. 66. Yang, Z.R., et al., RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics, 2005. 21(16): p. 3369-3376. 67. Zhang, T., et al., Intrinsic Disorder and Semi-disorder Prediction by SPINE-D. Methods Mol Biol, 2017. 1484: p. 159-174. 68. Vullo, A., et al., Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Res, 2006. 34(Web Server issue): p. W164-8. 69. Walsh, I., et al., CSpritz: accurate prediction of protein disorder segments with annotation for homology, secondary structure and linear motifs. Nucleic Acids Research, 2011. 39: p. W190-W196. 70. Walsh, I., et al., ESpritz: accurate and fast prediction of protein disorder. Bioinformatics, 2012. 28(4): p. 503-9.
33
71. Yan, J., et al., RAPID: Fast and accurate sequence-based prediction of intrinsic disorder content on proteomic scale. Biochimica Et Biophysica Acta-Proteins and Proteomics, 2013. 1834(8): p. 1671-1680. 72. Oates, M.E., et al., D(2)P(2): database of disordered protein predictions. Nucleic Acids Res, 2013. 41(Database issue): p. D508-16. 73. Oldfield, C.J., et al., Comparing and combining predictors of mostly disordered proteins. Biochemistry, 2005. 44(6): p. 1989-2000. 74. Uversky, V.N., Protein folding revisited. A polypeptide chain at the folding- misfolding-nonfolding cross-roads: which way to go? Cellular and Molecular Life Sciences, 2003. 60(9): p. 1852-1871. 75. Ward, J.J., et al., Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol, 2004. 337(3): p. 635-45. 76. Dunker, A.K., et al., Intrinsic disorder and protein function. Biochemistry, 2002. 41(21): p. 6573-82. 77. Pancsa, R. and P. Tompa, Structural disorder in eukaryotes. PLoS One, 2012. 7(4): p. e34687. 78. Feng, Z.P., et al., Abundance of intrinsically unstructured proteins in P. falciparum and other apicomplexan parasite proteomes. Mol Biochem Parasitol, 2006. 150(2): p. 256-67. 79. Mohan, A., et al., Intrinsic disorder in pathogenic and non-pathogenic microbes: discovering and analyzing the unfoldomes of early-branching eukaryotes. Mol Biosyst, 2008. 4(4): p. 328-40. 80. Uversky, V.N., Functional roles of transiently and intrinsically disordered regions within proteins. FEBS J, 2015. 282(7): p. 1182-9. 81. Xie, H., et al., Functional anthology of intrinsic disorder. 3. Ligands, post- translational modifications, and diseases associated with intrinsically disordered proteins. J Proteome Res, 2007. 6(5): p. 1917-32. 82. Tusnady, G.E., L. Dobson, and P. Tompa, Disordered regions in transmembrane proteins. Biochim Biophys Acta, 2015. 1848(11 Pt A): p. 2839-48.
34
83. Mohan, A., et al., Analysis of molecular recognition features (MoRFs). Journal of Molecular Biology, 2006. 362(5): p. 1043-1059. 84. Vacic, V., et al., Characterization of molecular recognition features, MoRFs, and their binding partners. Journal of Proteome Research, 2007. 6(6): p. 2351-2366. 85. Pejaver, V., et al., The structural and functional signatures of proteins that undergo multiple events of post-translational modification. Protein Science, 2014. 23(8): p. 1077-1093. 86. Deribe, Y.L., T. Pawson, and I. Dikic, Post-translational modifications in signal integration. Nature Structural & Molecular Biology, 2010. 17(6): p. 666-672. 87. Duan, G.Y. and D. Walther, The Roles of Post-translational Modifications in the Context of Protein Interaction Networks. Plos Computational Biology, 2015. 11(2). 88. Bah, A. and J.D. Forman-Kay, Modulation of Intrinsically Disordered Protein Function by Post-translational Modifications. J Biol Chem, 2016. 291(13): p. 6696-705. 89. Peterson, C.L. and M.A. Laniel, Histones and histone modifications. Current Biology, 2004. 14(14): p. R546-R551. 90. Garcia, B.A., et al., Organismal differences in post-translational modifications in histones H3 and H4. Journal of Biological Chemistry, 2007. 282(10): p. 7641- 7655. 91. Hansen, J.C., et al., Intrinsic protein disorder, amino acid composition, and histone terminal domains. J Biol Chem, 2006. 281(4): p. 1853-6. 92. Vuzman, D., Y. Hoffman, and Y. Levy, Modulating Protein-DNA Interactions by Post-Translational Modifications at Disordered Regions. Pacific Symposium on Biocomputing 2012, 2012: p. 188-199. 93. Uversky, V.N., p53 Proteoforms and Intrinsic Disorder: An Illustration of the Protein Structure-Function Continuum Concept. Int J Mol Sci, 2016. 17(11). 94. Marks, F., Protein Phosphorylation. VCH Weinheim, New York, Basel, Cambridge, Tokyo, 1996.
35
95. Kathiriya, J.J., et al., Presence and utility of intrinsically disordered regions in kinases. Mol Biosyst, 2014. 10(11): p. 2876-88. 96. Kathiriya, J.J., et al., Data on evolution of intrinsically disordered regions of the human kinome and contribution of FAK1 IDRs to cytoskeletal remodeling. Data Brief, 2017. 10: p. 315-324. 97. Manning, G., et al., Evolution of protein kinase signaling from yeast to man. Trends Biochem Sci, 2002. 27(10): p. 514-20. 98. Gao, J. and D. Xu, Correlation between posttranslational modification and intrinsic disorder in protein. Pac Symp Biocomput, 2012: p. 94–103. 99. Gao, J.J. and D. Xu, Correlation between Posttranslational Modification and Intrinsic Disorder in Protein. Pacific Symposium on Biocomputing 2012, 2012: p. 94-103. 100. Beltrao, P., et al., Evolution and functional cross-talk of protein post-translational modifications. Mol Syst Biol, 2013. 9: p. 714. 101. Hagai, T., et al., The origins and evolution of ubiquitination sites. Mol Biosyst, 2012. 8(7): p. 1865-77. 102. Kim, D.S. and Y. Hahn, Gains of ubiquitylation sites in highly conserved proteins in the human lineage. BMC Bioinformatics, 2012. 13: p. 306. 103. Kim, D.S. and Y. Hahn, Identification of novel phosphorylation modification sites in human proteins that originated after the human-chimpanzee divergence. Bioinformatics, 2011. 27(18): p. 2494-2501. 104. Nilsson, J., M. Grahn, and A.P. Wright, Proteome-wide evidence for enhanced positive Darwinian selection within intrinsically disordered regions in proteins. Genome Biol, 2011. 12(7): p. R65. 105. Chen, J.W., et al., Conservation of intrinsic disorder in protein domains and families: I. A database of conserved predicted disordered regions. J Proteome Res, 2006. 5(4): p. 879-87. 106. Brown, C.J., A.K. Johnson, and G.W. Daughdrill, Comparing models of evolution for ordered and disordered proteins. Mol Biol Evol, 2010. 27(3): p. 609-21.
36
107. Bellay, J., et al., Bringing order to protein disorder through comparative genomics and genetic interactions. Genome Biology, 2011. 12(2). 108. Kim, P.M., et al., The role of disorder in interaction networks: a structural analysis. Mol Syst Biol, 2008. 4: p. 179. 109. Zarin, T., et al., Selection maintains signaling function of a highly diverged intrinsically disordered region. Proceedings of the National Academy of Sciences of the United States of America, 2017. 114(8): p. E1450-E1459. 110. Brown, C.J., et al., Evolutionary rate heterogeneity in proteins with long disordered regions. J Mol Evol, 2002. 55(1): p. 104-10. 111. Mosca, R., R.A. Pache, and P. Aloy, The Role of Structural Disorder in the Rewiring of Protein Interactions through Evolution. Molecular & Cellular Proteomics, 2012. 11(7). 112. Wilson, B.A., et al., Young Genes are Highly Disordered as Predicted by the Preadaptation Hypothesis of De Novo Gene Birth. Nat Ecol Evol, 2017. 1(6): p. 0146-146. 113. Afanasyeva, A., et al., Human long intrinsically disordered protein regions are frequent targets of positive selection. Genome Res, 2018. 28(7): p. 975-982. 114. Daughdrill, G.W., et al., Dynamic behavior of an intrinsically unstructured linker domain is conserved in the face of negligible amino acid sequence conservation. Journal of Molecular Evolution, 2007. 65(3): p. 277-288. 115. Banerjee, S., S. Chakraborty, and R.K. De, Deciphering the cause of evolutionary variance within intrinsically disordered regions in human proteins. Journal of Biomolecular Structure & Dynamics, 2017. 35(2): p. 233-249. 116. Szalkowski, A.M. and M. Anisimova, Markov models of amino acid substitution to study proteins with intrinsically disordered regions. PLoS One, 2011. 6(5): p. e20488. 117. Midic, U., A.K. Dunker, and Z. Obradovic, Protein sequence alignment and structural disorder: a substitution matrix for an extended alphabet. Proceedings
37
of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics. New York, NY, USA: ACM, StReBio '09., 2009: p. 2731. 118. Thompson, J.D., et al., A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives. Plos One, 2011. 6(3). 119. Varadi, M., et al., DisCons: a novel tool to quantify and classify evolutionary conservation of intrinsic protein disorder. Bmc Bioinformatics, 2015. 16. 120. Lange, J., L.S. Wyrwicz, and G. Vriend, KMAD: knowledge-based multiple sequence alignment for intrinsically disordered proteins. Bioinformatics, 2016. 32(6): p. 932-6. 121. Uversky, V.N., Intrinsically disordered proteins from A to Z. Int J Biochem Cell Biol, 2011. 43(8): p. 1090-103. 122. Uversky, V.N., Intrinsically disordered proteins and their environment: effects of strong denaturants, temperature, pH, counter ions, membranes, binding partners, osmolytes, and macromolecular crowding. Protein J, 2009. 28(7-8): p. 305-25. 123. Romero, P., et al., Identifying disordered regions in proteins from amino acid sequence. 1997 Ieee International Conference on Neural Networks, Vols 1-4, 1997: p. 90-95. 124. Pedro Romero, Z.O., 1¥ Xiaohong Li,1‡ Ethan C. Garner,2† Celeste J. Brown,2 and A. Keith Dunker, Sequence Complexity of Disordered Protein. PROTEINS: Structure, Function, and Genetics, 2001. 42. 125. Fukuchi, S., et al., Binary classification of protein molecules into intrinsically disordered and ordered segments. Bmc Structural Biology, 2011. 11. 126. Zhang, T., et al., SPINE-D: accurate prediction of short and long disordered regions by a single neural-network based method. J Biomol Struct Dyn, 2012. 29(4): p. 799-813. 127. Ba ́lint Me ́sza ́ros, I.n.S., Zsuzsanna Doszta ́nyi, Prediction of Protein Binding Regions in Disordered Proteins. PLoS Comput Biol, 2009. 5(5).
38
128. He, B., et al., Predicting intrinsic disorder in proteins: an overview. Cell Research, 2009. 19(8): p. 929-949. 129. Radivojac, P., et al., Protein flexibility and intrinsic disorder. Protein Science, 2004. 13(1): p. 71-80. 130. Romero, Obradovic, and K. Dunker, Sequence Data Analysis for Long Disordered Regions Prediction in the Calcineurin Family. Genome Inform Ser Workshop Genome Inform, 1997. 8: p. 110-124. 131. Lobley, A., et al., Inferring function using patterns of native disorder in proteins. Plos Computational Biology, 2007. 3(8): p. 1567-1579. 132. Tompa, P. and L. Kalmar, Power Law Distribution Defines Structural Disorder as a Structural Element Directly Linked with Function. Journal of Molecular Biology, 2010. 403(3): p. 346-350. 133. Fuxreiter, M., P. Tompa, and I. Simon, Local structural disorder imparts plasticity on linear motifs. Bioinformatics, 2007. 23(8): p. 950-6. 134. Monika, F.J., et al., Preformed structural elements feature in partner recognition by intrinsically unstructured proteins. Biophysical Journal, 2005. 88(1): p. 560a- 560a. 135. Tompa, P., et al., Close encounters of the third kind: disordered domains and the interactions of proteins. Bioessays, 2009. 31(3): p. 328-335. 136. Pentony, M.M. and D.T. Jones, Modularity of intrinsic disorder in the human proteome. Proteins-Structure Function and Bioinformatics, 2010. 78(1): p. 212- 221. 137. Edwards, Y.J., et al., Insights into the regulation of intrinsically disordered proteins in the human proteome by analyzing sequence and gene expression data. Genome Biol, 2009. 10(5): p. R50. 138. Gsponer, J. and M.M. Babu, The rules of disorder or why disorder rules. Progress in Biophysics & Molecular Biology, 2009. 99(2-3): p. 94-103.
39
139. Meng, F., V.N. Uversky, and L. Kurgan, Comprehensive review of methods for prediction of intrinsic disorder and its molecular functions. Cell Mol Life Sci, 2017. 74(17): p. 3069-3090. 140. Pines, J., Cyclins and Cyclin-Dependent Kinases - a Biochemical View. Biochemical Journal, 1995. 308: p. 697-711. 141. Cheng, Y., et al., Mining alpha-helix-forming molecular recognition features with cross species sequence alignments. Biochemistry, 2007. 46(47): p. 13468-77. 142. Disfani, F.M., et al., MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins. Bioinformatics, 2012. 28(12): p. I75-I83. 143. Fang, C., et al., MFSPSSMpred: identifying short disorder-to-order binding regions in disordered proteins based on contextual local evolutionary conservation. Bmc Bioinformatics, 2013. 14. 144. Malhis, N. and J. Gsponer, Computational identification of MoRFs in protein sequences. Bioinformatics, 2015. 31(11): p. 1738-44. 145. Malhis, N., M. Jacobson, and J. Gsponer, MoRFchibi SYSTEM: software tools for the identification of MoRFs in protein sequences. Nucleic Acids Res, 2016. 44(W1): p. W488-93. 146. Yan, J., et al., Molecular recognition features (MoRFs) in three domains of life. Mol Biosyst, 2016. 12(3): p. 697-710. 147. Xue, B., A.K. Dunker, and V.N. Uversky, Retro-MoRFs: identifying protein binding sites by normal and reverse alignment and intrinsic disorder prediction. Int J Mol Sci, 2010. 11(10): p. 3725-47. 148. Sharma, R., et al., OPAL: prediction of MoRF regions in intrinsically disordered protein sequences. Bioinformatics, 2018. 34(11): p. 1850-1858. 149. Mooney, C., et al., Prediction of short linear protein binding regions. J Mol Biol, 2012. 415(1): p. 193-204. 150. Khan, W., et al., Predicting Binding within Disordered Protein Regions to Structurally Characterised Peptide-Binding Domains. Plos One, 2013. 8(9).
40
151. Davey, N.E., D.C. Shields, and R.J. Edwards, SLiMDisc: short, linear motif discovery, correcting for common evolutionary descent. Nucleic Acids Res, 2006. 34(12): p. 3546-54. 152. Prytuliak, R., et al., HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons. Nucleic Acids Res, 2017. 45(18): p. 10921. 153. Krystkowiak, I. and N.E. Davey, SLiMSearch: a framework for proteome-wide discovery and annotation of functional modules in intrinsically disordered regions. Nucleic Acids Research, 2017. 45(W1): p. W464-W469. 154. Gouw, M., et al., The eukaryotic linear motif resource - 2018 update. Nucleic Acids Res, 2018. 46(D1): p. D428-D434. 155. Davey, N.E., et al., SLiMPrints: conservation-based discovery of functional motif fingerprints in intrinsically disordered protein regions. Nucleic Acids Res, 2012. 40(21): p. 10628-41. 156. Ren, S., et al., Short Linear Motifs recognized by SH2, SH3 and Ser/Thr Kinase domains are conserved in disordered protein regions. Bmc Genomics, 2008. 9. 157. Uversky, V.N., Intrinsically disordered proteins and their (disordered) proteomes in neurodegenerative disorders. Frontiers in Aging Neuroscience, 2015. 7. 158. Uversky, V.N., C.J. Oldfield, and A.K. Dunker, Intrinsically disordered proteins in human diseases: introducing the D2 concept. Annu Rev Biophys, 2008. 37: p. 215-46. 159. Babu, M.M., et al., Intrinsically Disordered Proteins: Regulation and Disease. Biomolecular Forms and Functions: A Celebration of 50 Years of the Ramachandran Map, 2013: p. 346-361. 160. Uversky, V.N., et al., Unfoldomics of human diseases: linking protein intrinsic disorder with diseases. Bmc Genomics, 2009. 10. 161. Oldfield, C.J., et al., Flexible nets: disorder and induced fit in the associations of p53 and 14-3-3 with their partners. Bmc Genomics, 2008. 9.
41
162. Srinivasan, M. and A.K. Dunker, Proline rich motifs as drug targets in immune mediated disorders. Int J Pept, 2012. 2012: p. 634769. 163. Uversky, V.N., et al., Unfoldomics of human diseases: linking protein intrinsic disorder with diseases. BMC Genomics, 2009. 10 Suppl 1: p. S7. 164. Neira, J.L., et al., Identification of a Drug Targeting an Intrinsically Disordered Protein Involved in Pancreatic Adenocarcinoma. Scientific Reports, 2017. 7. 165. Ambadipudi, S. and M. Zweckstetter, Targeting intrinsically disordered proteins in rational drug discovery. Expert Opinion on Drug Discovery, 2016. 11(1): p. 65-77. 166. Kumar, D., N. Sharma, and R. Giri, Therapeutic Interventions of Cancers Using Intrinsically Disordered Proteins as Drug Targets: c-Myc as Model System. Cancer Informatics, 2017. 16. 167. Maity, B.K., Dynamics Based Drug Design for Intrinsically Disordered Proteins. Biophysical Journal, 2018. 114(3): p. 590a-590a. 168. Metallo, S.J., Intrinsically disordered proteins are potential drug targets. Current Opinion in Chemical Biology, 2010. 14(4): p. 481-488. 169. Zhang, Y.N., et al., Inhibition of Wnt signaling by Dishevelled PDZ peptides. Nature Chemical Biology, 2009. 5(4): p. 217-219. 170. Filippakopoulos, P., et al., Selective inhibition of BET bromodomains. Nature, 2010. 468(7327): p. 1067-1073. 171. Puca, R., et al., Regulation of p53 activity by HIPK2: molecular mechanisms and therapeutical implications in human cancer cells. Oncogene, 2010. 29(31): p. 4378-4387. 172. Souers, A.J., et al., ABT-199, a potent and selective BCL-2 inhibitor, achieves antitumor activity while sparing platelets. Nature Medicine, 2013. 19(2): p. 202- 208. 173. Adams, J.M. and S. Cory, The BCL-2 arbiters of apoptosis and their growing role as cancer targets. Cell Death and Differentiation, 2018. 25(1): p. 27-36. 174. Reed, J.C., Bcl-2 on the brink of breakthroughs in cancer treatment. Cell Death and Differentiation, 2018. 25(1): p. 3-6.
42
CHAPTER II
2 Bioinformatical parsing of folding-on-binding proteins
reveals compositional sequence design and evidence for
a general guiding mechanism for binding
A version of this chapter is originally published as: Narasumani, M. and Harrison, P. M. Bioinformatical parsing of folding-on-binding proteins reveals their compositional and evolutionary sequence design. Sci. Rep. 5, 18586; doi: 10.1038/srep18586 (2015).
43
2.1 Abstract:
Intrinsic disorder occurs when (part of) a protein remains unfolded during normal
functioning. Intrinsically-disordered regions can contain segments that ‘fold on binding’ to
another molecule. Here, we perform bioinformatical parsing of human ‘folding-on-binding’
(FB) proteins, into four subsets: Ordered regions, FB regions, Disordered regions that
surround FB regions (‘Disordered-around-FB’), and Other-Disordered regions. We
examined the composition and evolutionary behaviour (across vertebrate orthologs) of
these subsets. From a convergence of three separate analyses, we find that for
hydrophobicity, Ordered regions segregate from the other subsets, but the Ordered and
FB regions group together as highly conserved, and the Disordered-around-FB and
Other-Disordered regions as less conserved (with a lesser significant difference between
Ordered and FB regions). FB regions are highly-conserved with net positive charge, whereas Disordered-around-FB have net negative charge and are relatively less hydrophobic than FB regions. Indeed, these Disordered-around-FB regions are
excessively hydrophilic compared to other disordered regions generally. We describe how
our results point towards a possible compositionally-based steering mechanism of
folding-on-binding.
44
2.2 Introduction
Intrinsically disordered regions, in at least one of their functional modes, do not have
a well-defined three-dimensional structure under physiological conditions [1]. They are involved in specific functions such as molecular recognition, molecular assembly, protein modification, and entropic chain activities [2]. They are mostly found in eukaryotes rather than in prokaryotes [3, 4]. Approximately a third of proteins in eukaryotes are estimated to contain long disordered regions with 30 amino acids or higher [3, 5]. These regions are associated with a wide variety of functions, most notably signal transduction, transcription and translation regulation [3, 5]. Disordered regions are characterized by using several approaches, such as analysis of areas with missing electron density in an X-ray determined structure, or by NMR spectroscopy. They can be predicted by algorithms that analyse charge, hydrophobicity, low sequence complexity, amino acid composition and other factors [6-9]. Statistical studies of amino acid sequences in disordered regions show that they are significantly different than ordered regions [10].
Protein interaction analysis has showed that disordered regions are abundant in proteins with large numbers of interacting partners [11, 12]. Many proteins with disordered regions exhibit coupled folding and binding which has been proved to be a common process of molecular recognition and plays significant roles in protein function [13, 14].
Such disordered regions, which are termed here ‘folding on binding’ (FB) regions, are highly flexible and exhibit a well-defined structure only upon binding to a specific partner
45
molecule [15]. These regions have been reported to confer high specificity towards a partner molecule [16].
In general, disordered regions are usually characterised by low hydrophobicity and somewhat higher net charge [17, 18]. However, such trends are not clear for the specific character of FB regions [19, 20]. A study of FB region complexes showed that the interfaces of FB regions are enriched in hydrophobic residues and appear to be more conserved than other disordered regions in the same proteins [21]. IDRs exhibit different accepted point mutations, and show increased rates of insertions and deletions [17, 22,
23]. A comparative study on the evolution of ordered and disordered proteins suggested that disordered proteins evolve more rapidly than ordered proteins[17] . However, this condition is not always true and also a smaller group of disordered proteins appear to evolve very slowly [23]. Analysis of the evolution of disordered regions has thus yielded contradicting results [22, 24].
Here, we have studied the composition and conservation of proteins that form FB regions in human protein complexes. Specifically, we have parsed these proteins into four subsets of sequence: (i) Ordered regions, (ii) FB regions, (iii) disordered regions around
FB regions (‘Disordered-around-FB’), and (iv) Other-Disordered regions in the proteins.
We wish to ask whether the composition, and conservation behaviour across eukaryotic orthologs for these proteins is significantly different for these biophysically relevant subsets. We found a complex pattern of conservation and composition, with all of these regions having significantly different combinations of composition and conservation behaviour. Indeed, ‘Disordered-around-FB’ regions are the least hydrophobic regions,
46
and more evolutionarily variable, and the FB regions are of comparable hydrophobicity to
Other-Disordered regions in the proteins. We discuss the mechanistic implications of this
compositional sequence design.
2.3 Methods
2.3.1 Data sets
Human experimentally-verified intrinsically disordered protein sequences were
retrieved from the IDEAL (Intrinsically disordered proteins with extensive annotation and
literature) database [25, 26] (sequences retrieved in August 2014). The data sets were
reduced for sequence redundancy (at 40% sequence identity level) using the CD-HIT tool
[27]. This gave us a total of 99 human intrinsically disordered proteins with FB regions.
For some analysis we also used a data set of 134 disordered proteins from the DisProt
(Database of Protein disorder) DisProt release 6.02 [28]. To make multiple sequence
alignments, orthologs of these human proteins in other vertebrates were obtained from
the Ensembl BioMart data mining tool [29].
2.3.2 Multiple sequence alignments
Multiple sequence alignments (MSAs) of human intrinsically disordered proteins
along with their orthologs from other vertebrates were generated using MUSCLE v3.8.31
[30].
47
2.3.3 Conservation analysis of the aligned sequences
The position-specific conservation of the aligned protein sequences was calculated using the AL2CO program [31]. This program was used to calculate a conservation index for each position of the human proteins in the MUSCLE multiple sequence alignments. In AL2CO, the amino acid frequencies at each position are estimated and the conservation index is calculated from these frequencies. The entropy- based method of AL2CO was used to calculate the conservation index. This uses sequence information entropy, and calculates the frequency of amino acids by grouping the amino acids with similar physicochemical properties. We think this is suitable for analysing intrinsically disordered regions, since they are compositionally defined regions of a protein sequences.
2.3.4 Hydrophobicity and Charge calculation
The hydrophobicity of the four parsed regions in human protein sequences was calculated by ProtScale [32] using the Kyte & Doolittle [33] hydrophobicity scale with a window size of 5. The net charge at pH 7.0 was also calculated by adding up total numbers of positively and negatively charged residues [18]. The absolute value (i.e., the total ‘chargedness’) was also calculated by making all negative values positive (this is presented in Figure 2.3A).
48
2.4 Results and Discussion
2.4.1 Overview of the data sets
From the 99 human proteins containing FB regions that are the subject of this study, were parsed the following four sets of regions: (i) ’Ordered’ protein domains; (ii) folding-on-binding regions (‘FB’ set); (iii) the intrinsically-disordered regions around FB regions (‘Disordered-around-FB’ regions), and (iv) intrinsically disordered regions that do not contain FB regions (‘Other-Disordered’ regions). The Ordered region set comprises experimentally verified structures that do not have a known alternative intrinsically- disordered state. The Disordered-around-FB and Other-Disordered regions are only experimentally reported as intrinsically disordered. The FB regions contain experimentally determined structure in bound form to their partner molecule, as well as being shown to be intrinsically disordered at other times. These data sets are compared for their trends in composition and conservation, as populations of sequences, using the pipeline of methods detailed in Figure 2.1. The conservation of the four parsed region types across vertebrate evolution was analysed, and a conservation score calculated (as detailed in
Methods). An example of the parsing of a sequence into the four subsets is shown for human parathyroid hormone –like protein (Figure 2.2), with the same colour scheme as
Figure 2.1.
49
A
B
Figure 2. 1 Pipeline of the analysis performed (A) The four parsed datasets (i) Ordered set (ii) Disordered set (iii) folding-on- binding
regions (‘FB’ set), and (iv) the disordered around FB regions (DFB) are represented. (B)
The sequence analysis performed of the four parsed datasets is highlighted.
50
(continued from Figure 2.1, previous page)
Figure 2. 2 Example alignment of a parsed protein.
51
(continued from Figure 2.2, previous page)
Multiple sequence alignment of human parathyroid hormone-like protein and its vertebrate Orthologs, depicted using JalView, showing the four region types. This figure uses the same colour scheme as Figure 2.1.
2.4.2 Analysis of Ordered, Disordered, FB and Disordered around FB
regions as populations of sequences
Firstly, we asked whether we could distinguish the four region types according to their broad compositional characteristics. Since IDRs exhibit distinct amino acid composition, the order and disordered regions can be classified based on their net charge and hydrophobicity. Specifically, we wish to understand whether the composition of the three parsed disordered region types is different. Comparison of mean hydrophobicity and mean net charge of the four parsed region types is shown in Figure 2.3A, B. For the first plot, we use the absolute value of the mean net charge (Figure 2.3A), and for the second plot the raw mean net charge value (Figure 2.3B; see Methods for details). In these plots we only consider longer tracts, ≥20 residues. In line with a previous study [18], the Ordered subset stands out as more hydrophobic than the three other region types.
We fitted lines (as described in the figure legend) that give us optimum discrimination
(>95%) of the Ordered subset from the Other-Disordered set. The black and red represent the two extremes of slope for such fitted boundary lines (Figure 2.3A). In Figure 2.3A, the
52
other three sets scatter on either side of the lines and are not well segregated (24%–46% on the other side of the line). In Figure 2.3B, using the raw value of the mean net charge, while the two disordered sets are not well discriminated from the Ordered set (39–50%), the FB regions segregate better with the Ordered set (74% on same side of the line).
53
Datasets A Ordered Folding on Binding (FB)
Disordered around FB
Other Disordered
0.8
Ordered Folding on Binding (FB) Disordered around FB 0.6 Other Disordered
0.4 Mean Net − Charge
0.2
0.0
0.0 0.2 0.4 0.6 0.8 Mean Hydrophobicity
Datasets B Ordered Folding on Binding (FB)
Disordered around FB
Other Disordered
0.50
Ordered Folding on Binding (FB) 0.25 Disordered around FB Other Disordered
0.00
Mean Net − Charge −0.25
−0.50
0.0 0.2 0.4 0.6 Mean Hydrophobicity
Figure 2. 3 Analysis of the four region types as populations of sequences.
54
(Continued from Figure 2.3, previous page)
Only fragments ≥20 residues in length are used in the plots. The values of mean
hydrophobicity and mean conservation score are normalized to the range [0, 1]. (A) Mean hydrophobicity versus mean net-charge (absolute value). Lines were fitted to discriminate between Ordered and Other-Disordered regions by iterative Monte Carlo sampling of a wide range of intercept and slope values. The two lines (red and black) represent the two extremes of slope that give the same best percentage discrimination of Ordered regions
(100%) (equations C = 1.21 H – 0.34, and C = 0.47 H – 0.06, where C is the mean net charge and H is the mean hydrophobicity, in the fragments). Here the absolute value of the mean net-charge is used (i.e., negative values are made positive). Box plots are drawn using the same colour coding as the main scatter plot. The whiskers extend from the hinge to the highest/lowest values that are within 1.5 * IQR of the hinge, where IQR is the inter-quartile range, or distance between the first and third quartiles. (B) Mean hydrophobicity versus mean net-charge (raw value). Lines were fitted as above in (A).
The two lines (red and black) represent the two extremes of slope that give the same best percentage discrimination of Ordered regions (94%) (equations C = 0.11 H – 0.11, and
C = 0.05 H – 0.08, where C is the mean net charge and H is the mean hydrophobicity, in the fragments).
55
A plot of hydrophobicity versus region length shows that a single length threshold effectively segregates Ordered regions from the three other parsed subsets, which are intermingled (81% discrimination of Ordered set, >85% for other three sets on the other side of the line Figure 2.4A). Finally, an almost horizontal boundary line was found to discriminate effectively the Ordered and Other-Disordered regions (Figure 2.4B), with the
Ordered set pulling the FB regions with them (93% correct discrimination ordered, 62%
FB regions), and the Other-Disordered set pulling the Disordered-around-FB regions with them (85% Disordered, 82% disordered around FB regions).
56
A 1000
Ordered Folding on Binding (FB) Disordered around FB Other Disordered 750
500 Amino acid Length
250
0
0.0 0.2 0.4 0.6 0.8 Mean Hydrophobicity B 1.00 Ordered Folding on Binding (FB) Disordered around FB Other Disordered 0.75 Datasets
Ordered 0.50 Folding on Binding (FB)
Disordered around FB
0.25 Other Disordered Mean Conservation Score Mean Conservation Score
0.00
0.00 0.25 0.50 0.75 1.00 Mean Hydrophobicity Datasets
Figure 2. 4 Analysis of the four region types as populations of sequences. Only fragments ≥20 residues in length are used in the plots. The values of mean
hydrophobicity and mean conservation score are normalized to the range [0,1].
57
(continued from Figure 2.4, previous page)
(A) Mean Hydrophobicity versus length. The colour scheme is as for Figure 2.3. A simple
length threshold of region length = 93 was found to be the best boundary between
Ordered and Other-Disordered regions; the same line was also optimal for discriminating
between Ordered and either Disordered-around-FB or FB regions. (B) Mean conservation score versus mean hydrophobicity. The colour scheme is as for part (A). An almost horizontal line was found to be the best boundary between Ordered and Other-Disordered regions (equation S = 0.01 H + 0.59, where S is the mean conservation score and H is the mean hydrophobicity, in the fragments). Box plots are drawn using the same colour coding as the main scatter plot (see Figure 2.3 legend for details).
Thus, ordered regions are distinguished from the other region types by their hydrophobicity and length, whereas more segregation of Ordered along with FB regions
(versus Disordered-around-FB along with Other-Disordered regions) is achieved when conservation is considered.
58
2.4.3 Further analysis of compositional differences between the
Ordered, Disordered, FB and Disordered around FB parsed
subsets
The distribution of hydrophobicity and net charge for the populations of residues in the four parsed subsets (shown in Figure 2.5A, B) was analysed for significant differences
(Tables 2.1 to 2.4). This analysis includes the data for shorter sequence tracts (<20 residues in length).
A 80 70
60 Ordered 50 Folding on Binding (FB) Disordered around FB 40 Other Disordered 30 Percentage 20 10 0 -1 0 1 Charge
59
B 60
50
40
30 Ordered Folding on Binding (FB) 20
Percentage Disordered around FB 10 Other Disordered
0 0 0.2 0.4 0.6 0.8 1 Conservation Score
C 60
50
Ordered 40 Folding on Binding (FB) 30 Disordered around FB Other Disordered 20 Percentage
10
0 0 0.2 0.4 0.6 0.8 1 Hydrophobicity
Figure 2. 5 Trends in composition and conservation for the four parsed region types.
60
(continued from Figure 2.5, previous page)
(A) Histogram of charge for the total set of residues for the four subsets. The colour scheme is: Ordered, blue (total = 17868); Other-Disordered, red (2040); FB, green (3205);
Disordered-around-FB, orange (2936). Percentages are shown. (B) Histogram of hydrophobicity for the total set of residues for the four subsets. The colour scheme is the same as part (A). (C) Histogram of conservation score for the total set of residues for the four subsets. The colour scheme is the same as part (A).
Table 2. 1 Comparison of the hydrophobicities of the parsed subsets.
Datasets P-value*
Ordered vs Other-Disordered <0.0001
Ordered vs FB <0.0001
Ordered vs Disordered-around-FB <0.0001
Other-Disordered vs FB NS†
Other-Disordered vs Disordered-around-FB <0.0001
FB vs Disordered-around-FB <0.0001
*P-values for Wilcoxon ranked sum test.
†Not significant.
61
Table 2. 2 Mean hydrophobicity values of the four region types.
Subset Mean*
Ordered -0.3219 (±1.373)
Other-Disordered -0.867 (±1.278)
FB -0.834 (±1.326)
Disordered-around-FB -1.026 (±1.178)
*Sample size: 17869 (Ordered), 2036 (Other-Disordered), 3201 (FB), 2932 (Disordered- around-FB).
Table 2. 3 Comparison of the net charges of the parsed subsets.
Datasets P-value*
Ordered vs Other-Disordered <0.0001
Ordered vs FB NS†
Ordered vs Disordered-around-FB <0.0001
Other-Disordered vs FB <0.0001
Other-Disordered vs Disordered-around-FB NS†
FB vs Disordered-around-FB <0.0001
*P-values for Wilcoxon ranked sum test.
†Not significant.
62
Table 2. 4 Mean net-charge values of the parsed subsets.
Subset Mean*
Ordered 0.004 (±0.508)
Other-Disordered -0.060 (±0.502)
FB 0.020 (±0.559)
Disordered-around-FB -0.045 (±0.493)
*Sample size: 17869 (Ordered), 2036 (Other-Disordered), 3201 (FB), 2932 (Disordered-
around-FB).
In composite, the results for hydrophobicity (Tables 2.1 and 2.2) indicate the
following significant trend:
Ordered > (Other Disordered ~ FB) > Disordered-around-FB
Thus, Disordered-around-FB regions are distinctly the most hydrophilic parsed
subset, with FB regions, in general, approximately as hydrophobic as Other-Disordered
regions in the same sequences. It has been observed previously that the interfaces of
proteins that undergo disorder to order transition are more hydrophobic [34, 35], as is generally observed in protein-protein interactions [36]. However, it has also been suggested that the polar and charged amino acids present in FB proteins play a major role in interacting with the partner molecules [37], thus leading to overall hydrophobicity
63
in FB regions that is here indistinguishable from other disordered tracts; however, the
Disordered-around-FB regions are clearly distinct in composition to the FB regions.
The total net charge of each of the four datasets was calculated at pH 7 (Figure
2.5A). In composite, the results for net charge (Tables 2.3 and 2.4) indicate a significant trend, summarized by the following inequality:
(Ordered ~ FB) > (Disordered-around FB ~ Other-Disordered)
Thus, regions that can be structured (Ordered and FB) have overall positive charge, whereas the other sets have negative charge overall. If we examine the prevalences of the twenty amino acids in the four subsets, there are some distinctive trends for each subset (Figure 2.6); the Disordered-around-FB regions have a pronounced preference for T, S, G and P, with the Other-Disordered regions having a similar, less pronounced preference for S, G and P. Glycine and proline residues control the flexibility of the polypeptide chain, and so areas rich in these residues may be designed to bend or deform in specific ways.
64
16 O"Ordered
D"Disordered 14 FB"FB
12 DFB"DFB
10
8
6
Percentage of amino acids 4
2
0 R K D E C Q H S T Y N A I L M V W F G P Amino acids
Figure 2. 6 Comparison of the overall amino-acid composition of the four region types. The amino acid composition of Ordered, Disordered, FB and Disordered-around-FB regions are represented.
65
2.4.4 Complex pattern of sequence conservation in FB-containing
proteins
The distribution of conservation scores (shown in Figure 2.5C) was analysed for significant trends (Tables 2.5 and 2.6). In composite, we get the following overall tendency for conservation:
Ordered > FB > (Disordered-around FB ~ Other-Disordered)
Thus, FB regions are distinctly a highly conserved set, but not as highly conserved as the
Ordered set. The Disordered-around-FB and Other-Disordered regions are the most evolutionarily variable (Tables 2.5 and 2.6).
Table 2. 5 Comparison of the conservation scores of the parsed subsets.
Datasets P-value*
Ordered vs Other-Disordered <0.0001
Ordered vs FB 0.031
Ordered vs Disordered-around-FB <0.0001
Other-Disordered vs FB <0.0001
Other-Disordered vs Disordered-around-FB NS†
FB vs Disordered-around-FB <0.0001
66
(continued from Table 2.5, previous page)
*P-values for Wilcoxon ranked sum test. †Not significant
Table 2. 6 Mean conservation score values of the parsed region types.
Subset Mean*
Ordered 0.278 (±0.916)
Other-Disordered -0.368 (±1.050)
FB 0.234 (±1.021)
Disordered-around-FB -0.310 (±0.986)
*Sample size: 17869 (Ordered), 2036 (Other-Disordered), 3201 (FB), 2932 (Disordered- around-FB).
2.4.5 Sampling analysis of parsed subsets
We also analysed the parsed FB subset as a sample of larger total ordered and disordered sets (Table 2.7). We examined the FB set as a sample of the total ordered regions (Ordered + FB), and also as a sample of the total disordered regions
(FB + Disordered-around-FB + Other-Disordered). The results are in agreement with the analyses performed above, with the FB regions being very distinctive among the total disordered set for conservation (<0.1% of the random samples are more conserved) and
67
net charge (<0.1% are more positively charged), and for hydrophobicity in the total ordered set (<0.1% are less hydrophobic).
Table 2. 7 FB set as sample of total ordered and disordered sets.
Sampling* Ranking of means of each quality for
original set in list of samples**
Conservation
FB in total ordered 21.6 percentile
FB in total disordered 99.9 percentile
Hydrophobicity
FB in total ordered 0.1 percentile
FB in total disordered 87.8 percentile
Charge
FB in total ordered 89.5 percentile
FB in total disordered 99.9 percentile
*Total ordered=Ordered + FB; total disordered = Disordered-around-FB+FB+Other-
Disordered.
**10,000 samples of the same distribution of region lengths as observed for the FB set were taken from each total population of ordered and disordered regions. The ranking for
68
the mean value of the original FB subset in the list of samples is expressed as a
percentile, i.e. at 5 percentile, 5% of the samples are less conserved, hydrophobic or
positively charged.
2.4.6 A possible guidance mechanism during FB folding-on-binding
with protein interaction partners
FB regions have high conservation and slight net positive charge, with contiguous disordered regions having low conservation and slight net negative charge and excessive hydrophilicity. Indeed, the Disordered-around-FB regions are excessively hydrophilic compared to the Other-Disordered regions. It is interesting that these results parallel analyses of conserved areas in protein-protein interfaces, which tend to be more hydrophobic than non-conserved parts [36].
Our results suggest a possible guidance mechanism for FB regions, wherein excessively hydrophilic Disordered-around-FB regions steer the FB towards the binding site of its interaction partner, by lessening the occurrence of off-target interactions, and thus facilitating the folding-on-binding [38-40]. Such an electrostatic steering mechanism has been shown experimentally and simulationally for the binding of the cell cycle regulator p27 to cyclin A [41, 42]. The positive charge in the FB region is likely due to the charge character of the binding partners, or specific functional design. Indeed, fourteen
69
of the FB regions analysed are for binding DNA/RNA (which are negatively charged), and
a further eleven FB regions are nuclear localization signals, which are positively charged
for their specific function. In some cases, the disordered regions between the FB regions
have also been observed. For example, cancer susceptibility candidate gene 3 protein
(CASC3), a core component of exon junction complex (EJC), exhibits intrinsic disorder
within the region (residues 136-283) involved in RNA binding. In this functional segment,
a short disordered region is located between the two FB regions [43]. This region could
act as a flexible linker to prevent the steric hindrance of the structured region. Thus, the
polar natures of the disordered-around-FB regions suggest their significant role in
mediating the binding of FB regions to the partner molecule.
We performed enrichment analysis of Gene Ontology molecular function categories, using GOrilla [44]. Indeed, the proteins with FB regions are significantly enriched for
nucleic acid binding (GO:0003676, corrected P-value = 0.0074) and DNA binding
(GO:0003677, corrected P-value = 0.018, using a non-redundant DisProt set as
background population), which is consistent with the positive charge of the FB regions. It
has been previously shown that the charge in disordered regions correlates with
molecular function [44].
70
2.5 Concluding remarks
We performed a bioinformatical parsing of folding-on-binding proteins into four
distinct region types: Ordered, folding-on-binding (FB), Disordered-around-FB, and
Other-Disordered. From a convergence of three separate analyses (treating the sets as
fragments, as populations of residues and as samples of fragments from populations),
we observe that compositionally, the Ordered regions segregate as more hydrophobic
than the three other region types, but that in terms of conservation, the Ordered and FB
regions tend to group together and the Disordered-around-FB and Other-Disordered
regions with each other, although there is still some lesser significant difference between
the Ordered and FB sets. We described how our results point towards a possible
compositionally-based steering mechanism of FB region folding-on-binding. Simulation
studies of coupled folding and binding of disordered regions have been shown to improve
the understanding of this process [45, 46]. Hence, further experimental and simulational work is required to investigate this hypothesis.
71
2.6 References
1. Wright , P.E. and H.J. Dyson, Intrinsically unstructured proteins- re-assessing the protein structure-function paradigm. J. Mol. Biol., 1999. 293: p. 321-331. 2. Dunker, A.K., et al., Intrinsic disorder and protein function. Biochemistry, 2002. 41(21): p. 6573-82. 3. Ward, J.J., et al., Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol, 2004. 337(3): p. 635-45. 4. Tompa, P., Intrinsically unstructured proteins. Trends Biochem Sci, 2002. 27(10): p. 527-33. 5. Xie, H., et al., Functional anthology of intrinsic disorder. 1. Biological processes and functions of proteins with long disordered regions. J Proteome Res, 2007. 6(5): p. 1882-98. 6. Peng, K., et al., Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics, 2006. 7: p. 208. 7. Zoran Obradovic, K.P., Slobodan Vucetic, Predrag Radivojac, Celeste J. Brown, and A. Keith Dunker, Predicting intrinsic disorder from amino acid sequence. PROTEINS: Structure, Function, and Genetics, 2003. 8. Jones, D.T. and J.J. Ward, Prediction of disordered regions in proteins from position specific score matrices. Proteins-Structure Function and Bioinformatics, 2003. 53: p. 573-578. 9. Pedro Romero, Z.O., 1¥ Xiaohong Li,1‡ Ethan C. Garner,2† Celeste J. Brown,2 and A. Keith Dunker, Sequence Complexity of Disordered Protein. PROTEINS: Structure, Function, and Genetics, 2001. 42. 10. Radivojac, P., et al., Intrinsic disorder and functional proteomics. Biophys J, 2007. 92(5): p. 1439-56.
72
11. Dosztanyi, Z., et al., Disorder and sequence repeats in hub proteins and their implications for network evolution. Journal of Proteome Research, 2006. 5(11): p. 2985-2995. 12. Haynes, C., et al., Intrinsic disorder is a common feature of hub proteins from four eukaryotic interactomes. Plos Computational Biology, 2006. 2(8): p. 890- 901. 13. Dyson, H.J. and P.E. Wright, Coupling of folding and binding for unstructured proteins. Current Opinion in Structural Biology, 2002. 12(1): p. 54-60. 14. Shoemaker, B.A., J.J. Portman, and P.G. Wolynes, Speeding molecular recognition by using the folding funnel: The fly-casting mechanism. Proceedings of the National Academy of Sciences of the United States of America, 2000. 97(16): p. 8868-+. 15. Uversky, V.N. and A.K. Dunker, Understanding protein non-folding. Biochim Biophys Acta, 2010. 1804(6): p. 1231-64. 16. Wright, P.E. and H.J. Dyson, Linking folding and binding. Curr Opin Struct Biol, 2009. 19(1): p. 31-8. 17. Brown, C.J., et al., Evolutionary rate heterogeneity in proteins with long disordered regions. J Mol Evol, 2002. 55(1): p. 104-10. 18. Vladimir N. Uversky, J.R.G., and Anthony L. Fink, Why are “natively unfolded” proteins unstructured under physiologic conditions. PROTEINS: Structure, Function, and Genetics, 2000. 41: p. 415–427. 19. Sotomayor-Perez, A.C., D. Ladant, and A. Chenal, Disorder-to-Order Transition in the CyaA Toxin RTX Domain: Implications for Toxin Secretion. Toxins, 2015. 7(1): p. 1-20. 20. Forman-Kay, J.D. and T. Mittag, From sequence and forces to structure, function, and evolution of intrinsically disordered proteins. Structure, 2013. 21(9): p. 1492-9. 21. Meszaros, B., et al., Molecular principles of the interactions of disordered proteins. Journal of Molecular Biology, 2007. 372(2): p. 549-561.
73
22. Brown, C.J., A.K. Johnson, and G.W. Daughdrill, Comparing models of evolution for ordered and disordered proteins. Mol Biol Evol, 2010. 27(3): p. 609-21. 23. Brown, C.J., et al., Evolution and disorder. Curr Opin Struct Biol, 2011. 21(3): p. 441-6. 24. Szalkowski, A.M. and M. Anisimova, Markov models of amino acid substitution to study proteins with intrinsically disordered regions. PLoS One, 2011. 6(5): p. e20488. 25. Fukuchi, S., et al., IDEAL: Intrinsically Disordered proteins with Extensive Annotations and Literature. Nucleic Acids Res, 2012. 40(Database issue): p. D507-11. 26. Fukuchi, S., et al., IDEAL in 2014 illustrates interaction networks composed of intrinsically disordered proteins and their binding partners. Nucleic Acids Res, 2014. 42(Database issue): p. D320-5. 27. Huang, Y., et al., CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, 2010. 26(5): p. 680-682. 28. Sickmeier, M., et al., DisProt: the Database of Disordered Proteins. Nucleic Acids Res, 2007. 35(Database issue): p. D786-93. 29. Flicek, P., et al., Ensembl 2014. Nucleic Acids Res, 2014. 42(Database issue): p. D749-55. 30. Edgar, R.C., MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 2004. 5: p. 113. 31. Grishin, J.P.a.N.V., AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics, 2001. 17(8). 32. Wilkins, M.R., et al., Protein identification and analysis tools in the ExPASy server. Methods Mol Biol, 1999. 112: p. 531-52. 33. Kyte, J. and R.F. Doolittle, A simple method for displaying the hydropathic character of a protein. J Mol Biol, 1982. 157(1): p. 105-32.
74
34. Gunasekaran, K., C.J. Tsai, and R. Nussinov, Analysis of ordered and disordered protein complexes reveals structural features discriminating between stable and unstable monomers. J Mol Biol, 2004. 341(5): p. 1327-41. 35. Vacic, V., et al., Characterization of molecular recognition features, MoRFs, and their binding partners. J Proteome Res, 2007. 6(6): p. 2351-66. 36. Guharoy, M. and P. Chakrabarti, Conserved residue clusters at protein-protein interfaces and their use in binding site identification. BMC Bioinformatics, 2010. 11: p. 286. 37. Wong, E.T., D. Na, and J. Gsponer, On the importance of polar interactions for complexes containing intrinsically disordered proteins. PLoS Comput Biol, 2013. 9(8): p. e1003192. 38. Kissinger, C.R., et al., Crystal structures of human calcineurin and the human FKBP12-FK506-calcineurin complex. Nature, 1995. 378(6557): p. 641-4. 39. Uversky, V.N., Multitude of binding modes attainable by intrinsically disordered proteins: a portrait gallery of disorder-based complexes. Chem Soc Rev, 2011. 40(3): p. 1623-34. 40. Romero, Obradovic, and K. Dunker, Sequence Data Analysis for Long Disordered Regions Prediction in the Calcineurin Family. Genome Inform Ser Workshop Genome Inform, 1997. 8: p. 110-124. 41. Ganguly, D., et al., Electrostatically accelerated coupled binding and folding of intrinsically disordered proteins. J Mol Biol, 2012. 422(5): p. 674-84. 42. Ganguly, D., W. Zhang, and J. Chen, Electrostatically accelerated encounter and folding for facile recognition of intrinsically disordered proteins. PLoS Comput Biol, 2013. 9(11): p. e1003363. 43. Nielsen, K.H., et al., Mechanism of ATP turnover inhibition in the EJC. RNA, 2009. 15(1): p. 67-75. 44. Eden, E., et al., GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics, 2009. 10: p. 48.
75
45. Verkhivker, G.M., et al., Simulating disorder-order transitions in molecular recognition of unstructured proteins: where folding meets binding. Proc Natl Acad Sci U S A, 2003. 100(9): p. 5148-53. 46. Chen, T., J. Song, and H.S. Chan, Theoretical perspectives on nonnative interactions and intrinsic disorder in protein folding and binding. Curr Opin Struct Biol, 2015. 30: p. 32-42.
Acknowledgements
This work was supported by Natural Sciences and Engineering Research Council of
Canada (NSERC).
76
2.7 Connecting Text for Chapter 2 to Chapter 3
In Chapter 2, I have performed a bioinformatical evolutionary analysis of the different types of intrinsically disordered regions in human FB proteins. FB and DFB regions have shown an interesting compositional and conservation trend when compare to other disorder regions types. I observed that FB regions have mild positive charge and relatively high conservation. These results motivated me to use similar evolutionary methods to study trends of PTMs in disordered regions relative to ordered regions (PTMs are known to induce disorder-to-order transition). In Chapter 3, I examined the major
PTMs such as methylation, acetylation and ubiquitination (MAU) sites in FB regions across 380 eukaryotic species. However, the experimental data available for FB proteins is limited; hence I expanded the study by performing a large-scale conservation analysis of MAU sites in ordered and disordered regions generally.
77
CHAPTER III
3 Discerning evolutionary trends in post-translational
modification and the effect of intrinsic disorder: Analysis
of methylation, acetylation and ubiquitination sites in
human proteins
A version of this chapter has been submitted as:
Narasumani, M. and Harrison, P. M. Discerning evolutionary trends in post-translational modification and the effect of intrinsic disorder: Analysis of methylation, acetylation and ubiquitination sites in human proteins. Accepted in PLOS Computational Biology
78
3.1 Abstract
Intrinsically disordered regions (IDRs) of proteins play significant biological functional
roles despite lacking a well-defined 3D structure. For example, IDRs provide efficient
housing for large numbers of post-translational modification (PTM) sites in eukaryotic
proteins. Here, we study the distribution of more than 15,000 experimentally determined
human methylation, acetylation and ubiquitination sites (collectively termed ‘MAU’ sites)
in ordered and disordered regions, and analyse their conservation across 380 eukaryotic
species. Conservation signals for the maintenance and novel emergence of MAU sites
are examined at 11 evolutionary levels from the whole eukaryotic domain down to the
ape superfamily, in both ordered and disordered regions. We discover that MAU PTM is
a major driver of conservation for arginines and lysines in both ordered and disordered regions, across the 11 levels, most significantly across the mammalian clade.
Conservation of human methylatable arginines is very strongly favoured for ordered regions rather than for disordered, whereas methylatable lysines are conserved in either
set of regions, and conservation of acetylatable and ubiquitinatable lysines is favoured in
disordered over ordered. Notably, we find evidence for the emergence of new lysine MAU
sites in disordered regions of proteins in deuterostomes and mammals, and in ordered
regions after the dawn of eutherians. For histones specifically, MAU sites demonstrate an
idiosyncratic significant conservation pattern that is evident since the last common
ancestor of mammals. Similarly, folding-on-binding (FB) regions are highly enriched for
MAU sites relative to either ordered or disordered regions, with ubiquitination sites in FBs
being highly conserved at all evolutionary levels back as far as mammals. This
investigation clearly demonstrates the complex patterns of PTM evolution across the
79
human proteome and that it is necessary to consider conservation of sequence features at multiple evolutionary levels in order not to get an incomplete or misleading picture.
Keywords: post-translational modification; lysine; arginine; methylation; acetylation; ubiquitination; intrinsically disordered; folding-on-binding; conservation; human; eukaryote
80
3.2 Introduction
Intrinsically disordered regions (IDRs) in proteins were initially discovered as long
stretches of amino acids in proteins that remain unfolded under physiological conditions
[1, 2]. IDRs can be functional despite this absence of a well-defined three-dimensional
structure, and have caused a re-examination of the protein structure-function paradigm
[1-4]. They are involved in numerous biological functions [2, 4-8] and their improper
functioning leads to various disease conditions [7, 9-11]. Bioinformatical studies have
shown that long (>30 residues) IDRs are common in eukaryotic proteins (33% of them on
average) and occur much less in archaea (2% of proteins) and eubacteria (4%) [12-14].
In addition, Ward et al. reported that long IDRs (>30 residues) in yeast proteins are
associated with transcription regulation and cell signalling [12]. The amino-acid
sequences of IDRs contain compositional bias and low sequence complexity [15]. Many
computational tools have been developed to annotate disordered regions in amino acid
sequences [16-21], facilitating the distinction between ordered and disordered regions.
In many proteins, IDRs exhibit low amino-acid sequence conservation [22] and
tandem repeats are more abundant in IDRs than in ordered regions [23, 24]. Insertions
and deletions are more common in IDRs [25, 26] and they contain more amino acid substitutions than the ordered regions of the same proteins [22]. Furthermore, some disordered regions in proteins show conservation for chemical composition, but not detailed amino-acid sequence conservation [27]. Studies on the evolution of ordered and disordered regions have revealed that disordered regions generally evolve differently from ordered regions, but in some cases similarly to ordered regions [22, 26-31]. Hence,
81
understanding the evolution of disordered regions in comparison to ordered regions has
been challenging.
IDRs are involved in protein-protein interaction [11], including binding to kinases
[32], transcription factors [33], and translation inhibitors [34], and they also mediate
interaction with nucleic acids [33, 35]. Numerous receptors and enzymes with disordered
regions acquire structure when binding to a partner molecule [4, 36-38]. Proteins with
such folding on binding (FB) regions exhibit high specificity and low affinity towards a
partner molecule [1, 39]. Compared to other disordered regions, they are enriched in
hydrophobic residues, and positively charged amino acids [40] and are more conserved
[31]. Indeed, post-translational modifications (PTMs) can induce their disorder-to-order
transitions, and the conformational flexibility of IDRs provides sites for many PTMs per
amino-acid residue [41] [42]. Furthermore, PTMs in disordered regions have a significant
role in signaling and regulation [42]. Experimental and computational studies suggest that
PTMs including methylation and ubiquitination are enriched within IDRs, [6, 7, 42-45]
whereas analysis of acetylation has shown contradictory results [46]. Furthermore, the
phosphorylation sites present in disordered regions have been suggested to facilitate the
evolution of transcriptional regulation [45, 47, 48]. Methylation, Acetylation, and
Ubiquitination (abbreviated here collectively as ‘MAU’) are the three major PTMs, next to phosphorylation and glycosylation, which regulate the function of many eukaryotic proteins. Crosstalk between MAU sites facilitates complex regulatory programs in both histone and non-histone proteins [49]. However, the evolution of MAU sites in IDRs across eukaryotic species is not well understood [50-53]. Therefore, a comparative study on the conservation of human MAU sites in ordered and disordered regions will illuminate
82
their importance across the eukaryotic domain. Analysis of conservation across a large panel of genome-sequenced eukaryotes can give us more comprehensive insights into the evolutionary history of PTMs [45, 47, 48], while avoiding issues of data set completeness that may be a problem for experimental analysis of a variety of multi- cellular species.
We have performed a large-scale analysis of >15,000 experimentally-verified MAU sites from the ordered and disordered regions of >7,000 human proteins. We compiled four such data sets for both ordered and disordered regions: (i) methylated arginines, (ii) methylated lysines, (iii) acetylated lysines and (iv) ubiquitinated lysines. We studied the distribution and conservation of MAU sites in ordered and disordered regions across 380 eukaryotic organisms. Conservation signals for the maintenance and novel emergence of MAU sites were analysed at 11 evolutionary levels from the whole eukaryotic domain down to the level of the ape superfamily. We observed significant conservation attributable to lysine and arginine PTMs in both ordered and disordered regions across the 11 levels, and also some signals for the novel emergence of new MAU sites.
Furthermore, we have pinpointed trends for biologically important subsets of IDRs, such as FB regions and prion-like domains. For example, we observed that MAU and other
PTM sites are highly enriched in FB regions relative to both ordered and disordered regions generally and at evolutionary depths back as far as the emergence of the mammal class.
83
3.3 Methods
3.3.1 PTM Datasets
Human proteins with experimentally-verified PTM sites were retrieved from dbPTM
[54], PHOSIDA [55] and PhosphositePlus [56] databases as of November 2015. We
focused on the evolutionary behaviour of Methylation, Acetylation and Ubiquitination sites
(‘MAU sites’). Redundant annotations for PTMs were removed. This resulted in 1,009
lysine and 1,676 arginine methylation sites, 10,044 acetylation sites and 14,396
ubiquitination sites. We also comparatively analysed the distribution of serine, threonine
and tyrosine phosphorylation sites, and other rarer PTMs (but not their evolutionary
conservation).
3.3.2 Eukaryotic proteomes
Complete proteomes of 380 eukaryotic organisms were downloaded from
ENSEMBL [57], UniProt [58] and NCBI RefSeq [59] databases. The organisms were separated into eleven different taxonomic levels that provide a range of focus on the human: eukaryotes, metazoan, deuterostomes, chordates, vertebrates, mammals, tetrapods, eutherians, supraprimates, primates, and apes. Human proteins with experimentally-verified FB regions were obtained from the IDEAL database [60].
84
3.3.3 Sequence analysis
Phylogenetic trees of the eukaryotic organisms were drawn with Evolview [60]
using Newick-format files generated by phyloT (https://phylot.biobyte.de/) [61]. Human
orthologs in eukaryotic organisms were identified using the reciprocal best hit method
with BLASTP and e-value threshold <1e-04 [62]. Multiple sequence alignment of human
proteins with MAU sites and their orthologs in the 380 organisms was performed using
ClustalOmega [63]. For the evolutionary analysis, human proteins with an orthologue in
at least one of the organisms in a clade are included and the human proteins without an
orthologue in at least one of the organisms are discarded. We used ZORRO, a
probabilistic masking program to evaluate the alignment quality of individual positions
[64]. In doing this, the aligned positions with low ZORRO score were discarded, and the
positions within the recommended score range of five to ten were retained for
conservation analysis. For comparison, the alignment program KMAD was also applied
in some cases [65].
Enrichment analyses of gene ontology (GO) molecular function categories was performed
using the GOrilla tool to identify GO terms enriched in different clades [66].
3.3.4 Identification of ordered and disordered regions in proteins
We performed protein BLASTP [version 2.2.28] [62] against the ASTRAL non-
redundant protein domain database (95% identity threshold) [67]. We used PDB atom
records of proteins from ASTRAL domain database to identify the experimentally validated position of ordered regions in human proteins and the disordered regions in
85
human proteins were annotated with DISOPRED and IUPRED per-residue prediction
scores, using default parameters using default parameters [18, 19]. Since ASTRAL
domains are experimentally validated structures, we considered the region given by
ASTRAL BLAST hits as ordered region for the cases that are also predicted as
disordered. To keep the analysis and presentation of results manageable, regions un-
classified in this way were not analysed.
Human prion-like proteins are annotated disordered regions that have a bias for
asparagine or glutamine residues (using the fLPS program [68], run with default parameters except for a binomial P-value threshold of ≤1e-10, as used in previous studies
[69-71]).
3.3.5 Conservation & statistical analysis
A Python script was written to find the conserved MAU sites in ordered and
disordered regions by calculating the completely conserved lysine/arginine residues in
the multiple sequence alignment at each clade. Newly-emerged conserved residues are
those that are completely conserved in a clade but not across a more ancient, wider clade.
To test the significance of conservation, we performed enrichment analysis of the
conserved MAU-site residues at each evolutionary level as subsets of the total sets of
conserved residues of the same type, with appropriate corrections for multiple
hypotheses.
Hypergeometric probability tests were used to find these enrichments of MAU-site
residues in ordered and disordered regions for the different evolutionary levels. A
Bonferroni correction for multiple hypothesis testing was applied for all tests for a given
86
background population. The details of the enrichment calculations are given in the legend for Figure S3.5-S3.15. All enrichment and statistical analyses are performed using the R language [72].
3.4 Results and Discussion
First, we overview the distribution of methylation, acetylation and ubiquitination
(MAU) sites in ordered and disordered regions, and include some specific analysis and discussion of MAU sites in folding-on-binding (FB) regions, prion-like proteins and homopeptides (which are common features of disordered regions [73]).
Then, we examine the effect of MAU sites on the evolutionary behaviour of lysine and arginine residues. To what extent do MAU sites drive the conservation of these residues and the appearance of new conserved residues at different points in eukaryotic evolution? Is there evidence for the appearance of new conserved lysines in evolutionarily old proteins because of MAU site status?
These questions are examined for each of methylation, acetylation and ubiquitination separately in turn. In doing so, we also consider the effects of: (i) allowing mutation to other possible residue types for the same modification (e.g., allowing mutation between arginine and lysine for methylation); (ii) alignment quality on the results (through applying the program ZORRO, as described in Methods); (iii) removal of histones (which are known to have high levels of MAU).
The evolution of MAU sites is also specifically examined for histones, and for folding-on- binding proteins as subsets. Finally, we briefly consider the evolutionary behaviour of
87
sites that are ‘multiple-MAU’ (i.e., that can have more than one different type of MAU modification).
3.4.1 Distribution of MAU sites in ordered and disordered regions
The MAU site contents in the ordered and disordered regions are summarized in
Figure 3.1A. Specific lysine residues can be sites for multiple PTMs, including MAU
(Figure 3.1B). For MAU sites in ordered and disordered regions, the observed overlap between acetylation and ubiquitination sites correlates with an established regulatory relationship [74], and it is also interesting to note the high proportion of methylation sites
(~51%) specifically in ordered regions that have other PTMs, in comparison to any other
MAU in either ordered or disordered regions (Figure 3.1B).
88
89
(continued from Figure 3.1, previous page)
Figure 3. 1 Overview of the number of methylation, acetylation and ubiquitination
sites and the coincidence of different MAU types at the same residues in ordered
and disordered regions
(A) total number of MAU sites in ordered and disordered regions of 7160 human proteins
showing that the higher number of MAU sites in disordered regions than in ordered
regions and ubiquitination sites show preference for ordered regions. (B) Venn diagram illustrates the co-incidence of MAU (i.e., how many can have two or three different MAU at the same residue) in ordered and disordered regions.
In general, PTM sites have been reported to be abundant in the disordered regions of eukaryotic proteins [7, 75]. However, not all PTMs show a preference for disordered
regions. We examined the distribution in ordered and disordered regions of human
proteins of experimentally-verified MAU sites, along with phosphorylation sites for
comparison (as listed in Methods).
We observe that acetylation and ubiquitination sites and methylated lysine sites
generally have a significant preference for ordered regions (Figure 3.2). It is known that
lysine methylation in disordered regions blocks site-specific lysine ubiquitination to
increase protein half-life [76]. This may contribute to the relative abundance of
ubiquitination sites in ordered regions. In comparison, phosphorylation sites prefer
disordered regions, as expected [7, 75] (Figure 3.2).
90
Figure 3. 2 Percentage distribution of MAU and phosphorylation sites in ordered and disordered regions of human proteins
Percentages of MAU and phosphorylation sites (out of the total number of residues of the same type) in ordered and disordered regions of the human proteins analysed. The total number of each site present in ordered (olive green) and disordered (peach) regions are given in the centre of the bar. The hypergeometric distribution is used to identify the enrichment of MAU-modified residues in (dis)ordered regions in all lysines/arginines present in both ordered and disordered regions, with the total set of MAU sites as background population, and the diamond symbol on top of the bar indicates the corrected
91
P-value (0.0071) for significant enrichment of PTMs in ordered and disordered regions, and NS represents non-significant enrichment.
Previous studies have suggested that MAU sites are enriched in disordered regions [6, 7, 42-44] and acetylated lysines have no preference for either ordered or disordered regions [46]. In contrast, our analysis here shows that MAU lysines are significantly relatively enriched in ordered regions (Figure 3.2) rather than in disordered ones, whereas the opposite is true for phosphorylation sites (Figure 3.2).
3.4.2 FB regions as display areas for PTMs
FB regions in proteins are known to interact with multiple and diverse partners [1,
39], and are associated with PTMs [41, 42]. Previously, we found that FB regions are more conserved than contiguous disordered regions that are not known to exhibit disorder-to-order transition [31]. We have analysed the enrichment of MAU sites and other PTMs in FB regions (in 172 human proteins, data taken from the IDEAL database
[77]). Phosphorylation sites are highest in number in FB regions, followed by MAU sites
(Figure 3.3A).
92
A 400 51 25 74 93 132 300
57 200 367 348 306 325 280 Number of residues in FB regionsresidues in of NumberFB 100 205 27
85
0
No. of residues (K/R) in FB regions No. of modified residues (K/R) in FB regions