University of South Florida Scholar Commons

Graduate Theses and Dissertations Graduate School

6-27-2016 Intrinsic Disorder Where You Least Expect It: The Incidence and Functional Relevance of Intrinsic Disorder in and the Data Bank Shelly Deforte University of South Florida, [email protected]

Follow this and additional works at: http://scholarcommons.usf.edu/etd Part of the Bioinformatics Commons, Medicine and Health Sciences Commons, and the Molecular Commons

Scholar Commons Citation Deforte, Shelly, "Intrinsic Disorder Where You Least Expect It: The ncI idence and Functional Relevance of Intrinsic Disorder in Enzymes and the " (2016). Graduate Theses and Dissertations. http://scholarcommons.usf.edu/etd/6219

This Thesis is brought to you for free and open access by the Graduate School at Scholar Commons. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Scholar Commons. For more information, please contact [email protected].

Intrinsic Disorder Where You Least Expect It: The Incidence and Functional Relevance of

Intrinsic Disorder in Enzymes and the Protein Data Bank

by

Shelly DeForte

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Molecular Medicine College of Medicine University of South Florida

Major Professor: Vladimir Uversky, Ph.D. Yu Chen, Ph.D. Robert Deschenes, Ph.D. Sandy Westerheide, Ph.D. Bin Xue, Ph.D.

Date of Approval: June 14, 2016

Keywords: intrinsically disordered protein, x-ray crystallography, structural biology, function

Copyright © 2016, Shelly DeForte

Table of Contents

List of Tables ...... iv

List of Figures ...... v

Abstract ...... vii

1. Introduction to Intrinsically Disordered ...... 1 1.1 The Dominant Paradigms in Protein Science ...... 1 1.2 Defining Intrinsically Disordered Proteins ...... 2 1.3 The Subtler Side of Disorder ...... 5 1.4 The Line Between Order and Disorder ...... 7 1.5 The Mechanisms of Disorder ...... 9 1.5.1 Entropy ...... 10 1.5.2 Accessibility ...... 11 1.5.3 Plasticity ...... 11 1.6 Biological Functions ...... 12 1.6.1 Signaling ...... 12 1.6.2 Regulation ...... 12 1.7 Disorder and Protein Evolution...... 13 1.8 The Tools of the Un-Structural Biologist ...... 14 1.8.1 Experimental techniques ...... 14 1.8.1.1 X-Ray crystallography...... 14 1.8.1.2 Nuclear Magnetic Resonance...... 15 1.8.1.3 Combining experimental techniques...... 16 1.8.2 Bioinformatics analysis ...... 16 1.8.2.1 Sequence characteristics...... 16 1.8.2.2 Disorder prediction...... 18 1.8.2.3 Classification of function...... 20 1.8.2.4 Proteome level studies ...... 20 1.9 Protein Intrinsic Disorder and Disease ...... 21 1.10 Protein Intrinsic Disorder and Drug Design and Discovery ...... 23

i

1.10.1 The story of PTP1B ...... 24 1.11 The Field of Protein Intrinsic Disorder ...... 26 1.12 Intrinsic Disorder Where You Least Expect It ...... 29

2. Intrinsic Disorder in the Protein Data Bank ...... 31 2.1 Background ...... 31 2.1.1 The Protein Data Bank ...... 31 2.1.2 Missing regions in the PDB ...... 31 2.1.3 B-factors ...... 33 2.1.4 Missing regions and the development of disorder prediction ...... 34 2.1.5 Previous studies ...... 35 2.2 Results ...... 35 2.2.1 A new method for the characterization of missing regions ...... 35 2.2.2 Ambiguous regions have greater secondary structure variation ...... 39 2.2.3 Different types of missing regions have distinct characteristics ...... 42 2.2.4 Disorder prediction correlates with missing residue conservation ...... 46 2.2.5 Static disorder and wobbly domains are rare in the PDB ...... 49 2.3 Conclusions ...... 52

3. Intrinsic Disorder in Enzymes ...... 54 3.1 Background ...... 54 3.1.1 Intrinsically disordered enzymes in the literature ...... 54 3.1.2 Previous studies ...... 63 3.2 Results ...... 64 3.2.1 Experimental design ...... 64 3.2.2 Enzymes and non-enzymes have a similar incidence of IDPRs...... 66 3.2.3 Enzymes and non-enzymes have IDPRs of similar lengths ...... 67 3.2.4 Disorder is specific to enzyme length and type...... 69 3.2.5 Disorder increases with organismic complexity...... 70 3.2.6 Eukaryotic proteins in the PDB are highly truncated ...... 70 3.2.7 Long IDPRs in enzymes are associated with specific functions ...... 73 3.2.8 Promiscuity is not correlated with disorder in enzymes ...... 74 3.3 Conclusions ...... 75

4. Materials and Methods ...... 79 4.1 PubMed Data and Analysis ...... 79

ii

4.1.1 IDP terminology in PubMed ...... 79 4.2 PDB Data and Analysis ...... 81 4.2.1 Parsing and preparation of the PDB dataset ...... 81 4.2.2 The assignment of missing residues from PDB files ...... 82 4.2.3 The creation of PDB composite data in Python...... 82 4.2.4 Amino acid composition ...... 82 4.2.5 Disorder, binding, and MoRF predictions ...... 82 4.3 Reference Proteomes Data and Analysis ...... 83 4.3.1 Parsing and preparation of the Reference QFO dataset ...... 83 4.3.2 Enzyme Commission (EC) designations ...... 84 4.3.3 Disorder prediction ...... 84 4.3.4 Disorder analysis ...... 86 4.3.4.1 Disorder calculations ...... 86 4.3.4.2 Expectation values ...... 86 4.3.4.3 Transmembrane domains ...... 87 4.3.5 Ontology enrichment ...... 87

References ...... 88

Appendix A: Glossary ...... 101

Appendix B: IDP Search Terms and PubMed IDs ...... 103

Appendix C: IDP Search Terms, Search Results, and Disorder Scores ...... 142

Appendix D: Intrinsically Disordered Enzymes ...... 179

Appendix E: Copyright Permissions ...... 182

iii

List of Tables

Table 1 Secondary structure abbreviations...... 40

Table 2 Characterization of the datasets analyzed in this study...... 44

Table 3 Disorder content and content of disorder-based binding sites in the datasets analyzed in this study...... 47

Table 4 Agreement between disorder predictors...... 47

Table 5 Agreement between MoRF/ predictors...... 47

Table 6 Selected enzymes with experimental evidence of functionally relevant regions of protein intrinsic disorder...... 62

Table 7 The number of proteins in the reference proteome dataset assigned to each EC number, organized by taxon, and enzyme type (or non-enzyme)...... 65

Table 8 The number of proteins assigned to each EC number in the PDB dataset, organized by taxon, and enzyme type (or non-enzyme) for the set of proteins with at least one continuous missing (CM) region of three residues in length...... 65

Table 9 Matthew correlation coefficient comparison for each disorder predictor and a consensus...... 85

Table B1 IDP search terms and PubMed IDs...... 103

Table C1 IDP Search Terms, Search Results, and Disorder Scores ...... 142

Table D1 Intrinsically disordered enzymes ...... 179

iv

List of Figures

Figure 1 The usage of IDP terminology in PubMed abstracts...... 4

Figure 2 The fraction of PubMed IDs using IDP terminology by year...... 6

Figure 3 The fraction of predicted disorder versus the fraction of PubMed IDs that use IDP language...... 7

Figure 4 Entropic chain functions...... 10

Figure 5 Experimental and Bioinformatics techniques work together to describe the properties of disorder in proteomes and proteins...... 14

Figure 6 Amino acid scales and disorder and order promoting residues...... 17

Figure 7 A schematic representation of the secretion of adenylate cyclase toxin through the type 1 secretion system...... 22

Figure 8 A schematic representation of the effects of caffeine on the aggregation properties of alpha-synuclein...... 24

Figure 9 A representative ensemble of 100 conformers for PTP1B...... 25

Figure 10. The number of papers per author for the search term in PubMed, plotted against the fraction of those papers that use IDP terminology...... 28

Figure 11 Missing regions in X-ray crystal structures can have many causes...... 32

Figure 12 The classification scheme for PDB sequence regions used in this study...... 38

Figure 13 The distribution of protein lengths for the PDB chains used in this study compared to the distribution of lengths for the corresponding UniProt entries...... 39

Figure 14 Analysis of secondary structure in observed vs. missing regions...... 41

Figure 15 Analysis of sequence and PDB file characteristics sorted by missing region type...... 44

Figure 16 The amino acid composition of missing regions relative to the observed residues...... 45

Figure 17 The fraction of the set of each missing region type vs. the fraction of predicted disorder for a given missing region...... 48

v

Figure 18 An analysis of possible static disorder...... 51

Figure 19 The activation of by calmodulin through a disorder-order transition...... 56

Figure 20 A schematic of kinetic regulation by an order-disorder transition in ...... 57

Figure 21 The intrinsically disordered C-terminal region of the reductase VIMP contains a selenocysteine that is critical for ...... 58

Figure 22 The N terminal region of MsrB1 samples a wide range of dynamic conformations, and contains a resolving cysteine...... 58

Figure 23 The disordered C-terminal region of Ube2w helps bind diverse substrates...... 59

Figure 24 RNase E forms a flexible scaffold for protein interactions...... 60

Figure 25 The disordered C-terminus of NEIL1 allows it to engage in multiple molecular interactions...... 61

Figure 26 The composition of the reference proteome dataset (top) and the PDB dataset (bottom)...... 67

Figure 27 Predicted intrinsic disorder calculations for eukaryotes, prokaryotes, and archaea in reference proteomes...... 68

Figure 28 Missing region calculations for eukaryotes, prokaryotes, and archaea in the PDB dataset...... 69

Figure 29 Distribution of protein lengths and mean fraction of disorder by protein length in reference proteomes...... 71

Figure 30 Length distribution of the characterized portion of eukaryotes in the PDB overlaid on top of the length distribution of full length prokaryotes...... 72

Figure 31 Length distribution of the PDB dataset divided into eukaryotes (left) and prokaryotes (right)...... 73

Figure 32 (GO) term enrichment in human enzymes, categorized by longest CD length per protein...... 74

Figure 33 The distribution of disorder scores in the human proteome for seven disorder predictors and their consensus scores...... 86

vi

Abstract

Intrinsically disordered proteins (IDPs) and intrinsically disordered protein regions

(IDPRs) exist as interconverting conformational ensembles, without a single fixed three- dimensional structure in vivo. The focus in the literature up to this point has been primarily on IDPs that are mostly or entirely disordered. Therefore, we have an incomplete understanding of the incidence and functional relevance of IDPRs in proteins that have regions of both order and disorder. This work explores these populations, by examining IDPRs in the

Protein Data Bank (PDB) and in enzymes. By applying disorder prediction methods combined with an analysis of missing regions in crystal structure data, this work shows that enzymes have a similar incidence and length of IDPRs as do non-enzymes, and that these IDPRs are correlated with functions related to macromolecular metabolism, signaling, and regulation.

Furthermore, extensive analyses of missing regions with conflicting information between multiple structures in the PDB show that, rather than experimental artifacts, this ambiguity most likely arises due to partially or conditionally disordered regions. This work documents the first proteome level study of protein intrinsic disorder in enzyme populations and demonstrates a novel way of analyzing missing regions in the PDB. Furthermore, an extensive literature search as part of this work provides information for 1127 IDPs with experimental evidence documented in the literature, 96 of which are enzymes. The results contained herein present a new model of the protein universe, where disorder is directed by evolution in both non-enzymes and enzymes to make the most of limited proteomes in complex organisms through complicated signaling networks and tightly controlled regulation.

vii

1. Introduction to Intrinsically Disordered Proteins

Note to reader

Portions of this chapter have been previously published in RSC Advances, 2016

6(14):11513-11521, and have been reproduced with permission from the Royal Society of

Chemistry.

1.1 The Dominant Paradigms in Protein Science

The dominant paradigms in protein science were, in many ways, shaped by the earliest experiments in the field. Those early experiments were constrained by many of the same limitations we have today. Can this protein be purified? Does it have consistency and simplicity in its function? Ultimately, the measures of success in the dominant methods of experimentation can direct scientific thought regarding what is most important in the study of proteins. As an example, early protein studies in the nineteenth century revolved around the easily obtained and easily crystallized protein hemoglobin. Myosin, also easily available and identified around the same time as hemoglobin, was largely ignored because it was not crystalline. Because of this, it would be another 100 years before we understood Myosin at even the most basic level. [1]

Jacob Berzelius coined the term “catalysis” in 1836 [2], amidst intense interest in enzymes, which at that point had not yet been shown to be proteins. Emil Fischer demonstrated enzyme specificity, and established his seminal lock and key model in 1894 [3], a model that still dominates our understanding of today. Interestingly,

1

Fischer also proposed that proteins would prove to have a maximum size of 4000 amino acids in length [1], demonstrating the intuitive, and incorrect, speculations that follow from a strict adherence to the lock and key model. Experimental confirmation of the structure- function relationship continued with early X-ray crystal studies of enzymes such as lysozyme

[4] and -S [5], followed by Anfinsen’s demonstration that RNase A could be re- natured with an accompanying restoration of function [6]. It is undeniable that the success of these early experiments helped to shape dominant ideas of well-behaved protein behavior, where well-behaved was synonymous with well-structured, with one singular function, and one mechanism of action.

In many ways, these early experiments represented canonical examples of how proteins should be, which all further experiments were then held against. Therefore, protein behaviors that ran counter to the expected results were considered anomalies. That a protein could take an extended form with minimal residual structure was well understood due to numerous denaturation experiments. However, it was assumed that the native and functional state of a protein must have a stable structure. Therefore, results that ran counter to this assumption were typically considered a problem with the experiment or the experimenter, and not a result of the intrinsic properties of the protein. As these anomalies accumulated to the point where they could no longer be ignored, these problem proteins and problem regions were often seen as functionally irrelevant, and in many cases, removed before experimentation.

1.2 Defining Intrinsically Disordered Proteins

Even while the structure-function paradigm was strengthened in protein science, examples of intrinsically disordered proteins (IDPs) regularly appeared in the literature.

Prompted by the increased application of optical rotary dispersion in the 1950s and 1960s to

2 the investigation of protein structure, Jirgensons suggested a classification scheme that included a category called “disordered.” [7] By this time phosphvitin [7], casein [8], and histones [9] had been shown to have unusual structural properties. In 1971, it was proposed that two regions of missing electron density in the X-ray crystal structure of staphylococcus were “disordered” [10] as well. However, as more IDPs were uncovered, a wider variety of terms were applied to describe the phenomenon. Tau was initially referred to as

“natively denatured” [11], while alpha-synuclein was called “natively unfolded” [12]. Early reviews and theoretical work in the field used various terms as well, such as “intrinsically unstructured” [13], “natively disordered” [14] and “loopy” [15], among others. Vague terms such as “flexible” [16] and “mobile” [17] have a long history of hiding in the literature as well. In fact, until about 2005, the four most common terms “intrinsically disordered,”

“intrinsically unstructured,” “natively disordered,” and “natively unfolded,” were all used about equally in the literature (Figure 1) [18]. However, after that time, due in part to a concerted effort in the field to use consistent terminology, the term “intrinsically disordered” became the predominant and agreed upon term.

The field of un-structural biology has arisen to try to explain all cases of proteins that fall outside of the structure-function paradigm, and it is necessarily broad in scope because of this. Therefore, just as there have been challenges in reaching a consensus on terminology, there have also been similar challenges in defining protein intrinsic disorder. However, despite these challenges, several common definitions have emerged.

Different definitions of IDPs emphasize different experimental and theoretical perspectives. An IDP may be described as having little or no ordered secondary or tertiary structure. This definition emphasizes an IDP’s difference from proteins as understood by the tools of structural biology. An IDP may also be described as under-folded, or as failing to fold independently. This definition emphasizes that an IDP may have the same physical properties

3 as the unfolded state or as a folding intermediate of an ordered protein, such as random coil, molten globule, or pre-molten globule. Finally, it has become increasingly common to focus on the ensemble of IDPs when defining them. This places IDPs in the context of behavior that can increasingly be measured by NMR. The properties of IDPs are typically described as being present in vivo, in vitro, or under functional conditions. This is to emphasize that IDPs display their structural properties in a functional, native state.

Figure 1. The usage of IDP terminology in PubMed abstracts. The occurrence of each IDP term was counted for each year in abstracts of articles in PubMed associated with 1127 known IDPs.

It is interesting to note that all of these definitions do not place disordered proteins in a single opposite position from ordered proteins, but instead place ordered and disordered proteins at different points on a continuum. This is clearer, when we understand that both ordered and disordered proteins have movement at the atomic level and about their

Ramachandran angles. However, in ordered proteins this motion is sufficiently small that a consensus position can be inferred. On the other hand, an intrinsically disordered protein has

4 movement that precludes the collapse into a single point, both at the ensemble and individual protein level.

From these definitions, one can extrapolate a theoretical definition based on theories of . If a protein folds into its lowest energy conformation, then we can define an IDP as a protein that does not have a single global minimum in conformational space [19] or alternately, IDPs can be described as having a relatively flat free energy surface [20].

Unfortunately, this theoretical definition cannot currently be characterized experimentally for structured or disordered proteins.

Finally, it is necessary to distinguish proteins that are mostly or fully disordered from proteins with isolated regions of disorder. The term IDP is used to refer to proteins that are fully disordered, or contain long, defining regions of disorder. In contrast, when a protein is mostly structured but displays some regions of disorder, it is said to have intrinsically disordered protein regions (IDPRs). Proteins that contain a mix of ordered and disordered regions are also called hybrid proteins. In this work, I will attempt to distinguish IDPs from proteins with IDPRs, however for brevity the term IDP(s) will be used when referring to a set of proteins with varying levels of disorder, or when the disordered properties of a particular protein are being emphasized.

1.3 The Subtler Side of Disorder

In much the same way that protein science has been shaped by the dominant experimental methods, so too has the field which specializes in studying protein intrinsic disorder. Early measurements of protein intrinsic disorder were obtained by low-resolution methods such as optical rotary dispersion and circular dichroism. These methods cannot measure individual regions of disorder, but only the structural properties over the whole protein. X-ray crystallography can indicate small regions of possible disorder by their absence

5 in the resolved three-dimensional structure, but cannot establish the cause. Therefore the early emphasis in the field was on proteins that are mostly or fully disordered in vitro or in vivo, such as Myelin Basic Protein [21], alpha-synuclein [12], MAP2 [22], and tau [11].

Aside from the experimental challenges, there are additional reasons why the IDP literature has remained focused on highly and consistently disordered proteins. Acceptance of the relevance, and even the existence of IDPs and IDPRs has not yet solidified in the literature. Discussions of IDPs in textbooks are still largely absent, with a few exceptions in the last five years [23]. The citation aggregator PubMed did not add “intrinsically disordered proteins” to its MeSH (Medical Subject Headings) terms until 2014. The number of papers in the body of literature covering IDPs that actually use IDP terminology is still a fraction of a percent (Figure 2).

Figure 2. The fraction of PubMed IDs using IDP terminology by year. The fraction for each year is calculated by the number of PubMed IDs associated with IDPs that use IDP language, divided by the total number of PMIDs associated with the IDP proteins in the set. High confidence IDPs are those that have an extensive amount of experimental evidence verifying that the protein is intrinsically disordered.

6

Despite this, the rise of NMR and sequence-based bioinformatics methods has greatly expanded the experimental and theoretical literature on the topic of IDPs [23]. We now have the experimental tools to begin to characterize subtler categories of disorder, such as conditional disorder and partial disorder. A protein that is conditionally disordered is either ordered or disordered based on the environmental context or interaction partner. It is a term that encompasses both disorder-to-order transitions and transient (or cryptic) disorder, which is functionally relevant disorder that arises from structured regions (order-to-disorder) [24].

Furthermore, an increasing number of examples of small, but important IDPRs are appearing in the literature as can be seen in Figure 3 (red square).

Figure 3. The fraction of predicted disorder versus the fraction of PubMed IDs that use IDP language. Each blue dot represents a protein. The percent predicted disorder is plotted against the fraction of PubMed IDs that use intrinsic disorder language divided by all PubMed IDs associated with that protein search term. For each fraction of predicted disorder interval (0-10%, 10-20%, etc.), the fraction of the total proteins in that interval is plotted in red. The mean of the fraction of disorder PubMed IDs is plotted for each fraction of disorder interval in black.

1.4 The Line Between Order and Disorder

The emphasis in this work is on populations of proteins that display the properties of both structure and disorder, specifically proteins that have been at least partially

7 crystallized, and enzymes, which will have a structured catalytic region in most cases. While structure and disorder are often treated as binary states, they actually sit on a continuum.

This work concerns proteins that exist in the middle of the structure-disorder continuum, therefore it is useful to identify the conceptual line between a structured and disordered protein or protein region, and explore why it is necessary to use the tools and language of disorder when structure is present.

When attempting to semantically separate structure from disorder, it becomes clear that neither the term “structured” nor “disordered” is precisely correct. All proteins have some movement, and no protein is completely chaotic. Because these are conceptual frameworks that do not point to precise biological realities, the tools of structural or un- structural biology should be applied when most useful to solve the problem. The separating line between order and disorder is therefore drawn not by theoretical descriptions, but by practical considerations.

When a protein can no longer be adequately described by a single three-dimensional structure or a series of snapshots in three dimensions, then the language of disorder has now become useful. IDPs are defined by conformational uncertainty, and are typically characterized by a combination of sequence level features and a description of the overall shape, i.e. extended, random coil, molten globule, or pre-molten globule. Additionally, an

IDP that changes shape must also be described using the axis of time, and it is along this axis that structural biology and un-structural biology most acutely diverge. The introduction of a time variable greatly increases the possibilities when describing IDPs and IDPRs, and it is also change over time that allows us to describe the mechanisms that IDPs may employ and the advantages imparted by disorder.

As an example, short segments of disorder are commonly observed in the form of hinges that move a domain in a controlled way, or loops that have an open and closed

8 conformation, such as the WPD loop in the bacterial protein tyrosine YopH [25].

While these regions are technically disordered, the ability to describe the movement as a series of structural snapshots, typically places these dynamic movements within the realm of structural biology. However, other small disordered segments called Molecular Recognition

Features (MoRFs) [26], which undergo a contextual transition between disorder and order upon binding, have a function that is defined by the presence of disorder and the transition to an ordered state, and not by a specifically defined three-dimensional structure. Therefore, even despite their short lengths and disorder-order transitions, MoRFs fall within the realm of un-structural biology.

Furthermore, whether a protein is considered to be an IDP or to have an IDPR is largely determined by the functional significance assigned to the disorder. Disordered regions that have no known function are often considered to be functionally neutral sequence noise. An a priori assumption that a disordered region is function-neutral can create circular support for itself if this region is removed before experimentation, therefore the identification of disorder specific functions is of critical importance.

1.5 The Mechanisms of Disorder

A great deal of experimental and theoretical work has been done in order to illuminate how IDPs and IDPRs fit within a functional protein universe. When viewed within the context of finely regulated interaction and signaling networks, where proteins may need to display multiple context dependent behaviors, the advantages of disorder begin to become clear. While many historical examples of the functional properties of IDPs and IDPRs have come from studies of non-enzymes, an increasing number of more recent studies have shown that these functional advantages are also demonstrable in enzymes.

9

1.5.1 Entropy

Entropic functions make use of the advantages inherent in dynamic movement within a disordered protein chain. Entropic chains can provide precise spacing between functional domains, creating a less restricted search space, maintaining separation between domains, or creating the opportunity for two or more domains to interact with each other, or with another partner. Two enzymes with domains demonstrate the utility of a disordered interdomain linker. The kinase Yck2 has a long disordered interdomain linker that allows the kinase domain and a conserved C-terminal peptide (CCTP) domain to interact with two separate Akr1 domains simultaneously [27], while a disordered interdomain linker in

Phototropin 2 becomes elongated when exposed to blue light irradiation, preventing the LOV2 domain from making contact with and activating the kinase domain [28] (Figure 4).

Additionally, there are non-enzyme examples of entropic clock functions, such as the voltage- gated potassium channel of nerve axons which uses a ball and chain mechanism to inactivate the channel [29]. Entropic bristles use entropy to fill space as is seen in the gating of the nuclear pore complex through the repulsion caused by disordered nucleoporins [30].

Figure 4. Entropic chain functions. A) Yck2 (in blue) uses a disordered interdomain linker to bind to two separate domains on Akr1 (in orange). B) The disordered interdomain linker in Phototropin 2 (in green) becomes elongated when irradiated with blue light, causing the activating LOV2 domain to separate from the kinase domain.

10

1.5.2 Accessibility

Posttranslational modification (PTM) requires site accessibility, so it is therefore not surprising that many PTM sites are embedded in disordered regions, which can provide a large surface area with a limited number of residues. Phosphorylation sites in particular have been shown to be enriched in disordered residues [31]. Most well-known IDPs have phosphorylation sites, and they have been demonstrated in the IDPRs of enzymes as well. For example, the intrinsically disordered Cap region of both Abl and Arg non-receptor tyrosine is rich in phosphorylation sites that regulate multiple domains [32].

Site accessibility is also required for proteolytic processing that generates protein fragments with altered activity. For example, the phosphatase calcineurin contains an intrinsically disordered regulatory domain that is susceptible to proteolytic cleavage in vitro

[33] and in vivo [34] that significantly increases its activity.

1.5.3 Plasticity

Some of the most striking functional advantages of IDPs and IDPRs come from their ability to contextually change due to binding or environmental cues. IDPs and IDPRs may change shape in response to interactions with proteins, nucleic acids, and other ligands, allowing them to specifically bind a wide variety of partners. For example, the disordered loop near the of Mitochondrial 2,4-Dienoyl-CoA reductase allows this enzyme to accommodate a wide range of fatty acids [35]. R has two intrinsically disordered interdomain regions that may allow it to effectively dimerize when interacting with a RNA activators of varying size and shape [36]. Plasticity may also be helpful in identifying and negotiating disordered regions in substrates, and appears to play an important role in ubiquitination pathways. For instance, E3 that bind both substrates and E2 ubiquitin conjugating enzymes, are significantly more disordered than those that engage in single interactions [37].

11

1.6 Biological Functions

The biophysical mechanisms used by disordered regions are disproportionately connected to particular biological roles, such as signaling and molecular and cellular regulation [38]. This is not surprising, considering the ability of disordered regions to change over time and to adapt based on the environmental context.

1.6.1 Signaling

Signaling pathways provide a way for complicated biological systems to coordinate physiological activities and responses. Signaling pathways can usually be described as a linear cascade of interactions that triggers some kind of change in the cell. Frequently PTMs such as phosphorylation play a key role in this, and disorder frequently plays a role both in the accessibility of PTM sites, and the activity of kinases [39]. Key sequence signals may be encoded in disordered sequences, such as in sulfhydryl oxidase ALR which has an IDPR that acts as a mitochondrial targeting signal in the cytosol and a recognition site in the disulfide relay system of the intermembrane space [40]. Receptors, which are often a starting point for a signaling cascade, may have IDPRs that expose phosphorylation sites or bind to signaling molecules as is seen in the receptor tyrosine-protein kinase ErbB2 [41].

1.6.2 Regulation

IDPs are frequently involved in regulation at the cellular level through involvement in gene [42] and protein degradation [37, 43], and at the protein level, through allosteric effects or PTMs that result in the masking and unmasking of interaction sites. As an example, phosphorylation of the IDP 4E-BP2 acts as a regulatory switch by inducing a disorder-order transition and preventing binding with eIF4E [44]. Conversely, regulation of glucokinase is facilitated by an order-disorder transition that causes a time-delay when glucose is low [45].

12

IDPs are also abundant in protein degradation pathways. There are a number of E3 ubiquitin-protein ligases which have long stretches of disorder that appear to mediate interactions with a variety of mostly disordered substrates [37]. For example, San1 is an E3 ubiquitin-protein which has extended disorder in its N and C terminal binding regions. Interestingly, San1 avoids auto-ubiquitination through the absence of lysines in its disordered binding regions [46]. Ubiquitin-independent protein degradation pathways also involve disordered protein regions. The enzymes thymidylate synthase and ornithine decarboxylase both contain IDPRs that appear to contain the sole requirements for ubiquitin independent degradation [43].

1.7 Disorder and Protein Evolution

The evolution of protein coding regions in genomes presents a fundamental mystery.

The , the genome of the flowering , and the genome of the protozoa Tetrahyma are all estimated to have approximately 27,000 - 29,000 protein coding regions. On the other hand, Danio rerio (zebrafish) and Mus musculus have close to

40,000 protein coding regions each [47]. It is clear that protein coding regions do not scale linearly with organismic complexity. One compelling explanation for this is that protein intrinsic disorder combined with alternative splicing facilitates tightly regulated and context specific multi-functional behavior that allows more complicated organisms to make the most of a limited genome [42, 48].

There are several pieces of evidence to support a hypothesis of evolutionarily directed functional disorder. Bioinformatics analyses show that eukaryotes are more disordered than prokaryotes [49], natural sequences are more disordered than random sequences even with the same amino acid composition [50], and disorder within natural sequences is non-random in its patterns [51]. Disordered regions tend to evolve more rapidly while maintaining their

13 physiological functions [52], therefore, functional disorder may provide an advantage by buffering a genome against mutations. Finally, it can be argued that complex signaling networks and finely tuned regulatory mechanisms are themselves a response to organismic complexity, therefore the overrepresentation of IDPs and IDPRs in signaling and regulation also supports the hypothesis of the directed evolution of disorder in genomes. Interestingly, some protists have more disorder than multicellular eukaryotes, suggesting there may be an optimal amount of disorder for an organism that is partly based on lifestyle [49].

1.8 The Tools of the Un-Structural Biologist

The work herein focuses on computational analysis, however these results sit upon a foundation of decades of experimental work. The study of IDPs and IDPRs requires a large number of experimental and computational methods, and typically the results of these experiments combine together to form a picture of the disorder properties over the protein and the proteome (Figure 5).

Figure 5. Experimental and Bioinformatics techniques work together to describe the properties of disorder in proteomes and proteins.

1.8.1 Experimental techniques

1.8.1.1 X-Ray crystallography. It is somewhat surprising that X-ray crystal structures provide one of the largest datasets of experimentally indicated IDPRs, considering that X-ray

14 crystallography is one of the primary tools of structural biology. However, missing regions in

X-ray crystal structures are often caused by IDPRs. The challenge with using this data, however, is that missing regions are an imperfect indication of protein intrinsic disorder, as there are multiple possible explanations for a missing region, including experimental artifacts or annotation errors (for more on this, see Chapter 2). Furthermore, authentic IDPRs identified in X-ray crystal structures will be non-representative in terms of the size of the region and the amino acid composition, due to their emergence from a very structured set of proteins. The decision to use X-ray crystal structure data as an indication of disorder, must therefore be made by balancing the usefulness of a large amount of data against the imperfections in the data.

1.8.1.2 Nuclear Magnetic Resonance. Nuclear Magnetic Resonance (NMR) is arguably the current best experimental technique for identifying protein intrinsic disorder and conformational ensembles. The key differences that make NMR superior to X-ray crystallography for identifying IDPs are that NMR does not require crystallization and NMR can provide direct observation of disorder instead of simply indicating a lack of structure.

However, there are limitations in the size of the protein that restrict the applicability of

NMR, and the amount of NMR data is still significantly less than the amount of X-ray crystal structure data.

Identifying IDPs and IDPRs using NMR can be accomplished through several different approaches. A collapsed heteronuclear single quantum coherence spectroscopy (HSQC) NMR spectrum will indicate disorder over the entire protein, whereas a dispersed spectrum will indicate a structured protein. NMR techniques can also be used to generate conformational ensembles and multiple methods can be employed to measure the differences between ensembles [53, 54]. Additionally, chemical shift and 15N (1H) Nuclear Overhauser Effect (NOE)

15 data can provide flexibility information without the requirement for any structural models

[55].

1.8.1.3 Combining experimental techniques. The number of experimental techniques that can be used to study IDPs is extensive, and in fact multiple books [56, 57] and reviews [58, 59] have been dedicated to this topic. In practice, multiple techniques of varying resolution are typically employed and the aggregated evidence is used to create models of the disordered regions. These include low-resolution spectroscopic techniques such as circular dichroism, optical rotary dispersion, Fourier-transform infrared spectroscopy, and deep-UV resonance Raman spectroscopy. Additionally, the level of protein compaction can be measured by small angle X-ray scattering, small angle neutron scattering, gel-filtration, and viscometry. The properties of individual protein molecules can help identify ensemble properties and can be obtained via high speed atomic force microscopy (AFM), and single- molecule fluorescence resonance energy transfer (SM-FRET).

1.8.2 Bioinformatics analysis

Bioinformatics tools and analysis have played a large part in the study of IDPs and the establishing of the field. The tools used to study IDPs typically focus on extracting information from protein sequences, however genome studies focusing on the evolution of IDPs and the enrichment of splicing sites in disordered regions are common as well [60, 61]. Several recent reviews have been written focusing on different aspects of bioinformatics analyses of IDPs such as the discovery of degenerate motifs in IDPs [62], predicting function in IDPs [63], and the prediction of IDPs by protein sequence [64]. Indeed, the computational tools used to analyze IDPs and IDPRs are as vast as the experimental tools. Therefore, a brief introduction will be provided here of the methods used in this work.

1.8.2.1 Sequence characteristics. Because of the lack of a stable three-dimensional structure, the computational study of IDPs and IDPRs is predominantly dependent on primary

16 sequence information. Anfinsen’s dogma suggests that the three-dimensional structure of a protein is encoded into the primary sequence [65], however the accurate prediction of the folded structure from primary sequence remains elusive. Tools to predict disorder from primary sequence have been much more successful, however. This is intuitive from the perspective of entropy. A three-dimensional structure has only one form that it can take, whereas the conformational fluctuations that can define a disordered protein are nearly infinite in their possibilities within steric limitations, therefore predictors of disorder require less information than predictors of structure.

IDPs have distinct sequence characteristics that facilitate the identification of disorder from sequence. IDPs are enriched in specific disorder promoting residues, such as alanine, glycine, serine, proline, glutamine, glutamic acid, lysine and arginine, and they are depleted in the order promoting residues isoleucine, valine, leucine, phenylalanine, cysteine, tryptophan, tyrosine, and asparagine [66, 67]. These residues are roughly correlated with flexibility [68] and hydrophobicity scales [69] (Figure 6).

Figure 6. Amino acid scales and disorder and order promoting residues. Top) Ranking of the 20 amino acids by the Kyte-Doolittle hydrophobicity scale from most to least hydrophobic. Bottom) Ranking of the amino acids from most to least flexible by Vihinen’s flexibility scale.

Low complexity regions are often disordered, and disordered regions are often enriched in low complexity motifs. However, neither disorder nor low complexity necessarily implies the other [70]. Disordered regions tend to have lower sequence conservation in families, however there are also well conserved disordered domains [71]. Furthermore, in poorly conserved disordered regions, the chemical composition is often preserved [72, 73].

17

1.8.2.2 Disorder prediction. The distinct sequence features that are present in IDPs and IDPRs allow the construction of sequence based rules that can facilitate high performance disorder prediction. Over 70 predictors of disorder have been created since 1997 [73, 74]. A favorable balance between true positives (TP) / true negatives (TN) and false positives (FP) / false negatives (FN) is the objective, and this is typically expressed by the Matthews

Correlation Coefficient (MCC).

푇푃 ∗ 푇푁 − 퐹푃 ∗ 퐹푁 푀퐶퐶 = √(푇푃 + 퐹푃)(푇푁 + 퐹푃)(푇푃 + 퐹푁)(푇푁 + 퐹푁)

The Critical Assessment of protein Structure Prediction (CASP) competition judges disorder predictors based on as yet unpublished disordered regions, which are usually obtained from missing regions in newly published X-ray crystal structures [75]. The highest ranking predictors in the CASP experiment have an MCC of approximately 0.5 and these results are usually achieved by slower predictors that use multiple sequence alignments along with sequence based features such as amino acid composition and the physicochemical properties of the amino acids. These slow but high performing predictors, such as Protein

DisOrder prediction System (PrDOS) [76], Sequence based Prediction with Integrated Neural network for Disordered residues (SPINE-D) [77], and DISOPRED3 [78] are best for small datasets and single protein prediction. Large datasets, however, require the use of faster predictors that can be run on a local computer, such as Espritz [79], and IUPred [80, 81].

Disorder predictors must be trained and tested on datasets of experimentally indicated disordered residues. These datasets commonly come from Disprot [82], X-ray crystal structures in the PDB [83], or NMR data. Most disorder predictors will give a per-residue disorder score between 0.0 and 1.0, and the generally agreed upon threshold for disorder is greater than or equal to a score of 0.5. Some disorder predictors, such as SLIDER [84] or

RAPID [85], will provide fast prediction for complete proteomes, by calculating a single score across the entire protein. Other predictors may provide a score that is calibrated differently,

18 such as DynaMine which produces scores in the form of backbone N-H S2 order parameter values [55, 86]. When using DynaMine, a score below 0.7 is considered flexible.

Meta-predictors, which combine the outputs from multiple single predictors are also common, such as PONDR-FIT [87] and MetaDisorder [88]. A consensus of multiple predictors can provide improved results [89], as it will theoretically reduce the bias inherent in single predictors that were trained on limited datasets. However, despite a modest improvement through consensus methods, disorder prediction based on currently available datasets has likely hit a bottleneck in terms of the maximum possible MCC scores.

This limitation arises in part due to the imperfections in the testing and training sets for the development of disorder prediction. Experimental indications of disorder are gathered over a wide range of experimental techniques, including low resolution techniques such as circular dichroism which may not provide accurate estimates of the exact disordered residues, X-ray crystallography, which only provides an indication of disorder, but may be caused by other factors, and NMR, which requires significant, and therefore variable interpretation in order to assign disorder. A second issue, which further compounds this, is the presence of conditionally and partially disordered regions, which may be assigned as ordered or disordered, depending on the experiment. Finally, it is likely that there are different flavors of disorder [90] with different sequence based markers, yet a clear classification scheme has not yet been created.

Disorder prediction should be employed in analysis with an appropriate awareness of the inherent level of error. However, despite these considerations, disorder prediction still provides a way to separate unique sequence regions that indicate a propensity towards disorder, and provides a useful level of biological accuracy, especially at the proteome level.

While individual residues may not be assigned correctly in all cases, disorder prediction still

19 provides an illuminating look into the propensity of the protein to be solvent exposed, to undergo dynamic transitions, or to be destabilized by environmental factors.

1.8.2.3 Classification of function. There appears to be a relationship between biophysical function, cellular function, and sequence characteristics, however the identification and development of these relationships is still in its early stages [91].

Therefore, a major task in bioinformatics is to attach biological sequence information to physical behavior, biological functions, and cellular response. Functional classification based on sequence requires two sets of information. The first is sequence based features. These may be disorder prediction scores, calculations based on the physicochemical features, or amino acid motifs. The second set of information is functional annotation. Gene Ontology

(GO) term assignments are one of the primary sources of annotations related to cellular components, biological processes, and molecular functions [92]. GO terms can be assigned based on experimental evidence, or can be inferred based on homology. Additionally, Enzyme

Commission (EC) numbers provide a useful annotation protocol when classifying enzymes [93].

EC numbers are assigned based on the that is catalyzed by the enzyme.

Similar to GO terms, EC designations can be made based on direct experimental evidence or can be inferred through .

1.8.2.4 Proteome level studies. The early assumption in protein science was that protein intrinsic disorder represented an unusual and isolated phenomenon. In fact, it can be argued that this assumption is still held by many researchers today [18]. Therefore, the application of disorder prediction to whole proteomes has been critical to establishing the ubiquity and relevance of protein intrinsic disorder while the tools for large scale experimental identification are still nascent. Despite differences in the disorder predictors used, the proteomes they have been applied to, and the different measures applied, the consensus is that eukaryotes tend to have more predicted disorder than prokaryotes or

20 archaea [49], and within eukaryotic proteomes especially, intrinsic disorder is exceptionally common [94]. For instance, Ward et al. found that 2.0% of archaean, 4.2% of eubacterial and

33.0% of eukaryotic proteins had disordered regions greater than 30 residues in length [95].

Estimates of the average fraction of disorder for eukaryotic proteomes tends to be between

20-30%, while prokaryotes tend to be closer to 5-10%, however there is significant variation and overlap in disorder prediction between the taxa [49, 94].

1.9 Protein Intrinsic Disorder and Disease

In 2008, Uversky et al. introduced the disorder in disorders (D2) concept, and showed that proteins with IDPRs greater than thirty residues in length are overrepresented in proteins involved with signaling, cancer, neurodegenerative diseases, cardiovascular diseases, and diabetes [96]. While this relationship may suggest innate pathogenicity in IDPs and IDPRs, studies suggest instead that IDPs and proteins with IDPRs perform tightly regulated [97] and necessary functions, many of which depend on the lack of a three-dimensional structure [98].

However, as is the case in structured proteins, genetic, environmental, or systemic perturbations can make IDPRs and IDPRs susceptible to misfolding and misregulation.

Some of the most well-known examples of disease related IDPs are implicated in neurodegeneration, such as tau, alpha-synuclein, beta amyloid, and prion protein [99].

Flexibility in these proteins can facilitate perturbations into misfolded, aggregated states, and therefore the disease state is directly related to a structural transition facilitated by the disordered properties of the protein. However, there are also a myriad of potential roles that

IDPs and IDPRs can play in disease processes. The pathogenic behavior of an IDP or protein with an IDPR may be triggered by genetic factors such as pathogenic mutations, alternative transcription, or aberrant splicing, or non-genetic factors such as altered protein expression levels, PTMs, or aberrant cleavage. These cellular changes may result in misfolding, loss of

21 normal function, gain of toxic function, protein aggregation, misidentification, misregulation, or missignaling (reviewed in [100] and [96]).

While uncommon, there are some interesting examples of disease associated enzyme

IDPs in the literature. For instance, virulence factors with catalytic domains may utilize long

IDPRs to translocate or avoid host defense systems. The adenylate cyclase toxin in Bordetella pertussis provides an interesting example of this phenomenon. The adenylate cyclase toxin contains a Repeat in ToXin (RTX) motif, which is intrinsically disordered in the absence of calcium inside the bacterial cell. This allows the enzyme to translocate the catalytic domain across the narrow type 1 secretion channel, and then transition to a globular structure when exposed to the calcium gradient on the bacterial cell wall [101] (Figure 7).

Figure 7 A schematic representation of the secretion of adenylate cyclase toxin through the type 1 secretion system. Reprinted under the creative commons license from Sotomayor-Pérez AC, Ladant D, Chenal A: Disorder-to-order transition in the CyaA toxin RTX domain: implications for toxin secretion. Toxins (Basel) 2015, 7(1):1-20

The human variant, AChE-R, provides an interesting counter point to IDPR associated pathology. The AChE-R variant has an intrinsically disordered C-terminus, and the presence of this disordered region appears to provide neuroprotective effects in

22

Alzheimer’s disease as compared to the AChE-S variant, which has a helical C-terminus, and appears to accelerate the formation of amyloid fibrils [102]. It is likely that many examples of enzymes with IDPRs that are involved in disease processes in both pathogenic and protective capacities will emerge with the increased acceptance of the functional roles that IDPRs in enzymes may play.

1.10 Protein Intrinsic Disorder and Drug Design and Discovery

The enrichment of long IDPRs in proteins associated with disease processes presents a rich source of potential drug targets, however there are several challenges in drug design and discovery when IDPRs are targets. Secure binding to a small molecule requires stabilization in the binding site, and this may not be possible in all IDPRs. However, functional IDPRs will often take on transient structure due to binding or environmental factors which suggests that inducible structure may be possible in many cases. In proteins that are mostly disordered in their native state, the disorder can be understood as a population of many interconverting conformations, with the potential to stabilize a single non-pathogenic conformation. The investigation of natural compounds with known effects on IDPs can provide a powerful route to discovery. For instance, the consumption of coffee appears to provide some protection against the development of Parkinson’s disease. A study on the effects of caffeine on alpha- synuclein aggregation showed that caffeine modifies the conformation of the monomer form, thus accelerating the aggregation of a less toxic species [103] (Figure 8).

Drug design and discovery in IDPRs may be more akin to navigating a handshake than placing a lock in a key, however many of the principles of design and discovery are still the same. When stabilizing binding partners are not known, a clear challenge is the absence of a priori knowledge of the three-dimensional structure for in silico screening or rational drug design of favorable compounds. This challenge is not insurmountable however, as blind

23 exploratory assays are standard in drug discovery, and the presence of an IDPR should not prohibit these kinds of screens. Instead, the biggest challenge in drug discovery for IDPRs may be the standard practice of the exclusion of IDPRs before drug screens.

Figure 8 A schematic representation of the effects of caffeine on the aggregation properties of alpha-synuclein. Reprinted with permission from Kardani J, Roy I: Understanding Caffeine's Role in Attenuating the Toxicity of α-Synuclein Aggregates: Implications for Risk of Parkinson's Disease. ACS Chem Neurosci 2015, 6(9):1613-1625. Copyright 2015 American Chemical Society.

1.10.1 The story of PTP1B

The phosphatase PTP1B (Figure 9) provides an illustrative example of delayed drug discovery for an IDPR due to the consistent truncation of the region before drug screens. The catalytic region of PTP1B, encompassing residues 1-321 was purified in 1988 from the human placenta [104], and in 1990 the full length form of 435 residues was uncovered through cDNA cloning [105]. Despite knowledge of the full length form, and an early demonstration of its role in the regulation of PTP1B [106], studies on PTP1B between 1990 and 2014 focused almost exclusively on the originally purified form encompassing residues 1-321, therefore ignoring the disordered C-terminal region. PTP1B became an attractive therapeutic target due to its involvement in multiple signaling pathways, including those implicated in obesity and

24 diabetes [107], however the development of a small molecule inhibitor for the catalytic domain of PTP1B was frustrated by the highly charged nature of the catalytic site [108], and the practice of testing inhibitors against the truncated form. It was not until 2014 that MSI-

1436, a known inhibitor of PTP1B in vivo [109], was screened against the full length form, and found to be an effective inhibitor [110].

Figure 9 A representative ensemble of 100 conformers for PTP1B. Reprinted with permission from Macmillan Publishers Ltd: Nature Chemical Biology. Krishnan N, Koveal D, Miller DH, Xue B, Akshinthala SD, Kragelj J, Jensen MR, Gauss CM, Page R, Blackledge M et al: Targeting the disordered C terminus of PTP1B with an allosteric inhibitor. Nat Chem Biol 2014, 10(7):558-566 copyright 2014.

The C-terminal region of PTP1B is intrinsically disordered, moving within a wide range of three-dimensional space (Figure 9). However, as is often the case in IDPRs, there is residual secondary structure in the form of two small alpha helical regions. The most peripheral of these alpha helical regions provides one anchor point for MSI-1436, while the

25 second anchor point is found close to the catalytic domain. Upon binding to these two regions, PTP1B becomes more compact, and Vmax is decreased. While MSI-1436 provides a small amount of inhibition of the truncated form of the enzyme, the primary biding site is between residues 367 and 394, and therefore this region is required to observe the full strength of the inhibition. To our knowledge, this was the first drug screen against the full length form, therefore it is possible that MSI-1436 or other effective inhibitors had been tried and discarded previously.

The story of PTP1B demonstrates that it is sometimes assumptions about the lack of functional relevance of IDPRs in enzymes that creates one of the largest obstacles to understanding and utilizing these regions in disease intervention.

1.11 The Field of Protein Intrinsic Disorder

An un-structural biologist specializes in the tools and techniques used to study IDPs.

Furthermore, a specialist in the field of IDPs must be aware of individual proteins identified as IDPs and the body of experimental, proteomic and bioinformatics literature validating the existence of disorder in these proteins. Due to the broad scope of the material covered by the field of protein intrinsic disorder, and the nascence and relative obscurity of the field, there are a number of researchers who focus almost exclusively on the study of IDPs from various perspectives.

The field of protein intrinsic disorder represents a powerful example of the productive relationship between experimental and bioinformatics techniques (Figure 5). For example, an experimentalist who notices unusual structural behavior in their protein, may employ the use of disorder prediction to assess the propensity of their protein towards disorder. Using this information as a guide, they can target their research towards those regions predicted to be intrinsically disordered, and apply the appropriate experimental techniques. This

26 experimental data yields information, which can then be used to revise and improve the prediction of disorder in other proteins, to extrapolate evolutionary information for the , or to predict function or biophysical mechanism. However, the number of researchers who specialize in the study of specific proteins that are intrinsically disordered, and who also embrace the language of disorder to describe these properties, is remarkably small (Figure 10), demonstrating that the tools and language of protein intrinsic disorder has not propagated far beyond those who specialize in studying IDPs.

27

Figure 10. The number of papers per author for the search term in PubMed, plotted against the fraction of those papers that use IDP terminology. Each point represents an author on one or more papers associated with the given search term. The darker the dot, the larger the concentration of authors at that point. Blue dots are authors who have an IDP paper in the field in question (alpha- synuclein or tau, in this case), while the red dots are authors who have an IDP paper in the field in question and also have an IDP paper in a different field. The fraction of IDP papers is the number of papers by that author that use IDP terminology divided by all papers for that author and search term. The following search terms were used: (A) (Top) “alpha synuclein” (B) (Bottom) “tau AND (protein OR Alzheimer's OR tauopathies OR neuronal)”.

28

1.12 Intrinsic Disorder Where You Least Expect It

As this new paradigm of un-structural biology becomes more accepted, it becomes clear that the disciplines of structural and un-structural biology can work in tandem to explain the dynamics of a protein over time. The line between order and disorder is a practical line, and the language and tools of protein intrinsic disorder become necessary when a protein or protein region can no longer be described in three-dimensions. The acceptance of the presence and functional relevance of protein intrinsic disorder, however, remains relatively low, especially in protein populations that are expected to be structured.

Therefore, this introduction highlights several challenges:

 There is a focus in the IDP literature on mostly or fully disordered proteins, and an

incomplete understanding of the diverse mechanisms and functions employed by

proteins that have regions of both order and disorder.

 The development and validation of disorder prediction depends on experimental

datasets of missing regions from the PDB, however it is not always clear what the

cause and nature of the missing region is.

 There is an increasing number of enzymes with experimentally measured IDPRs in the

literature, but no proteome level studies up to this point.

 There is limited acceptance of the language of protein intrinsic disorder outside of

those who specialize in studying IDPs.

 Drug design and discovery frequently focuses on the truncated form of the protein,

potentially resulting in missed opportunities.

This work has direct relevance to applications in human health and disease by addressing the question of how likely IDPRs are to occur in canonically structured protein classes and whether and how these regions are functionally significant.

The original research contained within this work has the following main results:

29

 Ambiguous missing regions in the PDB are likely to indicate varying levels of

conditional and partial disorder instead of static disorder.

 Enzymes have a similar incidence and length of IDPRs as non-enzymes.

 Enzymes implicated in macromolecular metabolism, signaling, and regulation are

enriched in long IDPRs.

 Protein intrinsic disorder scales with organismic complexity in both enzymes and non-

enzymes.

 There is a non-random retention of disorder in enzymes, suggesting that disorder in

enzymes is not sequence noise, and probably has evolutionarily directed functional

relevance in most cases.

Furthermore, the following resources have been developed and compiled for this work:

 1127 experimentally identified IDPs and proteins with IDPRs, and their supporting

literature, hand curated through an extensive literature search.

 96 enzymes with experimentally verified regions of protein intrinsic disorder and their

supporting literature.

 A publicly available method for compiling ambiguous regions and missing regions in the

PDB, implemented in the programming language Python [111].

This work has a direct bearing on the design of experiments targeted towards intervention in human disease. By highlighting “Disorder Where You Least Expect It” this work will contribute to the ideological expansion of the field of protein intrinsic disorder so that it is increasingly seen as an integrated toolset in the study of proteins.

30

2. Intrinsic Disorder in the Protein Data Bank

Note to reader

Portions of this chapter have been previously published in Protein Sci 2016, 25(3):676-

688, and have been reproduced with permission from John Wiley & Sons publishing.

2.1 Background

2.1.1 The Protein Data Bank

The Protein Data Bank (PDB) is the foremost archive of three-dimensional structural information for proteins and nucleic acids. The PDB has experienced impressive growth since its creation in 1971, and as of July 2015, there were over 110,000 entries. PDB structures are obtained primarily by X-ray crystallography (89% of structures) and nuclear magnetic resonance (10% of structures), with a small number of structures obtained by electron microscopy and other methods. In this study, we have focused on the missing residues from X- ray crystal structures where multiple PDB structures representing the same sequence are available.

2.1.2 Missing regions in the PDB

Missing residues in a three-dimensional crystal structure occur due to regions of low or poorly defined electron density that cannot be resolved into a single point in space.

Oftentimes, this is due to dynamic atomic movement resulting in non-coherent X-ray scattering, and therefore it is not surprising that these missing regions were some of the first to be called “disordered” by the scientific community [10]. However, it is important to note that this early use of the word was meant to encompass a wide range of structural

31 possibilities. A missing region in the electron density map of a crystal structure indicates the lack of a single stable structure, but it is not a direct measure of the cause. This “disorder” was divided roughly by Bennett in 1984 into dynamic and static disorder [112] (Figure 11).

Dynamic disorder, he proposed, was caused by continual motion in the protein, whereas static disorder encompassed all other possibilities, such as multiple stable conformations or crystal packing imperfections. In 1998, Garner et al. introduced the term domain wobble to describe missing regions that result from cooperative movements of a structurally intact unit, which are typically facilitated by a small flexible hinge [113]. They also differentiated these regions, along with structural ensembles, from intrinsic disorder.

Figure 11 Missing regions in X-ray crystal structures can have many causes.

A precise definition of intrinsic disorder in the PDB is further complicated by the presence of conditionally (dis)ordered regions [114] and partially disordered regions, introduced in section 1.3. Conditionally disordered regions are intrinsically disordered under some conditions and structured under others. This most often manifests as a disorder-to-order transition upon binding, which is often facilitated by molecular recognition features (MoRFs)

[26]. There are also many examples of proteins that have transient or cryptic disorder, which is functionally relevant disorder that arises only under certain conditions [24]. Partially disordered or semi-disordered [115] regions display intermediate amounts of highly flexible,

32 residual, and/or transient secondary structure. Both conditional disorder and partial disorder are difficult to detect experimentally and predictively.

It has long been understood that not all missing regions in X-ray crystal structures are intrinsically disordered. Static disorder, wobbly/mobile domains, packing imperfections, and missing regions that arise from experimental conditions would not be considered intrinsically disordered. Therefore, for the sake of clarity, we will refer to protein regions with missing electron density as missing regions and consider these as distinct from, but often correlated with, IDPRs.

2.1.3 B-factors

The B-factor (also called a temperature factor) of an atomic coordinate in a PDB file, describes the average displacement of atoms from their mean position in a crystal structure.

Therefore, the B-factor is usually interpreted as an indication of local flexibility in a protein or the degree of solvent accessibility, but it can also be correlated with crystallographic resolution and crystal-packing contacts [116]. In addition to variable interpretations, the B- factors themselves can be highly variable and may require normalization for proper interpretation [116]. Due to the lack of a consistent definition for B-factor values, disorder prediction methods are not usually trained using B-factor data, and B-factor data is not typically used as an indication of disorder, with the exception of the prediction of loops with high B-factors by the disorder predictor Disembl [117]. High B-factors were used to determine

Vihinen’s flexibility scale, which roughly correlates with disorder promoting residues (Figure

6) [68], and is commonly used to display the biased composition of disordered regions.

Radivojac et al. studied the characteristics of high B-factor regions, low B-factor regions, and missing regions and found that high B-factor regions were similar to short missing regions, with some differences in the amino acid compositions [118]. In practice, missing regions provide a better indication of intrinsic disorder in X-ray crystal structure data than do B-

33 factors, and therefore we have focused exclusively on the measurement of missing regions for this study.

2.1.4 Missing regions and the development of disorder prediction

Bioinformatics tools, introduced in section 1.8.2, have played a large part in establishing the field of intrinsic disorder, and in the study of IDPs/IDPRs, particularly at the proteome level, where high-throughput experimental methods to recognize intrinsic disorder are lacking. In order to help fill this gap, over 70 in silico predictors of intrinsic disorder have been developed [73, 74] (introduced in section 1.8.2.2). Predictors of intrinsic disorder typically use sequence-based features to predict the likelihood that a particular residue or region is intrinsically disordered. The experimental identification of IDPRs often requires a consensus of methods that may leave some uncertainty as to the nature of the disorder and the precise location. Therefore, the development of datasets of known intrinsically disordered regions that can be used to train predictors is a slow and arduous process. DisProt

[82] provides the largest and most well-known database of experimentally verified intrinsically disordered regions. However, at 694 entries (as of July 2015), its coverage is infinitesimal compared to the predicted amount of intrinsic disorder in various proteomes, and it is unlikely to be fully representative. Several groups have compiled NMR-based datasets as well [55, 79]; however, the largest dataset of experimentally indicated IDPRs continues to come from X-ray crystal structures in the PDB.

While one can address the problem of scarce experimental data by using missing regions as an indication of disorder, missing electron density is also a weaker indication of an

IDPR than NMR or a consensus of multiple methods. Therefore, the use of data from the PDB for training and testing predictors introduces some uncertainty. It is likely that noise in the training and testing sets for disorder predictors is currently the largest bottleneck to increased accuracy. For instance, the Critical Assessment of protein Structure Prediction

34

(CASP) competition, which measures the accuracy of disorder predictors, uses missing regions in newly published X-ray crystal structures to measure the accuracy of competing predictors, despite the acknowledgement that these missing regions could arise for multiple reasons, including annotation errors [75]. Disorder predictors are often refined for best performance against CASP datasets, but this does not necessarily mean that they are best optimized to predict in vivo intrinsic disorder. The fidelity of datasets of IDPRs is of upmost importance; therefore, it is critical that we continue to examine the best ways to extract genuine intrinsic disorder data from the PDB.

2.1.5 Previous studies

Several studies have examined intrinsic disorder in the PDB [119-123]. Of particular interest to us were the ambiguous or dual personality fragments, defined in 2007 by Le Gall

[119] and Zhang [120], respectively. These are regions in PDB chains where multiple structures of the same sequence show a conflict between missing and observed assignments.

The PDB currently contains nearly three times as many entries as it did in 2007, when Le Gall and Zhang published their works. With this expanded source data, we were able to further investigate these ambiguous regions by preparing a large dataset that consists only of UniProt sequences that have at least two structures (PDB chains) available. Furthermore, using precompiled information from the PDB providing per-residue assignments of missing residues and secondary structure has allowed us to simplify this analysis and provide an easy-to-use method for proteomics studies that make use of PDB data.

2.2 Results

2.2.1 A new method for the characterization of missing regions

PDB files contain coordinates for a molecular structure (usually a protein) in three- dimensional space. A single file may have one structure, or it may contain multiple

35 homogenous or heterogeneous structures in complex. Each structure is assigned a chain identifier, and in this study, we call these individual structures PDB chains, to distinguish them from the PDB file, which may contain multiple chains. In most cases, some or all of a

PDB chain can be mapped to a UniProt identifier [124], which provides sequence information for the entire protein. However, it is often the case that the PDB chain does not contain the entire UniProt sequence, or it may happen that a single PDB chain contains mappings to multiple UniProt identifiers, or has additional non-mapped residues. Therefore, we treat the

PDB file, the PDB chain, and the UniProt identifier as three separate entities. A PDB file may be mapped to multiple PDB chains, a single UniProt identifier may be mapped to multiple PDB chains and multiple PDB files, and multiple UniProt identifiers may be mapped to a single PDB chain.

Our base dataset consists of PDB chains that contain identical sequence residues in at least some portion of the chain, where those residues can be mapped to all or part of a

UniProt identifier. A PDB chain is a single contiguous peptide or protein in a PDB file, where some PDB files may contain multiple heterologous or homologous chains in complex. We have developed a method that allows us to classify missing regions in these PDB chains according to the pattern the missing residues display when those chains disagree. Our method employs the following steps, which are outlined in Figure 12:

1. Create a representative sequence for each PDB chain composed of missing residues,

uncharacterized residues, and secondary structure information.

2. Create a representation of the UniProt sequence by compiling information over all PDB

chains and recording only observed, missing, and uncharacterized assignments.

3. For each missing region in the UniProt sequence, assign a category (conserved,

contained, conflicting, overlapping, or discarded), established by the pattern of

missing residues between PDB chains.

36

We used the following definitions for a single-residue column across multiple PDB chains:

 Uncharacterized: No PDB chain has an observed or missing residue in this position.

 Characterized: At least one PDB chain has an observed or missing residue in this

position.

 Observed: At least one PDB chain has an observed residue in this position, and no

PDB chains have a missing residue in this position.

 Missing residue: At least one PDB chain has a missing residue in this position.

 Missing region: There are at least three contiguous missing residues from the

composite of all structures.

The missing region categories were assigned based on the following criteria, and in the following order:

 Conserved: The contiguous missing residues are present in all PDB chains.

 Conflicting: At least one PDB chain was completely observed in the missing region.

 Contained: At least one PDB chain had the full length of the missing region, and all

other regions were the same length or contained within (but not completely

conserved).

 Overlapping: The missing regions overlap or are contiguous, but no one structure

has a missing region that contains all others.

If there was a missing region in only one structure and there was not a fully characterized region in any of the other structures, the missing region was discarded because we felt this left insufficient information for comparison.

Our final dataset consisted of 19153 UniProt entries, representing 54937 PDB files and

147800 PDB chains. 5% of the residues were missing and 34% of the residues were uncharacterized, which means they were not crystallized in the experiment. Therefore, it is not surprising that the set of PDB chains was significantly shorter overall than the

37 corresponding set of UniProt sequences. The shortest PDB chain was 4 residues in length, and the longest PDB chain was 4187 residues in length, with an average length of 250 residues across all PDB chains. The shortest UniProt sequence was 7 residues in length, and the longest was 7737 residues in length, with an average length of 419 residues (Figure 13).

Figure 12 The classification scheme for PDB sequence regions used in this study.

38

Figure 13 The distribution of protein lengths for the PDB chains used in this study compared to the distribution of lengths for the corresponding UniProt entries.

2.2.2 Ambiguous regions have greater secondary structure variation

Ambiguous regions, by definition, are missing regions that have observed residues in some of their associated PDB chains. Therefore, we were able to compare the difference in the secondary structure assignments between the observed portions of the ambiguous regions and the fully observed regions in our dataset. Secondary structure assignments are provided in a single file by the PDB (available at http://www.rcsb.org/pdb/files/ss_dis.txt) and are calculated by the DSSP (Define Secondary Structure of Proteins) program [83]. These are not secondary structure predictions, but rather they are calculated by rigorously defined geometrical restraints based on the three-dimensional structure of the protein [125]. When no defined geometrical restraints are met, the secondary structure for that residue position is left blank by DSSP. These irregular assignments are not devoid of information, however, because the lack of assignment indicates that these regions have low curvature and lack hydrogen-bonded structure [125]. We assigned the letter P to these residues and found that they were very highly represented in ambiguous regions. In addition to irregular assignments

(P), ambiguous regions are also enriched in hydrogen-bonded turns (T) and bends (S), while observed regions are enriched in alpha helices (H) and beta sheets (E) (Figure 14(A)). A list of

39 secondary structure, missing residue, and uncharacterized assignments and their abbreviations is provided in Table 1.

Table 1 Secondary structure abbreviations.

P = low curvature without H-bonded structure

H = α-helix

B = residue in isolated β-bridge

E = extended strand, participates in β ladder

G = 3-helix (310 helix)

I = 5 helix (π-helix)

T = hydrogen bonded turn

S = bend

- = uncharacterized

X = not observed (missing)

Ambiguous regions are more likely to have secondary structure variation between different PDB chains in a single-residue column (Figure 14(B, C, D)). Figure 14(B) shows the top 10 most common secondary structure combinations (including those columns with only one secondary structure assignment) in an ambiguous residue column position. Surprisingly, the combination of P and S is actually more common than a beta sheet assignment. The pairs

PS, PE, ST, and PH all commonly occur in the same residue position between multiple structures in ambiguous regions. This suggests that between the different PDB chains in an ambiguous region, recognizable secondary structure elements are relaxing to the point where they no longer have a recognized secondary structure type. Figure 14(C) shows the Shannon entropy of the secondary structure within residue columns of observed and ambiguous regions. The Shannon entropy measures the amount of information within a text string, and therefore it increases in proportion to the variety and relative proportion of secondary structure assignments in a single-residue position [126]. Nearly 90% of the observed regions

40 had only 1 secondary structure assignment in a residue column, and over 40% of the ambiguous regions had at least 2 different secondary structure assignments (Figure 14(C, D)).

Figure 14 Analysis of secondary structure in observed vs. missing regions. (A) The relative fraction of secondary structure assignments on a per-residue basis across all PDB structures with uncharacterized and missing residues removed. (B) The 10 highest-occurring secondary structure combinations in ambiguous region columns and the relative fraction in ambiguous and observed residue columns. (C) The cumulative distribution of Shannon entropy by secondary structure in residue columns. (D) The number of unique secondary structure elements per residue column.

Therefore, results described in this section suggest that:

 Ambiguous regions have greater variation in the secondary structure between

multiple PDB files of the same sequence than observed regions.

 Irregular secondary structure, which has low curvature, is highly represented in

ambiguous regions.

41

2.2.3 Different types of missing regions have distinct characteristics

The missing regions in this study are defined as an all-or-nothing composite of the missing residues amongst all PDB chains associated with a particular UniProt ID. 73%

(13194/19153) of the UniProt IDs in our set had a missing region of at least 3 residues in length, with an average of 2.3 (31531/13914) regions per UniProt ID within that set. Each missing region was assigned a category, depending on the pattern of missing residues between

PDB chains mapped to the same UniProt ID. The quantities of each region sorted by category are provided in Table 2. 62% of the missing regions were less than 10 residues in length, with distinctive differences in the length distribution between each missing region type (Figure

15(A)). Conflicting regions, which have at least one PDB chain that is completely observed, were the shortest on average, and also occurred most often (77%) between multiple files

(Figure 15(D)). The overlapping pattern was the longest on average, and was quite rare, with only 1708 examples in our set. Overlapping patterns are composed of 53% uncharacterized residues, yet occur only slightly more often on the ends of the protein (Figure 15(B)).

Additionally, 76% are produced between different PDB files (Figure 15(D)), in a similar proportion to the conflicting regions. This suggests that the overlapping pattern may often be an artifact of variable truncation of the PDB chain between multiple files and rarely a

“naturally occurring” pattern.

Contained regions, where at least one PDB chain has a longest missing region that encompasses all others, were more than 2.5 times as likely to occur as completely conserved regions, and were the most commonly occurring pattern overall. While contained and conserved regions have similar amino acid compositions (Figure 16), they come from very different file combinations. Conserved regions arise from multiple PDB chains within the same file 67% of the time, far more often than any other pattern. It is likely that many of these are symmetric oligomers, and the identical missing regions arise from identical environmental

42 conditions and interaction circumstances. Conserved regions are rarely seen in situations where PDB chains are pulled from both complex files and monomer files (4%, 208 regions)

(Figure 15(D)). This suggests that full conservation of a missing region may be somewhat delicate, and when different environmental factors are present, including different or absent binding partners, the missing region may take on variable lengths, as seen in the contained pattern.

None of the ambiguous region types display the same secondary structure composition as the observed regions (Figure 14(A), Figure 15(C)). The vast majority of the residues in the contained regions are missing (Figure 15(C), inset). However, when residues were observed in the contained regions, almost 50% of the secondary structure assignments were irregular (P), indicating low curvature in these regions. The conflicting regions have fewer missing residues, and therefore more assigned secondary structure, but do not show an increase in helical regions (H) or beta sheets (E), as might be expected if experimental artifacts caused the conflicting regions. Instead, where secondary structure is assigned, conflicting regions show more turns (T) and bends (S).

Figure 16 displays the amino acid composition of each missing region type versus sequence residues from the observed regions. It is displayed using Vihinen’s [68] flexibility index, which sorts amino acids from least to most flexible. It is clear that all missing region types display a composition bias away from globularity and towards flexibility. The differences between the missing region types scale roughly with conserved being the most biased and conflicting being the least (Figure 16). Both conserved and contained regions show a high bias towards Methionine, which is likely due to their increased likelihood of occurring in the N terminus (Figure 15(B)). All the missing region types show a remarkably similar amino acid bias to DisProt. However, there are some differences, such as a reduced amount of

Proline in the missing regions compared to DisProt.

43

Table 2 Characterization of the datasets analyzed in this study.

Missing Region Type Number of Regions Number of Residues Conserved 4744 55040 Contained (ambiguous) 12088 178277 Conflicting (ambiguous) 11848 102410 Overlapping (ambiguous) 1708 42837 Discarded 1143 15845

Figure 15 Analysis of sequence and PDB file characteristics sorted by missing region type. (A) The cumulative length distribution of missing regions by missing region type. (B) The fraction of regions occurring at different locations along the length of the full protein sequence. The full sequence is divided into 10 sections, and the missing region location is defined as the midpoint of the missing region. (C) The relative secondary structure composition, excluding uncharacterized and missing residues. (Inset) The fraction of residues that are not uncharacterized or missing and are therefore assigned a secondary structure or are irregular. (D) The fraction of missing regions occurring in different PDB file combinations. Mult. Files refers to PDB chains attached to a single UniProt ID that were obtained from more than one PDB file. Mult. Files Monomers refers to missing regions obtained from PDB files containing only one PDB chain. Mult. Files Mixed refers to missing regions that are obtained from multiple PDB files where at least one PDB file had a single PDB chain and at least one PDB file had more than one associated PDB chain. Mult. Files Complexes refers to PDB chains obtained from multiple files that all had more than one PDB chain. Same File Complex refers to PDB chains obtained from a single file.

44

Figure 16 The amino acid composition of missing regions relative to the observed residues. DisProt vs. PDB select 25 is provided as a reference.

We drew the following conclusions from the results reported in this section:

 Different missing region types have different secondary structure characteristics

and different amino acid compositions, and reside in different locations along the

primary sequence.

 Missing residues between PDB files show greater variation when there is

experimental variation between PDB files.

 The contained pattern appears to be a common result when PDB chains are

crystallized under different contexts.

45

2.2.4 Disorder prediction correlates with missing residue conservation

We measured the fraction of predicted disorder for observed regions, uncharacterized regions, and each missing region type using the predictors IUPred-short, ESpritz X-ray, and

DynaMine, displayed in Table 3. Further information on these predictors is available Chapter

4, Materials and Methods. The disorder, MoRF and binding site predictors, despite using different training sets and prediction methods, were in close agreement, both in the overall percentages, and by a per-residue pairwise comparison of prediction scores, which yielded agreement between 84% and 96% (Table 4, Table 5). As expected, the highest prediction of disorder was within the conserved regions. However, predictions for contained regions were very close, while conflicting regions had the lowest number of predicted disordered residues of the missing region types. Observed regions had very low prediction scores, further validating the sequence-based differences between observed and missing regions. The MoRF and binding predictors followed a similar trend, which would be expected, given these predictors are geared towards binding residues within disordered regions. Uncharacterized regions were also predicted to be significantly more disordered than observed regions. One surprising result is that uncharacterized residues were predicted to be within a MoRF 15% of the time by ANCHOR. This may be because uncharacterized regions are frequently in N and C terminal regions.

We found that the average of the disorder scores was a misleading calculation, however. The majority of the missing regions are predicted to be either 100% ordered or 100% disordered, with little in between (Figure 17). The most dramatic example is the ESpritz X- ray prediction for conflicting regions: 80% of the regions are predicted to be 100% ordered,

13% are predicted to be 100% disordered, and only 7% are somewhere in between. Figure 17 presents an interesting perspective on the differences between each predictor in terms of

46

“spread.” DynaMine tends to show the largest fraction of regions between 0 and 100%, with

Espritz X-ray showing the smallest fraction, and IUPred between the two.

Table 3 Disorder content and content of disorder-based binding sites in the datasets analyzed in this study.

Disorder Predictor Morf / Binding Site Predictor Region Type IUPred ESpritz DynaMine Anchor DisoRDP DispoRDP DispoRDP Morf X-Ray DNA Prot RNA Pred Conserved 0.43 0.48 0.47 0.06 0.07 0.04 0.07 0.10 Contained 0.39 0.41 0.41 0.07 0.07 0.05 0.08 0.08 Overlapping 0.29 0.29 0.29 0.08 0.06 0.05 0.07 0.03 Conflicting 0.15 0.12 0.19 0.02 0.04 0.01 0.08 0.03 Observed 0.05 0.03 0.06 0.02 0.02 0.00 0.06 0.01 Uncharacterized 0.23 0.23 0.23 0.15 0.04 0.07 0.06 0.01

Table 4 Agreement between disorder predictors.

Region Type DynaMine- ESpritz X-Ray - IUPred - ESpritz X-Ray IUPred DynaMine Conserved 0.84 0.84 0.84 Contained 0.84 0.85 0.84 Conflicting 0.89 0.92 0.88 Overlapping 0.84 0.85 0.84

Table 5 Agreement between MoRF/binding site predictors.

Region Type Anchor- Anchor- Anchor- Anchor- Disordp DNA- Disordp DNA Disordp RNA Disordp Prot MoRFpred Disordp Prot Conserved 0.88 0.88 0.93 0.86 0.89 Contained 0.88 0.86 0.91 0.87 0.89 Conflicting 0.95 0.91 0.98 0.95 0.96 Overlapping 0.88 0.86 0.91 0.91 0.89 Disordp DNA- Disordp DNA- Disordp RNA- Disordp RNA- MoRFpred- Disordp RNA MoRFpred Disordp Prot MoRFpred Disordp Prot Conserved 0.87 0.85 0.88 0.85 0.87 Contained 0.87 0.88 0.87 0.86 0.88 Conflicting 0.90 0.94 0.91 0.90 0.96 Overlapping 0.88 0.92 0.88 0.91 0.93

47

Figure 17 The fraction of the set of each missing region type vs. the fraction of predicted disorder for a given missing region. A Savitzky-Golay filter was applied to smooth intermediate values for clearer viewing.

Therefore, data reported in this section clearly show that:

 Missing regions, as well as uncharacterized regions, are predicted on average to be

more disordered than observed regions.

 The amount of average predicted disorder for each missing region type scales with

the level of missing residue conservation (with conserved regions being the most

conserved, and conflicting regions being the least conserved) in the region.

 In most cases, missing regions in the dataset are predicted to be either 100%

disordered or 100% ordered, with little in between, and the average disorder

scores are mostly determined by the relative fractions of each.

48

2.2.5 Static disorder and wobbly domains are rare in the PDB

One interpretation for the narrow split between regions predicted to be entirely ordered or entirely disordered could be that the line between the two is a rough divider between static and dynamic disorder. In order to investigate this, we felt the best candidate subset of our data for static disorder was conflicting regions that were predicted to be 100% ordered by a 3/3 consensus of IUPred-short, ESpritz X-ray, and DynaMine. This subset was composed of 7033 regions, representing 59% of the total conflicting regions. If we start with the assumption that static disorder occurs in regions that are still essentially structured, then it would make sense that these regions should have the same amino acid composition as the observed regions. We compared the amino acid composition to the observed region composition, and found that this subset had a composition bias suggestive of flexibility, though less so than other missing regions or conflicting regions as a whole (Figure 18(A)). This result suggests that static disorder may be uncommon, and that prediction of structure by disorder predictors may not be the best indicator of static disorder. Instead, many of these conflicting regions may arise from conditionally or partially disordered residues, which are difficult to detect by disorder predictors.

In order to investigate the probable incidence of wobbling domains, we examined long missing regions at least 50 residues in length as a subset. Domain wobble describes the movement of a large structured region, typically facilitated by a smaller flexible region at the edges of the domain. As a result of this behavior, we expected that domain wobble may have a pattern of predicted disorder at one or both of the ends of the domain, and predicted structure in the center. Therefore, for this subset of our data containing long missing regions, we looked at the pattern of predicted disorder, by a 2/3 consensus of IUPred-short, ESpritz X- ray, and DynaMine.

49

We then examined the distribution of the disorder scores across each individual missing region (Figure 18(B)). We divided the missing region into three segments, consisting of the first 20%, the middle 60%, and the last 20% of the protein. If there were at least five disordered residues by consensus, the location along the protein of those residues was recorded. If all residues occurred within the first or last 20% of the region, and they all occurred at the beginning or end of the protein chain (the starting point and ending point were defined as the first and last characterized residue from all the PDB chains), then these regions were assigned as Tails. If all disordered residues occurred in the first and last 20% of the region and some or all were not on a tail region, then these were assigned as Ends. If all disordered residues occurred in the middle 60% of the protein, these were assigned as

Centered. All others were considered to be Dispersed, which includes 100% disordered regions. All those with fewer than five disordered residues were labeled as Ordered.

We expected that wobbly domains would have disordered residues concentrated at one or both ends (but not in a tail region), therefore indicating a small flexible hinge that could move the larger structured region. The incidence of this pattern was low overall, with only 43 out of 865 regions displaying a possible hinge and large movable domain pattern.

Much more common was a dispersed pattern, with 410 regions displaying a spread-out pattern of predicted disordered residues, and only 253 regions predicted to be completely ordered.

This supports the conclusion that domain wobble is probably rare, and it is more likely that many of these regions are conditionally or partially disordered.

50

Figure 18 An analysis of possible static disorder. (A) The composition of conflicting missing regions predicted to be ordered by a 3/3 consensus of IUPred-short, ESpritz X-ray, and DynaMine. DisProt vs. PDB select 25 is provided for reference. (B) The relative fraction of long missing regions (> 49 residues) that fall into each disorder distribution. Ends refers to disorder scores only occurring in the last 20% of the missing region, when those residues do not occupy a tail position. Tails refers to disorder scores in a missing region occurring only at the ends of the PDB chain. Centered refers to disorder scores only occurring in the center 60% of the protein. Dispersed refers to all other cases with 5 or more disordered residues. If there are fewer than 5 disordered residues, the region is considered Ordered.

We drew the following conclusions for this section:

 Conflicting regions that are predicted to be 100% ordered still display composition bias

towards flexibility.

51

 Long missing regions rarely display a predicted hinge pattern that would be suggestive

of domain wobble.

 Static disorder and domain wobble are probably rare in the PDB.

2.3 Conclusions

We have introduced a method for easily creating and categorizing a dataset of missing regions when there are multiple PDB chains attached to a single UniProt identifier. This classification scheme further divides ambiguous regions, those where PDB structures disagree as to whether a given residue is missing or observed, into three categories: conflicting, contained, and overlapping. This classification scheme may be useful in the investigation of individual proteins, large sets of proteins, and the development and refinement of disorder prediction software. Furthermore, we have provided analysis that will help clarify the nature of ambiguous missing regions.

Our analysis provides further validation that there is a measurable difference between missing regions and observed regions, which indicates increased flexibility. Missing regions have a greater variation in secondary structure, an amino acid composition biased in favor of intrinsic disorder, and a significantly higher fraction of residues that are predicted to be disordered. Furthermore, it appears that the extent of these differences roughly scales with the level of ambiguity in the region. Fully conserved regions show the strongest indications of intrinsic disorder, followed by contained, overlapping, and conflicting patterns. However, our analysis also shows that ambiguity is more likely to arise as different PDB chains with the same sequence are exposed to greater environmental differences. Our results indicate that perfect conservation in a missing region should not necessarily be correlated to higher confidence that a region is intrinsically disordered. Variable lengths of the missing region between different files may be a very natural result when intrinsically disordered regions are

52 exposed to different environments. Additionally, conflicting regions should not necessarily be discarded from IDP sets, as they may simply be an indication of conditional disorder placed within different contexts. In other words, whether the missing region displays a pattern of conservation, ambiguity, or conflict may in some cases be a function of the differences between the source files rather than the extent of the disorder. We found little evidence of static disorder and domain wobble, and suspect that the incidence is probably quite low.

Instead, it is likely that the ambiguous regions in the PDB are a rich source of conditional and partial disorder.

In summary, results reported in our study support the following main conclusions:

 In the majority of cases, the characteristics of missing regions indicate protein

intrinsic disorder instead of static disorder, domain wobble, or experimental

artifacts.

 The presence of an ambiguous region and the degree of ambiguity in that region is

more likely to indicate varying levels of conditional or partial disorder, rather than

static disorder.

53

3. Intrinsic Disorder in Enzymes

3.1 Background

3.1.1 Intrinsically disordered enzymes in the literature

Enzymes have been central to the development of the structure-function paradigm, which tells us that the unique three-dimensional structure of a protein is the key to understanding that protein’s function. Therefore, as the widespread prevalence and functional relevance of protein intrinsic disorder has been demonstrated throughout the years, enzymes have consistently been considered an exception to the rule. This assumption has largely gone unchallenged and un-quantified. This is likely due to an eclipsing focus on the catalytic region of enzymes, as the story of PTP1B in section 1.10.1 demonstrates.

However, catalysis alone does not fully describe the complicated life of an enzyme that may need to bind many different partners and substrates, be intricately regulated, or have inducible multi-functionality.

There is an understanding that many enzymes have a certain amount of conformational flexibility and that in fact all proteins interconvert to some extent, however the relationship of this movement to catalysis is still hotly debated [127-136]. The observation of large domain movements via small hinge regions, small local vibrations, and ensembles of structures have been regular fixtures in the discussion of enzyme structure and function. Therefore, we would like to distinguish our discussion of IDPs and IDPRs by focusing specifically on longer regions of protein intrinsic disorder that spend some part of their functional life cycle without a fixed three-dimensional structure, and without a specific and

54 defined range of movement. On a practical level, these are frequently the regions that cause experimental difficulties, and may be removed before experimentation.

Interestingly, some of the earliest examples of IDPRs came from enzymes. In the late

1970s, trypsinogen [137, 138], the precursor to the serine protease trypsin, provided one of the most extensively documented disordered regions. Disorder facilitates the activation and regulation of trypsin in two ways. The first is through increased susceptibility to cleavage, which converts trypsinogen to trypsin, the active form. The second is through an IDPR which remains after cleavage and tethers a small two residue knob, which has the spatial flexibility to search for and bind a hole within trypsin, promoting a disorder-order transition and activating trypsin.

Another early example of an intrinsically disordered enzyme came in 1983, when

Manalan and Klee showed that the phosphatase calcineurin can be activated by the cleavage of an exposed IDPR, which simultaneously prevents calmodulin binding [33]. Later studies showed that this happens due to an intrinsically disordered autoinhibitory domain in calcineurin that impedes the active site. Upon binding to calmodulin, the autoinhibitory domain becomes structured, and moves away from the active site, therefore activating calcineurin [139] (Figure 19). Cleavage of the autoinhibitory domain therefore activates the enzyme, but also removes a critical regulatory mechanism.

Regulation through an IDPR is a common theme in many enzymes. CTP:phosphocholine cytidylyltransferase (CCT) has a disordered tail region that acts as an inhibitor of catalysis in the unbound form of the enzyme. However, this IDPR also appears to facilitate binding to the cell membrane, and upon binding the inhibition is relieved, and the enzyme becomes active.

Ding et al. performed a number of experiments with chimeric forms of the enzyme and found that this regulatory region was able to accommodate significant changes in length and sequence while still being functionally viable [140].

55

Figure 19 The activation of calcineurin by calmodulin through a disorder-order transition. In the absence of calmodulin, the disordered autoinhibitory domain impedes substrate access by hovering near the active site. Upon binding to calmodulin the autoinhibitory domain undergoes a disorder-order transition and is displaced from the active site. Reprinted from Ye Q, Feng Y, Yin Y, Faucher F, Currie MA, Rahman MN, Jin J, Li S, Wei Q, Jia Z: Structural basis of calcineurin activation by calmodulin. Cell Signal 2013, 25(12):2661-2667.), with permission from Elsevier.

Human glucokinase provides an example of regulation through a small, but critical

IDPR that regulates the turnover number (kcat) of the enzyme in a glucose sensitive manner.

Upon binding to glucose the IDPR undergoes a disorder-order transition that causes the enzyme to become catalytically active [45]. Therefore, when glucose is low, the small IDPR in glucokinase creates a built-in time delay caused by an order-disorder transition (Figure 20).

Disordered regions may also be directly involved in catalysis. The critical selenocysteine of the selenoprotein VIMP reductase is found in the disordered C-terminal region [141] (Figure 21). The disorder in this region may allow the protein to effectively interact with a variety of partially unfolded endoplasmic reticulum associated degradation

(ERAD) substrates. Similarly, the resolving cysteine in Msrb1 is present in the disordered N- terminal region [142] (Figure 22). The flexibility in this region is critical to catalytic action, as it allows the resolving cysteine to fold up into the proper position. Additionally, flexibility in this region may enable interaction with diverse substrates.

56

Figure 20 A schematic of kinetic regulation by an order-disorder transition in glucokinase. (A) The small domain of unliganded GCK is intrinsically disordered, giving rise to a broad conformational ensemble. (B) Glucose binding, activator binding, or an activating PHHI-associated mutation promotes folding of the disordered regions in the small domain, narrowing the conformational distribution. Upon formation of the GCK–glucose binary complex, ATP binds and catalysis proceeds with little additional reorganization. (C) Following release, ordered unliganded GCK persists until the small domain undergoes an order–disorder transition on the millisecond time scale, allowing access to the “time delay loop” (red): Under low glucose concentrations, the delay loop is operational, leading to slow turnover and kinetic . Under high glucose concentrations (or when GCK is activated), the delay loop is effectively bypassed, turnover is fast, and cooperativity is eliminated (green). Reprinted under the creative commons license from Larion M, Salinas RK, Bruschweiler-Li L, Miller BG, Brüschweiler R: Order-disorder transitions govern kinetic cooperativity and allostery of monomeric human glucokinase. PLoS Biol 2012, 10(12):e1001452.

57

Figure 21 The intrinsically disordered C-terminal region of the reductase VIMP contains a selenocysteine that is critical for catalysis. The disordered region encompasses residues 123-189 (shown in green). This research was originally published in The Journal of Biological Chemistry. Christensen LC, Jensen NW, Vala A, Kamarauskaite J, Johansson L, Winther JR, Hofmann K, Teilum K, Ellgaard L: The human selenoprotein VCP-interacting (VIMP) is non-globular and harbors a reductase function in an intrinsically disordered region. J Biol Chem 2012, 287(31):26388- 26399 © the American Society for Biochemistry and Molecular Biology.

Figure 22 The N terminal region of MsrB1 samples a wide range of dynamic conformations, and contains a resolving cysteine. MsrB1 structural family consisting of 20 conformers with the lowest target function. This research was originally published in The Journal of Biological Chemistry. Aachmann FL, Sal LS, Kim HY, Marino SM, Gladyshev VN, Dikiy A: Insights into function, catalytic mechanism, and fold evolution of selenoprotein methionine sulfoxide reductase B1 through structural analysis. J Biol Chem 2010, 285(43):33315-33323 © the American Society for Biochemistry and Molecular Biology.

58

Disorder has been shown in some cases to be necessary for substrate recognition, binding, and promiscuity. In particular, enzymes involved in ubiquitination and deubiquitination that must interact with a large number of unique substrates, appear to be enriched in disorder [37]. The disordered regions of the E3 ubiquitin-ligase San1 [143] and the deubiquitinase Upb10 [144] are punctuated by small ordered regions that recognize a large number of substrates, while the E2 ubiquitin-conjugating enzyme Ube2w uses disorder to specifically recognize the disordered N termini of its substrates [145] (Figure 23).

Figure 23 The disordered C-terminal region of Ube2w helps bind diverse substrates. A) Side-view of the full Ube2w ensemble looking down the helix-3 axis reveals the three clusters. B) In all twenty members of the Ube2w ensemble residues N136-W145 occupy positions beneath the active site, C91 (orange). Residues 119-135 are not shown for clarity. Adapted with permission from Macmillan Publishers Ltd: Nature Chemical Biology Vittal V, Shi L, Wenzel DM, Scaglione KM, Duncan ED, Basrur V, Elenitoba-Johnson KS, Baker D, Paulson HL, Brzovic PS et al: Intrinsic disorder drives N-terminal ubiquitination by Ube2w. Nat Chem Biol 2015, 11(1):83-89, copyright 2015.

There are a number of examples of enzymes that use disorder as a flexible tether to form transient protein complexes and interact with nucleic acids. The DNA helicase Sgs1 has a disordered N-terminus that facilitates stability and binding to Top3/Rmi1 [146].

The membrane bound prokaryotic enzyme RNase E has an extended IDPR in the C-terminal

59 region that is punctuated by MoRFs. These small microdomains allow RNase E to use an extended region of disorder as a scaffold to form a multi-enzyme RNA [147]

(Figure 24). The DNA glycosylase NEIL1 provides an example of disorder that facilitates a large number of protein interactions. NEIL1 has a disordered C-terminal region that interacts with a large number of base excision repair proteins. It is likely that the flexibility in this IDPR enables increased specificity [148]. Interestingly, the disordered C-terminal region in NEIL1 also exerts a stabilizing influence on the catalytic region.

Figure 24 RNase E forms a flexible scaffold for protein interactions. A) The primary binding partners of RNase E form the degradosome. B) The disordered C-terminal region forms a flexible scaffold. Reprinted under the creative commons license from Aït-Bara S, Carpousis AJ, Quentin Y: RNase E in the γ-Proteobacteria: conservation of intrinsically disordered noncatalytic region and molecular evolution of microdomains. Mol Genet Genomics 2014.

60

Figure 25 The disordered C-terminus of NEIL1 allows it to engage in multiple molecular interactions. (A) The ab initio shape predicted from the experimental data shows not only a compact volume consistent with the crystal structure but also a volume extending from the C-terminal portion of the protein. (B) BilboMD models and their percentage representation in the population that were selected by the MES fit the experimental scattering data as an ensemble. Reprinted from Hegde ML, Tsutakawa SE, Hegde PM, Holthauzen LM, Li J, Oezguen N, Hilser VJ, Tainer JA, Mitra S: The disordered C-terminal domain of human DNA glycosylase NEIL1 contributes to its stability via intramolecular interactions. J Mol Biol 2013, 425(13):2359-2371 with permission from Elsevier.

IDPs and IDPRs have been extensively linked to signaling pathways [38], therefore it is not surprising that IDPRs in enzymes often play a critical role in signaling. The sulfhydryl oxidase lf-ALR has an IDPR that performs a dual function dependent on its subcellular localization. In the cytosol it contains a mitochondrial targeting signal, while in the intermembrane space, it provides a recognition site in the disulfide relay system of the intermembrane space [40]. The intrinsically disordered juxtamembrane domain in the receptor EGFR contains multiple signals that are critical for receptor trafficking and are modulated by interaction with the membrane [149]. Peptidylglycine alpha- amidating monooxygenase (PAM) is a secretory granule membrane protein that has many

61 phosphorylation sites in its intrinsically disordered cytosolic domain that may help relay information to the cytosol and nucleus [150].

IDPRs in enzymes may also help attenuate or contribute to disease states. The stress induced variant of the ACHE-R has an IDPR at the C terminal [102]. Interestingly, this variant attenuates Alzhemier’s disease in the mouse, while the variant without the disordered region does not. Alternatively, we see disorder in the enzymes of multiple pathogens, such as adenylate cyclase toxin in b. pertussis [101], alphavirus capsid protease

[151], and the HIV-1 protease [152].

A survey of the literature documenting IDPRs in enzymes suggests that protein intrinsic disorder fulfills many of the same functions in enzymes as it does in non-enzymes, such as facilitating DNA binding and protein-protein interactions, and participating in regulation and signaling. However, there are also more puzzling cases of enzymes that do not take on significant structure. The disordered DNA anhydrin [153] remains “fuzzy” when bound to DNA and the GTPases TPPP/p25 [154] and UreG [155] are both catalytically active in an extended molten globule form.

Table 6 summarizes just a few of the many examples of enzymes with functional IDPRs found in the literature. A more extensive with experimentally identified IDPRs is provided in Appendix D.

Table 6 Selected enzymes with experimental evidence of functionally relevant regions of protein intrinsic disorder.

Name Enzyme Type Organism Disordered Region References Regulation of Catalysis calcineurin phosphatase Human central domain 372-467, [33, 139, 18% disordered 156, 157] glucokinase kinase Human inner loop 151-179, [45] 6% disordered CCT choline-phosphate Various N/C terminal [140, 158- cytidylyltransferase 160] Participation in Catalysis VIMP reductase, Human C terminal 123-189, [141, 161, peroxidase 35% disordered 162]

62

Table 6 (Continued)

Name Enzyme Type Organism Disordered Region References Msrb1 methionine sulfoxide mouse N/C terminal 1-18, 105-116, [142] reductase 25% disordered Substrate Recognition, Binding and Promiscuity San1 E3 Ub-ligase s. N/C terminal, [46, 143] cerevisiae ~60% disordered Upb10 deubiquitinase s. N terminal 1-359, [144] cerevisiae 45% disordered Ube2w E2 Ub-conjugating human C terminal, ~137-145, [145] 5% disordered Protein-protein and protein-nucleic acid interactions Sgs1 DNA helicase s. N terminal 1-125, [146, 163] cerevisiae 9% disordered NEIL1 DNA glycosylase human C terminal 290-390, [148, 164] 26% disordered RNase E RNase e. coli C terminal [147, 165- 167] Signaling lf-ALR sulfhydryl oxidase human N terminal, 1-80, [40, 168, 39% disordered 169] EGFR tyrosine kinase human Inter-domain, 645-697, [149, 170] 4% disordered PAM monooxygenase, rat C terminal, 896-976, [150] 8% disordered PTP1B Phosphatase human C terminal 300-393, [110] 23% disordered Disease Involvement AChE-R acetylcholinesterase human C terminal 575-603, [102, 171] 5% disordered adenylate adenylate cyclase b. pertussis C terminal 1000-1706, [101, 172- cyclase toxin 41% disordered 175] alphavirus protease aura virus C terminal [151] capsid protease HIV-1 protease protease HIV-1 linker 79-82 [152, 176- 178] Mostly Disordered anhydrin DNA endonuclease a. avenae mostly disordered [153] TPPP/p25 GTPase human mostly disordered [154, 179- 181] UreG GTPase b. pasteurii mostly disordered [155, 182]

3.1.2 Previous studies

Multiple proteome-wide investigations of intrinsic disorder have been undertaken using various disorder predictors [94, 183]. Some proteomics level disorder studies have done Gene

Ontology (GO) term analyses that included enzyme related keywords, [84, 95], there are

63 studies focusing specifically on ubiquitination pathways [37] and kinases [39], and there is a recent review on protein disorder and catalysis [133], however to our knowledge, no large- scale proteome level examination of intrinsic disorder in enzymes vs. non-enzymes has been undertaken.

3.2 Results

3.2.1 Experimental design

In this study, we have used two complementary datasets in order to examine the incidence of intrinsic disorder in enzymes. The first dataset was compiled from the protein sequences in 66 reference proteomes from the Quest for Orthologs (QFO) project [184]. A 4/7 binary consensus from seven disorder predictors was obtained in order to assess the amount of disorder in this dataset (see section 4.3.3 for more details). Additionally, missing residues from the PDB were used as a complementary experimental indication of disorder. Due to the difficulty in experimentally identifying IDPRs, missing regions in X-ray crystal structures are often used as an indication of IDPRs in order to supplement other experimental datasets. In

Chapter 2, it is shown that there are several possible causes of a missing region in an X-ray crystal structure, however supporting evidence, such as an amino acid composition biased in favor of flexibility, and variability in secondary structure between multiple structures with ambiguous missing regions, suggests that missing regions are very likely to be intrinsically disordered, though these regions may have differing amounts of conditional or residual secondary structure that depends on the environment [185]. The reference proteomes dataset allows us to examine an unbiased and representative collection of proteins, while the PDB dataset allows us to examine experimental indications of disorder.

Additionally, we have examined the relationship of disorder to enzyme function as defined by the Enzyme Commission (EC) designation [186], as well as Gene Ontology (GO) [92]

64

enrichment. It is our objective to better quantify the incidence of IDPRs in enzymes compared

to non-enzymes, and to broadly ascertain the functional relevance of these regions.

Table 7 details the relative amounts of each enzyme type in the reference proteomes,

sorted by taxon, while Table 8 provides similar statistics for the PDB dataset. Due to the low

number of archaea in the PDB, these proteins were excluded from this dataset. The majority

of enzymes in all data subsets are composed of (EC2) and (EC3),

followed by (EC1).

Table 7 The number of proteins in the reference proteome dataset assigned to each EC number, organized by taxon, and enzyme type (or non-enzyme).

EC# Description Eukaryotes Prokaryotes Archaea EC1 Oxidoreductases 20285 (16%) 5751 (20%) 946 (18%) EC2 Transferases 43112 (33%) 8344 (29%) 1591 (31%) EC3 Hydrolases 49002 (38%) 8784 (31%) 1521 (29%) EC4 3278 (3%) 1563 (5%) 297 (6%) EC5 2915 (2%) 1059 (4%) 172 (3%) EC6 Ligases 4062 (3%) 1385 (5%) 352 (7%) Multi Multiple EC# 6860 (5%) 1704 (6%) 314 (6%) NE Non-enzymes 428126 48443 10036 Multi refers to enzymes that have been assigned to multiple top level EC. Percentages show percent relative to all enzymes.

Table 8 The number of proteins assigned to each EC number in the PDB dataset, organized by taxon, and enzyme type (or non-enzyme) for the set of proteins with at least one continuous missing (CM) region of three residues in length.

EC# Description Eukaryotes Prokaryotes EC1 Oxidoreductases 341 (16%) 416 (18%)

EC2 Transferases 828 (39%) 750 (33%) EC3 Hydrolases 627 (30%) 461 (20%) EC4 Lyases 119 (6%) 270 (12%) EC5 Isomerases 77 (4%) 148 (7%) EC6 Ligases 57 (3%) 174 (8%) Multi Multiple EC# 53 (3%) 37 (2%) NE Non-Enzymes 5648 7560 Multi refers to enzymes that have been assigned to multiple top level EC numbers. Percentages show percent relative to all enzymes.

65

3.2.2 Enzymes and non-enzymes have a similar incidence of IDPRs.

Figure 26 shows the relative composition of the reference proteomes (top) and the

PDB dataset (bottom) by taxon and then further subdivided into enzyme vs. non-enzyme. Due to the larger size of eukaryotic proteomes, eukaryotes were most strongly represented in the reference proteomes dataset. 23% of eukaryotic proteins were assigned as enzymes, while

37% and 34% of prokaryotes and archaea were assigned as enzymes respectively. 92% of eukaryotic enzymes and 93% of eukaryotic non-enzymes were predicted to have at least one continuously disordered (CD) region of at least three residues in length. Interestingly, while the number of proteins with predicted CD regions increases as the taxon becomes more complex, there is negligible difference between enzymes and non-enzymes in our dataset in terms of the number of proteins with predicted CD regions.

Our PDB dataset is composed of unique UniProt IDs that are mapped to one or more X- ray crystal structures. A single composite missing region is compiled using the method outlined in Section 4.1, therefore, we are able to minimize the bias that is introduced by overrepresented proteins in the PDB. We have removed viruses, as they are a small portion of the PDB and provide a unique case due to the number of polyproteins. There are a large number of prokaryotic proteins represented in the PDB (Figure 26, bottom), especially considering that prokaryotic proteomes tend to be smaller. The number of structures with at least one missing region in the PDB is remarkably similar between different taxa and enzymes and non-enzymes, with fractions between 69-75% for eukaryotes and prokaryotes. Archaea represent the exception, with only 48% of archaeal enzymes having a missing region, compared with 68% for non-enzymes, however the sample size is small for archaea.

Eukaryotic enzymes have slightly more missing regions than non-enzymes, while the numbers are almost identical for prokaryotes

66

Figure 26 The composition of the reference proteome dataset (top) and the PDB dataset (bottom). A protein is considered to have a predicted IDPR or missing region if there is a continuous stretch of predicted disorder or missing region of at least 3 residues in length

3.2.3 Enzymes and non-enzymes have IDPRs of similar lengths

When comparing the fraction of predicted disordered residues per protein, eukaryotic non-enzymes in the reference proteomes dataset have an average of 23% predicted disorder, while enzymes vary from 7-13% depending on the EC assignment (Figure 27A). However, when comparing the average length of the longest CD region between enzymes and non-enzymes, the similarity between the groups increases. Transferases, hydrolases, and enzymes with multiple EC designations have an average longest CD region of 50, 43 and 47 respectively, while non-enzymes have an average longest CD region of 59. A similar trend is observed in the number of CD regions (Figure 27B), with transferases and hydrolases having an average of

4.5 CD regions, enzymes with multiple EC designations having an average of 4.9, and non- enzymes having an average of 4.8.

67

Figure 27 Predicted intrinsic disorder calculations for eukaryotes, prokaryotes, and archaea in reference proteomes. Box plots are first quartile, second quartile (median), and third quartile. Plot whiskers extend to 5% and 95% of the data. A) The fraction predicted disorder. B) The number of continuously disordered (CD) regions per protein. C) The longest CD region.

Overall, there is more variability between enzyme types than between enzymes and non-enzymes in regards to the presence and length of missing regions in the PDB dataset. We have used the same measurements as in the predicted disorder analysis (Figure 27) however, we have only included proteins with a missing region of at least 3 contiguous residues in the analyzed dataset. We have also excluded archaea due to an insufficient dataset size. The longest continuous missing (CM) region average is between 18 and 24, with non-enzymes having an average longest CM length of 22 (Figure 28). The exception is ligases which show an average longest missing region length of 39 for eukaryotes. This may not generalize, however considering the small sample size of 57 proteins.

68

Figure 28 Missing region calculations for eukaryotes, prokaryotes, and archaea in the PDB dataset. Calculations were performed only on those proteins that had at least one continuous missing region that was three residues in length. Box plots are first quartile, second quartile (median), and third quartile. Plot whiskers extend to 5% and 95% of the data. A) The fraction of missing residues per- protein. B) The number of continuous missing (CM) regions per protein. C) The longest CM region.

3.2.4 Disorder is specific to enzyme length and type.

Prokaryotes and archaea are shorter on average than eukaryotes, and non-enzymes are shorter on average than enzymes for all taxa (Figure 29A-C, top). Enzymes with multiple assigned EC numbers are longer for all taxa, which is likely due to the presence of multiple catalytic domains. Transferases, hydrolases, and ligases are the longest in eukaryotes for single EC assignments. We created subsets based on protein length for each EC number, and for non-enzymes in order to compare fractional disorder to length (Figure 29A-C, bottom). For very short proteins of all types, fractional disorder is higher, and then dips to a lowest point

69 as the protein length is increased. We expect that many of these very short proteins do not have sufficient length to form a stable core, and are subunits that must be stabilized in order to be functional. Therefore, the correlation between length and fractional disorder for eukaryotes can be described as more parabolic than linear, however there is significant variety between EC numbers. Prokaryotes and archaea, however, do not show significant variation in fractional disorder based on length by enzyme type, once they reach a minimum stable length.

3.2.5 Disorder increases with organismic complexity.

In alignment with previous studies [49], we found that eukaryotes are predicted by all measures to be the most disordered, followed by prokaryotes, and then archaea (Figure 27,

Figure 28). though it should be emphasized that there is some proteomic variability within each taxon. Eukaryotes showed significant variability in their disorder metrics between EC numbers, which was not observed to the same extent in prokaryotes or archaea. Eukaryotes in particular showed higher predicted disorder in transferases (EC2) and hydrolases (EC3).

Additionally, the relative proportion of transferases and hydrolases to other enzymes was highest for eukaryotes at 72% of total enzymes, versus 60% in prokaryotes and archaea.

3.2.6 Eukaryotic proteins in the PDB are highly truncated

Proteins are frequently truncated as part of preparation for crystallization. The remaining protein region is referred to as the characterized portion, while the removed region is the uncharacterized portion. Typically, this is to remove areas that have caused problems in crystallization, though this is not always the case. These uncharacterized regions are predicted to be more disordered than the observed regions, though not as disordered as missing regions (Table 3).

70

Figure 29 Distribution of protein lengths and mean fraction of disorder by protein length in reference proteomes. Top plots show the distribution of protein lengths separated by enzyme type (or non-enzyme). Box plots are first quartile, second quartile (median), and third quartile. Plot whiskers extend to 5% and 95% of the data. Bottom plots display the mean fraction of predicted disorder by protein length interval, separated by enzyme type. Colors of mean fractions correspond to box plot coloring. Sorted into A) Eukaryotes, B) Prokaryotes, and C) Archaea.

71

When examining the length distributions for the characterized portion of the protein in the X-ray crystal structure, we found that the average lengths of the characterized portion of eukaryotic proteins that are crystallized are remarkably similar to the average lengths for the reference prokaryotes (Figure 30). This relationship was observed for all enzyme types and non-enzymes. The difference between the average truncated length and the average full length of the proteins in eukaryotes was highly variable depending on the enzyme type

(Figure 31 left, red circle, black circle). Additionally, in most cases, the mean length of the eukaryotic proteins was divergent from the average lengths from the reference proteome set.

In contrast, prokaryotes were minimally truncated, and length averages were similar to the prokaryotic reference proteomes length averages (Figure 31, right)

Figure 30 Length distribution of the characterized portion of eukaryotes in the PDB overlaid on top of the length distribution of full length prokaryotes.

72

Figure 31 Length distribution of the PDB dataset divided into eukaryotes (left) and prokaryotes (right). Box plots are first quartile, second quartile (median), and third quartile. Plot whiskers extend to 5% and 95% of the data.

3.2.7 Long IDPRs in enzymes are associated with specific functions

We performed Gene Ontology (GO) enrichment on human enzymes in order to investigate the relationship between CD regions and function (Figure 32). We found that enrichment occurred at the extreme ends of the disorder-order spectrum. Enzymes predicted to be very ordered were enriched in small molecule and lipid metabolic processes while enzymes predicted to have CD regions greater than one hundred were enriched in proteins involved in macromolecular metabolic processes, phosphorylation, DNA binding, and chromosome organization (Figure 32). Small molecule catabolic processes showed significant enrichment only at CD lengths between 5-10, and deubiquitination activity showed enrichment only at lengths between 50-100, suggesting there may be some CD length specificity for some functions.

73

Figure 32 Gene Ontology (GO) term enrichment in human enzymes, categorized by longest CD length per protein. A dashed line indicates processes enriched in proteins predicted to be mostly structured. Solid lines correspond to processes enriched in proteins predicted to have long CD regions. Thick lines correspond to top scoring general GO terms. Thin lines correspond to more specific GO terms with high enrichment. Only terms that had a P value of less than or equal to 10-10 are displayed, and highly redundant terms have been removed. The plots have been divided into A) Biological Processes and B) Molecular Functions.

3.2.8 Promiscuity is not correlated with disorder in enzymes

We performed an analysis of the protein-protein interaction properties of enzymes, and did not find any strong correlation between the number of binding partners and the amount of predicted disorder using interaction data from IntAct [187] and String [188]. We

74 also did not find any compelling relationships between the amount of predicted disorder in enzymes and measures of multi-functionality such as the number of GO terms, the keyword multifunctional enzyme in UniProt, or experimentally identified moonlighting enzymes in

MoonProt [189]. While the literature supports a role for IDPRs in promiscuous binding and multi-functionality, a bioinformatics analysis suggests that this relationship may be subtle at the proteome level or the currently available datasets are incomplete.

3.3 Conclusions

Experimental evidence (Table 6) clearly demonstrates that the existence of an IDPR or multiple IDPRs in an enzyme does not preclude catalysis, and in fact an enzyme may be highly disordered, while still being catalytically active. Our disorder prediction measurements

(Figure 27) support the conclusion that enzymes are rarely entirely disordered, and are therefore more structured than non-enzymes on the whole. While there are counter examples that beg a closer look, our results are in line with the scientific paradigm that catalysis is facilitated by a structured core.

However, our results also demonstrate the existence of CD regions in enzymes at a similar frequency and length as non-enzymes (Figure 27C). Additionally, CM regions in X-ray crystal structures in the PDB show more difference between enzyme types than between enzymes and non-enzymes, and show a high frequency for both (Figure 28). While it is difficult to directly compare prediction scores in representative proteomes to CM regions in the PDB, due to the large proportion of uncharacterized residues in the PDB, we feel the two measurements still provide consistent support for the common occurrence of IDPRs in enzymes and non-enzymes as well as the minimal difference between the two.

While it is difficult to capture the diversity of enzyme structural and related functional behaviors at this level, there are still some clear trends. Broadly speaking,

75 enzymes with long CD regions are enriched in macromolecular metabolic processes, such as those involving DNA, RNA, and glycoproteins, while proteins predicted to be completely or almost completely structured are enriched in small molecule and lipid metabolic processes.

Transferases and hydrolases are not only enriched in disorder in eukaryotes, but they also represent a much larger portion of total enzymes in eukaryotes than in prokaryotes/archaea.

GO term enrichment analysis highlights deubiquitination processes, histone modification, and serine/threonine kinase activity in enzymes with long CD regions. We feel these findings demonstrate parallels to commonly accepted functional enrichments in intrinsically disordered non-enzymes, such as DNA/RNA binding, regulation, and, signaling processes.

It should be noted that there are many disorder-related functions that are not well represented in currently available ontologies, such as scaffolding protein complexes, promiscuity, and various flavors of entropically driven . Furthermore, there are IDPRs that have been shown to be very impactful at very short lengths, and so relating quantities of disorder to function, will be necessarily incomplete. A lack of GO enrichment for intermediate lengths of CD does not necessarily imply that these regions are not functionally relevant, instead it only conveys a lack of overrepresentation for any particular function or process

Additionally, it is clear that disorder does not increase randomly as a function of length (Figure 29). For instance, the isomerases (EC5), which are overall predicted to be short and structured, increase dramatically in their predicted fraction of disorder in eukaryotes for protein lengths between 400-800 and enzymes with multiple assigned EC numbers, peak in their fraction of predicted disorder at a length of 1600-800, but then become more structured at longer lengths. Rather than accumulating stretches of ‘junk’ sequence, it is more likely that protein evolution employs both protein length and protein disorder to cope with changing

76 functional needs. Further studies focusing on the evolution of protein intrinsic disorder in specific enzyme families will be necessary to shed further light on this.

The minimal action an enzyme must perform to be considered functionally viable is catalysis. The reduced length of prokaryotic enzymes compared to eukaryotic enzymes, suggests that in some cases of functionally similar enzymes, the prokaryotic version represents a minimum length for catalytic viability. The overrepresentation of prokaryotes in the PDB (Figure 26) and the truncation of eukaryotic proteins to prokaryotic lengths before crystallization (Figure 30, Figure 31), suggests that enzymes are often being studied at this minimum viable length. The critical question then, is whether IDPRs that extend beyond the catalytic core, are important, and this study answers that question with a resounding yes.

Broadly speaking, IDPRs are likely to be important in regulatory mechanisms, signaling, recognition, and interaction with other proteins and biological molecules and may be useful targets for intervention. Furthermore, in cases where the IDPR is mediating different substrate interactions, targeting these areas creates the possibility of substrate specific inhibition or activation.

It is a generally agreed upon theory that protein intrinsic disorder increases in proteomes as an evolutionarily directed response to complexity. Our results support and extend this hypothesis to include IDPRs in enzymes. Our results provide striking parallels to

(primarily) non-enzyme studies of protein intrinsic disorder that show enrichment in protein signaling and regulation [38].

It has typically been assumed that enzymes represent the structured portion of the protein kingdom, and when IDPRs are present, they are functionally neutral. However, our analysis does not support this assumption. Instead, structure and disorder are more tightly coupled to the complexity of the organism, and in more complex organisms, both enzymes and non-enzymes show a marked increase in protein intrinsic disorder. Disorder at the

77 proteome level appears to emerge in response to organismic and functional complexity, and enzymes are not an exception to this rule.

In summary, the results from this study clearly show the following:

 The presence of an IDPR in an enzyme does not preclude catalysis.

 IDPRs are common in enzymes and occur at similar lengths and with similar

frequencies to non-enzymes.

 IDPRs in enzymes are functionally important and specific.

 The experimental emphasis on the structured, catalytic portion of enzymes may

exclude functionally important regions.

 Protein intrinsic disorder is a response to complexity in both enzymes and non-

enzymes.

78

4. Materials and Methods

Note to Reader

Portions of this chapter have been previously published in RSC Advances, 2016

6(14):11513-11521 and Protein Sci 2016, 25(3):676-688, and have been reproduced with permission from the Royal Society of Chemistry and John Wiley & Sons publishing

4.1 PubMed Data and Analysis

4.1.1 IDP terminology in PubMed

IDPs in the literature can be referred to by a number of different terms. In order to try to maximize coverage while minimizing irrelevant results, we used the search term

(intrinsically OR natively OR naturally OR inherently) AND (disordered OR unfolded OR unstructured OR denatured OR flexible) AND (protein OR region OR peptide OR domain)

AND (1978/1/1:2014/10/15[dp]) in PubMed at http://www.ncbi. nlm.nih.gov/pubmed/. This search covers the date ranges from 1/1/1978 through

10/15/2014. This search yielded 3343 results.

From the initial 3343 results, we manually examined each article to try to ascertain which proteins were referred to as an IDP or indicated to have an IDPR. We recorded these names using the same language used in the corresponding literature. We discarded review, theory, proteomic, and method papers, as well as irrelevant results.

This filtering resulted in 2278 PubMed articles attached to 1127 search terms, each corresponding to a protein or protein domain. These search terms and their corresponding PubMed IDs can be found in Appendix B.

79

Our emphasis was primarily on the language used in the literature for the initial search, therefore we did not evaluate the experimental evidence when compiling the initial list. For each of the 1127 identified proteins and protein domains, we created a search term and attempted to maximize relevant results by adding qualifiers as necessary. For instance, the search term we created for tau was “tau AND (protein OR

Alzheimer’s OR tauopathies OR neuronal),” because a search for “tau” alone would return many irrelevant results. Similarly, the search term for was “p53 AND (CTD or C-terminal or C-terminus),” because we wanted to specifically target our search towards the region that had been identified as intrinsically disordered. We attached

DisProt and UniProt IDs to each protein search term, however in many cases, this required an educated guess due to variations in naming conventions. In some cases, more than one UniProt and/or DisProt ID was attached when multiple organisms were referred to in the paper(s). In cases where only a domain was mentioned, a UniProt ID was not assigned. There were 630 proteins in our set that could be attached to DisProt

IDs. For each UniProt ID assigned to a protein, a disorder prediction was obtained by the disorder predictor RAPID [85] at http://biomine- ws.ece.ualberta.ca/RAPID/index.php.

In order to get the number of both IDP and non-IDP papers per year, PubMed was automatically queried for each PubMed ID using the Biopython suite of tools [190] and custom Python scripts. The fraction of IDP papers is calculated as the number of IDP papers divided by the entire set of papers for that protein search query.

80

4.2 PDB Data and Analysis

4.2.1 Parsing and preparation of the PDB dataset

The initial PDB dataset and enzyme designations were obtained from the structure integration with function, taxonomy and sequence (SIFTS) project, which provides a mapping between PDB chains and UniProt identifiers [191]. This mapping allows us to match sequence regions between multiple PDB files with different residue numbering schemes. The missing residues and secondary structure assignments for each PDB file are available in a precompiled text file provided by the PDB (available at http://www.rcsb.org/pdb/files/ss_dis.txt). By using this precompiled information, we were able to avoid directly parsing the PDB files, thus greatly simplifying the method. From this starting point, we performed the following filtering:

1. Remove obsolete PDB files and obsolete UniProt entries, and retain only X-ray

crystallography files of individual proteins, protein complexes, or proteins and nucleic

acid complexes.

2. Remove any entries with unclear mappings between the UniProt and PDB files, or

where the mapping spanned fewer than four residues.

3. Remove any PDB files that do not have any secondary structure information available.

4. Remove any PDB chain that was not a 100% match with the corresponding region of the

mapped UniProt entry.

5. Remove any UniProt ID that had a sequence longer than 10000 residues or that

contained non-standard amino acids (for consistency with disorder prediction).

By removing all crystal structures that did not exactly match their corresponding

UniProt entry, we hope to minimize any confounding effects from sequence variation. We did not filter our dataset to remove homologs or fragments, nor did we filter based on date or resolution, as our primary objective was to provide a comprehensive survey of missing regions in the PDB. Additionally, for all analysis involving ambiguous regions, covered in Chapter 2, all

81

UniProt IDs with only one associated PDB chain were removed. However, for the analysis performed in Chapter 3, these UniProt IDs were retained.

4.2.2 The assignment of missing residues from PDB files

We created a representative sequence of secondary structure assignments, uncharacterized residues, and missing residues for each PDB chain. Where a residue was not characterized, we used a dash character, where the residue was missing, we used an X character, and where the residue was observed but not assigned a secondary structure designation, we used a P character. The remaining secondary structure designations were from DSSP [83, 125]. The PDB chains could then be directly aligned to compare the missing regions.

In our analysis, we have considered both the individual residues in each PDB chain, as well as a single composite of the PDB chains that is attached to a residue position on the

UniProt sequence. We distinguish these two by referring to a position in the UniProt entry that spans all associated PDB chains as a residue column or a residue position.

4.2.3 The creation of PDB composite data in Python

An implementation of this method was written in the programming language Python using the Pandas data analysis library [111]. It is available on GitHub at https://github.com/shellydeforte/PDB.

4.2.4 Amino acid composition

The amino acid composition analysis in Chapter 2 was performed by composition profiler [192] using 10,000 bootstrap iterations. It is displayed using the flexibility index proposed by Vihinen et al. [68].

4.2.5 Disorder, binding, and MoRF predictions

We used ESpritz X-ray [79], IUPred-short [80, 81], and DynaMine [55, 86] to predict intrinsic disorder in the PDB dataset for analysis in Chapter 2. We chose these predictors

82 because they are all fast, perform well on short regions, and do not use multiple sequence alignments. Furthermore, each predictor used a different training set in its development, including a dataset based on missing regions in X-ray crystal structures (ESpritz X-ray), purely globular regions (IUPred Short), and NMR chemical shifts (DynaMine). We chose these predictors because we felt they would be best at highlighting distinct physicochemical features and would not be biased by specific sequence patterns that may be present in the

PDB. However, because the ESpritz X-ray training set was most likely to have crossover with our dataset, we compared the ESpritz X-ray training set to our dataset and found that there were 2029 PDB chains in common, representing only 1.4% of our total dataset. Therefore,

ESpritz X-ray should not be overly biased towards the PDB dataset.

In order to predict binding propensity and the presence of MoRFs, we used the DNA,

RNA, and protein binding predictor DisoRDPbind [193], as well as the MoRF predictors ANCHOR

[194, 195] and a new fast version of MoRFpred (unpublished work). All disorder and binding scores were treated as binary (either ordered or disordered), with the threshold set based on published materials of the predictor in question.

4.3 Reference Proteomes Data and Analysis

4.3.1 Parsing and preparation of the Reference QFO dataset

66 reference proteomes composing the primary proteome set for the Quest for

Orthologs (QFO) project [184], corresponding to UniProt release 2014_04, were obtained from the EMBL-EBI website at http://www.ebi.ac.uk/reference_proteomes. The sequences were loaded into a MongoDB database. Before analysis was performed, protein fragments (as defined by UniProt), proteins with non-standard amino acids, and proteins of length 10,000 or greater where removed. After filtering, our dataset consisted of 649,902 proteins.

83

4.3.2 Enzyme Commission (EC) designations

EC numbers are assigned based on the chemical reaction that is catalyzed by the enzyme. On a per protein basis, these designations are assigned based on experimental evidence, as well as evidence from homology. We obtained these designations from two sources. The first was provided by ExPASy at ftp://ftp.expasy.org/databases/enzyme

(released 2/4/15). These designations appeared to cover primarily the Swiss-Prot portion of the Uniprot Knowledgebase, and therefore did not provide complete coverage for our reference proteome set. The second source of EC assignments was obtained from QuickGO at http://www.ebi.ac.uk/QuickGO/ on 2/20/15 using the EC to GO mapping.

4.3.3 Disorder prediction

We used a 4/7 per-residue, binary (each residue is assigned by each predictor as ordered or disordered) consensus of seven different disorder predictors for disorder prediction on the reference proteomes, used in the analysis in Chapter 3. The predictors we used were the three flavors of Espritz (Espritz Disprot, Espritz NMR, and Espritz X-ray) [79] and the three flavors of IUPred (IUPred Glob, IUPred Long, and IUPred Short) [80, 81], as well as the consensus predictor PONDR-fit [87]. Due to the size of our dataset, our first criteria for choosing disorder predictors was that they needed to be fast and to perform reasonably well against other predictors via a published benchmark. The three Espritz methods and the three

IUPred predictors use various biophysical methods obtained from unique training sets, while

PONDR-fit is itself a meta predictor using the output of 6 methods. The rationale for using a large number of predictors was to reduce bias towards one kind of disordered region over another.

We examined the Matthews correlation coefficient (MCC) for the output of the seven predictors and consensus score, for the 166 sequences in our dataset that overlapped with the

MxD dataset [196], which contains experimentally verified disordered and structured regions.

84

The MCC, introduced in section 1.8.2.2, provides a score that combines information from true positives and negatives as well as false positives and negatives, and is one of the metrics used for evaluating disorder predictor performance in the CASP competition [75]. Our goal was to balance the strengths of proven predictors through consensus, as well as produce a high MCC.

Only IUPred Glob performed better by the MCC measurement, and this was through a high true negative rate, but low true positive rate, which we better balanced through our 4/7 consensus score (Table 9).

Figure 33 shows box plots for the fraction of predicted disorder per protein in the human proteome, for each of the seven disorder predictors and the 4/7 consensus. This plot demonstrates some of the differences between each predictor at the proteome level. In particular, Espritz Disprot and IUPred Glob tend to predict more proteins to be 100% structured, and have high true negative rates, while predictors such as PONDR-Fit and Espritz

NMR predict that most proteins will have some disorder, and have high true positive rates.

Our consensus score finds a balance between these while maintaining a high MCC.

Table 9 Matthew correlation coefficient comparison for each disorder predictor and a consensus.

Disorder Predictor MCC Espritz Disprot 0.486

Espritz NMR 0.415 Espritz X-ray 0.419 IUPred Glob 0.491 IUPred Long 0.466 IUPred Short 0.417 PONDR-fit 0.459 Consensus 0.487

85

Figure 33 The distribution of disorder scores in the human proteome for seven disorder predictors and their consensus scores. Box plots are first quartile, second quartile (median), and third quartile. Plot whiskers extend to 5% and 95% of the data.

4.3.4 Disorder analysis

4.3.4.1 Disorder calculations

We used three different calculations for each whole protein sequence, which were the fraction of predicted disorder, the longest continuously disordered (CD) region, and the number of CD regions, based on the binary disorder consensus score. The fraction of disorder is calculated as the number of predicted disordered residues divided by the length of the protein. A CD region is defined as the length of contiguous predicted disordered residues.

4.3.4.2 Expectation values

Due to the non-normality of the distributions of calculations used in this study including disorder calculations, missing region calculations, and protein length calculations, we have used kernel density estimation (KDE) to approximate the probability density function and have calculated expectation values from this by the method outlined by Vincent et. al

(submitted). KDE approximations were obtained via the Python package PyQt and integration was performed using the SciPy Python package. For fractional calculations with limits between 0 and 1, we used the quad integration method in Scipy, and for all other calculations we used Simpson’s rule, with the width set to 0.5. In this paper, we have referred to

86 expectation values as “mean” or “average” values, but in no case were they obtained from an assumed normal distribution.

4.3.4.3 Transmembrane domains

Disorder prediction was performed for every residue over the intact protein sequence, however before disorder calculations (fraction of disorder, longest CD region, number of CD regions) were obtained, the transmembrane (TM) domains were removed, because the disorder predictors we used in this study have not been properly calibrated on these regions

[197]. For the purposes of calculating the CD length, the regions surrounding a TM domain were considered to be contiguous. The region spanning the TM domain was obtained from

UniProt. 20% of the proteins in our dataset had at least one transmembrane domain (20% each for eukaryotes, prokaryotes and archaea).

4.3.5 Gene Ontology enrichment

Gene ontology (GO) enrichment was performed using the R library topGO [198]. The human proteome was used to examine enrichment, due to its high coverage of GO terms. No filtering was done based on the evidence code for the GO terms, as we suspected that experimental evidence for ontology assignments might be biased in favor of structured enzymes. The ontology file and the gene association files were downloaded from http://geneontology.org/page/download-ontology and http://geneontology.org/page/download-annotations on 12/29/15 Human enzymes were divided into subsets based on their longest CD length, and interval lengths were chosen so that the subsets were of approximately equal size. Enrichment analysis was performed on each length interval subset against the entire set of human enzymes. Both the classic method, which gives the top scoring GO term, and the weight01 method of topGO, which obtains more specific GO terms were used. P-values were obtained using Fischer’s exact test.

87

References

1. Tanford C, Reynolds J: Nature's Robots, A History of Proteins. Oxford, New York: Oxford University Press; 2001. 2. Jorpes JE: Jacob Berzelius, his life and work (trans. B. Steele). Stockholm: Almqvist & Wiksell; 1970. 3. Kunz H: Emil Fischer--unequalled classicist, master of organic chemistry research, and inspired trailblazer of biological chemistry. Angew Chem Int Ed Engl 2002, 41(23):4439-4451. 4. Blake CC, Koenig DF, Mair GA, North AC, Phillips DC, Sarma VR: Structure of hen egg-white lysozyme. A three-dimensional Fourier synthesis at 2 Angstrom resolution. Nature 1965, 206(4986):757-761. 5. Wyckoff HW, Hardman KD, Allewell NM, Inagami T, Tsernoglou D, Johnson LN, Richards FM: The structure of ribonuclease-S at 6 A resolution. J Biol Chem 1967, 242(16):3749-3753. 6. ANFINSEN CB, REDFIELD RR, CHOATE WL, PAGE J, CARROLL WR: Studies on the gross structure, cross-linkages, and terminal sequences in ribonuclease. J Biol Chem 1954, 207(1):201-210. 7. Jirgensons B: Classification of proteins according to conformation. Die Makromolekulare Chemie 1966, 91:74-86. 8. Herriott RM: ISOLATION, CRYSTALLIZATION, AND PROPERTIES OF SWINE PEPSINOGEN. J Gen Physiol 1938, 21(4):501-540. 9. JIRGENSONS B: Optical rotation and viscosity of native and denatured proteins. X. Further studies on optical rotatory dispersion. Arch Biochem Biophys 1958, 74(1):57-69. 10. Arnone A, Bier CJ, Cotton FA, Day VW, Hazen EE, Richardson DC, Yonath A, Richardson JS: A high resolution structure of an inhibitor complex of the extracellular nuclease of Staphylococcus aureus. I. Experimental procedures and chain tracing. J Biol Chem 1971, 246(7):2302-2316. 11. Schweers O, Schönbrunn-Hanebeck E, Marx A, Mandelkow E: Structural studies of tau protein and Alzheimer paired helical filaments show no evidence for beta- structure. J Biol Chem 1994, 269(39):24290-24297. 12. Weinreb PH, Zhen W, Poon AW, Conway KA, Lansbury PT: NACP, a protein implicated in Alzheimer's disease and learning, is natively unfolded. Biochemistry 1996, 35(43):13709-13715. 13. Wright PE, Dyson HJ: Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol 1999, 293(2):321-331.

88

14. Bailey RW, Dunker AK, Brown CJ, Garner EC, Griswold MD: , a binding protein with a molten globule-like region. Biochemistry 2001, 40(39):11828- 11840. 15. Liu J, Tan H, Rost B: Loopy proteins appear conserved in evolution. J Mol Biol 2002, 322(1):53-64. 16. Pullen RA, Jenkins JA, Tickle IJ, Wood SP, Blundell TL: The relation of polypeptide hormone structure and flexibility to receptor binding: the relevance of X-ray studies on insulins, glucagon and human placental lactogen. Mol Cell Biochem 1975, 8(1):5-20. 17. Cary PD, Moss T, Bradbury EM: High-resolution proton-magnetic-resonance studies of chromatin core particles. Eur J Biochem 1978, 89(2):475-482. 18. DeForte S, Uversky VN: Intrinsically disordered proteins in PubMed: what can the tip of the iceberg tell us about what lies below? RSC Advances 2016, 6(14):11513-11521. 19. Tompa P: Structure and function of intrinsically disordered proteins. Boca Raton: CRC Press; 2010. 20. Uversky VN: Introduction to intrinsically disordered proteins (IDPs). Chem Rev 2014, 114(13):6557-6560. 21. Thomas WH, Weser U, Hempel K: Conformational changes induced by ionic strength and pH in two bovine myelin basic proteins. Hoppe Seylers Z Physiol Chem 1977, 358(10):1345-1352. 22. Hernández MA, Avila J, Andreu JM: Physicochemical characterization of the heat- stable microtubule-associated protein MAP2. Eur J Biochem 1986, 154(1):41-48. 23. Dunker AK, Oldfield CJ: Back to the Future: Nuclear Magnetic Resonance and Bioinformatics Studies on Intrinsically Disordered Proteins. Adv Exp Med Biol 2015, 870:1-34. 24. Jakob U, Kriwacki R, Uversky VN: Conditionally and transiently disordered proteins: awakening cryptic disorder to regulate protein function. Chem Rev 2014, 114(13):6779-6805. 25. Hu X, Stebbins CE: Dynamics of the WPD loop of the Yersinia protein tyrosine phosphatase. Biophys J 2006, 91(3):948-956. 26. Vacic V, Oldfield CJ, Mohan A, Radivojac P, Cortese MS, Uversky VN, Dunker AK: Characterization of molecular recognition features, MoRFs, and their binding partners. J Proteome Res 2007, 6(6):2351-2366. 27. Roth AF, Papanayotou I, Davis NG: The yeast kinase Yck2 has a tripartite palmitoylation signal. Mol Biol Cell 2011, 22(15):2702-2715. 28. Takayama Y, Nakasako M, Okajima K, Iwata A, Kashojiya S, Matsui Y, Tokutomi S: Light-induced movement of the LOV2 domain in an Asp720Asn mutant LOV2- kinase fragment of Arabidopsis phototropin 2. Biochemistry 2011, 50(7):1174- 1183. 29. Bentrop D, Beyermann M, Wissmann R, Fakler B: NMR structure of the "ball-and- chain" domain of KCNMB2, the beta 2-subunit of large conductance Ca2+- and voltage-activated potassium channels. J Biol Chem 2001, 276(45):42116-42121. 30. Patel SS, Belmont BJ, Sante JM, Rexach MF: Natively unfolded nucleoporins gate protein diffusion across the nuclear pore complex. Cell 2007, 129(1):83-96.

89

31. Iakoucheva LM, Radivojac P, Brown CJ, O'Connor TR, Sikes JG, Obradovic Z, Dunker AK: The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res 2004, 32(3):1037-1049. 32. Buffa P, Manzella L, Consoli ML, Messina A, Vigneri P: Modelling of the ABL and ARG proteins predicts two functionally critical regions that are natively unfolded. Proteins 2007, 67(1):1-11. 33. Manalan AS, Klee CB: Activation of calcineurin by limited proteolysis. Proc Natl Acad Sci U S A 1983, 80(14):4291-4295. 34. Mukerjee N, McGinnis KM, Park YH, Gnegy ME, Wang KK: Caspase-mediated proteolytic activation of calcineurin in thapsigargin-mediated apoptosis in SH- SY5Y neuroblastoma cells. Arch Biochem Biophys 2000, 379(2):337-343. 35. Alphey MS, Yu W, Byres E, Li D, Hunter WN: Structure and reactivity of human mitochondrial 2,4-dienoyl-CoA reductase: enzyme-ligand interactions in a distinctive short-chain reductase active site. J Biol Chem 2005, 280(4):3068- 3077. 36. VanOudenhove J, Anderson E, Krueger S, Cole JL: Analysis of PKR structure by small-angle scattering. J Mol Biol 2009, 387(4):910-920. 37. Bhowmick P, Pancsa R, Guharoy M, Tompa P: Functional diversity and structural disorder in the human ubiquitination pathway. PLoS One 2013, 8(5):e65443. 38. Wright PE, Dyson HJ: Intrinsically disordered proteins in cellular signalling and regulation. Nat Rev Mol Cell Biol 2015, 16(1):18-29. 39. Kathiriya JJ, Pathak RR, Clayman E, Xue B, Uversky VN, Davé V: Presence and utility of intrinsically disordered regions in kinases. Mol Biosyst 2014, 10(11):2876- 2888. 40. Banci L, Bertini I, Cefaro C, Ciofi-Baffoni S, Gajda K, Felli IC, Gallo A, Pavelkova A, Kallergi E, Andreadaki M et al: An intrinsically disordered domain has a dual function coupled to compartment-dependent redox control. J Mol Biol 2013, 425(3):594-608. 41. Bornet O, Nouailler M, Feracci M, Sebban-Kreuzer C, Byrne D, Halimi H, Morelli X, Badache A, Guerlesquin F: Identification of a Src kinase SH3 binding site in the C-terminal domain of the human ErbB2 receptor tyrosine kinase. FEBS Lett 2014, 588(12):2031-2036. 42. Niklas KJ, Bondos SE, Dunker AK, Newman SA: Rethinking gene regulatory networks in light of alternative splicing, intrinsically disordered protein domains, and post-translational modifications. Front Cell Dev Biol 2015, 3:8. 43. Erales J, Coffino P: Ubiquitin-independent proteasomal degradation. Biochim Biophys Acta 2014, 1843(1):216-221. 44. Bah A, Vernon RM, Siddiqui Z, Krzeminski M, Muhandiram R, Zhao C, Sonenberg N, Kay LE, Forman-Kay JD: Folding of an intrinsically disordered protein by phosphorylation as a regulatory switch. Nature 2015, 519(7541):106-109. 45. Larion M, Salinas RK, Bruschweiler-Li L, Miller BG, Brüschweiler R: Order-disorder transitions govern kinetic cooperativity and allostery of monomeric human glucokinase. PLoS Biol 2012, 10(12):e1001452. 46. Fredrickson EK, Clowes Candadai SV, Tam CH, Gardner RG: Means of self- preservation: how an intrinsically disordered ubiquitin-protein ligase averts self-destruction. Mol Biol Cell 2013, 24(7):1041-1052.

90

47. Hou Y, Lin S: Distinct gene number-genome size relationships for eukaryotes and non-eukaryotes: gene content estimation for dinoflagellate genomes. PLoS One 2009, 4(9):e6978. 48. Dunker AK, Bondos SE, Huang F, Oldfield CJ: Intrinsically disordered proteins and multicellular organisms. Semin Cell Dev Biol 2015, 37:44-55. 49. Pancsa R, Tompa P: Structural disorder in eukaryotes. PLoS One 2012, 7(4):e34687. 50. Yu JF, Cao Z, Yang Y, Wang CL, Su ZD, Zhao YW, Wang JH, Zhou Y: Natural protein sequences are more intrinsically disordered than random sequences. Cell Mol Life Sci 2016. 51. Teraguchi S, Patil A, Standley DM: Intrinsically disordered domains deviate significantly from random sequences in mammalian proteins. BMC Bioinformatics 2010, 11 Suppl 7:S7. 52. Brown CJ, Takayama S, Campen AM, Vise P, Marshall TW, Oldfield CJ, Williams CJ, Dunker AK: Evolutionary rate heterogeneity in proteins with long disordered regions. J Mol Evol 2002, 55(1):104-110. 53. Potenza E, Di Domenico T, Walsh I, Tosatto SC: MobiDB 2.0: an improved database of intrinsically disordered and mobile proteins. Nucleic Acids Res 2015, 43(Database issue):D315-320. 54. Ota M, Koike R, Amemiya T, Tenno T, Romero PR, Hiroaki H, Dunker AK, Fukuchi S: An assignment of intrinsically disordered regions of proteins based on NMR structures. J Struct Biol 2013, 181(1):29-36. 55. Cilia E, Pancsa R, Tompa P, Lenaerts T, Vranken WF: From protein sequence to dynamics and disorder with DynaMine. Nat Commun 2013, 4:2741. 56. Intrinsically Disordered Protein Analysis, vol. 1, 1 edn. Humana Press; 2012. 57. Intrinsically Disordered Protein Analysis, vol. 2, 1 edn. Humana Press; 2012. 58. Uversky VN, Dunker AK: Multiparametric analysis of intrinsically disordered proteins: looking at intrinsic disorder through compound eyes. Anal Chem 2012, 84(5):2096-2104. 59. Uversky VN: Biophysical Methods to Investigate Intrinsically Disordered Proteins: Avoiding an "Elephant and Blind Men" Situation. Adv Exp Med Biol 2015, 870:215-260. 60. Khan T, Douglas GM, Patel P, Nguyen Ba AN, Moses AM: Polymorphism Analysis Reveals Reduced Negative Selection and Elevated Rate of Insertions and Deletions in Intrinsically Disordered Protein Regions. Genome Biol Evol 2015, 7(6):1815-1826. 61. Smithers B, Oates ME, Gough J: Splice junctions are constrained by protein disorder. Nucleic Acids Res 2015, 43(10):4814-4822. 62. Bhowmick P, Guharoy M, Tompa P: Bioinformatics Approaches for Predicting Disordered Protein Motifs. Adv Exp Med Biol 2015, 870:291-318. 63. Varadi M, Vranken W, Guharoy M, Tompa P: Computational approaches for inferring the functions of intrinsically disordered proteins. Front Mol Biosci 2015, 2:45. 64. Punta M, Simon I, Dosztányi Z: Prediction and analysis of intrinsically disordered proteins. Methods Mol Biol 2015, 1261:35-59.

91

65. Anfinsen CB: Principles that govern the folding of protein chains. Science 1973, 181(4096):223-230. 66. Dunker AK, Lawson JD, Brown CJ, Williams RM, Romero P, Oh JS, Oldfield CJ, Campen AM, Ratliff CM, Hipps KW et al: Intrinsically disordered protein. J Mol Graph Model 2001, 19(1):26-59. 67. Williams RM, Obradovi Z, Mathura V, Braun W, Garner EC, Young J, Takayama S, Brown CJ, Dunker AK: The protein non-folding problem: amino acid determinants of intrinsic order and disorder. Pac Symp Biocomput 2001:89-100. 68. Vihinen M, Torkkila E, Riikonen P: Accuracy of protein flexibility predictions. Proteins 1994, 19(2):141-149. 69. Kyte J, Doolittle RF: A simple method for displaying the hydropathic character of a protein. J Mol Biol 1982, 157(1):105-132. 70. Kumari B, Kumar R, Kumar M: Low complexity and disordered regions of proteins have different structural and amino acid preferences. Mol Biosyst 2015, 11(2):585-594. 71. Chen JW, Romero P, Uversky VN, Dunker AK: Conservation of intrinsic disorder in protein domains and families: I. A database of conserved predicted disordered regions. J Proteome Res 2006, 5(4):879-887. 72. Moesa HA, Wakabayashi S, Nakai K, Patil A: Chemical composition is maintained in poorly conserved intrinsically disordered regions and suggests a means for their classification. Mol Biosyst 2012, 8(12):3262-3273. 73. Li J, Feng Y, Wang X, Liu W, Rong L, Bao J: An Overview of Predictors for Intrinsically Disordered Proteins over 2010-2014. Int J Mol Sci 2015, 16(10):23446-23462. 74. He B, Wang K, Liu Y, Xue B, Uversky VN, Dunker AK: Predicting intrinsic disorder in proteins: an overview. Cell Res 2009, 19(8):929-949. 75. Monastyrskyy B, Kryshtafovych A, Moult J, Tramontano A, Fidelis K: Assessment of protein disorder region predictions in CASP10. Proteins 2014, 82 Suppl 2:127- 137. 76. Ishida T, Kinoshita K: PrDOS: prediction of disordered protein regions from amino acid sequence. Nucleic Acids Res 2007, 35(Web Server issue):W460-464. 77. Zhang T, Faraggi E, Xue B, Dunker AK, Uversky VN, Zhou Y: SPINE-D: accurate prediction of short and long disordered regions by a single neural-network based method. J Biomol Struct Dyn 2012, 29(4):799-813. 78. Jones DT, Cozzetto D: DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics 2015, 31(6):857-863. 79. Walsh I, Martin AJ, Di Domenico T, Tosatto SC: ESpritz: accurate and fast prediction of protein disorder. Bioinformatics 2012, 28(4):503-509. 80. Dosztányi Z, Csizmok V, Tompa P, Simon I: IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 2005, 21(16):3433-3434. 81. Dosztányi Z, Csizmók V, Tompa P, Simon I: The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol 2005, 347(4):827-839.

92

82. Sickmeier M, Hamilton JA, LeGall T, Vacic V, Cortese MS, Tantos A, Szabo B, Tompa P, Chen J, Uversky VN et al: DisProt: the Database of Disordered Proteins. Nucleic Acids Res 2007, 35(Database issue):D786-793. 83. Touw WG, Baakman C, Black J, te Beek TA, Krieger E, Joosten RP, Vriend G: A series of PDB-related databanks for everyday needs. Nucleic Acids Res 2015, 43(Database issue):D364-368. 84. Peng Z, Mizianty MJ, Kurgan L: Genome-scale prediction of proteins with long intrinsically disordered regions. Proteins 2014, 82(1):145-158. 85. Yan J, Mizianty MJ, Filipow PL, Uversky VN, Kurgan L: RAPID: fast and accurate sequence-based prediction of intrinsic disorder content on proteomic scale. Biochim Biophys Acta 2013, 1834(8):1671-1680. 86. Cilia E, Pancsa R, Tompa P, Lenaerts T, Vranken WF: The DynaMine webserver: predicting protein dynamics from sequence. Nucleic Acids Res 2014, 42(Web Server issue):W264-270. 87. Xue B, Dunbrack RL, Williams RW, Dunker AK, Uversky VN: PONDR-FIT: a meta- predictor of intrinsically disordered amino acids. Biochim Biophys Acta 2010, 1804(4):996-1010. 88. Kozlowski LP, Bujnicki JM: MetaDisorder: a meta-server for the prediction of intrinsic disorder in proteins. BMC Bioinformatics 2012, 13:111. 89. Fan X, Kurgan L: Accurate prediction of disorder in protein chains with a comprehensive and empirically designed consensus. J Biomol Struct Dyn 2014, 32(3):448-464. 90. Vucetic S, Brown CJ, Dunker AK, Obradovic Z: Flavors of protein disorder. Proteins 2003, 52(4):573-584. 91. Cozzetto D, Jones DT: The contribution of intrinsic disorder prediction to the elucidation of protein function. Curr Opin Struct Biol 2013, 23(3):467-472. 92. Consortium GO: Gene Ontology Consortium: going forward. Nucleic Acids Res 2015, 43(Database issue):D1049-1056. 93. Bairoch A: The ENZYME database in 2000. Nucleic Acids Res 2000, 28(1):304-305. 94. Peng Z, Yan J, Fan X, Mizianty MJ, Xue B, Wang K, Hu G, Uversky VN, Kurgan L: Exceptionally abundant exceptions: comprehensive characterization of intrinsic disorder in all domains of life. Cell Mol Life Sci 2015, 72(1):137-151. 95. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT: Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 2004, 337(3):635-645. 96. Uversky VN, Oldfield CJ, Dunker AK: Intrinsically disordered proteins in human diseases: introducing the D2 concept. Annu Rev Biophys 2008, 37:215-246. 97. Gsponer J, Futschik ME, Teichmann SA, Babu MM: Tight regulation of unstructured proteins: from transcript synthesis to protein degradation. Science 2008, 322(5906):1365-1368. 98. Uversky VN, Dunker AK: Understanding protein non-folding. Biochim Biophys Acta 2010, 1804(6):1231-1264. 99. Uversky VN: Intrinsically disordered proteins and their (disordered) proteomes in neurodegenerative disorders. Front Aging Neurosci 2015, 7:18.

93

100. Uversky VN, Davé V, Iakoucheva LM, Malaney P, Metallo SJ, Pathak RR, Joerger AC: Pathological unfoldomics of uncontrolled chaos: intrinsically disordered proteins and human diseases. Chem Rev 2014, 114(13):6844-6879. 101. Sotomayor-Pérez AC, Ladant D, Chenal A: Disorder-to-order transition in the CyaA toxin RTX domain: implications for toxin secretion. Toxins (Basel) 2015, 7(1):1-20. 102. Berson A, Soreq H: It all starts at the ends: multifaceted involvement of C- and N-terminally modified in Alzheimer's disease. Rambam Maimonides Med J 2010, 1(2):e0014. 103. Kardani J, Roy I: Understanding Caffeine's Role in Attenuating the Toxicity of α- Synuclein Aggregates: Implications for Risk of Parkinson's Disease. ACS Chem Neurosci 2015, 6(9):1613-1625. 104. Tonks NK, Diltz CD, Fischer EH: Purification of the major protein-tyrosine- of human placenta. J Biol Chem 1988, 263(14):6722-6730. 105. Chernoff J, Schievella AR, Jost CA, Erikson RL, Neel BG: Cloning of a cDNA for a major human protein-tyrosine-phosphatase. Proc Natl Acad Sci U S A 1990, 87(7):2735-2739. 106. Hao L, Tiganis T, Tonks NK, Charbonneau H: The noncatalytic C-terminal segment of the T cell protein tyrosine phosphatase regulates activity via an intramolecular mechanism. J Biol Chem 1997, 272(46):29322-29329. 107. Tonks NK: Protein tyrosine phosphatases--from housekeeping enzymes to master regulators of signal transduction. FEBS J 2013, 280(2):346-378. 108. Anderson JN, Tonks NK: Protein tyrosine phosphatase-based therapeutics: lessons from PTP1B. Topics in Current Genetics 2004, 5:201-230. 109. Lantz KA, Hart SG, Planey SL, Roitman MF, Ruiz-White IA, Wolfe HR, McLane MP: Inhibition of PTP1B by trodusquemine (MSI-1436) causes fat-specific weight loss in diet-induced obese mice. Obesity (Silver Spring) 2010, 18(8):1516-1523. 110. Krishnan N, Koveal D, Miller DH, Xue B, Akshinthala SD, Kragelj J, Jensen MR, Gauss CM, Page R, Blackledge M et al: Targeting the disordered C terminus of PTP1B with an allosteric inhibitor. Nat Chem Biol 2014, 10(7):558-566. 111. McKinney W: Data Structures for Statistical Computing in Python. In. Proceedings of the 9th Python in Science Conference; 2010: 51-56. 112. Bennett WS, Huber R: Structural and functional aspects of domain motions in proteins. CRC Crit Rev Biochem 1984, 15(4):291-384. 113. Garner E, Cannon P, Romero P, Obradovic Z, Dunker AK: Predicting Disordered Regions from Amino Acid Sequence: Common Themes Despite Differing Structural Characterization. Genome Inform Ser Workshop Genome Inform 1998, 9:201-213. 114. Bardwell JC, Jakob U: Conditional disorder in action. Trends Biochem Sci 2012, 37(12):517-525. 115. Zhang T, Faraggi E, Li Z, Zhou Y: Intrinsically semi-disordered state and its role in induced folding and protein aggregation. Cell Biochem Biophys 2013, 67(3):1193-1205. 116. Touw WG, Vriend G: BDB: databank of PDB files with consistent B-factors. Protein Eng Des Sel 2014, 27(11):457-462.

94

117. Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB: Protein disorder prediction: implications for structural proteomics. Structure 2003, 11(11):1453-1459. 118. Radivojac P, Obradovic Z, Smith DK, Zhu G, Vucetic S, Brown CJ, Lawson JD, Dunker AK: Protein flexibility and intrinsic disorder. Protein Sci 2004, 13(1):71-80. 119. Le Gall T, Romero PR, Cortese MS, Uversky VN, Dunker AK: Intrinsic disorder in the Protein Data Bank. J Biomol Struct Dyn 2007, 24(4):325-342. 120. Zhang Y, Stec B, Godzik A: Between order and disorder in protein structures: analysis of "dual personality" fragments in proteins. Structure 2007, 15(9):1141-1147. 121. Lobanov MY, Galzitskaya OV: Disordered patterns in clustered Protein Data Bank and in eukaryotic and bacterial proteomes. PLoS One 2011, 6(11):e27142. 122. Oldfield CJ, Xue B, Van YY, Ulrich EL, Markley JL, Dunker AK, Uversky VN: Utilization of protein intrinsic disorder knowledge in structural proteomics. Biochim Biophys Acta 2013, 1834(2):487-498. 123. Walsh I, Giollo M, Di Domenico T, Ferrari C, Zimmermann O, Tosatto SC: Comprehensive large-scale assessment of intrinsic protein disorder. Bioinformatics 2015, 31(2):201-208. 124. Consortium U: Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res 2014, 42(Database issue):D191-198. 125. Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22(12):2577-2637. 126. Shannon CE: A mathematical theory of communication. Bell System Technical Journal 1948, 27(3):379-423. 127. Fraser JS, Clarkson MW, Degnan SC, Erion R, Kern D, Alber T: Hidden alternative structures of proline essential for catalysis. Nature 2009, 462(7273):669-673. 128. Klinman JP: Importance of protein dynamics during enzymatic C-H bond cleavage catalysis. Biochemistry 2013, 52(12):2068-2077. 129. Henzler-Wildman K, Kern D: Dynamic personalities of proteins. Nature 2007, 450(7172):964-972. 130. McGowan LC, Hamelberg D: Conformational plasticity of an enzyme during catalysis: intricate coupling between cyclophilin A dynamics and substrate turnover. Biophys J 2013, 104(1):216-226. 131. Bhabha G, Biel JT, Fraser JS: Keep on moving: discovering and perturbing the conformational dynamics of enzymes. Acc Chem Res 2015, 48(2):423-430. 132. Callender R, Dyer RB: The dynamical nature of enzymatic catalysis. Acc Chem Res 2015, 48(2):407-413. 133. Schulenburg C, Hilvert D: Protein conformational disorder and enzyme catalysis. Top Curr Chem 2013, 337:41-67. 134. Vendruscolo M: Enzymatic activity in disordered states of proteins. Curr Opin Chem Biol 2010, 14(5):671-675. 135. Hammes-Schiffer S, Benkovic SJ: Relating protein motion to catalysis. Annu Rev Biochem 2006, 75:519-541.

95

136. Hammes GG, Benkovic SJ, Hammes-Schiffer S: Flexibility, diversity, and cooperativity: pillars of enzyme catalysis. Biochemistry 2011, 50(48):10422- 10430. 137. Bode W, Fehlhammer H, Huber R: Crystal structure of bovine trypsinogen at 1-8 A resolution. I. Data collection, application of patterson search techniques and preliminary structural interpretation. J Mol Biol 1976, 106(2):325-335. 138. Oldfield CJ, Dunker AK: Intrinsically disordered proteins and intrinsically disordered protein regions. Annu Rev Biochem 2014, 83:553-584. 139. Ye Q, Feng Y, Yin Y, Faucher F, Currie MA, Rahman MN, Jin J, Li S, Wei Q, Jia Z: Structural basis of calcineurin activation by calmodulin. Cell Signal 2013, 25(12):2661-2667. 140. Ding Z, Taneva SG, Huang HK, Campbell SA, Semenec L, Chen N, Cornell RB: A 22- mer segment in the structurally pliable regulatory domain of metazoan CTP: phosphocholine cytidylyltransferase facilitates both silencing and activating functions. J Biol Chem 2012, 287(46):38980-38991. 141. Christensen LC, Jensen NW, Vala A, Kamarauskaite J, Johansson L, Winther JR, Hofmann K, Teilum K, Ellgaard L: The human selenoprotein VCP-interacting membrane protein (VIMP) is non-globular and harbors a reductase function in an intrinsically disordered region. J Biol Chem 2012, 287(31):26388-26399. 142. Aachmann FL, Sal LS, Kim HY, Marino SM, Gladyshev VN, Dikiy A: Insights into function, catalytic mechanism, and fold evolution of selenoprotein methionine sulfoxide reductase B1 through structural analysis. J Biol Chem 2010, 285(43):33315-33323. 143. Rosenbaum JC, Fredrickson EK, Oeser ML, Garrett-Engele CM, Locke MN, Richardson LA, Nelson ZW, Hetrick ED, Milac TI, Gottschling DE et al: Disorder targets misorder in nuclear quality control degradation: a disordered directly recognizes its misfolded substrates. Mol Cell 2011, 41(1):93-106. 144. Reed BJ, Locke MN, Gardner RG: A Conserved Uses Intrinsically Disordered Regions to Scaffold Multiple Protein Interaction Sites. J Biol Chem 2015, 290(33):20601-20612. 145. Vittal V, Shi L, Wenzel DM, Scaglione KM, Duncan ED, Basrur V, Elenitoba-Johnson KS, Baker D, Paulson HL, Brzovic PS et al: Intrinsic disorder drives N-terminal ubiquitination by Ube2w. Nat Chem Biol 2015, 11(1):83-89. 146. Kennedy JA, Daughdrill GW, Schmidt KH: A transient α-helical molecular recognition element in the disordered N-terminus of the Sgs1 helicase is critical for chromosome stability and binding of Top3/Rmi1. Nucleic Acids Res 2013, 41(22):10215-10227. 147. Aït-Bara S, Carpousis AJ, Quentin Y: RNase E in the γ-Proteobacteria: conservation of intrinsically disordered noncatalytic region and molecular evolution of microdomains. Mol Genet Genomics 2014. 148. Hegde ML, Tsutakawa SE, Hegde PM, Holthauzen LM, Li J, Oezguen N, Hilser VJ, Tainer JA, Mitra S: The disordered C-terminal domain of human DNA glycosylase NEIL1 contributes to its stability via intramolecular interactions. J Mol Biol 2013, 425(13):2359-2371.

96

149. Choowongkomon K, Carlin CR, Sönnichsen FD: A structural model for the membrane-bound form of the juxtamembrane domain of the epidermal growth factor receptor. J Biol Chem 2005, 280(25):24043-24052. 150. Rajagopal C, Stone KL, Francone VP, Mains RE, Eipper BA: Secretory granule to the nucleus: role of a multiply phosphorylated intrinsically unstructured domain. J Biol Chem 2009, 284(38):25723-25734. 151. Aggarwal M, Dhindwal S, Kumar P, Kuhn RJ, Tomar S: trans-Protease activity and structural insights into the active form of the alphavirus capsid protease. J Virol 2014, 88(21):12242-12253. 152. Tie Y, Kovalevsky AY, Boross P, Wang YF, Ghosh AK, Tozser J, Harrison RW, Weber IT: Atomic resolution crystal structures of HIV-1 protease and mutants V82A and I84V with saquinavir. Proteins 2007, 67(1):232-242. 153. Chakrabortee S, Meersman F, Kaminski Schierle GS, Bertoncini CW, McGee B, Kaminski CF, Tunnacliffe A: Catalytic and chaperone-like functions in an intrinsically disordered protein associated with desiccation tolerance. Proc Natl Acad Sci U S A 2010, 107(37):16084-16089. 154. Zotter Á, Oláh J, Hlavanda E, Bodor A, Perczel A, Szigeti K, Fidy J, Ovádi J: Zn²+- induced rearrangement of the disordered TPPP/p25 affects its microtubule assembly and GTPase activity. Biochemistry 2011, 50(44):9568-9578. 155. Zambelli B, Stola M, Musiani F, De Vriendt K, Samyn B, Devreese B, Van Beeumen J, Turano P, Dikiy A, Bryant DA et al: UreG, a chaperone in the urease assembly process, is an intrinsically unstructured GTPase that specifically binds Zn2+. J Biol Chem 2005, 280(6):4684-4695. 156. Rumi-Masante J, Rusinga FI, Lester TE, Dunlap TB, Williams TD, Dunker AK, Weis DD, Creamer TP: Structural basis for activation of calcineurin by calmodulin. J Mol Biol 2012, 415(2):307-317. 157. Dunlap TB, Cook EC, Rumi-Masante J, Arvin HG, Lester TE, Creamer TP: The distal helix in the regulatory domain of calcineurin is important for domain stability and enzyme function. Biochemistry 2013, 52(48):8643-8651. 158. Dennis MK, Taneva SG, Cornell RB: The intrinsically disordered nuclear localization signal and phosphorylation segments distinguish the membrane affinity of two cytidylyltransferase isoforms. J Biol Chem 2011, 286(14):12349- 12360. 159. Huang HK, Taneva SG, Lee J, Silva LP, Schriemer DC, Cornell RB: The membrane- binding domain of an amphitropic enzyme suppresses catalysis by contact with an amphipathic helix flanking its active site. J Mol Biol 2013, 425(9):1546- 1564. 160. Chong SS, Taneva SG, Lee JM, Cornell RB: The curvature sensitivity of a membrane-binding amphipathic helix can be modulated by the charge on a flanking region. Biochemistry 2014, 53(3):450-461. 161. Liu J, Li F, Rozovsky S: The intrinsically disordered membrane protein selenoprotein S is a reductase in vitro. Biochemistry 2013, 52(18):3051-3061. 162. Liu J, Rozovsky S: Contribution of selenocysteine to the peroxidase activity of selenoprotein S. Biochemistry 2013, 52(33):5514-5516.

97

163. Mirzaei H, Syed S, Kennedy J, Schmidt KH: Sgs1 truncations induce genome rearrangements but suppress detrimental effects of BLM overexpression in Saccharomyces cerevisiae. J Mol Biol 2011, 405(4):877-891. 164. Hegde ML, Banerjee S, Hegde PM, Bellot LJ, Hazra TK, Boldogh I, Mitra S: Enhancement of NEIL1 protein-initiated oxidized DNA base excision repair by heterogeneous nuclear ribonucleoprotein U (hnRNP-U) via direct interaction. J Biol Chem 2012, 287(41):34202-34211. 165. Redko Y, Tock MR, Adams CJ, Kaberdin VR, Grasby JA, McDowall KJ: Determination of the catalytic parameters of the N-terminal half of ribonuclease E and the identification of critical functional groups in RNA substrates. J Biol Chem 2003, 278(45):44001-44008. 166. Carpousis AJ: The RNA degradosome of Escherichia coli: an mRNA-degrading machine assembled on RNase E. Annu Rev Microbiol 2007, 61:71-87. 167. Callaghan AJ, Aurikko JP, Ilag LL, Günter Grossmann J, Chandran V, Kühnel K, Poljak L, Carpousis AJ, Robinson CV, Symmons MF et al: Studies of the RNA degradosome-organizing domain of the Escherichia coli ribonuclease RNase E. J Mol Biol 2004, 340(5):965-979. 168. Banci L, Bertini I, Calderone V, Cefaro C, Ciofi-Baffoni S, Gallo A, Kallergi E, Lionaki E, Pozidis C, Tokatlidis K: Molecular recognition and substrate mimicry drive the electron-transfer process between MIA40 and ALR. Proc Natl Acad Sci U S A 2011, 108(12):4811-4816. 169. Vitu E, Bentzur M, Lisowsky T, Kaiser CA, Fass D: Gain of function in an ERV/ALR sulfhydryl oxidase by molecular engineering of the shuttle disulfide. J Mol Biol 2006, 362(1):89-101. 170. Shan Y, Eastwood MP, Zhang X, Kim ET, Arkhipov A, Dror RO, Jumper J, Kuriyan J, Shaw DE: Oncogenic mutations counteract intrinsic disorder in the EGFR kinase and promote receptor dimerization. Cell 2012, 149(4):860-870. 171. Greenberg DS, Toiber D, Berson A, Soreq H: Acetylcholinesterase variants in Alzheimer's disease: from neuroprotection to programmed cell death. Neurodegener Dis 2010, 7(1-3):60-63. 172. Szilvay GR, Blenner MA, Shur O, Cropek DM, Banta S: A FRET-based method for probing the conformational behavior of an intrinsically disordered repeat domain from Bordetella pertussis adenylate cyclase. Biochemistry 2009, 48(47):11273-11282. 173. Sotomayor Pérez AC, Karst JC, Davi M, Guijarro JI, Ladant D, Chenal A: Characterization of the regions involved in the calcium-induced folding of the intrinsically disordered RTX motifs from the bordetella pertussis adenylate cyclase toxin. J Mol Biol 2010, 397(2):534-549. 174. Shur O, Wu J, Cropek DM, Banta S: Monitoring the conformational changes of an intrinsically disordered peptide using a quartz crystal microbalance. Protein Sci 2011, 20(5):925-930. 175. Kang'ethe W, Bernstein HD: Charge-dependent secretion of an intrinsically disordered protein via the autotransporter pathway. Proc Natl Acad Sci U S A 2013, 110(45):E4246-4255.

98

176. Munshi S, Chen Z, Yan Y, Li Y, Olsen DB, Schock HB, Galvin BB, Dorsey B, Kuo LC: An alternate binding site for the P1-P3 group of a class of potent HIV-1 protease inhibitors as a result of concerted structural change in the 80s loop of the protease. Acta Crystallogr D Biol Crystallogr 2000, 56(Pt 4):381-388. 177. Ohtaka H, Freire E: Adaptive inhibitors of the HIV-1 protease. Prog Biophys Mol Biol 2005, 88(2):193-208. 178. Torbeev VY, Raghuraman H, Hamelberg D, Tonelli M, Westler WM, Perozo E, Kent SB: Protein conformational dynamics in the mechanism of HIV-1 protease catalysis. Proc Natl Acad Sci U S A 2011, 108(52):20982-20987. 179. Kovács GG, László L, Kovács J, Jensen PH, Lindersson E, Botond G, Molnár T, Perczel A, Hudecz F, Mezo G et al: Natively unfolded tubulin polymerization promoting protein TPPP/p25 is a common marker of alpha-synucleinopathies. Neurobiol Dis 2004, 17(2):155-162. 180. Orosz F, Kovács GG, Lehotzky A, Oláh J, Vincze O, Ovádi J: TPPP/p25: from unfolded protein to misfolding disease: prediction and experiments. Biol Cell 2004, 96(9):701-711. 181. Zotter A, Bodor A, Oláh J, Hlavanda E, Orosz F, Perczel A, Ovádi J: Disordered TPPP/p25 binds GTP and displays Mg2+-dependent GTPase activity. FEBS Lett 2011, 585(5):803-808. 182. Zambelli B, Musiani F, Savini M, Tucker P, Ciurli S: Biochemical studies on Mycobacterium tuberculosis UreG and comparative modeling reveal structural and functional conservation among the bacterial UreG family. Biochemistry 2007, 46(11):3171-3182. 183. Xue B, Dunker AK, Uversky VN: Orderly order in protein intrinsic disorder distribution: disorder in 3500 proteomes from viruses and the three domains of life. J Biomol Struct Dyn 2012, 30(2):137-149. 184. Dessimoz C, Gabaldón T, Roos DS, Sonnhammer EL, Herrero J, Consortium QfO: Toward community standards in the quest for orthologs. Bioinformatics 2012, 28(6):900-904. 185. DeForte S, Uversky VN: Resolving the ambiguity: Making sense of intrinsic disorder when PDB structures disagree. Protein Sci 2016, 25(3):676-688. 186. Barrett AJ: Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). Enzyme Nomenclature. Recommendations 1992. Supplement 4: corrections and additions (1997). Eur J Biochem 1997, 250(1):1-6. 187. Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, del-Toro N et al: The MIntAct project--IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 2014, 42(Database issue):D358-363. 188. Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, Simonovic M, Roth A, Santos A, Tsafou KP et al: STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res 2015, 43(Database issue):D447-452. 189. Mani M, Chen C, Amblee V, Liu H, Mathur T, Zwicke G, Zabad S, Patel B, Thakkar J, Jeffery CJ: MoonProt: a database for proteins that are known to moonlight. Nucleic Acids Res 2015, 43(Database issue):D277-282.

99

190. Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B et al: Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25(11):1422-1423. 191. Velankar S, Dana JM, Jacobsen J, van Ginkel G, Gane PJ, Luo J, Oldfield TJ, O'Donovan C, Martin MJ, Kleywegt GJ: SIFTS: Structure Integration with Function, Taxonomy and Sequences resource. Nucleic Acids Res 2013, 41(Database issue):D483-489. 192. Vacic V, Uversky VN, Dunker AK, Lonardi S: Composition Profiler: a tool for discovery and visualization of amino acid composition differences. BMC Bioinformatics 2007, 8:211. 193. Peng Z, Kurgan L: High-throughput prediction of RNA, DNA and protein binding regions mediated by intrinsic disorder. Nucleic Acids Res 2015, 43(18):e121. 194. Mészáros B, Simon I, Dosztányi Z: Prediction of protein binding regions in disordered proteins. PLoS Comput Biol 2009, 5(5):e1000376. 195. Dosztányi Z, Mészáros B, Simon I: ANCHOR: web server for predicting protein binding regions in disordered proteins. Bioinformatics 2009, 25(20):2745-2746. 196. Mizianty MJ, Stach W, Chen K, Kedarisetti KD, Disfani FM, Kurgan L: Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources. Bioinformatics 2010, 26(18):i489-496. 197. Tusnády GE, Dobson L, Tompa P: Disordered regions in transmembrane proteins. Biochim Biophys Acta 2015, 1848(11 Pt A):2839-2848. 198. A A, J R: topGO: topGO: Enrichment analysis for Gene Ontology. In., 2.22.0 edn; 2010.

100

Appendix A: Glossary

Ambiguous / dual personality region: A missing region that is characterized by conflicting information concerning the presence of missing residues, between multiple PDB structures of the same sequence.

CD: A predicted continuously disordered region.

CM: A continuous missing region in an X-ray crystal structure, formed from a composite of all structures that match that sequence

Conditionally disordered region: An intrinsically disordered region that is structured under some circumstances and disordered under others.

Conflicting region: An ambiguous missing region where at least one crystal structure is fully observed in the region and one crystal structure is fully missing in the region.

Conserved region: A missing region that is identical between all crystal structures.

Contained region: An ambiguous missing region where at least one crystal structure contains the full length of the missing region, and all others are contained within.

Domain wobble: A missing region that arises from the wholesale movement of a structured domain, typically facilitated by a small flexible hinge.

Dynamic disorder: Disorder that is characterized by missing regions that arise from perpetual motion at the backbone level, in a region of the crystallized protein. The presence of dynamic disorder does not necessarily indicate that this region is intrinsically disordered in vivo.

EC (Enzyme Commission) number: A numerical classification scheme for enzymes based on the reactions they catalyze.

Hybrid protein: A protein that contains a mix of ordered and disordered regions.

101

IDP (Intrinsically disordered protein): A protein that does not have a stable three- dimensional state in vivo, though it may have some small regions of structure.

IDPR (Intrinsically disordered protein region): A region of a protein that is intrinsically disordered.

Missing Region: Missing residues in a three-dimensional crystal structure that occur due to regions of low or poorly defined electron density.

MoRF: Molecular Recognition Feature. A small region, typically between ten and seventy residues, within longer disordered regions, which undergo a disorder-order transition upon binding to the correct molecular partner.

Overlapping region: An ambiguous missing region that is composed of multiple missing regions in crystal structures which overlap or are contiguous, where no one crystal structure contains a missing region that encompasses all.

Partially disordered region: An intrinsically disordered protein region that displays significant residual secondary structure.

PDB: Protein Data Bank

Static disorder: Disorder that is characterized by missing regions that arise for reasons other than dynamic disorder. These possibilities include an ensemble of stable structures, wobbling domains, and crystal packing imperfections. Static disorder is exclusive from intrinsic disorder.

Transient / cryptic disorder: Conditional intrinsic disorder that arises due to environmental triggers, and typically provides a functional advantage.

QFO: Quest for Orthologs

102

Appendix B: IDP Search Terms and PubMed IDs

Table B1 IDP search terms and PubMed IDs.

Search term PubMed IDs that use IDP terminology Aavlea1 12569097, 21909508 ABL tyrosine kinase 17211892, 22632137 (“Abscisic acid stress 17189335, 23470734 ripening” OR “Abscisic acid-, stress-, and ripening-induced“) Acetolactate synthase catalytic 11902841 subunit AND mitochondria* Acetylcholinesterase 10892800 acetylcholinesterase variant 20173328, 23908786 AND AChE-R Acetyl-CoA carboxylase 15341732 ("actin-binding protein") 11668184, 16110343, 19590096 ACTR 11823864, 20556825, 21766125, 21894929, 21898648, 22253588, 23327569, 23373423, 23586525, 23758617, 23799450, 24449148, 24811666 Acylamino-acid-releasing 15296741 enzyme Acyl carrier protein AND e. coli 12057197 (“acyl carrier protein” AND 17604643, 18059524, 18060858, 18773978 “Vibrio harveyi“) acylphosphatase AND Sulfolobus 14724277, 16287076 solfataricus Ad41 16254343 Adapter molecule crk 12384576, 17515907 Adenomatous polyposis coli 16293619, 16753179, 21859464, 24130866 protein adenovirus early region 1A 23783631 (“adenylate cyclase” OR CyaA) 2522624, 19015266, 19860484, 20096704, 21112299, 21416544, AND bordetella pertussis 21454565, 23941183, 24145447 ADP-ribosylation factor-binding 12668765, 12679809, 12767220 protein GGA1 ADR1 regulatory protein 3515197, 9642072 AF9 23260655 aggrecan AND CS-attachment 20806220 region AND chondroitin sulfate

103

Table B1 (Continued) Search term PubMed IDs that use IDP terminology AGR2 23780840 (AKAP79 OR AKAP5) 16442664 (AKAP250 OR AKAP12 OR AKAP 16442664, 16762919 gravin) (Alb3 OR A3CT OR ALBINO3) 20018841 Aldehyde dehydrogenase AND 15983043 mitochondria* ALG11 11846551, 12538870, 15653326, 16037492, 19929855 Alkylmercury lyase 15222745 alpha4 22194938 Alpha-adducin 7642559 alpha/beta AND (SASP OR small 24029407 acid soluble) AND protein Alpha- A chain 9650074, 12235146 Alpha-crystallin B chain 9650074, 12235146 (alpha hemolysin OR HlyA) AND 7703231, 17407262 Escherichia coli alpha lactalbumin AND HAMLET 20977665 (alpha S1 OR alpha s) AND 7305393, 18155889, 18700180, 20025277, 21689790, 22155633 casein alpha spectrin AND (N-terminal 23200054, 24055379 or N-terminus) (“alpha synuclein”) 8901511, 9264546, 10942772, 10978144, 11152691, 11425308, 11560511, 11604526, 11734199, 11734201, 11784308, 11812782, 12032141, 12062445, 12428728, 12534279, 12586824, 12621030, 12649428, 12667059, 12754258, 12815044, 12834338, 12956606, 14982446, 15005622, 15096050, 15103328, 15643843, 15671169, 15744056, 15790533, 15925383, 15939304, 16081040, 16092089, 16162499, 16197548, 16223878, 16305213, 16343531, 16366524, 16452621, 16464864, 16489768, 16519533, 16571022, 16981679, 16981712, 17088319, 17209570, 17279794, 17315997, 17391701, 17605001, 17623039, 17681539, 17893145, 17904099, 18167451, 18198943, 18282005, 18419123, 18423664, 18436957, 18511942, 18541383, 18665616, 18691903, 18692132, 18855701, 18948383, 18976814, 19081538, 19099437, 19208933, 19293380, 19481090, 19538146, 19554627, 19555081, 19576220, 19634918, 19645507, 19655784, 19759002, 19763886, 19847913, 19891973, 19910228, 20026206, 20028147, 20121219, 20141569, 20199073, 20209636, 20359221, 20385841, 20490633, 20499903, 20522010, 20580965, 20620148, 20714568, 20828147, 20847048, 20858207, 20872101, 20923645, 20947801, 21044603, 21108951, 21120859, 21163351, 21237164, 21280118, 21297620, 21348834, 21376144, 21570984, 21641618, 21686180, 21819966, 21841800, 21880361, 22009045, 22024360, 22147495, 22153624, 22155643, 22166445, 22267729, 22274962, 22308332, 22315227, 22355530, 22438319, 22560500, 22573613, 22620680, 22662273, 22760321, 22760334, 22767608, 22771474, 22820150, 22826265, 22927976, 22947936, 22960996, 22988846, 23123341, 23162382, 23185649, 23189168, 23199922,

104

Table B1 (Continued) Search term PubMed IDs that use IDP terminology 23214618, 23295967, 23314729, 23349712, 23374074, 23398174, 23398399, 23477540, 23526115, 23557146, 23567152, 23583776, 23607785, 23648364, 23775688, 23813793, 23817832, 23927048, 23941114, 23964651, 23978162, 24003031, 24010662, 24018100, 24058647, 24066973, 24072065, 24099487, 24140056, 24144701, 24192542, 24338013, 24360766, 24361273, 24367999, 24374342, 24383916, 24397337, 24474217, 24475132, 24489820, 24507596, 24551051, 24552879, 24634806, 24725464, 24739028, 24785077, 24817693, 24947141, 24970188, 24976112, 25079425, 25081642, 25129622, 25135664, 25139280, 25209675, 25210774, 25224747, 25246573, 25264250 alpha tropomyosin AND (N- 11575936, 21584876, 22754618, 23052974 terminal OR N-terminus) alpha Tubulin 24835459, 25307498 alphavirus capsid protease AND 25100849 (C-terminal or C-terminus) ameloblastin 18353005, 22243255, 23782691 amelogenin 19081741, 19086270, 19236004, 20304108, 21351181, 24114119, 25298002 Amidophosphoribosyltransferase 9914248 (amyloid beta OR Abeta OR APP 12149256, 15925383, 16197548, 17534931, 18059284, 18511942, OR “amyloid precursor 18625543, 19208933, 19260715, 19540204, 20036826, 20306540, protein”) 20336261, 20385841, 20483339, 20598937, 20709081, 21056574, 21209058, 21354340, 21376731, 21797254, 21985427, 22155633, 22270944, 22674434, 22828513, 22952038, 23145167, 23421682, 23484883, 23561531, 23640306, 23679641, 23798407, 23811057, 23883055, 24028075, 24077017, 24131107, 24410358, 24871565, 25018569, 25080056, 25260075 Andes virus Gn tail 22203819 AND NH2 11896058, 15023052, 15107424, 23722902 Ankyrin-2 16368689 Antibacterial protein LL-37 9452503 Antitermination protein N 9659923 AP7 AND nacre 19159266, 24837160, 24977921 AP-2 complex subunit mu 12086608 (Aplysia nucleolar protein OR 18078811 ApLLP) apolipoprotein A-I 12062424, 15476409 Apolipoprotein E 2063194, 10850798 (apo-parvalbumin OR (“apo 18260106, 19651438 form” AND parvalbumin)) Arabidopsis phototropin 2 21222437 aragonite associated 23060620 biomineralization proteins Aragonite protein AP24 14648763, 15222016 ARG tyrosine kinase 17211892

105

Table B1 (Continued) Search term PubMed IDs that use IDP terminology arrestin2 19710023 Aryl hydrocarbon receptor 15641800, 22653727 ASPP2 18448430 ataxin-3 15265035, 23891935 At3g04780 15741346 Atg13 23670046, 25139988 Atg29 23858448 Atg3 24879155 Atg1 25139988 ATP synthase-coupling factor 6 15327958 AND mitochondria* Atrial natriuretic peptide 8043577, 16870210, 21508037, 23110718 phosphatase 12732633 (axin OR axin1) 16293619, 21087614, 23603389 7B2 22947085 bacillus 25001212 bacterial luciferase AND mobile 11827516, 19435287, 19710008, 21156144 loop bacteriophage p22 coat protein 15784254 bacteriophage T4 tail lysozyme 15701513 Bag3 21767525 Bag6 23665563 (basic region OR 22226835 bZIP) (Basic salivary proline-rich 4093242, 16529944, 17503833, 19402144, 19838685, 20643086, protein 4 OR IB5 salivary 20665010, 21524106 protein) BASP1 23520002, 23821606 BBK32 15292204 B cell receptor AND Igalpha 14967045, 17176095, 21487502, 24769851 B cell receptor AND Igbeta 14967045, 17176095, 21487502, 24769851 Bcl-2 AND apoptosis 15733859, 23272207 bcl2 antagonist of cell death 11206074, 16645638 (bcl-2-like protein 11 OR 16645638 BCL2L11) Bcl-2-modifying factor 16645638 (Bcl-xL OR "Bcl-2-like protein 8692274, 9346936, 14534311 1") AND human (Bcl-xL OR "Bcl-2-like protein 9346936 1") AND rat (BDNF OR "brain derived 24048383 neurotrophic factor")

106

Table B1 (Continued) Search term PubMed IDs that use IDP terminology BECN1 24115198 Beta-adducin 7642559 Beta-arrestin-1 11566136, 19710023 Beta casein 2387396, 7305393, 7979373, 8373815, 11784308, 18155889, 18303844, 18327957, 18336901, 18700180, 19322774, 21689790, 24918971, 25144497, 25227946 Beta-defensin 12 7577957 beta dystroglycan 14622018, 15295116 Betagamma-crystallin 15536081, 20929244, 21549498, 22836948, 24671380 Beta-lactoglobulin 9141136 beta spectrin AND (C-terminal 23200054, 24055379 or C-terminus) (“beta synuclein”) 10942772, 11812782, 12667059, 15966733, 17681539, 18436957, 19081538, 23526115, 24489820 BimC 14530265 BimL 16645638, 21378313, 24974286 Biotin carboxyl carrier protein 11956202 of acetyl-CoA carboxylase BirA bifunctional protein 1409631, 10497026, 11124029, 20169168 2,3-bisphosphoglycerate- 15735341 dependent phosphoglycerate mutase (BMFP OR Brucella abortus MFP 18616282 ) BN28a 9046590 Bone sialoprotein 2 11162539 Borealin 19146389 (B23 OR Nucleophosmin OR 24106084, 24576674, 24952945 NPM1) AND RNA Botulinum neurotoxin 15157097 bovine dentin phosphophoryn 15892946, 16000119 (bovine viral diarrhea virus core 18032507, 18033802 OR (BVDV AND bovine)) bovine viral diarrhea virus poly 18033802 protein (15b protein OR COR15b) 20510170, 25096979 BQP35 22317784 bradykinin OR kininogen domain 19272336, 21355573, 24810540 5 BRCA1 15571721, 15609993, 24303310 (Brinker OR apo-brinker) 21088782 BSP-30 10377061, 20331968 (BTPC OR bacterial type 21524275, 22404138, 22569262, 24266766 phosphoenolypyruvate

107

Table B1 (Continued) Search term PubMed IDs that use IDP terminology carboxylase) Bud13p 18809678 bunyavirus nucleocapsid protein 16428606 bZIP28 23624714 Cadherin-1 11121423 CagA AND (C-terminal OR C- 22817985, 24223932, 25116152 terminus) Calcineurin AND (subunit B type 7498455, 8142351, 8524402, 12357034, 17498738, 22100452 1 OR regulatory subunit) calcitonin AND salmon 23208874 calcium-dependent protein 15568805 kinase SK5 Calcyclin-binding protein 15996101 caldesmon 14635127 Calmodulin 1606151, 1909892, 15099569, 19190183 Calmodulin-sensitive adenylate 11807546 cyclase Calpastatin 2407243, 9366272, 14500891, 15751971, 18519038, 18537264, 19020623, 19378261 calponin AND (“regulatory 21463585, 22424482 region” OR (N-terminus OR N- terminal)) Calreticulin 11101311 Calsenilin 15746104 Calsequestrin 4472093, 6480588, 6725251, 9628486, 25091206 cAMP-dependent protein kinase 1862343, 2040607 inhibitor alpha cAMP-dependent protein kinase 6487597, 15692043 type I-alpha regulatory subunit Canavalin 8310056 Carbonic anhydrase 10924115 Carbon monoxide 11593006 dehydrogenase (carbon monoxide oxidation 22544803 activator OR CooA) Carnitine O-acetyltransferase 12562770 carnobacteriocin-B2 immunity 15362858 protein (CASK-interactive protein1 OR 19523119 Caskin-1) Catenin beta-1 9298899, 15629534 cathepsin F 23684953 (Cby OR Chibby) 21182262 CcdA AND antitoxin AND (C- 17007877, 19647513, 23289531 terminal OR C-terminus)

108

Table B1 (Continued) Search term PubMed IDs that use IDP terminology ccl26 11425309 C4 Cotton leaf curl Kokhran 23500017 virus (CCT OR phosphocholine 21303909 cytidylyltransferase) AND alpha (CCT OR phosphocholine 21303909 cytidylyltransferase) AND beta CDK2AP1 22427660 CDNF 22528768 (CDSN OR corneodesmosin) 20448140 (“C/EBP homologous protein”) 18534616, 22496840 Cel7a 21112302 Cell invasion protein sipA 14512630 Cellular retinoic acid-binding 15907702 protein 1 (C/EPB OR CCAAT enhancer 17102635 binding protein) CETN1 24252580 (c-Fos AND (C-terminal OR C- 10704222, 11749217, 20498278, 22301802 terminus OR “activation domain”)) Chd64 24805353 Chemotaxis protein cheA 8639521 Chemotaxis protein cheW 11799399, 12488014 CHMP3 21220121 cholera enterotoxin subunit A 7669757, 15049684 Choriogonadotropin subunit 8202136 beta chorismate mutase AroQ 9506949, 15322276 Chromogranin-A 3473966, 8243650 Chz1 22807669 Circadian locomoter output 22653727 cycles protein kaput (CITED2 OR CIT-ED2 OR 14594809 Cbp/p300-interacting transactivator 2) cJun 12437352 (CLASP2 OR cytoplasmic linker 22467876 associated protein 2) Clathrin coat assembly protein 11756460 AP180 (CLPH OR casein like 19271754 phosphoprotein) clusterin 11570883

109

Table B1 (Continued) Search term PubMed IDs that use IDP terminology c-Maf 25220806 C- AND “transactivation 23026051, 23875714, 23980173, 24753318 domain” (CMyBP-C OR cardiac myosin 17876814, 22004751 binding protein) c- AND oncoprotein 8755740, 15663936, 19022175, 19432426, 20598937, 22399317, 22457068, 22815918, 24098099 CNBP 24161561 Col7 24810542 (“colicin E9” OR colE9) 12054823, 16114886, 16166265, 16894158, 17375930, 18408035, 19021565, 20687482, 21098297, 22310049, 23176512, 23812713 ("colicin N") 9680221, 9687368, 11099384, 12679333, 15004032, 15465872, 17375930, 18573254, 19306883, 20687482, 21098297, 23176512, 23672584 Collagen adhesin 8218209, 9334749 Collagen alpha-1(II) chain 15466413 complexin 21481779 (connexin43 OR “connexin 43” 18411056, 18649183, 19808665, 23828237 OR cx43) AND cytoplasmic (connexin45 OR Cx45) 24853747 (connexin32 OR Cx32) 22718765 (connexin40 OR Cx40CT) 19808665 Conotoxin iota-RXIA 17696362 Consortin 19864490 Convicilins 18545275 Copper-transporting ATPase 1 14572476, 15035611 COR15A 20510170, 25096979 (cordon-bleu OR Cobl) 24415668 (C8orf4 OR “Thyroid Cancer 1” 15087392, 17905836, 23189168, 23880650, 24941347 OR TC-1) Cox17 15465825, 23060071, 23106082 CP12 AND protein AND calvin 12846565, 15047759, 16427344, 17947231, 19456123, 20224939, 21271977, 22153507, 22514274, 22955105, 22988853, 24056937, 24523724, 24863370 (CPAP OR CENPJ) AND PN2-3 19131341 c1R AND CUB2 domain 20178990 (CREB-binding protein OR CBP) 11823864, 18565542, 19278259, 19894758, 20616042, 20949632, 20969867, 21268115, 21766125, 21892446, 22174264, 22253588, 22280219, 22506277, 22829760, 23307074, 23327569, 23373423, 23586525, 23799450, 23822324, 23875714, 23980173, 24253305, 24446368, 24449148, 24970129, 25002472, 25049401, 25092343, 25215518 (CREB-1 OR Cyclic AMP- 9413984, 19278259, 19894758, 20949632, 21268115, 22506277, responsive element-binding 23822324, 24446368, 24970129, 25215518 protein 1)

110

Table B1 (Continued) Search term PubMed IDs that use IDP terminology C21 reductase 20124700 2 15751956, 21841916 cryptochrome 1 16164372 CsGA 22493266 CsGB 22493266 c-Src kinase AND (N-terminal or 19520085, 23744817 N-terminus) CSTB protein 16155205 CtBP AND (C-terminal or C- 16597837 terminus) cucumber mosaic virus 20056465 cucumovirus coat cwf10 24014766 cyclin B AND G2 12220679 Cyclin-dependent kinase 11718560, 12630860 inhibitor 2A (Cyclin-dependent kinase 8684460, 11749217, 11790096, 12697905, 15024385, 15609350, inhibitor 1B OR p27(Kip1) OR 16214166, 16458085, 17426451, 18177895, 18627125, 21715330, p27) 22080214, 22276948, 22399317, 22721951, 22988851, 23029651, 23071750, 24278008, 24646893 cyclin E LMW 24947816 Cyclin-H 8941715 Cysteine AND glycine-rich 9722554 protein 2 (Cystic fibrosis transmembrane 10792060, 17660831, 18423665, 22683332, 23052212, 23578801, conductance regulator OR 23826884, 24191035, 25120007 CTFR) ("cytochrome b5") 18398853 Cytochrome b-c1 complex 10971589 subunit Rieske AND mitochondria* ("cytochrome c") 4344990 Cytochrome c oxidase copper 18093982 chaperone CytR 24127726 Dad1 20727855 DARPP-32 AND PP1 18954090, 20826336 (DAXX OR Death domain- 17081986, 19106612, 21134643 associated protein 6) Dbp5p 19281819, 21884706 DcpS 15769464 ddc1 19464966 (DEAF1 OR DEAF-1) 23417771, 25310299 DE-cadherin 7958432, 11121423

111

Table B1 (Continued) Search term PubMed IDs that use IDP terminology Decorin 15501918 Defensin-like protein 10775411 Defensin-related cryptdin-4 15595831 Dehydrin AND (TsDHN-1 OR 20921991, 20924623, 21970344 TsDHN-2) Dehydrin AND (Xero2 OR Lti30) 16565295, 21665998 Dehydrin CIDhn 18452657 Dehydrin COR47 16565295, 18849483 Dehydrin dhn13 15711789 Dehydrin DHN1 12529538, 12644649, 19439573, 22882870 Dehydrin ERD14 18359842, 21336827 Dehydrin ERD10 16565295, 18359842 Dehydrin K2 19842064, 21031484 Dehydrin-like protein 12644649 Dehydrin MpDhn12 21737399 Dehydrin Rab18 16565295 Dehydrin YSK2 19842064 DELLA protein GAI 19037309, 20103592 Dematin 19241372, 23355471 (“Dentin phosphoprotein”) 20037676, 20157636 deoxycytidylate deaminase 15504034 Deoxyuridine 5'-triphosphate 1311056, 11257499 nucleotidohydrolase DF31 18484763 (dHR38 OR hormone receptor 21064127 38) AND drosophila Dihydrofolate reductase 9012674, 14717591 (dihydroorotase OR DHO) 19128030 dihydropyridine receptor AND 18761102, 19076161, 21525002 (alpha1 OR II-III) Diol dehydrase beta subunit 10467140 Diol dehydrase gamma subunit 10467140 Diphtheria toxin repressor 10339551 DISC1 21853134 DivlC 11994149 DNA-binding protein fis 1619650, 1946369, 9362499, 11183780 DNA-directed RNA II 10704311 subunit RPB1 DNA-directed RNA polymerase 16632472 subunit RPABC3 DNA fragmentation factor 11371636

112

Table B1 (Continued) Search term PubMed IDs that use IDP terminology subunit alpha DNA helicase II 17190599, 19762288 DnaK suppressor protein 15294156 DNA ligase AND (Enterococcus 15296738 faecalis OR Streptococcus faecalis) DNA lyase AND (apurinic OR 9351835, 11286553 apyrimidinic) DNA-repair protein 9699634, 10050037 complementing XP-A cells DNA repair protein XRCC4 11080143, 11702069, 23197791 DNA topoisomerase 1 AND 8567649, 8631793, 8631794, 9488644, 9488652, 9611241, 10380229, human 10497031, 11283003, 14654701, 14741206, 21046176 DNA topoisomerase 2 AND yeast 8538787, 10380229, 12022860 Dribble nucleolar protein 16542639 DRM1 23516402, 24442277 Dss1 19919104, 23094644 dTIS11 25246635 duffy binding protein II 24384095 dynamitin 24829381 (dynein intermediate chain AND 15581372, 16949604, 19759397, 20472935, 20974845, 22988856 (N-terminal OR N-terminus)) (E7 AND HPV) 8245034, 15035602, 16889404, 17715947, 19553340, 19598264, 20088881, 21787785, 22590549, 22683353, 23118886, 23504368, 23864166, 24086265, 24559112 (E6 AND HPV) 16889404, 19553340 Early activation antigen CD69 11101293 Early E2A DNA-binding protein 4040872, 8039495, 9545375 E1B-55K 21851959 EBNA2 24675874 EBNA1 25011696 4E-BP 21183464, 22977176 Ebps 24787448 E1B-93R 21851959 E-cadherin 16293619 (Ecdysteroid receptor OR EcR 15192079, 19156821, 22628309, 23727127 OR Ecdysone receptor) AND Drosophila melanogaster Ectodysplasin-A 14656435 EDEM1 22905195 E2 enzymes 3R 22507829 16734427

113

Table B1 (Continued) Search term PubMed IDs that use IDP terminology EFL1 24406167 eglin 1505678 (EhPCTP-L OR 25223890 Phosphatidylcholine transfer protein) AND E. histolytica E1 HPV 22278251 (eIF-2B OR eIF2beta) 22683627 Eisenia lombricine kinase 15327979 EJ97 20385607 Elongation factor G 8070397, 11054294 EMB-1 protein 2339072 EMGAM56 21819990 EMI1 23708605 EMILIN1 protein 18463100 Endonuclease VIII 16145054 (ENSA OR endosulfine alpha OR 18973346 endosulfine-alpha) EntI protein 12832790, 15753083 Eotaxin 9712872, 11425309 (Epidermal Growth Factor 12196540, 15840573, 22579287 Receptor OR EGFR) AND (juxtamembrane domain OR kinase domain) Epsin-1 11756460 equine lysozyme AND ELOA 20977665 ERM AND PEA3 20647002 Er_P1 23665109 EscJ 15752191 esculentin 22899362 EspF(U) 22921828 Estradiol 17-beta- 7663947, 8756321, 8805577, 9525918, 9927655, 10625652, 12223444, dehydrogenase 1 12490543 AND alpha 9439992, 11595744, 16604168, 22064478, 23792173 AND (AF1 domain OR N-terminal OR DBD) Estrogen receptor AND beta 11595744, 16984883 AND (N-terminal OR N- terminus) (Ets-1 OR Ets1) AND (SRR OR 18692067, 25017730, 25024220 “serine rich region”) Eukaryotic initiation factor 4F 10409688 subunit p150 Eukaryotic peptide chain 15099522 release factor GTP-binding

114

Table B1 (Continued) Search term PubMed IDs that use IDP terminology subunit Eukaryotic peptide chain 10676813 release factor subunit 1 Eukaryotic translation initiation 9684899, 24122746 factor 4E-binding protein 1 Eukaryotic translation initiation 16698542 factor 4 gamma 1 Ewings Sarcoma AND oncogene 17202261, 19584866, 20598937, 22383402, 22399321, 24086122 AND (EWS OR EWS-Fli1) Exostosin-like 2 12562774 ExsE 22138394 Fab DNA-1 7506692, 15784256 FACT AND spt16 18698566, 19605348, 23708362 FACT AND Ssrp1 18698566, 19605348, 23708362 Fatty acid-binding protein AND 9063893 intestinal Fatty acid multifunctional 20463021 protein Fc receptor gamma chain 21487502 Fc receptor immunoglobulin 12783876 alpha Fe(3+)-pyochelin receptor 16139844 Fermitin AND mouse 19804783 Ferripyoverdine receptor 15733922 2Fe-2S ferredoxin 21760931 (FEZ1 OR fasciculation AND 18615714 elongation protein Zeta 1) fibrinogen AND (alpha chain OR 11601975, 16288455, 20828133, 21044602, 21247890 αIIbβ3) fibrinogen AND (gammaC 18710925 peptide OR gamma subunit) Fibrinogen beta chain 11601975 Fibrinogen gamma chain 11601975 (“fibronectin binding protein A” 9398523, 18713862 OR FnBPA) Fimbrial protein 11294863 Fip1 18537269 Flagellin 2810365 flavocytochrome beta 2 8706682 FL cytokine receptor 14759363 FlgB AND salmonella 15136044 FlgC AND salmonella 15136044 FlgF AND salmonella 15136044

115

Table B1 (Continued) Search term PubMed IDs that use IDP terminology FlgG AND salmonella 15136044 FlgM 9454599, 12271132, 20298817, 23352839 FliE AND salmonella 15136044 formin homology protein OR 16361249 FHOD1 fowlicidin-1 AND cathelicidins 16817888 FOXO3a 19821614 Fragile X mental retardation 1 10496225 protein Frataxin AND mitochondria* 19843162 FRQ 24316221 Fst AND par toxin 20677831 Fth AND SRP 9659905 FtsL 11994149 FtsY 21543314 FtsZ 23692518, 23714328, 25305578 Gab2 24739176 Gab1 21935523, 22536782 GAGA 9878427 GAGE-12I 23029259 Gal4 15826952 Gamma-aminobutyric acid type 15304491 B receptor subunit 1 (“gamma synuclein”) 10942772, 11812782, 12667059, 17234772, 17567567, 19081538, 22620680, 23526115, 23532302, 24489820 Gap junction alpha-1 protein 12151412, 14699011, 15284189, 15492000 Gap junction alpha-5 protein 19808665 GARP 16280326 General control protein 2145515, 2204805, 16483603 Gibberellin receptor GID1A 19037309 (GIP OR SCP1) 24751520, 24925644 Gir2 14715270, 15896712 Gli3 24146948 Glial cell line-derived 9187648, 10545102 neurotrophic factor (gliotactin OR Gli-cyt) AND 14579366 Drosophila globulin cruciferin B 15912356 glucokinase 23271955 Glutathione S- 11119643 alpha-1

116

Table B1 (Continued) Search term PubMed IDs that use IDP terminology Glutenin subunit DX5 12741823 Glycine amidinotransferase 9165070, 9218780 mitochondria* Glycine N-methyltransferase 10756111 Glycogen synthase kinase-3 11427888, 11440715, 12554650 beta glycolate oxidase 8706682 Glycyl-tRNA synthetase 7556056, 10064708 GmPM1 20071374 GNA1870 16407174 G3P attachment protein 10329170, 10756036 gpNU3 21821043 GRA2 AND “Toxoplasma Gondii” 24327774, 25246715 GRAS Proteins 21732203, 22280012 Grb14 AND PIR domain 14623073, 15465854 GRE1 21420397 GroEL 8876186, 15238634, 16223749, 24970895 GroES 21507961, 23583779 Growth arrest AND DNA 20460379 damage-inducible protein GADD45 alpha Growth factor receptor-bound 15901248 protein 14 Growth factor receptor-bound 11178911 protein 2 Growth hormone-binding 1549776, 8943276, 9571026 protein GTA AND blood 12198488, 15987364 GTB AND blood 12198488, 15987364 G/T mismatch-specific thymine 15959518, 16626738, 18512959, 18587051, 21620710 DNA glycosylase GTPase HRas 2406906, 8142349, 9230043, 12842038 GTP-binding protein Rheb 15728574 Guanine nucleotide-binding 9539726 protein Gt GW182 protein 24043833 H/ACA ribonucleoprotein 16373493 complex subunit 3 (HBx OR hepatitis B protein x) 22820921 HCN1 25142030 protein AND 7849597, 8745404 (Kluyveromyces lactis) Heat shock factor protein AND 8745404 (Saccharomyces cerevisiae OR

117

Table B1 (Continued) Search term PubMed IDs that use IDP terminology yeast) beta-1 7649277 Hef 24947516 Heh2 23357007 Hemagglutinin 8072525, 21763731 hemocyanin 16332393 (henipavirus OR hendra virus) 20657787, 21317293, 22108848, 22881220 AND (nucleoprotein) (henipavirus OR hendra virus) 20657787, 21317293, 24086133 AND (phosphoprotein) Hepatitis B polymerase 23202419 (“hepatitis C virus” AND “core 18992225, 20453932, 24030713 protein” ) Hepatitis GB virus B core 18033802 protein Hepatocyte growth factor 14559966, 15167892 receptor herpes virus AND VP16 10398682, 15654739, 15826952, 19643037 Heterogeneous nuclear 11917013 ribonucleoprotein A1 (HEV OR hepatitis E) AND 22811526 polyproline region Hfq AND (Escherichia coli OR E. 21330354, 23396917 Coli) AND (C-terminal or C- terminus) HIF-1 alpha AND (CAD OR N-TAD 11959977, 15629713 OR ODD) High affinity immunoglobulin 14967045 epsilon receptor subunit gamma High mobility group protein B1 15379539 High mobility group protein 7559428, 10372360, 11498590, 11593421, 19732855, 21545188 HMG-I/HMG-Y High mobility group-T protein 6273163 Hirudin variant-1 2567183, 3013692 Histone-binding protein N1/N2 12695505 Histone H1.0 10794405, 11413144, 11584004, 16006555 Histone H3 19878690, 21517079, 22212475 Histone H1 2373370, 9109385, 14654695, 15050829, 15878177, 21464206, 22021384, 22813934, 24622397, 24906881 Histone H5 2181148 histone H1.2 2373370, 9109385 Histone H2A 21517079, 22212475 Histone H4 AND N-terminal 19395382, 21517079, 22988066 domain Histone H2B 21517079

118

Table B1 (Continued) Search term PubMed IDs that use IDP terminology HIV AND P6 AND gag 16234236 HIV-1 AND Vif AND (C-terminal 17598142, 19218568, 20450485 OR C-terminus) (HIV-1 OR HIV-1 gag 2438988, 9735293, 21134384, 24960591 OR HIV-1 p24) HIV-1 Rev 2125482, 7510518, 23972852 HIV-1 Tat 16423825, 18189286, 19941902, 20034112, 20450479, 21035463, 21189124, 21303342, 21560167, 23471103, 23557146 HIV-2 Vpx 24956595 HMGB1 25190813 HMGN5 21518955 (hNopp 140 OR Nopp 140 OR 22906532, 24218616 human nucleolar phosphoprotein p140) hnRNP-u 22902625 protein Nkx-2.1 8282100, 8898894, 9425125 Homeobox protein Nkx-3.1 19780584 Homoaconitase small subunit 20170198 Hop1 23841450 (h-prune OR human prune 23939913 protein) hRTN3 19364499 HSE1 16110343, 17543868 HSF1 AND (C-terminal OR C- 17323918, 22044151 terminus) hsl protease AND ATP 10693812 Hsp12 20797624, 21420397, 21998307, 22848679 Hsp22 17722063, 19783089 HSP16.9 11702068 Hsp33 22385960 22057845 atpase 22660624 (“human cardiac hormone”) 17399679, 18440296 AND (N-terminal OR N- terminus) AND “B-type natriuretic peptide” human desmoglein 1 19136012 human farnesyltransferase 11687658 human 10196139, 15545613, 18946767, 19841061, 20184958, 21501657, AND (AF1 OR transcription- 21603604, 21760925, 22003412, 22669939, 22988850, 23132854 activating fragment OR activation function subdomain OR activation domain)

119

Table B1 (Continued) Search term PubMed IDs that use IDP terminology Humanin 20116397 human metapneumovirus 25031352 glycoprotein ectodomain human replication protein A 16060685, 17976647 (“human respiratory syncytial 16361428, 24031640 virus”) AND (phosphoprotein OR protein) AND (N-terminal OR N- terminus OR C-terminal OR C- terminus) (Huntingtin-interacting protein 18155047 1 OR HIP1) (huntingtin yeast two hybrid 18076027, 23272104, 24465598, 25116620 protein OR hypk) H5 vaccinia virus 23476017 (HvNAC013 OR NAC013) 21856750 HvNAC005 OR NAC005 21856750 HY5 17001643 hydrophilin 21420397 Hypoxanthine-guanine 9790669 phosphoribosyltransferase I-309 10821677 (“IA3” OR “IA 3” OR “IA(3)” OR 15065849, 17087512, 18681437, 19003993, 21080428, 21490720 "IA₃") AND (Saccharomyces cerevisiae OR Yeast) IC AND cytoplasmic dynein 25263009 ICAT AND transcriptional 16293619 inhibitor (Icln OR pICln) 15905169, 22179008 Id protein HLH domain 24981796 IE1 19812155 (IE62 OR varicella-zoster virus 19357160 major transactivator) IF1 AND Bovine 25049402 (IF7 OR IF17) AND “glutamine 12824490, 21992216, 22023175 synthetase” IgE cepsilonmx 24457896 IgG heavy chain CH1 11036070, 19524537 (importin-beta OR Kap95p) 20816072, 25435324 Inosine-5'-monophosphate 9271497 dehydrogenase Insulin-like growth factor- 15308688 binding protein 6 Integrase p46 20026028 Interferon-induced guanylate- 10676968 binding protein 1

120

Table B1 (Continued) Search term PubMed IDs that use IDP terminology interferon regulatory factor 1 20947504, 21245151, 23134341 Involucrin 10601302 IRF5 22995936 IRSp53 18417251 (“islet amyloid polypeptide” OR 10191146, 15576552, 16197548, 17123962, 19647750, 20201512, IAPP) 20337445, 21215287, 23266002, 23380070, 24009497, 24021023, 25260075 Jagged-1 AND human AND 16427310, 17892488 cytoplasmic tail JARID1A 18270511, 19430464 JARID1B 19636953, 20403335 Juxtanodin 23198089 kappa casein 11784308, 18155889, 18700180, 20025277, 21689790, 21961598, 22443319, 25035108 Ki-1/57 18788774, 20436279 KorB 20200158 Lactose operon repressor 8638105, 8683581, 9032054 (Lactotransferrin OR 19858187, 25245670 lactoferrin) Lamin A/C 12057196, 21635954 L20 AND ribosomal protein AND 16977336, 19399222 (extension OR "amino terminal" OR N-terminal OR N-terminus) L5 AND xenopus AND ribosome 12860121 Latent membrane protein 2A 17174309 (LBH OR limb bud and heart) 20005203 AND (transcription or transcriptional) LC8 21533121, 21777386 (L1-CAM OR Neural cell 18321067 adhesion molecule L1) LEA protein 1 AND soybean 11891239 Lef-1 16293619, 22089506 Leukotoxin 7703231 Lhp1p 21212361 (Lipid A export ATP-binding OR 11546864 permease protein msbA) Lipoate-protein ligase A subunit 16384580 1 Lipocalin-1 15489503, 19770509 Listeria monocytogenes ActA 18577520 LjIDP1 18779323 LLA23 23317817

121

Table B1 (Continued) Search term PubMed IDs that use IDP terminology Lombricine Kinase 21212263 L-rhamnose isomerase 10891278 Lupus La protein 12842046, 15004549, 16387655 luxU Phosphorelay protein 15740742 Lymphoid enhancer-binding 7651541 factor 1 Lysozyme C 1515108, 15805597 Lysyl oxidase 20192271 lytM 16269153 MA16 22270696 Macrophage metalloelastase 15809432 Macrophage scavenger receptor 11785981 types I and II major prion protein 9132005, 10618385, 11685242, 12890024, 17190613, 17210575, 17256089, 17299036, 17359979, 19913031, 20356930, 20580965, 20627399, 20958083, 21091436, 21196244, 21441896, 21726811, 22339436, 22421432, 22529103, 22676969, 22987112, 23145167, 24106878, 25034251, 25101991 (MANF OR mesencephalic 19258449 astrocyte-derived neutrotrophic factor) MAP1B 23339032 (MAP2c OR “microtubule- 15751971, 21634433, 23877929 associated protein 2”) (MAP4 OR microtubule 25054624 associated protein 4) MARCKS 8608129, 11829734, 15640140, 18502797, 22427633, 24590112 Max AND c-myc 16171389, 19022175, 19114306, 22815918, 23106332 (MaZE OR PemI-like protein 1) 12533537, 12718874, 12743116, 15735309 AND (Escherichia coli OR E. coli) AND (C-terminal OR C- terminus) MaZF AND (Escherichia coli OR 12718874 E. coli) MBD1 24810720 (MCL-1 OR “myeloid cell 20392693, 20480043 leukemia 1”) AND (N-terminal OR N-terminus) MCoCC-1 19711988 11718560, 12630860, 15953616, 18809412, 20303977, 20436290, 21191186, 22807444, 24127580 (MDMX OR MDM4) 24127580 (“measles virus”) AND 12621042, 14645906, 14749181, 17034249, 18536007, 19198564, nucleoprotein 19445677, 19718689, 20058326, 20303863, 20450481, 20816082, 20925409, 21533140, 21613569, 21805002, 22399322, 22887965, 22901047, 23811056, 24043820

122

Table B1 (Continued) Search term PubMed IDs that use IDP terminology (“measles virus”) AND 12069524, 12944395, 14645906, 15479804, 16046624, 20058326, phosphoprotein 20450481, 21805002, 22887965 mec1 19464966 (MeCP2 OR Methyl-CpG-binding 17371874, 20405910, 21031501, 21278419, 22294343, 24766768 protein 2) MEG-14 23746503 melanophilin 17513864 Membrane fusion protein p14 15448165 (mesoderm development 17488095 candidate 2 OR MESD) Metal-binding protein smbP 15366930 Metallo beta lactamase 19395380 metallothionein-2A 24918957 metapneumovirus 24224051 phosphoprotein Methane monooxygenase 15379538 component C Methyl-accepting chemotaxis 15774032 protein I Methyl-accepting chemotaxis 15774032 protein II (mfp-1 OR fp1) AND mussel 22915553 (mfp-3 OR fp3) AND mussel 22915553 (mfp-2 OR fp2) AND mussel 22915553 MGM101 23418572 MHC class I polypeptide-related 10367903 sequence A (minichromosome maintenance 22208199 protein OR MCM OR N-mtMCM) AND protein 2mit 24098788 (Mitochondria* fission 1 protein 14623186 OR Fis1) (“mixed lineage leukemia”) 20961854, 20969867 AND protein MKL1 24909411 Modification methylase PvuII 9207015 Mothers against 10647180 decapentaplegic homolog 4 MpAsr 21327389 Msh6 17531814 Msn2 22505609 MSP2 AND “Plasmodium 17883245, 18440022, 19450733, 20545323, 20865513, 21784057, Falciparum” 22304430, 22749949, 22966050 MST1 SARAH Domain 22112013

123

Table B1 (Continued) Search term PubMed IDs that use IDP terminology mSYD1A 23791195 Multidrug resistance protein 15226509 mexA Mu-type opioid receptor 15680247 (“myelin basic protein”) 12372316, 12605403, 14695288, 15219899, 16420483, 16794783, 17131428, 17676872, 18284662, 18326633, 18449534, 19134474, 19519451, 19636827, 19642704, 19855925, 19856323, 19903451, 20169373, 20453917, 20593886, 20831157, 21044600, 21887699, 21889463, 22249765, 22405011, 22728818, 22947219, 23618134, 23861868, 24516125, 24758710, 24956930 Myelin transcription factor 1- 10606515 like protein MyoD 1327135 Myoglobin or apomyoglobin 8844864, 12079388, 21640124 Myomesin-1 15890201 myosin 5a 25312846 (myosin II heavy chain kinase B 20199682 OR MHCK-B) Myosin light chain 1 3233216 ("myosin motor domain") 9147986, 9741621 n16.3 24720254 Nab3 24100036 NACHT LRR PYD domains- 14527388 containing protein 1 Naked2 17045239 (NCoR OR NCoR-1 OR SMRT) 21925568 Negative regulator of flagellin 9095196 synthesis NEIL1 AND (C-terminal or C- 22902625, 23542007 terminus) Neurofilament heavy 3220257, 9424114, 9714161 polypeptide Neurofilament light polypeptide 3920075, 8634266 neurogranin 23462742, 24713697 (“neuroligin 3”) 18456828, 19898942, 21481779, 21647611 neuromodulin 23462742 neurotensin 15994900 NF-kappa-B inhibitor alpha OR 7891711, 9872404, 18565540, 21628581, 24605363 IkappaBalpha NGF AND neurotrophin 20923662 (NGN1 OR neurogenin 1) 20102160 NHE1 AND (C-terminal OR C- 20556825, 21425832 terminus) (Nipah virus OR NiV) AND 18815311

124

Table B1 (Continued) Search term PubMed IDs that use IDP terminology glycoprotein (Nipah virus OR NIV) AND 20657787 nucleoprotein (Nipah virus OR NIV) AND 20657787 phosphoprotein NIPP1 22866172, 22940584, 22988849 NleH 24373767 (NMDAR2B OR GluN2B) 18197970, 21481779, 21712388, 23782697 (NMII OR non muscle myosin II) 19959848 Nogo-B AND (N-terminal OR N- 17397058, 17437522, 17585875, 19508346 terminus OR C-terminal OR C- terminus) Non-histone chromosomal 10228169 protein 6A Non-histone chromosomal 6273163 protein H6 Non-histone chromosomal 6273163 protein HMG-14 Non-histone chromosomal 565710 protein HMG-17 (n16 OR n16N) 21473588 Nrf2 AND Neh2 domain 16581765, 18762866 (NRSF OR neural restrictive 21627111, 24446368 silencer factor) (NS5A OR Non structural protein 15247283, 15339921, 15902263, 17880107, 18032500, 19244328, 5A) AND hepatitis 19249289, 19297321, 20450484, 21489988, 21516467, 22543239, 22720086, 23740998, 23947833, 25031324 (NS5B OR Non structural protein 22815741 5B) AND hepatitis NS1-2 norovirus 22347381 (NS3 OR non structural protein 23803659 3) AND hepatitis C virus (Nsp OR nucleoskeletal-like 23358668 protein) AND bacillus subtilis NT4 AND neurotrophin 20923662 NtGR-RBP1 24957607 (NTL9 OR N terminal domain of 12416980 the ribosomal protein L9) Ntrc1 16169010 Nuclear cap-binding protein 11545740 subunit 1 Nuclear cap-binding protein 11545740 subunit 2 Nuclear pore complex protein 15557116, 17360435 Nup133 coactivator 3 11823864 Nucleocapsid protein p7 1304355, 1639074

125

Table B1 (Continued) Search term PubMed IDs that use IDP terminology Nucleoplasmin 11248694, 24121686 nudix hydrolase 20657662 Nup214 24739174 Nup159 23223634, 24037535, 25263009 Nup107 17360435, 24739174 Nup62 24739174 Nup153 17446086, 21961597, 22665783, 24739174, 24898547 Nup116 AND (Saccharomyces 18688269 cerevisiae OR yeast) Nup96 OR Nup98 17360435, 24739174 NUP2P nucleoporin AND 12065587 (Saccharomyces cerevisiae OR yeast) (Nupr1 OR p8 nucleoprotein) 23929272, 24205110 AND human ("ole e 6") 15247256 Olfactory marker protein 16008352 Omega gliadin storage protein 12741823 Oncomodulin 8487302, 17766386 Opaque 2 15096055 (OPN OR osteopontin) 11162539, 15892946, 19636917, 20152802, 20174473, 21609000, 22342723, 22730383, 24327307, 24928493 Opy2p 19846660 Ornithine decarboxylase 10563800, 10985770 OS-9 25193139 Osteocalcin 8101026, 8218200 OVOL2 22737237 OVOL3 22737237 OVOL1 22737237 p300 acetyltransferase 17438265, 23133622, 23307074, 24253305 (PAGE5 OR Prostate associated 19768387, 22073178 gene 5) (PAGE4 OR “Prostate associated 21357425, 22078626, 24263171, 24559171 gene 4”) Paired box 6 10346815 PAK4 23876315 Pan1 23801378 pan3 23340509 Pancreatic trypsin inhibitor 11320305 p53 AND (CTD or C-terminal or 7559631, 8875929, 9169453, 9405613, 12110597, 12351827, C-terminus) 12681938, 14499615, 14534297, 15837201, 16118206, 16461914, 17335006, 17438265, 17620598, 17972286, 18189286, 18366598,

126

Table B1 (Continued) Search term PubMed IDs that use IDP terminology 18391200, 18410249, 18809412, 18812399, 18824006, 19162020, 19216110, 19557012, 19819244, 19847292, 19933326, 20516128, 20873738, 20876941, 20961098, 21457718, 21528875, 21541066, 21686180, 21943426, 21979461, 22046250, 22176582, 22280219, 22777995, 22807444, 22915551, 22972749, 23028280, 23145047, 23205890, 23313430, 23352836, 23606624, 23609977, 23807285, 23836939, 24150971, 24303310, 25075982, 25099811, 25185827, 25244701 p8 AND human AND 11056169, 11350901 nucleoprotein PAP AND protein 25267253 Par-4 AND (C-terminal or C- 11714921, 12538889, 19490121, 25195896 terminus) Parathyroid Hormone AND 10837469, 11604398 human Parathyroid hormone-related 8138348, 10050767 protein ParB 21839743 parG AND (N-terminal OR N- 14622405, 15951570 terminus) PCaP1 18664522, 20448467 P130Cas AND CasSD 23827411, 24722239 PCC6 OR Dsp16 9067253 (p21(Cip1) OR p21 OR Cyclin- 8876165, 12964161, 16270364, 18627125, 19165719, 21358637, dependent kinase inhibitor 1) 21634433, 22399317, 22988851, 23029651 pdtaR 15341725 Pectate lyase 11717490 (PELPK1 OR At5g09530) 21559969 Penaeidin-3a 12842879 PEP-19 19106096, 20973509, 23204517 peptidylglycine alpha-amidating 19635792 monooxygenase Peptidyl-prolyl cis-trans 17125854 isomerase Perfringolysin O 10555145, 15797734 Peripherin-2 16522184 Periplasmic hydrogenase large 10368269 subunit peroxiredoxin AND Aeropyrum 16214169 pernix Peroxisome proliferator- 19043829, 21620710 activated receptor gamma Pex19 25062251 (Pex5p OR peroxisomal cycling 16403517 receptor) (PfAMA1 OR (plasmodium 12270711 falciparum AND "apical

127

Table B1 (Continued) Search term PubMed IDs that use IDP terminology membrane antigen 1")) PfEMP1 18773118 PFMG1 23865482 (PfTIM OR triosephosphate 19914198 isomerase) AND plasmodium falciparum PGC-1alpha AND “activation 22049338, 23648480, 23884631 domain” PhaF 23457638 phasin AND "Novosphingobium 22582058 nitrogenifigens" Phenylalanyl-tRNA synthetase 7664121, 9016717 alpha chain phob 17313959 phosphatase 2B subunit alpha 8524402 1-phosphatidylinositol-4,5- 8602259, 8784353 bisphosphate delta-1 Phosphatidylinositol-4- 9753329 phosphate 5-kinase type II beta (Phospholamban OR PLN) 25251363 ((photosystem II OR PSII) AND 9890923, 10569936, 10675525, 12146968, 16049787, 16170637, manganese stabilizing protein 16228381, 16503666, 21316983, 23940038, 24437616 OR "Oxygen-evolving enhancer protein 1") Phytanoyl-CoA dioxygenase 16186124 Pih1 23139418 Pilosulin-1 15639237 pin1 17892493 PIPKIgamma661 19287005 p57Kip2 11746698 (PKR OR ) 19232355 (PKS OR modular polyketide 22282160 synthase) Plasminogen 15299951 Plasminogen activator inhibitor 7664104 1 PLK1 14592974 PMLII 19088278 Pml1p 18809678, 19010333 (PM28 OR GmPM28) 20071374 (pNGF OR pro-Nerve Growth 16856872, 19089979, 21818348 Factor OR proNGF) PNUTS 24591642

128

Table B1 (Continued) Search term PubMed IDs that use IDP terminology poliovirus 3AB 23908350 polyprotein foot-and-mouth 2537470, 6194313 disease virus (positive 4 OR PC4) 16605275 AND Human Potassium voltage-gated 2122519 channel protein Shaker POU domain AND class 2 AND 11380252 transcription factor POU domain class 2-associating 10329190, 10541551, 11380252 factor 1 PPARgamma 16823031 P1 protease 24603811 PPYR1 OR CG15031 16631104 (PQBP-1 OR “polyglutamine 19303059, 22500761, 23649393 tract binding protein 1”) precol-NG AND "mussel byssus" 23947342 preS1 17766372, 23851574 prestin 24453323 Prevent host death protein 9915794, 18757857, 20603017 prickle AND PET domain 19053268 AND CTE 19553667, 21803119, 23995840 (“prokaryotic ubiquitin-like 19580545, 19607839, 20636328, 20953180, 23198822, 23557784, protein”) 23601177 ProP effector 15476391 Prostaglandin E synthase 3 10543959, 10811660 ("prostatic ") 10639192, 19995078 Protease A Inhibitor 3 10655612 Protein-arginine deiminase 15247907 type-4 Protein argonaute-2 22539551 Protein B-Myc 16893186 Protein C-ets-1 9770451, 15994560 Protein grpE 9103205 ("") 17222345, 23946424, 24192038 Protein kinase C alpha V5 23762412 domain Protein L precursor 9007989 Protein Nef 10339411 ( inhibitor 2 6245879, 10807923, 17636256, 18954090, 20826336 OR PP1 I-2) Protein phosphatase 1 9843442 regulatory subunit 11

129

Table B1 (Continued) Search term PubMed IDs that use IDP terminology Protein phosphatase 1 208844 regulatory subunit 1A (Protein phosphatase 1 15164081, 21142030 regulatory subunit 12A OR MYPT1) Protein phosphatase 1 6319628 regulatory subunit 1B Protein transport protein SEC9 9326367, 10048921 Protein transport protein Sec61 8942632 subunit gamma Protein tyrosine phosphatase 15213447 type IVA 1 Protein tyrosine phosphatase 14704153 type IVA 3 proteoglycans syndecans 24956062 prothymosin alpha 7548085, 10555983, 10631119, 12062405, 12582818, 16628001, 17355803, 17929838, 17949994, 22125611, 23050820, 23189168, 23318954, 23359453 Proto-oncogene 15525646 serine/threonine-protein kinase Pim-1 Prp8 24643059 (PSC OR posterior sex combs) 22517748 (PSD-95 OR Post Synaptic 16601002, 17666528 Density 95) PsHSP18.1 19717454 PTEN 23783762, 24056727 PTP1B 24845231 PulS 21878629 PUMA 23301700, 24654952, 25313042 (Purkinje cell protein 4 OR 19106096, 20973509, 23204517 PEP19) (PvAMA-1 OR Plasmodium Vivax 21713006 Apical Membrane Antigen 1) PVY AND potato virus 17971447 PWL2 OR PWL2D 20438845 PXR receptor AND human 11408620 Pyridoxine-5'-phosphate oxidase 12824491 Pyridoxine/pyridoxamine 5'- 15858270 phosphate oxidase pyrrhocoricin 15478127 Pyruvate dehydrogenase E1 11955070 quaking-A protein 15811367 rabies virus phosphoprotein 19341745 Rab proteins 11886217, 12620235

130

Table B1 (Continued) Search term PubMed IDs that use IDP terminology geranylgeranyltransferase component A 1 RAC-beta serine/threonine- 12086620, 12517337 protein kinase RAD52 homolog DNA repair 12370410 protein RAF kinase 7766599, 7791872, 8710867, 8756332 Rap-2a 9312017, 10591105 RAP1 AND DNA 8620531 RAR gamma AND (N-terminal OR 24333369 N-terminus) Ras-related C3 botulinum toxin 14625275 substrate 1 Ras-related protein Ral-A 15530367 RcdA 19747489 recA intein AND Mycobacterium 16288917 tuberculosis Regulator of chromosome 9510255, 20347844 condensation Regulator of G-protein signaling 9108480 4 Regulatory protein cro 2146682, 6452580, 7500341, 9653036, 9653037 (RelA OR SPoT) AND (C-terminus 24717772 OR C-terminal) Replication protein A AND 70 10526407 (Reticulon-4 OR RTN 4B) 17397058 (retinal OR cGMP) AND 12643535, 18230733, 19075750, 21393250, 21978030, 22514270 phosphodiesterase AND “gamma subunit” RXR- 9698548, 9826495 alpha (RGS9-2 OR Regulator of G- 20095651 protein signaling 9) Rhabdoviridae AND 20450482 nucleoprotein Rhabdoviridae AND 20450482 phosphoprotein rhodopsin 10926528, 23883288 Rho GTPase 9009196 protein 15518563 component 1 Ribonucleoside-diphosphate 7589490, 8130196, 8876648 reductase M2 subunit Ribonucleoside-diphosphate 11526233 reductase small chain 2 Ribonucleoside-diphosphate 11526233 reductase small chain 1

131

Table B1 (Continued) Search term PubMed IDs that use IDP terminology Ribonucleoside-diphosphate 8805591 reductase 1 subunit beta ribosomal protein L2 21592750 ribosomal protein L4 1618378, 19399222 ribosomal protein L11 8989327, 9398519 ribosomal protein L33 3297162 ribosomal protein P1-alpha 10913306 ribosomal protein P1-B 15182941 ribosomal protein P2-beta 9236009, 10913306 ribosomal protein S4 21156135, 22458631 ribosomal protein S17 6751823 ribosomal protein SA AND 22640394, 23137297 human Ribosyldihydronicotinamide 16302811 dehydrogenase RIC-3 19899809 Rif1 24634216 RNA Polymerase AND 25261014 Escherichia coli RNA polymerase II subunit A C- 12732728, 19888685, 21672523, 21988473 terminal domain phosphatase OR FCP1 ("Rnase E" AND caulobacter 20952404 crescentus) (“RNase E” AND (escherichia 15236960, 16516921, 17447862 coli OR e. coli)) (“RNase P” OR “ribonuclease P” 11258888, 20476778 ) AND bacillus subtilis ("RNase Y") AND Bacillus subtilis 21803996 RNF4 ubiquitin ligase 24844634 ROP6 24993791 (4.1R OR "Protein 4.1") 20109190 RPM1-interacting protein 4 25039985 Rpn4 22349505 Rv2377c 20434955 Rv3221c 18004752 (Rwdd1 OR RWD domain- 18954556 conatining protein 1) RYBP 19170609 RyR1 22937102 S12 24442609 S100A4 24098542 S100A3 24098542

132

Table B1 (Continued) Search term PubMed IDs that use IDP terminology S100A12 24098542 S100A2 24098542 S100A6 24098542 (SAA OR serum amyloid A) 22448726 San1 21211726, 23363599 (S100A8 OR S100A9) 24098542 Saposin-C 14674747, 15713488, 16823039, 18462685 (“SARS coronavirus” AND 16228284, 23717688 (“nucleocapsid protein” OR “nucleocapsid phosphoprotein”)) SAS-5 24778935 S100B 24098542 (SbASR-1 OR Abscisic acid stress 22639284 ripening protein) (SBDS OR TcSBDS) AND 19121363 Trypanosoma cruzi Sbi-III 18550524 (SBP2 OR SECIS binding protein 11238886, 19467292 2) Secretogranin-1 9136897 (securin OR PTTG1 OR pituitary 12220679, 15929994, 19053469 transforming gene 1 product) AND (N-terminal OR N- terminus) Selenocysteine lyase 20164179 (“selenoprotein s” OR SelS OR 22700979, 23566202, 23914919 VIMP) Sem1 24412063 (Seminal vesicle protein 18215165, 19851073 number 4 OR SV-IV) (SeMV OR "sesbania mosaic 19995563 virus") AND polyprotein 2a (“sendai virus”) AND 17459940, 20450486 nucleoprotein (“sendai virus”) AND 14980481, 16284250, 17459940, 17586564, 20450486 phosphoprotein sept4 21949740 SEPT4 AND septins 17105210 SERF1a 22854022 Serine-aspartate repeat- 9813018 containing protein D Serine protease HTRA2 AND 11967569 mitochondria* Serine/threonine protein 15713458 phosphatase 5

133

Table B1 (Continued) Search term PubMed IDs that use IDP terminology (SERT OR serotonin transporter) 21129485 ("serum albumin") AND human 10388840 11406578 Seryl-tRNA synthetase 8128220, 8230201, 8654381 SF3b155 24795046 Sfbl-5 21840989 sfr1 23324799 S-Gi alpha 1 8073283 Sgs1 helicase 24038467 SHC-transforming protein 1 12906822 shematrin 25001481 Sialidase-2 15501818 Sic1 AND CDK 11734834, 17522259, 17660831, 19008353, 19280601, 20399178, 20399186, 20589454, 21053335, 21539793, 22356687, 23189058, 24673507 Sigma-E factor negative 18421143 regulatory protein Simian virus Major capsid 1659663, 8805523 protein VP1 SIMPL 17233114 sindbis virus AND capsid protein 1944569 SIP18 21420397 (Sir3p OR silent information 2121770, 16581798, 16641491, 17176117, 22096199 regulator 3 protein OR SIR3) AND saccharomyces cerevisiae sirt1 AND (N-terminal OR N- 23497088, 23811471, 24020004 terminus OR C-terminal OR C- terminus) Sis1 AND yeast 23227221 (SKIP OR Ski interaction 20007319 protein) AND SNW skp1 24506136 SLBP OR (Histone RNA hairpin- 15260482, 23286197, 25002523 binding protein) Small heat shock protein 9707123 HSP16.5 (small molecule reductase 19086274 regulatory protein OR SmI1 OR knr4) Small proline-rich protein 2E 3133554, 9722562 SMARCA2 15368101 SMG-9 20817927 SMK toxin 9089808

134

Table B1 (Continued) Search term PubMed IDs that use IDP terminology (Smoothelin-like protein 1 OR 18310078, 18477568, 22424482, 24905744 SMTNL1) snurportin 19619473 Somatoliberin 9375414 SopB 24075929 sortase 11371637, 15117963, 22468560 Sortilin 18191449 Sorting nexin-3 14514667 Sos1 23528987 southern bean mosaic virus 6854633 capsid protein Sp1 15609997, 19292861 (SPA OR septal pore associated 22955885 proteins) SPARC 3427055 Spatzle 12872120 spd2 24652833 spd1 20516199, 20556825, 24652833 Sperm histone 2243113, 2738040 spinophilin 18028445 split intein 24236406 SpoIISB 21147767 src family kinase 25071818 50S ribosomal protein L27 3297162, 11673426 30S ribosomal protein S12 6751823 30S ribosomal protein S18 6751823 30S ribosomal protein S19 6751823 SRP19 17434535 SSB AND (escherichia coli OR e. 15169953, 20360609, 24021816 coli) AND (C-terminal OR C- terminus) (SseJ OR SseJ-H OR SseJ-L) AND 20877914 salmonella Sso Acp 24893801 Stannin 16246365 starmaker 18636772, 19635593, 22821534 statherin 15892946 stathmin 10675326, 12860982, 16554300, 17029844, 17034249 STIL 24022511 Stim1 24650897

135

Table B1 (Continued) Search term PubMed IDs that use IDP terminology stringent starvation protein A 15735307 (stringent starvation protein B 14536075 OR SspB) subtilisin AND (propeptide OR 11519747, 15740747, 23009354 pro-peptide) SUFU 24311597 sulfhydryl oxidase ALR 23207295 ("sulfite reductase") AND NADPH 10860732 Sulfotransferase 1A3/1A4 10543947 (SULT2B1 OR SULT2B1b OR 12923182 Sulfotransferase family cytosolic 2B member 1) SUMO-activating enzyme 15660128 subunit 1 SUMO-activating enzyme 15660128 subunit 2 ("Superoxide dismutase") 19800308 superoxide dismutase-like yojM 15897454 supervillin 23075227 Suppressor of cytokine signaling 16302975 3 SVIP 24055875 Swallow AND cytoplasmic 25263009 dynein Synaptobrevin homolog 1 9326367 (synaptobrevin OR Vesicle- 9346956, 19918058, 21481779 associated membrane protein 2) Synaptopodin 2 variant OR 17676886, 18457655 Fesselin Synaptosomal-associated 9671503 protein 25 syndecan-1 23888783 Syntaxin-1A 12680753 Talin-1 19804783 Tau AND (protein OR 10995239, 11606569, 11807946, 12358741, 14769047, 15615633, alzheimer’s OR tauopathies OR 15628855, 15855160, 15925383, 16464864, 16475817, 16515451, neuronal) 16908029, 17047358, 17081491, 17241479, 17262987, 17385861, 17493042, 18061582, 18495933, 18725412, 18725924, 18771286, 18783251, 18834853, 18953106, 19075586, 19149675, 19214739, 19226187, 19549281, 19625749, 19769346, 19826005, 20160453, 20184958, 20678071, 20687558, 21056617, 21498513, 21560166, 21677644, 21910444, 21931162, 22083130, 22291015, 22401494, 22528085, 22762014, 22891813, 22901047, 22998648, 23027743, 23027744, 23199922, 23515417, 24018100, 24361273, 24559475, 24581495, 24733915, 24945760, 25206938, 25284680

136

Table B1 (Continued) Search term PubMed IDs that use IDP terminology Tax AND HTLV transcriptional 14580193 activator TB1-C-Grx1 24830542 TBPL2 17570761 T-cell surface glycoprotein CD4 9622505 T-cell surface glycoprotein CD3 14967045, 17176095 delta chain T-cell surface glycoprotein CD3 14967045, 17176095 epsilon chain T-cell surface glycoprotein CD3 14967045, 17176095 gamma chain T-cell surface glycoprotein CD3 14967045, 17174464, 17176095, 17410622, 19012413, 19733547, zeta chain 21487502, 24120941 TCP8 23760157 (TDH OR thermostable direct 20335168 hemolysin) AND Vibrio parahaemolyticus TDP-43 23962724, 24497973 Teg12 20361791 Telomere-binding protein 9201953, 17082188 subunit beta (Tex1 OR Trophozoite exported 23056243 protein 1) TGBp1 19675186, 22349738 TgDRE 16605254 TgGCN5 AND "toxoplasma 21055425 gondii" THAP11 15368101 thioredoxin AND (escherichia 12641467, 16113108, 16542678, 16787768 coli OR e. coli) thioredoxin AND human 17611012 thioredoxin-glutathione 19710012 reductase AND Schistosoma mansoni thrombopoietin receptor MPL 21858098 Thylakoid soluble 17176085 phosphoprotein Thylakoid-soluble 19113838 phosphoprotein AND Arabidopsis thaliana 7628623, 9336833 thymidylate synthase 8845352 Thymosin beta-4 8269922, 15039431 TIM23 20718036 Tir AND "Escherichia coli" 17449672

137

Table B1 (Continued) Search term PubMed IDs that use IDP terminology AND (SH3 OR PEVK) 7569978, 9472037, 16766517, 16949547, 22910563, 23062340, 23063534 Tma46 23002146 tnrc6c 23340509 Tob2 23340509 Toc159 20042108, 21057194 Tom70 15316022 tomato aspermy virus 20056465 cucumovirus coat tonB 15644214 (Tppp3 OR tubulin 19235716 polymerization-promoting protein family member 3) TPPP/p25 15474353, 15567525, 15883183, 17105200, 23166627 TRAF3IP2 24021976 Transcriptional activator 11171981 protein traR ("Transcription factor 4") AND 11237626, 23821606 human ("Transcription factor 7-like 2") 11237626, 11713476 (Transcription factor p65 OR 9384558, 12686541, 21220295 NF-kappab p65) Transcription initiation factor 8610010 IIA large subunit Transcription initiation factor 8610010 IIA small subunit Transcription initiation factor 9741622 TFIID subunit 1 (transient receptor potential 22575650 OR TRPV1) Transitional endoplasmic 15740751 reticulum ATPase Translation initiation factor IF-3 9054966 of 132 20042108 (translocated promoter region 21216290 protein OR Traf3ip2) Transthyretin 15477096 Tricorn protease 11719810 triosephosphate isomerase 2402636, 3430618 yeast Trisk 95 22937102 (tropoelastin OR elastin) 19501564, 24550393, 25142785 tropomodulin AND (N-terminal 11423419, 17706248, 21584876 OR N-terminus) Troponin C AND (slow skeletal 2933134 OR cardiac muscles)

138

Table B1 (Continued) Search term PubMed IDs that use IDP terminology Troponin I cardiac muscle 12840750, 18433059, 20889975, 21322033 TRP channel AND (C-terminal or 23553631, 24723374 C-terminus) TRPV6 21664972 TRPV5 21664972 TRTK-12 23028280 Trypsin inhibitor 2 11292835, 11434766 Tryptophan synthase alpha 15667212 chain (TTN-1 OR 2MDa_1) 20346955 Tubulin beta-4 chain 10211825, 24835459, 25307498 Tubulin beta-2 chain 10211825 Tumor protein 11264583 Type II secretion system protein 15081815 M Tyrosine 3-monooxygenase 8104613 Tyrosine-tRNA ligase 17855524 Tyrosyl-tRNA synthetase 2504923, 12005430 U1A 15476390 U2AF65 24734879 Ubiquinol oxidase subunit 2 1324168, 8433374, 8618822, 11017202 Ubiquinol oxidase subunit 1 2162835, 11017202 UDP-N-acetylhexosamine 11707391 pyrophosphorylase ULK1 21853163 Ultrabithorax AND (N-terminal 18508761, 22399320, 25286318 OR N-terminus OR C-terminal OR C-terminus) ultraspiracle AND "aedes 24704038 aegypti" Ump1 24065419, 24688736 UmuD 18216271 Undecaprenyl pyrophosphate 11581264 synthetase UPF2 19556969 URE2 10224139, 11327838, 15628874 (UreG OR urease accessory 15542602, 16846235, 17309280, 21922108, 22271305, 23560717, protein) 25200810 Uroporphyrinogen 9564029, 11719352 decarboxylase UV excision repair protein 14557549 RAD23 homolog A vanilla mosaic virus coat 16421638 protein

139

Table B1 (Continued) Search term PubMed IDs that use IDP terminology VapD 25084333 Vinculin 15642262 VIP1 25212215 Viral macrophage inflammatory 10595530 protein 2 ("Vitamin D3 receptor") 10678179 Voltage-dependent L-type 12620094 calcium channel subunit alpha- 1S Voltage-gated potassium 10585425 channel subunit beta-1 Von Hippel-Lindau disease 14963040 tumor suppressor VP30 17567691 VP16 AND transcription factor 15826952 activator VP2 foot-and-mouth disease 2537470 virus VP1 foot-and-mouth disease 2537470 virus VP3 foot-and-mouth disease 2537470 virus VPg AND “potato virus” 18533220, 19800647 VP4 protein 2537470 Vpr AND "Bacillus cereus" 19383694 (VSVP OR vesicular stomatitis 22789567 virus phosphoprotein) (WASP interacting protein OR 23060071, 23870269 WIP(C)) (WASP OR Wiskott-Aldrich 10724160, 15235593, 18650809, 19260013, 20536449, 21875562 syndrome) AND protein AND (C- terminus or C-terminal) WDR46 23848194 19858290 Werner syndrome ATP- 16339893 dependent helicase west nile virus AND (capsid 18033802 protein C OR polyprotein) WSK3 18948596 Xanthine AND (dehydrogenase 11005854 OR oxidase) XO lethal protein 1 12672694 XPA AND DNA 11344324 yacG 12211008 (YB-1 OR “Y Box binding protein 22590640, 24217978 1” OR YBX1)

140

Table B1 (Continued) Search term PubMed IDs that use IDP terminology Yck2 kinase 21653825 YefM 14672926, 15980067, 18793646 Yersinia crystallin 15536081, 16470515 YopE 18502763 (Yorkie homolog OR YAP1) 20123905, 20368466 (zetacyt OR T-cell surface 18311971 glycoprotein CD3 zeta chain) FYVE domain 15231848 containing protein 9 Zinc finger protein 593 18287285 Zinc finger protein Eos 15491138 ZipA 12107152, 23660966

141

Appendix C: IDP Search Terms, Search Results, and Disorder Scores

Table C1 IDP Search Terms, Search Results, and Disorder Scores

UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs Q95V77 97.2 DP00186 Aavlea1 2 6 P00519 30.8 x ABL tyrosine kinase 2 9588 Q08655 90.43 DP00531 (“Abscisic acid stress 2 11 ripening” OR “Abscisic acid-, stress-, and ripening-induced“) P07342 12.95 DP00398 Acetolactate synthase catalytic 1 3 subunit AND mitochondria* P07140 2.62 DP00346 Acetylcholinesterase 1 22467 P22303 3.42 x acetylcholinesterase variant AND 2 60 AChE-R Q00955 4.93 DP00557 Acetyl-CoA carboxylase 1 3789 P15891 58.11 DP00634 ("actin-binding protein") 3 2248 Q9Y6Q9 36.94 x ACTR 13 283 Q9YBQ2 4.47 DP00248 Acylamino-acid-releasing enzyme 1 115 P0A6A8 55.13 DP00416 Acyl carrier protein AND e. coli 1 1139 P55336 3.69 x (“acyl carrier protein” AND 4 32 “Vibrio harveyi“) Q97ZL0 15.84 DP00513 acylphosphatase AND Sulfolobus 2 17 solfataricus E2IJZ8 2.58 x Ad41 1 132 P46108 28.29 DP00748, Adapter molecule crk 2 27 DP00748_A002 P25054 37.88 DP00519 Adenomatous polyposis coli 1 2839 protein P03255 43.94 x adenovirus early region 1A 1 249 P0DKX7 6.74 DP00591 (“adenylate cyclase” OR CyaA) 10 1798 AND bordetella pertussis Q9UJY5 29.11 DP00314 ADP-ribosylation factor-binding 3 17 protein GGA1 P07248 9.07 DP00077 ADR1 regulatory protein 2 52 P42568 52.82 x AF9 1 286 P16112 23.85 x aggrecan AND CS-attachment 1 5 region AND chondroitin sulfate O95994 18.86 x AGR2 1 159

142

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P24588 60.42 x (AKAP79 OR AKAP5) 1 230 Q02952 60.44 x (AKAP250 OR AKAP12 OR AKAP 2 186 gravin) Q8LBP4 17.97 DP00662 (Alb3 OR A3CT OR ALBINO3) 1 91 P05091 11.03 DP00383 Aldehyde dehydrogenase AND 1 1226 mitochondria* P53954 0.73 DP00618 ALG11 5 19 P77072 8.96 DP00575 Alkylmercury lyase 1 29 P78318 35.4 x alpha4 1 3804 P35611 33.65 DP00240 Alpha-adducin 1 583 P02959 37.1 x alpha/beta AND (SASP OR small 1 103 acid soluble) AND protein P02489 10.4 DP00444 Alpha-crystallin A chain 2 240 P02511 18.86 DP00445 Alpha-crystallin B chain 2 700 P09983 9.48 DP00389 (alpha hemolysin OR HlyA) AND 2 1085 Escherichia coli P00709 16.9 x alpha lactalbumin AND HAMLET 1 77 P02662 32.71 DP00330 (alpha S1 OR alpha s) AND casein 6 625 P02549 17.98 x alpha spectrin AND (N-terminal 2 212 or N-terminus) P37840 43.57 DP00070 (“alpha synuclein”) 218 6160 P09493 57.39 x alpha tropomyosin AND (N- 4 324 terminal OR N-terminus) Q71U36 9.76 x alpha Tubulin 2 25523 Q86925 8.28 x alphavirus capsid protease AND 1 12 (C-terminal or C-terminus) Q9NP70 38.03 x ameloblastin 3 217 P63277, 48.57, DP00692, amelogenin 7 1577 P45561 53.97 DP00692_A001, DP00692_A002, DP00693 P0AG16 6.73 DP00578 Amidophosphoribosyltransferase 1 237 P05067 18.57 x (amyloid beta OR Abeta OR APP 45 47039 OR “amyloid precursor protein”) Q99BV0 2.28 x Andes virus Gn tail 1 13 P10275 21.65 DP00492 androgen receptor AND NH2 4 87 Q01484 34.24 DP00467 Ankyrin-2 1 21 P49913 11.18 DP00004, Antibacterial protein LL-37 1 223 DP00004_C002 P03045 67.29 DP00005 Antitermination protein N 1 230 Q9BP37 10.23 DP00714 AP7 AND nacre 3 8 P84092 5.75 DP00455 AP-2 complex subunit mu 1 67 B0FRH7 100.0 DP00544 (Aplysia nucleolar protein OR 1 5 ApLLP)

143

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P02647 20.97 DP00386, apolipoprotein A-I 2 9783 DP00386_C001 P02649 25.87 DP00355 Apolipoprotein E 2 19057 P02628 32.41 DP00550 (apo-parvalbumin OR (“apo 2 17 form” AND parvalbumin)) P93025 26.23 x Arabidopsis phototropin 2 1 112 x x x aragonite associated 1 35 biomineralization proteins Q9BP38 14.04 DP00715 Aragonite protein AP24 2 3 P42684 21.91 x ARG tyrosine kinase 1 88 P32120 18.1 x arrestin2 1 330 P35869, 10.73, DP00381, Aryl hydrocarbon receptor 2 6732 Q9WTL8 25.0 DP00735_A004 Q13625 35.02 x ASPP2 1 83 P54252 37.09 DP00576 ataxin-3 2 355 Q9SQZ9 14.77 DP00434 At3g04780 1 1 Q06628 34.82 x Atg13 2 95 Q12092 33.8 x Atg29 1 16 P40344 21.94 x Atg3 1 142 P53104 17.5 x Atg1 1 235 P02721 31.48 DP00201 ATP synthase-coupling factor 6 1 4 AND mitochondria* P01160 29.41 DP00747, Atrial natriuretic peptide 4 18856 DP00747_C002 Q27974 25.05 DP00351 auxilin phosphatase 1 21 O15169 31.32 x (axin OR axin1) 3 1100 P05408 39.15 x 7B2 1 288 P37957 2.36 x bacillus lipase 1 459 P07740 9.58 x bacterial luciferase AND mobile 4 4 loop P26747 8.37 DP00268 bacteriophage p22 coat protein 1 113 P16009 8.17 DP00284 bacteriophage T4 tail lysozyme 1 15 O95817 51.83 x Bag3 1 219 P46379 27.92 x Bag6 1 81 x x x (basic region leucine zipper OR 1 2952 bZIP) P10163 100.0 DP00119 (Basic salivary proline-rich 8 42 protein 4 OR IB5 salivary protein) P80723 100.0 x BASP1 2 98 O50835 37.85 DP00354 BBK32 1 39 P11912 4.87 DP00502 B cell receptor AND Igalpha 4 106 P40259 10.92 DP00503 B cell receptor AND Igbeta 4 101

144

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P10415 17.15 DP00297 Bcl-2 AND apoptosis 2 37605 Q61337 63.73 DP00563 bcl2 antagonist of cell death 2 1472 O43521, 35.86, DP00643, (bcl-2-like protein 11 OR 1 1327 O43521-2 57.97 DP00643_A002 BCL2L11) Q91ZE9 34.59 DP00645 Bcl-2-modifying factor 1 22 Q07817 15.88 DP00298 (Bcl-xL OR "Bcl-2-like protein 1") 3 4497 AND human P53563 26.18 DP00449 (Bcl-xL OR "Bcl-2-like protein 1") 1 926 AND rat P23560 13.77 x (BDNF OR "brain derived 1 14751 neurotrophic factor") Q14457 18.22 x BECN1 1 1231 P35612 38.98 DP00241 Beta-adducin 1 583 P17870 20.33 DP00390 Beta-arrestin-1 2 1906 P02666, 30.36, DP00329, Beta casein 15 12216 P05814 25.66 DP00199 P46170 13.16 DP00209 Beta-defensin 12 1 338 Q62165 27.21 DP00491, beta dystroglycan 2 1363 DP00491_C002 A0A0E1V051, 3.85, x Betagamma-crystallin 5 75 Q2SHN6 13.78 P02754 15.73 DP00193 Beta-lactoglobulin 1 8488 P11277 17.36 x beta spectrin AND (C-terminal or 2 248 C-terminus) Q16143 56.72 DP00555 (“beta synuclein”) 9 1276 P17120 25.08 DP00636 BimC 1 81 O54918 44.9 DP00518 BimL 3 47 P0ABE0 25.64 DP00415 Biotin carboxyl carrier protein of 1 146 acetyl-CoA carboxylase P96884, 8.65, DP00695, BirA bifunctional protein 4 18 P06709 5.61 DP00349 P9WIC8 12.85 DP00295 2,3-bisphosphoglycerate- 1 9 dependent phosphoglycerate mutase Q2YPE7 14.61 x (BMFP OR Brucella abortus MFP ) 1 4 Q9FUM5 69.23 DP00216 BN28a 1 2 P21815 60.25 DP00332 Bone sialoprotein 2 1 971 Q53HL2 39.29 x Borealin 1 79 P06748 53.4 x (B23 OR Nucleophosmin OR 3 614 NPM1) AND RNA Q00496 1.36 DP00732_C001 Botulinum neurotoxin 1 12550 A0JNH0 57.54 x bovine dentin phosphophoryn 2 38 P19711 0.65 DP00675_C002 (bovine viral diarrhea virus core 2 1937 OR (BVDV AND bovine))

145

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P19711 0.65 DP00675 bovine viral diarrhea virus poly 1 17 protein Q9SIN5 43.97 DP00743 (15b protein OR COR15b) 2 350 D0V3W5 37.94 x BQP35 1 1 P01042 22.67 x bradykinin OR kininogen domain 3 16948 5 P38398 38.06 DP00238 BRCA1 3 11054 Q9XTN4 40.34 x (Brinker OR apo-brinker) 1 975 P81019 21.31 DP00669 BSP-30 2 36 P00864 12.46 x (BTPC OR bacterial type 4 25 phosphoenolypyruvate carboxylase) G2WDT1 80.08 x Bud13p 1 8 P04873 13.19 x bunyavirus nucleocapsid protein 1 127 Q9SG86 19.56 x bZIP28 1 25 P09803 10.86 DP00159 Cadherin-1 1 17095 Q9JP55 20.55 x CagA AND (C-terminal OR C- 3 64 terminus) P63098 8.82 DP00565 Calcineurin AND (subunit B type 6 296 1 OR regulatory subunit) P01263 31.62 x calcitonin AND salmon 1 1841 P28583 15.55 DP00561 calcium-dependent protein 1 1 kinase SK5 Q9CXW3 33.62 DP00226 Calcyclin-binding protein 1 59 P12957 91.7 DP00120 caldesmon 1 5222 P62152 36.24 DP00344 Calmodulin 4 38337 P40136 21.5 DP00395 Calmodulin-sensitive adenylate 1 94 cyclase P20810 78.39 DP00196 Calpastatin 8 1258 P51911 26.94 x calponin AND (“regulatory 2 171 region” OR (N-terminus OR N- terminal)) P27797 35.01 DP00333 Calreticulin 1 2412 Q9QXT8 21.48 DP00291 Calsenilin 1 395 P07221 19.75 DP00132 Calsequestrin 5 1038 P61926 48.68 DP00015 cAMP-dependent protein kinase 2 1334 inhibitor alpha P00514 16.05 DP00245 cAMP-dependent protein kinase 2 38 type I-alpha regulatory subunit P50477 10.11 DP00436 Canavalin 1 55 P40881 6.07 DP00110 Carbonic anhydrase 1 10911 P31896 5.48 DP00239 Carbon monoxide dehydrogenase 1 444 Q3AB29 4.15 x (carbon monoxide oxidation 1 110 activator OR CooA)

146

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P43155 6.55 DP00305 Carnitine O-acetyltransferase 1 373 P38582 9.01 DP00380 carnobacteriocin-B2 immunity 1 10 protein Q8WXD9 43.12 x (CASK-interactive protein1 OR 1 2 Caskin-1) Q02248 16.39 DP00341 Catenin beta-1 2 122 Q9UBX1 6.82 x cathepsin F 1 59 Q9Y3M2 30.95 DP00709 (Cby OR Chibby) 1 257 P62552 25.0 x CcdA AND antitoxin AND (C- 3 4 terminal OR C-terminus) Q9Y258 8.51 DP00696 ccl26 1 216 C1K010 69.0 x C4 Cotton leaf curl Kokhran virus 1 2 P49585 29.7 x (CCT OR phosphocholine 1 325 cytidylyltransferase) AND alpha Q9Y5K3 31.44 x (CCT OR phosphocholine 1 328 cytidylyltransferase) AND beta O14519 33.91 x CDK2AP1 1 62 Q49AH0 12.3 x CDNF 1 39 Q15517 52.36 DP00706 (CDSN OR corneodesmosin) 1 141 P35638 73.37 DP00624 (“C/EBP homologous protein”) 2 2273 G0RVK1 19.07 x Cel7a 1 163 E1WAC6 23.8 DP00157 Cell invasion protein sipA 1 54 P29762 11.68 x Cellular retinoic acid-binding 1 551 protein 1 Q03701 27.8 x (C/EPB OR CCAAT enhancer 1 9824 binding protein) Q12798 44.19 x CETN1 1 9 P01100 44.21 DP00078 (c-Fos AND (C-terminal OR C- 4 832 terminus OR “activation domain”)) Q9VZI1 18.09 x Chd64 1 3 P07363 7.65 DP00407 Chemotaxis protein cheA 1 421 Q56311 5.96 DP00350 Chemotaxis protein cheW 2 244 Q9Y3E7 53.15 x CHMP3 1 44 P01555 6.2 DP00250 cholera enterotoxin subunit A 2 51 P01233 23.03 DP00013 Choriogonadotropin subunit beta 1 4569 Q57696 23.23 DP00465 chorismate mutase AroQ 2 25 P05059 67.48 DP00118 Chromogranin-A 2 4460 P40019 88.24 x Chz1 1 11 O08785 23.39 DP00734 Circadian locomoter output 1 3 cycles protein kaput Q99967 39.63 DP00356 (CITED2 OR CIT-ED2 OR 1 169 Cbp/p300-interacting

147

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs transactivator 2) P05412 33.23 x cJun 1 398 O75122 17.23 x (CLASP2 OR cytoplasmic linker 1 291 associated protein 2) Q05140 36.72 DP00225 Clathrin coat assembly protein 1 174 AP180 Q68FX6 55.13 x (CLPH OR casein like 1 770 phosphoprotein) P05371 25.28 DP00014 clusterin 1 2078 O75444 23.06 x c-Maf 1 397 Q708E1 45.42 x C-myb AND “transactivation 4 210 domain” Q14896 15.93 x (CMyBP-C OR cardiac myosin 2 2450 binding protein) P01106 42.14 DP00260 c-myc AND oncoprotein 9 12286 P62633 11.3 x CNBP 1 166 Q02388 37.02 x Col7 1 22 P09883 36.94 DP00342 (“colicin E9” OR colE9) 12 111 P08083 21.71 DP00461 ("colicin N") 13 51 Q53654 49.11 DP00098 Collagen adhesin 2 397 P02458 57.7 DP00274 Collagen alpha-1(II) chain 1 367 O14810 90.3 x complexin 1 238 P17302 20.16 x (connexin43 OR “connexin 43” 4 728 OR cx43) AND cytoplasmic P36383 30.56 x (connexin45 OR Cx45) 1 461 P08034 10.95 x (connexin32 OR Cx32) 1 986 P36382 20.67 x (connexin40 OR Cx40CT) 1 239 Q7Z094 15.22 x Conotoxin iota-RXIA 1 3 Q6PJW8 34.48 x Consortin 1 2 x x x Convicilins 1 4 Q04656 7.0 DP00282 Copper-transporting ATPase 1 2 107 Q42512 45.32 DP00536 COR15A 2 83 Q53SF7 45.18 x (cordon-bleu OR Cobl) 1 57 Q9NR00 30.19 DP00372 (C8orf4 OR “Thyroid Cancer 1” 5 21820 OR TC-1) Q12287 47.83 DP00277 Cox17 3 103 Q9LZP9, 47.33, DP00534, CP12 AND protein AND calvin 14 39 A6Q0K5 24.3 DP00359 Q9HC77 48.95 x (CPAP OR CENPJ) AND PN2-3 1 2 P00736 3.26 DP00621 c1R AND CUB2 domain 1 8 P45481 32.28 DP00348 (CREB-binding protein OR CBP) 31 12497 P15337 33.72 DP00682, (CREB-1 OR Cyclic AMP- 10 5167

148

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs DP00080 responsive element-binding protein 1) Q91WR5 4.33 DP00690 C21 reductase 1 215 Q49AN0 17.37 DP00473 cryptochrome 2 2 548 Q43125 18.94 DP00474 cryptochrome 1 1 665 B2CY49 50.0 x CsGA 1 246 Q548S0 11.26 x CsGB 1 49 P41240 4.89 x c-Src kinase AND (N-terminal or 2 73 N-terminus) Q76LA1 28.57 DP00511 CSTB protein 1 245 Q9Z2F5 12.79 DP00499 CtBP AND (C-terminal or C- 1 269 terminus) Q6Q4B0 18.81 x cucumber mosaic virus 1 141 cucumovirus coat O94316 8.74 x cwf10 1 1 P14635 26.79 DP00223 cyclin B AND G2 1 1466 Q64364, 27.81, DP00335, Cyclin-dependent kinase inhibitor 2 6615 Q8N726 69.7 DP00336 2A P46527 54.55 DP00018 (Cyclin-dependent kinase 21 10136 inhibitor 1B OR p27(Kip1) OR p27) P24941 5.7 x cyclin E LMW 1 39 P51946 9.6 DP00307 Cyclin-H 1 210 Q05158 24.23 DP00438 Cysteine AND glycine-rich protein 1 125 2 P13569 1.89 DP00012 (Cystic fibrosis transmembrane 9 8214 conductance regulator OR CTFR) P20070 4.65 x ("cytochrome b5") 1 2494 P08067 11.63 DP00687 Cytochrome b-c1 complex 1 4 subunit Rieske AND mitochondria* P00004 54.29 DP00006 ("cytochrome c") 1 38417 Q14061 52.38 DP00543 Cytochrome c oxidase copper 1 93 chaperone P0ACN7 11.73 x CytR 1 101 Q12248 24.47 x Dad1 1 123 Q9UD71 85.78 x DARPP-32 AND PP1 2 52 Q9UER7, 40.81, DP00707, (DAXX OR Death domain- 3 471 O35613 38.29 DP00708 associated protein 6) P20449, 13.9, x Dbp5p 2 17 Q09747 16.7 Q96C86 10.09 x DcpS 1 200 Q08949 22.55 x ddc1 1 66 O75398 24.96 x (DEAF1 OR DEAF-1) 2 76

149

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs Q24298 5.37 DP00269 DE-cadherin 2 132 P21793 6.11 DP00489 Decorin 1 2244 P56552 25.93 DP00582 Defensin-like protein 1 188 P28311 44.57 DP00388 Defensin-related cryptdin-4 1 3 E2GK57 95.51 x Dehydrin AND (TsDHN-1 OR 3 5 TsDHN-2) P42758 100.0 DP00658 Dehydrin AND (Xero2 OR Lti30) 2 10 Q6L8H6 97.48 x Dehydrin CIDhn 1 2 P31168 83.77 DP00657 Dehydrin COR47 2 9 A1IVM4 100.0 x Dehydrin dhn13 1 1 P12950 83.93 DP00530 Dehydrin DHN1 4 14 P42763 100.0 DP00667 Dehydrin ERD14 2 7 P42759 93.85 DP00606 Dehydrin ERD10 2 7 Q4VT48 100.0 x Dehydrin K2 2 3 Q39805 59.73 DP00170 Dehydrin-like protein 1 37 F5CAF0 100.0 x Dehydrin MpDhn12 1 1 P30185 74.73 DP00689 Dehydrin Rab18 1 5 Q3ZNL4 100.0 x Dehydrin YSK2 1 4 Q9LQT8 6.94 DP00724 DELLA protein GAI 2 79 Q08495 55.56 x Dematin 2 54 Q9NZW4 62.64 x (“Dentin phosphoprotein”) 2 622 P16006 5.7 DP00583 deoxycytidylate deaminase 1 191 P06968 7.95 DP00337 Deoxyuridine 5'-triphosphate 2 30 nucleotidohydrolase O16042 83.7 x DF31 1 7 P49869 25.91 DP00594, (dHR38 OR hormone receptor 38) 1 43 DP00594_A002 AND drosophila P0ABQ4 12.58 DP00301 Dihydrofolate reductase 2 6752 O66990 3.55 x (dihydroorotase OR DHO) 1 693 Q13062 9.84 x dihydropyridine receptor AND 3 477 (alpha1 OR II-III) Q59471 26.79 DP00105 Diol dehydrase beta subunit 1 10 Q59472 30.06 DP00106 Diol dehydrase gamma subunit 1 7 H2I233 11.5 DP00374 Diphtheria toxin repressor 1 106 Q9NRI5 29.16 x DISC1 1 559 P37471 36.0 x DivlC 1 2 P0A6R3 16.33 DP00422 DNA-binding protein fis 4 409 P08775 16.85 DP00181 DNA-directed RNA polymerase II 1 165 subunit RPB1 P52434 14.0 DP00504 DNA-directed RNA polymerase 1 4963 subunit RPABC3

150

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs O00273 18.13 DP00173 DNA fragmentation factor 1 93 subunit alpha P03018 8.19 DP00684 DNA helicase II 2 12888 P0ABS1 34.44 DP00414 DnaK suppressor protein 1 44 Q837V6 8.73 DP00326 DNA ligase AND (Enterococcus 1 77 faecalis OR Streptococcus faecalis) P27695 36.79 DP00007 DNA lyase AND (apurinic OR 2 1842 apyrimidinic) P23025 30.77 DP00243 DNA-repair protein 2 9 complementing XP-A cells Q13426 27.38 DP00152 DNA repair protein XRCC4 3 425 P11387 48.63 DP00075 DNA topoisomerase 1 AND human 12 3961 P06786 19.4 DP00076 DNA topoisomerase 2 AND yeast 3 292 Q9VPU8 45.8 DP00540 Dribble nucleolar protein 1 1 x x x DRM1 2 23 P60896 84.29 DP00617 Dss1 2 84 P47980 33.94 x dTIS11 1 6 C9WHS9 5.18 x duffy binding protein II 1 96 Q13561 25.69 x dynamitin 1 131 O14576, 25.43, DP00360, (dynein intermediate chain AND 6 37 Q24246 19.0 DP00605 (N-terminal OR N-terminus)) P03129 36.73 DP00024 (E7 AND HPV) 15 3366 P06463 12.66 x (E6 AND HPV) 2 3507 Q07108 13.07 DP00306 Early activation antigen CD69 1 1717 P03265 34.97 DP00003 Early E2A DNA-binding protein 3 193 P03243 14.31 x E1B-55K 1 118 P12978 54.41 x EBNA2 1 497 P03211 50.55 x EBNA1 1 724 A3KLJ6 59.29 x 4E-BP 2 219 Q53630 53.7 x Ebps 1 581 F2YRV4 100.0 x E1B-93R 1 2 P12830 14.29 x E-cadherin 1 21477 P34021 14.58 x (Ecdysteroid receptor OR EcR OR 4 279 Ecdysone receptor) AND Drosophila melanogaster Q92838, 35.04, DP00460, Ectodysplasin-A 1 583 Q92838-3 35.22 DP00460_A003 Q92611 4.57 x EDEM1 1 83 x x x E2 enzymes 3R 1 13 Q01094 30.21 x E2F1 1 2720 Q7Z2Z2 20.0 x EFL1 1 9

151

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P01051 11.43 x eglin 1 543 C4LXF1 6.11 x (EhPCTP-L OR 1 1 Phosphatidylcholine transfer protein) AND E. histolytica P17382 9.54 x E1 HPV 1 555 P32501 16.99 x (eIF-2B OR eIF2beta) 1 509 O15991 6.47 x Eisenia lombricine kinase 1 6 Q8KMU4 36.36 x EJ97 1 7 P13551 18.23 DP00021 Elongation factor G 2 691 P17639 100.0 DP00022 EMB-1 protein 1 5 Q8I1K3 21.85 x EMGAM56 1 1 O00423 10.31 x EMI1 1 90 Q9Y6C2 19.49 DP00569 EMILIN1 protein 1 26 P50465 5.7 DP00375 Endonuclease VIII 1 730 O43768 93.39 x (ENSA OR endosulfine alpha OR 1 186 endosulfine-alpha) Q47785 10.68 DP00289 EntI protein 2 8 P51671 25.77 DP00641 Eotaxin 2 2776 P00533 4.79 DP00309 (Epidermal Growth Factor 3 4100 Receptor OR EGFR) AND (juxtamembrane domain OR kinase domain) O88339 45.39 DP00251 Epsin-1 1 237 x x x equine lysozyme AND ELOA 1 6 P41161 33.33 x ERM AND PEA3 1 72 x x x Er_P1 1 1 Q7DB66 10.0 DP00290 EscJ 1 17 P86155 32.61 x esculentin 1 67 Q8X482 60.42 x EspF(U) 1 89 P14061 11.89 DP00023 Estradiol 17-beta-dehydrogenase 8 916 1 P03372 11.43 DP00074 Estrogen receptor AND alpha AND 5 254 (AF1 domain OR N-terminal OR DBD) Q92731 15.47 DP00079 Estrogen receptor AND beta AND 2 111 (N-terminal OR N-terminus) P14921 15.42 x (Ets-1 OR Ets1) AND (SRR OR 3 10 “serine rich region”) P39935 45.27 DP00082 Eukaryotic initiation factor 4F 1 2 subunit p150 O74718 28.7 DP00396 Eukaryotic peptide chain release 1 7 factor GTP-binding subunit P62495 5.72 DP00310 Eukaryotic peptide chain release 1 26 factor subunit 1

152

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs Q13541 61.86 DP00028 Eukaryotic translation initiation 2 365 factor 4E-binding protein 1 Q04637 33.83 DP00406 Eukaryotic translation initiation 1 303 factor 4 gamma 1 Q01844 41.92 DP00632 Ewings Sarcoma AND oncogene 6 484 AND (EWS OR EWS-Fli1) Q9ES89 1.82 DP00397 Exostosin-like 2 1 4 Q9I322 49.38 x ExsE 1 21 x x DP00270 Fab DNA-1 2 10 Q8IRG6 28.16 DP00721 FACT AND spt16 3 70 Q05344 46.89 DP00720 FACT AND Ssrp1 3 75 P02693 13.64 DP00263 Fatty acid-binding protein AND 1 1108 intestinal Q9ZPI5 4.41 DP00654 Fatty acid multifunctional 1 605 protein P12318 11.04 x Fc receptor gamma chain 1 1286 P24071 4.88 DP00311 Fc receptor immunoglobulin 1 4960 alpha P42512 9.31 DP00176 Fe(3+)-pyochelin receptor 1 6 P59113 15.07 DP00655 Fermitin AND mouse 1 5 P48632 4.29 DP00183 Ferripyoverdine receptor 1 23 P00250 14.43 x 2Fe-2S ferredoxin 1 223 Q99689 40.31 x (FEZ1 OR fasciculation AND 1 32 elongation protein Zeta 1) P14448 20.78 DP00233 fibrinogen AND (alpha chain OR 5 1408 αIIbβ3) P02679 10.6 x fibrinogen AND (gammaC peptide 1 153 OR gamma subunit) Q02020 10.8 DP00234 Fibrinogen beta chain 1 967 O93568 4.37 DP00235 Fibrinogen gamma chain 1 311 P14738, 39.29, DP00025, (“fibronectin binding protein A” 2 406 Q06556 34.19 DP00127 OR FnBPA) P17838 12.1 DP00206 Fimbrial protein 1 1793 P45976 40.06 DP00625 Fip1 1 99 P06179 11.11 DP00026 Flagellin 1 3902 P00175 9.98 x flavocytochrome beta 2 1 46 P36888 4.83 DP00312 FL cytokine receptor 1 973 P16437 18.84 x FlgB AND salmonella 1 24 P0A1I7 11.19 x FlgC AND salmonella 1 10 P16323 29.08 x FlgF AND salmonella 1 13 P0A1J3 19.23 x FlgG AND salmonella 1 20 O66683 73.86 DP00701 FlgM 4 141 P26462 21.15 x FliE AND salmonella 1 20

153

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs Q9Y613 21.48 DP00448 formin homology protein OR 1 278 FHOD1 Q6QLQ5 10.81 x fowlicidin-1 AND cathelicidins 1 7 O43524 39.08 x FOXO3a 1 1132 Q06787 29.43 DP00134 Fragile X mental retardation 1 1 2158 protein Q16595 18.1 DP00607 Frataxin AND mitochondria* 1 508 P19970 40.44 x FRQ 1 260 Q9RLG7 45.45 x Fst AND par toxin 1 10 x x x Fth AND SRP 1 1 Q07867 23.93 x FtsL 1 82 P10121 31.99 x FtsY 1 205 P0A9A6 12.27 x FtsZ 3 1602 Q9UQC2 42.16 x Gab2 1 315 Q13480 40.63 x Gab1 2 413 Q08605-2 40.66 DP00328 GAGA transcription factor 1 211 P0CL82 86.32 x GAGE-12I 1 1 P04386 11.58 x Gal4 1 4292 Q9Z0U4 4.94 DP00463 Gamma-aminobutyric acid type B 1 178 receptor subunit 1 O76070 53.54 DP00630 (“gamma synuclein”) 10 354 P08050 19.9 DP00278 Gap junction alpha-1 protein 4 130 Q01231 23.18 DP00646 Gap junction alpha-5 protein 1 9 Q28181-4 73.39 DP00441 GARP 1 255 P03069 51.6 DP00083 General control protein GCN4 3 104 Q9MAA7 3.77 DP00723 Gibberellin receptor GID1A 1 31 P58466 13.03 x (GIP OR SCP1) 2 2919 Q03768 36.98 DP00163 Gir2 2 11 P10071 33.23 x Gli3 1 681 Q07731 29.86 DP00029 Glial cell line-derived 2 3683 neurotrophic factor Q7KT70 8.68 x (gliotactin OR Gli-cyt) AND 1 12 Drosophila P15456 7.91 x globulin cruciferin B 1 8 P35557 5.59 x glucokinase 1 3364 P00502 12.61 DP00731 Glutathione S-transferase alpha- 1 176 1 P10388 50.83 DP00285 Glutenin subunit DX5 1 9 P50440 7.33 DP00099 Glycine amidinotransferase 2 23 mitochondria* P13255 2.39 DP00031 Glycine N-methyltransferase 1 376

154

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P49841 19.29 DP00385 Glycogen synthase kinase-3 beta 3 4189 F8WQN2 6.77 x glycolate oxidase 1 314 P56206 6.72 DP00032 Glycyl-tRNA synthetase 2 172 Q01417 80.35 DP00664 GmPM1 1 2 Q6QCC2 17.15 DP00471 GNA1870 1 9 P03661 29.01 DP00034 G3P attachment protein 2 10 P03712 30.91 x gpNU3 1 5 P13404 43.78 x GRA2 AND “Toxoplasma Gondii” 2 50 x x x GRAS Proteins 2 599 O88900 14.68 DP00490 Grb14 AND PIR domain 2 9 Q08969 89.88 x GRE1 1 14 P0A6F5 9.49 x GroEL 4 2724 P0A6G1 14.43 DP00412 GroES 2 1176 P24522 20.0 DP00704 Growth arrest AND DNA damage- 1 23 inducible protein GADD45 alpha Q14449 13.15 DP00230 Growth factor receptor-bound 1 48 protein 14 P62993 9.68 DP00210 Growth factor receptor-bound 1 2094 protein 2 P10912 10.34 DP00033, Growth hormone-binding protein 3 768 DP00033_C001 P16442 4.8 DP00339 GTA AND blood 2 78 B0B1U2 9.6 DP00338 GTB AND blood 2 175 Q13569 27.56 DP00719 G/T mismatch-specific thymine 5 531 DNA glycosylase P01112 3.7 DP00153 GTPase HRas 4 2018 Q15382 7.07 DP00364 GTP-binding protein Rheb 1 289 P04695 16.86 DP00273 Guanine nucleotide-binding 1 242 protein Gt Q8NDV7 41.28 x GW182 protein 1 141 Q6Q547 81.03 DP00475 H/ACA ribonucleoprotein 1 5 complex subunit 3 P0C686 20.78 x (HBx OR hepatitis B protein x) 1 2716 O60741 16.63 x HCN1 1 392 P22121 32.2 DP00036 Heat shock factor protein AND 2 15 (Kluyveromyces lactis) P10961 33.73 DP00135 Heat shock factor protein AND 1 930 (Saccharomyces cerevisiae OR yeast) P14602 21.05 DP00142 Heat shock protein beta-1 1 2899 D4GYC2 25.29 x Hef 1 296 Q03281 41.93 x Heh2 1 23 P13102 0.18 DP00566 Hemagglutinin 2 18266

155

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P80960 9.09 x hemocyanin 1 4981 O89339 18.42 DP00698 (henipavirus OR hendra virus) 4 21 AND (nucleoprotein) O55778 30.83 DP00700 (henipavirus OR hendra virus) 3 34 AND (phosphoprotein) P03159 4.85 x Hepatitis B polymerase 1 6736 P27958 3.29 DP00588_C002 (“hepatitis C virus” AND “core 3 3236 protein” ) Q69422 3.63 DP00674, Hepatitis GB virus B core protein 1 11 DP00674_C001 P08581 1.22 DP00317 Hepatocyte growth factor 2 6193 receptor P06492 22.45 DP00087 herpes virus AND VP16 4 723 P09651 44.62 DP00324 Heterogeneous nuclear 1 545 ribonucleoprotein A1 A0A075C3Q2 8.19 x (HEV OR hepatitis E) AND 1 4 polyproline region P0A6X3 27.45 x Hfq AND (Escherichia coli OR E. 2 23 Coli) AND (C-terminal or C- terminus) Q16665 25.91 DP00262 HIF-1 alpha AND (CAD OR N-TAD 2 105 OR ODD) P30273 17.44 DP00509 High affinity immunoglobulin 1 2 epsilon receptor subunit gamma P63158 80.47 DP00384 High mobility group protein B1 1 2941 P17096 100.0 DP00040 High mobility group protein HMG- 6 69 I/HMG-Y P07746 80.39 DP00041 High mobility group-T protein 1 6 P01050 32.31 DP00137 Hirudin variant-1 2 32 P06180 58.98 DP00213 Histone-binding protein N1/N2 1 12 P10922 100.0 DP00097 Histone H1.0 4 213 P68432 36.76 x Histone H3 3 36292 P53551 92.64 DP00423 Histone H1 10 34935 P02259 100.0 DP00044 Histone H5 1 33570 P15865 100.0 DP00136 histone H1.2 2 86 P04908 37.69 x Histone H2A 2 33886 P62805 38.83 x Histone H4 AND N-terminal 3 463 domain P62807 62.7 x Histone H2B 1 33905 P12493, 32.4, DP00101, HIV AND P6 AND gag 1 312 P27958 3.29 DP00101_C006 P69723 18.75 x HIV-1 AND Vif AND (C-terminal 3 74 OR C-terminus) P12497 13.52 DP00410_C011, (HIV-1 Integrase OR HIV-1 gag OR 4 10585 DP00410 HIV-1 p24)

156

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P04325 50.86 DP00424 HIV-1 Rev 3 3075 Q1PAB4 64.36 DP00650 HIV-1 Tat 11 4193 P06939 28.57 x HIV-2 Vpx 1 145 P09429 80.47 x HMGB1 1 3734 P82970 100.0 x HMGN5 1 16 Q14978 86.98 x (hNopp 140 OR Nopp 140 OR 2 9 human nucleolar phosphoprotein p140) Q00839 34.18 x hnRNP-u 1 106 P23441 32.53 DP00071 Homeobox protein Nkx-2.1 3 12 Q99801 77.35 DP00683 Homeobox protein Nkx-3.1 1 10 Q58667 7.65 DP00619 Homoaconitase small subunit 1 5 P20050 12.73 x Hop1 1 128 Q86TP1 14.35 x (h-prune OR human prune 1 101 protein) O95197 30.52 x hRTN3 1 2 P38753 32.74 DP00635 HSE1 2 31 Q00613 32.89 x HSF1 AND (C-terminal OR C- 2 46 terminus) P0A6H5 19.41 DP00100 hsl protease AND ATP 1 5 P22943 77.06 DP00705 Hsp12 4 103 Q9UJY1 32.65 x Hsp22 2 147 P12810 23.84 DP00677 HSP16.9 1 22 Q08914 3.38 x Hsp33 1 52 P04792 37.07 x Hsp27 1 3274 P02829 25.39 x hsp90 atpase 1 815 P16860 32.09 DP00551 (“human cardiac hormone”) AND 2 2142 (N-terminal OR N-terminus) AND “B-type natriuretic peptide” Q02413 8.1 x human desmoglein 1 1 672 P49354 21.9 DP00558 human farnesyltransferase 1 1391 P04150 17.63 DP00030 human glucocorticoid receptor 12 457 AND (AF1 OR transcription- activating fragment OR activation function subdomain OR activation domain) Q8IVG9 66.67 x Humanin 1 173 Q6WB98 1.67 x human metapneumovirus 1 3 glycoprotein ectodomain P15927 20.0 x human replication protein A 2 981 P12579 57.68 DP00447 (“human respiratory syncytial 2 144 virus”) AND (phosphoprotein OR protein) AND (N-terminal OR N-

157

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs terminus OR C-terminal OR C- terminus) Q9NX55 72.09 DP00546 (Huntingtin-interacting protein 1 1 247 OR HIP1) Q9NX55 72.09 x (huntingtin yeast two hybrid 4 60 protein OR hypk) P07242 50.74 x H5 vaccinia virus 1 32 F6IAY2 19.08 x (HvNAC013 OR NAC013) 1 2 K9S1M5 33.33 x HvNAC005 OR NAC005 1 2 O24646 76.19 DP00469 HY5 1 262 P38216, 50.78, x hydrophilin 1 6 P47009, 74.04, P53872 84.8 Q27796 8.6 DP00045 Hypoxanthine-guanine 1 4495 phosphoribosyltransferase P22362 17.71 DP00644 I-309 1 138 P01094 85.29 DP00586 (“IA3” OR “IA 3” OR “IA(3)” OR 6 248 "IA₃") AND (Saccharomyces cerevisiae OR Yeast) O14576 25.43 x IC AND cytoplasmic dynein 1 39 Q9NSA3 51.85 x ICAT AND transcriptional 1 17 inhibitor P35521 45.96 DP00717 (Icln OR pICln) 2 106 P41134 20.0 x Id protein HLH domain 1 264 P03169 34.01 x IE1 1 822 P09310 28.4 x (IE62 OR varicella-zoster virus 1 145 major transactivator) P01096 71.56 x IF1 AND Bovine 1 49 P73124 46.15 DP00158 (IF7 OR IF17) AND “glutamine 3 8 synthetase” x x x IgE cepsilonmx 1 4 x x DP00710 IgG heavy chain CH1 2 62 G2WJG6 9.64 x (importin-beta OR Kap95p) 2 1310 P50097 0.6 DP00399 Inosine-5'-monophosphate 1 303 dehydrogenase P24592 25.42 DP00211 Insulin-like growth factor-binding 1 262 protein 6 P03355 22.5 DP00651, Integrase p46 1 2 DP00651_C007 P32455 19.43 DP00313 Interferon-induced guanylate- 1 21 binding protein 1 P10914 44.62 x interferon regulatory factor 1 3 1689 P07476 65.81 DP00221 Involucrin 1 1510 Q13568 30.52 x IRF5 1 364

158

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs Q9UQB8 33.7 x IRSp53 1 114 P10997 7.87 x (“islet amyloid polypeptide” OR 13 2321 IAPP) P78504 5.17 DP00418 Jagged-1 AND human AND 2 4 cytoplasmic tail P29375 22.13 DP00713 JARID1A 2 32 Q9UGL1 17.29 DP00712 JARID1B 2 60 Q8TAM6 66.2 x Juxtanodin 1 6 P02668 49.47 DP00192 kappa casein 8 10004 Q5JVS0 49.88 x Ki-1/57 2 17 P07674 39.11 DP00656 KorB 1 644 P03023 8.61 DP00433 Lactose operon repressor 3 949 P02788 2.96 DP00616 (Lactotransferrin OR lactoferrin) 2 6859 P02545 34.79 DP00716, Lamin A/C 2 1243 DP00716_C001 A3MYU7 21.37 x L20 AND ribosomal protein AND 2 10 (extension OR "amino terminal" OR N-terminal OR N-terminus) Q6PBE3 27.03 DP00579 L5 AND xenopus AND ribosome 1 14 A8CDV5 94.07 DP00538 Latent membrane protein 2A 1 217 Q9CX60 51.43 DP00661 (LBH OR limb bud and heart) AND 1 82 (transcription or transcriptional) Q2PEE1 15.73 x LC8 2 133 P32004 5.33 DP00666 (L1-CAM OR Neural cell adhesion 1 1557 molecule L1) I1JLC8 100.0 DP00185 LEA protein 1 AND soybean 1 16 Q9UJU2 51.13 x Lef-1 2 1110 P16535 7.76 DP00345 Leukotoxin 1 768 E7Q1N2 43.27 x Lhp1p 1 20 P60752 2.92 DP00400 (Lipid A export ATP-binding OR 1 141 permease protein msbA) Q9HKT1 9.92 DP00096 Lipoate-protein ligase A subunit 1 4 1 P31025 11.93 DP00647 Lipocalin-1 2 209 Q6EAJ7 62.42 x Listeria monocytogenes ActA 1 494 x x x LjIDP1 1 1 Q9SEW0 87.32 x LLA23 1 5 Q8T6T7 6.56 x Lombricine Kinase 1 19 P32170 9.79 DP00429 L-rhamnose isomerase 1 38 P05455 41.91 DP00229 Lupus La protein 3 2365 P0C5S4 16.67 DP00292 luxU Phosphorelay protein 1 9 Q9QXN1 55.42 DP00046 Lymphoid enhancer-binding 1 906 factor 1

159

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P00703 0.68 DP00259 Lysozyme C 2 4276 P16636 22.14 x Lysyl oxidase 1 1537 O33599 26.58 DP00352 lytM 1 46 B6TRJ5 29.11 x MA16 1 17 P39900 4.04 DP00571 Macrophage metalloelastase 1 617 P21758 28.92 DP00246 Macrophage scavenger receptor 1 96 types I and II P04156, 32.81, DP00466, major prion protein 27 1025 P04925, 31.5, DP00265, P04273 36.22 DP00187, DP00483 P55145 13.19 x (MANF OR mesencephalic 1 65 astrocyte-derived neutrotrophic factor) P46821 47.89 x MAP1B 1 515 P15146 54.0 DP00122 (MAP2c OR “microtubule- 3 33929 associated protein 2”) P27816 63.8 x (MAP4 OR microtubule associated 1 291 protein 4) P26645 100.0 DP00253 MARCKS 6 742 P61244, 94.38, DP00084, Max AND c-myc 5 868 P61244-2 95.36 DP00084_A002 P0AE73 21.95 DP00296 (MaZE OR PemI-like protein 1) 4 8 AND (Escherichia coli OR E. coli) AND (C-terminal OR C-terminus) P0AE70 9.91 DP00299 MaZF AND (Escherichia coli OR E. 1 116 coli) Q9UIS9 37.69 x MBD1 1 205 Q07820 30.0 x (MCL-1 OR “myeloid cell 2 368 leukemia 1”) AND (N-terminal OR N-terminus) D0VWX1 27.27 DP00681 MCoCC-1 1 2 Q00987 31.16 DP00334 MDM2 9 6283 O15151 22.65 x (MDMX OR MDM4) 1 497 Q89933, 14.1, DP00160, (“measles virus”) AND 21 360 P04851 15.11 DP00640 nucleoprotein P03422 44.77 DP00133 (“measles virus”) AND 9 229 phosphoprotein P38111 3.55 x mec1 1 515 P51608 74.49 DP00539 (MeCP2 OR Methyl-CpG-binding 6 1977 protein 2) G4V6N9 68.38 x MEG-14 1 2 Q91V27 40.34 DP00541 melanophilin 1 81 Q80FJ1 34.4 DP00276 Membrane fusion protein p14 1 27 Q14696 52.56 x (mesoderm development 1 139

160

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs candidate 2 OR MESD) Q82S91 83.76 DP00205 Metal-binding protein smbP 1 1 Q9XBN7 7.23 x Metallo beta lactamase 1 2055 P02795 19.67 x metallothionein-2A 1 75 Q8B9Q8 69.05 x metapneumovirus 1 19 phosphoprotein P22868 21.26 DP00379 Methane monooxygenase 1 800 component C P02942 9.62 DP00300 Methyl-accepting chemotaxis 1 931 protein I P07017 14.47 DP00294 Methyl-accepting chemotaxis 1 930 protein II Q25460 64.8 x (mfp-1 OR fp1) AND mussel 1 10 Q2Q9Z9 10.14 x (mfp-3 OR fp3) AND mussel 1 10 Q25464 2.75 x (mfp-2 OR fp2) AND mussel 1 3 P32787 17.1 x MGM101 1 13 Q29983 7.57 DP00670 MHC class I polypeptide-related 1 30 sequence A O27798 11.11 x (minichromosome maintenance 1 2175 protein OR MCM OR N-mtMCM) AND protein Q9VFY9 23.49 x 2mit 1 1 Q9Y3D6 9.87 DP00457 (Mitochondria* fission 1 protein 1 829 OR Fis1) Q03164 38.47 x (“mixed lineage leukemia”) AND 2 825 protein Q969V6 46.29 x MKL1 1 178 P11409 11.01 DP00060 Modification methylase PvuII 1 49 Q13485 17.03 DP00464 Mothers against decapentaplegic 1 2781 homolog 4 x x x MpAsr 1 2 Q03834 19.57 x Msh6 1 1146 P33748 37.22 x Msn2 1 248 B7SX18 83.2 x MSP2 AND “Plasmodium 9 204 Falciparum” Q13043 31.62 x MST1 SARAH Domain 1 14 Q9DBZ9 29.99 x mSYD1A 1 1 P52477 22.19 DP00401 Multidrug resistance protein 1 101 mexA P35372 8.5 DP00272 Mu-type opioid receptor 1 114 P02687, 57.4, DP00047, (“myelin basic protein”) 34 9697 P02686, 50.0, DP00236, P04370, 57.6, DP00237, P81558 76.61 DP00663

161

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P70475 46.67 DP00049 Myelin transcription factor 1-like 1 6 protein P15172 35.94 x MyoD 1 3390 P02185 16.88 DP00303 Myoglobin or apomyoglobin 3 11645 P52179 18.87 DP00517 Myomesin-1 1 11 Q9Y4I1 14.56 x myosin 5a 1 49 P90648 15.3 x (myosin II heavy chain kinase B 1 69 OR MHCK-B) P24844 37.21 x Myosin light chain 1 1 5351 P10587 34.66 DP00102 ("myosin motor domain") 2 117 Q9TW98 11.45 x n16.3 1 1 P38996 42.39 x Nab3 1 57 Q9C000 11.34 DP00554 NACHT LRR PYD domains- 1 15 containing protein 1 Q969F2 51.66 DP00520 Naked2 1 8 O75376 48.85 x (NCoR OR NCoR-1 OR SMRT) 1 896 P26477 56.7 DP00027 Negative regulator of flagellin 1 47 synthesis Q96FI4 37.18 x NEIL1 AND (C-terminal or C- 2 14 terminus) P19246 74.22 DP00050 Neurofilament heavy polypeptide 3 139 P02547 36.98 DP00151 Neurofilament light polypeptide 2 317 Q92686 69.23 x neurogranin 2 290 Q9NZ94 8.84 DP00553 (“neuroligin 3”) 4 105 P17677 100.0 x neuromodulin 1 2519 P30990 14.71 x neurotensin 1 5218 P25963 25.55 DP00468 NF-kappa-B inhibitor alpha OR 5 7656 IkappaBalpha P01138 14.94 x NGF AND neurotrophin 1 2287 Q92886 40.08 DP00672 (NGN1 OR neurogenin 1) 1 430 P19634 4.66 x NHE1 AND (C-terminal OR C- 2 50 terminus) Q9IH62 1.83 DP00686 (Nipah virus OR NiV) AND 1 177 glycoprotein Q9IK92 15.79 DP00697 (Nipah virus OR NIV) AND 1 18 nucleoprotein Q9IK91 35.26 DP00699 (Nipah virus OR NIV) AND 1 39 phosphoprotein Q12972 33.9 x NIPP1 3 39 A0A023YU88 7.85 x NleH 1 22 Q13224 7.35 x (NMDAR2B OR GluN2B) 4 459 Q7Z406 34.54 x (NMII OR non muscle myosin II) 1 2413 Q96E22 13.31 x Nogo-B AND (N-terminal OR N- 4 17

162

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs terminus OR C-terminal OR C- terminus) P11632 94.62 DP00432 Non-histone chromosomal 1 4 protein 6A P02315 100.0 DP00042 Non-histone chromosomal 1 14 protein H6 P02316 100.0 DP00038 Non-histone chromosomal 1 71 protein HMG-14 P05204, 100.0, DP00039, Non-histone chromosomal 1 75 P02313 100.0 DP00195 protein HMG-17 I7GQ94 22.48 x (n16 OR n16N) 1 185 Q61985 29.82 DP00671 Nrf2 AND Neh2 domain 2 27 Q13127 50.23 x (NRSF OR neural restrictive 2 317 silencer factor) Q9WMX2, 3.32, DP00588_C010, (NS5A OR Non structural protein 16 1342 P27958 3.29 DP00615_C010, 5A) AND hepatitis DP00615 P26662 2.96 x (NS5B OR Non structural protein 1 1446 5B) AND hepatitis Q80J95 16.6 x NS1-2 norovirus 1 8 P26662 2.96 x (NS3 OR non structural protein 3) 1 2768 AND hepatitis C virus x x x (Nsp OR nucleoskeletal-like 1 2 protein) AND bacillus subtilis P34130 15.24 x NT4 AND neurotrophin 1 146 Q9SQ56 66.56 x NtGR-RBP1 1 1 Q9BYD2 12.36 x (NTL9 OR N terminal domain of 1 66 the ribosomal protein L9) O67198 9.34 x Ntrc1 1 13 Q09161 19.11 DP00392 Nuclear cap-binding protein 1 39 subunit 1 P52298 20.51 DP00393 Nuclear cap-binding protein 1 37 subunit 2 Q8WUM0 8.04 DP00318 Nuclear pore complex protein 2 39 Nup133 Q9Y6Q9-3 36.75 DP00343 Nuclear receptor coactivator 3 1 506 P03347 32.62 DP00148, Nucleocapsid protein p7 2 164 DP00148_C004 P05221 66.5 DP00217 Nucleoplasmin 2 335 Q9RY71 16.37 x nudix hydrolase 1 214 P35658 30.96 x Nup214 1 182 P40477 29.66 x Nup159 3 43 P57740 11.24 x Nup107 2 82 P37198 26.44 x Nup62 1 70 P49790 40.41 x Nup153 5 181

163

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs Q02630 37.11 x Nup116 AND (Saccharomyces 1 37 cerevisiae OR yeast) P52948 19.15 x Nup96 OR Nup98 2 389 P32499 67.22 DP00222 NUP2P nucleoporin AND 1 19 (Saccharomyces cerevisiae OR yeast) O60356 100.0 x (Nupr1 OR p8 nucleoprotein) AND 2 58 human O24172 36.0 DP00580 ("ole e 6") 1 9 P08523 28.22 DP00279 Olfactory marker protein 1 845 Q9FUW7 100.0 DP00227 Omega gliadin storage protein 1 24 P02631 32.11 DP00730 Oncomodulin 2 143 Q39532 44.36 DP00470 Opaque 2 1 2868 P10451 45.86 DP00214 (OPN OR osteopontin) 10 7702 E7QAI1 42.5 x Opy2p 1 2 P07805 2.84 DP00051 Ornithine decarboxylase 2 7562 Q13438 37.03 x OS-9 1 124 P81455 38.78 DP00116 Osteocalcin 2 14028 Q9BRP0 26.91 x OVOL2 1 15 O00110 25.23 x OVOL3 1 1 Q86XL8 14.75 x OVOL1 1 26 Q09472 28.38 DP00633 p300 acetyltransferase 4 930 Q96GU1 72.31 x (PAGE5 OR Prostate associated 2 2001 gene 5) O60829 100.0 x (PAGE4 OR “Prostate associated 4 2014 gene 4”) Q66SS1 35.07 DP00568 Paired box 6 1 1305 O96013 31.64 x PAK4 1 160 P32521 37.3 x Pan1 1 77 Q58A45 12.97 x pan3 1 39 P00974 5.0 DP00729 Pancreatic trypsin inhibitor 1 4801 P04637 39.69 DP00086 p53 AND (CTD or C-terminal or C- 61 1240 terminus) O60356 100.0 DP00510 p8 AND human AND 2 21 nucleoprotein P15309 6.22 x PAP AND protein 1 5439 Q96IZ0 68.24 x Par-4 AND (C-terminal or C- 4 17 terminus) P01270 49.57 DP00637 Parathyroid Hormone AND human 2 24877 P12272 76.84 DP00138 Parathyroid hormone-related 2 4240 protein P9WIJ9 18.6 x ParB 1 279 Q9KJ82 34.21 DP00529 parG AND (N-terminal OR N- 2 16

164

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs terminus) Q96262 71.11 x PCaP1 2 8 Q61140 36.16 x P130Cas AND CasSD 2 6 P22239 100.0 DP00112 PCC6 OR Dsp16 1 7 P38936 44.51 DP00016 (p21(Cip1) OR p21 OR Cyclin- 10 29902 dependent kinase inhibitor 1) P9WGM2 20.0 DP00247 pdtaR 1 3 Q9RHW0 5.36 DP00738 Pectate lyase 1 638 Q9LXB8 95.41 x (PELPK1 OR At5g09530) 1 2 P81058 4.88 DP00373 Penaeidin-3a 1 2 P48539 91.94 x PEP-19 3 64 P19021 8.74 x peptidylglycine alpha-amidating 1 484 monooxygenase Q7RSH5 7.14 DP00745 Peptidyl-prolyl cis-trans 1 6121 isomerase P0C2E9 4.8 DP00280 Perfringolysin O 2 233 P17810 4.64 DP00220 Peripherin-2 1 32 P07598 8.31 DP00108 Periplasmic hydrogenase large 1 23 subunit Q9Y9L0 11.2 DP00037 peroxiredoxin AND Aeropyrum 1 8 pernix P37231-2 6.92 DP00718, Peroxisome proliferator- 2 15469 DP00718_A001 activated receptor gamma P40855 48.83 x Pex19 1 95 P50542-3 24.56 DP00472 (Pex5p OR peroxisomal cycling 1 182 receptor) P22621 10.29 x (PfAMA1 OR (plasmodium 1 248 falciparum AND "apical membrane antigen 1")) Q25733 28.56 DP00746 PfEMP1 1 404 Q3YL59 21.32 x PFMG1 1 2 Q07412 2.82 DP00614 (PfTIM OR triosephosphate 1 31 isomerase) AND plasmodium falciparum Q9UBK2 32.08 x PGC-1alpha AND “activation 3 44 domain” Q88HS5 15.81 x PhaF 1 64 F1ZCA0 71.26 x phasin AND "Novosphingobium 1 1 nitrogenifigens" P27001 10.29 DP00053 Phenylalanyl-tRNA synthetase 2 12 alpha chain P0AFJ5 12.23 x phob 1 354 Q08209 20.35 DP00092 phosphatase 2B subunit alpha 1 62 P10688 10.85 DP00055 1-phosphatidylinositol-4,5- 2 86 bisphosphate phosphodiesterase

165

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs delta-1 P78356 15.87 DP00054 Phosphatidylinositol-4-phosphate 1 9 5-kinase type II beta P26678 38.46 x (Phospholamban OR PLN) 1 2847 P12359 21.99 DP00188 ((photosystem II OR PSII) AND 11 160 manganese stabilizing protein OR "Oxygen-evolving enhancer protein 1") O14832 6.51 DP00327 Phytanoyl-CoA dioxygenase 1 7 Q9NWS0 27.59 x Pih1 1 26 Q07932 22.32 DP00567 Pilosulin-1 1 7 Q13526 49.69 x pin1 1 1054 O70161 26.48 x PIPKIgamma661 1 6 P49918 45.25 DP00017 p57Kip2 1 292 P19525 21.23 x (PKR OR protein kinase R) 1 21830 Q03133 9.21 x (PKS OR modular polyketide 1 2214 synthase) P00747 8.89 DP00191 Plasminogen 1 47732 P05121 6.72 DP00320 Plasminogen activator inhibitor 1 1 10218 P53350 10.12 DP00428 PLK1 1 1171 P29590 24.94 x PMLII 1 3 E7Q6S2 12.25 x Pml1p 2 6 Q9XES8 66.29 DP00665 (PM28 OR GmPM28) 1 4 P01139 46.06 x (pNGF OR pro-Nerve Growth 3 219 Factor OR proNGF) Q96QC0 45.96 x PNUTS 1 33 P03300 2.26 x poliovirus 3AB 1 59 P03305 11.41 DP00573 polyprotein foot-and-mouth 2 140 disease virus Q6IBA2 74.8 DP00501 (positive cofactor 4 OR PC4) AND 1 498 Human P08510 14.5 DP00267 Potassium voltage-gated channel 1 631 protein Shaker P14859 26.38 DP00231 POU domain AND class 2 AND 1 99 transcription factor Q64693, 33.98, DP00008, POU domain class 2-associating 3 3 Q16633 33.98 DP00172 factor 1 P37231 8.51 x PPARgamma 1 15073 P17767 8.44 x P1 protease 1 3149 Q9VXX1 55.66 x PPYR1 OR CG15031 1 7 O60828 56.23 x (PQBP-1 OR “polyglutamine tract 3 162 binding protein 1”) U5Y3Y2 74.66 x precol-NG AND "mussel byssus" 1 2

166

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs Q6VBP1 7.25 x preS1 2 365 P58743 2.55 x prestin 1 290 Q06253 24.66 DP00288 Prevent host death protein 3 373 A1Z6W3 30.64 x prickle AND PET domain 1 6 P06401 23.9 x progesterone receptor AND CTE 3 6 P9WHN4 90.62 DP00293 (“prokaryotic ubiquitin-like 7 69 protein”) P45577 53.88 DP00377 ProP effector 1 4 Q15185 45.62 DP00358 Prostaglandin E synthase 3 2 1038 P15309 6.22 DP00628 ("prostatic acid phosphatase") 2 1647 P01094 85.29 DP00179 Protease A Inhibitor 3 1 38105 Q9UM07 7.69 DP00321 Protein-arginine deiminase type- 1 10 4 Q9UKV8 15.13 DP00736 Protein argonaute-2 1 255 Q6P8Z1 51.18 DP00564 Protein B-Myc 1 15 P27577 14.32 DP00111 Protein C-ets-1 2 1114 P09372 32.99 DP00103 Protein grpE 1 561 P10644 13.91 x ("protein kinase A") 3 14902 P17252 8.04 x Protein kinase C alpha V5 domain 1 5 Q51912 47.84 DP00121 Protein L precursor 1 7035 P04324, 22.82, DP00189, Protein Nef 1 3009 P03406 27.18 DP00048 P11845, 60.98, DP00232 (Protein phosphatase inhibitor 2 5 180 P41236, 72.2, OR PP1 I-2) Q8R3G1 30.2 O60927 69.05 DP00219 Protein phosphatase 1 regulatory 1 36 subunit 11 P01099 100.0 DP00325 Protein phosphatase 1 regulatory 1 7 subunit 1A Q90623-2 45.38 DP00218 (Protein phosphatase 1 2 370 regulatory subunit 12A OR MYPT1) P07516 85.15 DP00421 Protein phosphatase 1 regulatory 1 7 subunit 1B P40357 43.16 DP00128 Protein transport protein SEC9 2 32 P60058 13.24 DP00117 Protein transport protein Sec61 1 8 subunit gamma Q93096 8.09 DP00255 Protein tyrosine phosphatase 1 6 type IVA 1 O75365 10.4 DP00254 Protein tyrosine phosphatase 1 4 type IVA 3 x x x proteoglycans syndecans 1 2090 P06302 100.0 DP00058 prothymosin alpha 14 377 P11309-2 8.63 DP00322 Proto-oncogene 1 538

167

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs serine/threonine-protein kinase Pim-1 Q6P2Q9 5.35 x Prp8 1 153 P35820 45.03 x (PSC OR posterior sex combs) 1 4066 P78352 8.84 x (PSD-95 OR Post Synaptic Density 2 1744 95) P19243 29.11 DP00676 PsHSP18.1 1 3 P60484 19.85 x PTEN 2 9534 P18031 24.6 x PTP1B 1 1232 H0USY9 24.73 x PulS 1 516 Q9BXH1 53.89 x PUMA 3 1997 P48539 91.94 DP00592 (Purkinje cell protein 4 OR 3 2220 PEP19) Q9N607 12.32 x (PvAMA-1 OR Plasmodium Vivax 1 68 Apical Membrane Antigen 1) Q02597 6.04 DP00560, PVY AND potato virus 1 365 DP00560_C007 Q01144 63.45 x PWL2 OR PWL2D 1 9 O75469 12.44 DP00323 PXR receptor AND human 1 1030 P0AFI7 11.93 DP00165 Pyridoxine-5'-phosphate oxidase 1 78 Q9NVS9 16.09 DP00168 Pyridoxine/pyridoxamine 5'- 1 3 phosphate oxidase P37362 90.0 x pyrrhocoricin 1 24 P0AFG8 4.96 DP00427 Pyruvate dehydrogenase E1 1 1159 Q32NN2 22.58 DP00286 quaking-A protein 1 258 P22363 30.3 x rabies virus phosphoprotein 1 164 P37727 16.92 DP00458 Rab proteins 2 8 geranylgeranyltransferase component A 1 P31751 9.98 DP00304 RAC-beta serine/threonine- 2 5 protein kinase P43351 32.54 DP00437 RAD52 homolog DNA repair 1 63 protein P04049 15.28 DP00171 RAF kinase 4 11589 P10114 12.02 DP00167 Rap-2a 2 3 P11938 30.71 DP00020 RAP1 AND DNA 1 539 P51449 20.85 x RAR gamma AND (N-terminal OR 1 22 N-terminus) P63000 5.21 DP00408 Ras-related C3 botulinum toxin 1 4520 substrate 1 P11233 34.47 DP00581 Ras-related protein Ral-A 1 6 Q9A3A9 24.85 x RcdA 1 230 P9WHJ3 7.22 x recA intein AND Mycobacterium 1 31 tuberculosis

168

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P18754 9.26 DP00691 Regulator of chromosome 2 140 condensation P49799 27.32 DP00063 Regulator of G-protein signaling 1 846 4 P03040 30.3 DP00741 Regulatory protein cro 5 330 P9WHG9 8.54 x (RelA OR SPoT) AND (C-terminus 1 375 OR C-terminal) P27694 13.8 DP00061 Replication protein A AND 70 1 137 Q9NQC3-2 31.9 DP00524 (Reticulon-4 OR RTN 4B) 1 600 P04972 79.31 DP00638, (retinal OR cGMP) AND 6 287 DP00347 phosphodiesterase AND “gamma subunit” P62965 12.41 DP00340 Retinoic acid receptor RXR-alpha 2 1745 O54828 27.11 x (RGS9-2 OR Regulator of G- 1 125 protein signaling 9) x x x Rhabdoviridae AND nucleoprotein 1 613 x x x Rhabdoviridae AND 1 433 phosphoprotein P02699 13.51 DP00271 rhodopsin 2 8760 Q07960 24.6 DP00459 Rho GTPase 1 11298 O28362 13.73 DP00382 Ribonuclease P protein 1 53 component 1 P11157 10.51 DP00462 Ribonucleoside-diphosphate 3 74 reductase M2 subunit P49723 10.72 DP00488 Ribonucleoside-diphosphate 1 4 reductase small chain 2 P09938 18.3 DP00487 Ribonucleoside-diphosphate 1 5 reductase small chain 1 P69924 9.57 DP00107 Ribonucleoside-diphosphate 1 9 reductase 1 subunit beta B0B899 30.63 x ribosomal protein L2 1 89 P60723 28.86 DP00600 ribosomal protein L4 2 168 P56210 27.82 DP00512 ribosomal protein L11 2 277 P0A7N9 100.0 DP00143 ribosomal protein L33 1 74 P05318 40.57 DP00164 ribosomal protein P1-alpha 1 3 Q9HFQ6 46.3 DP00001 ribosomal protein P1-B 1 1 P02400 54.55 DP00002 ribosomal protein P2-beta 2 8 B9JVV3 19.02 x ribosomal protein S4 2 205 P0AG63 16.67 DP00242 ribosomal protein S17 1 45 P08865 30.85 x ribosomal protein SA AND human 2 58 P16083 6.49 DP00727 Ribosyldihydronicotinamide 1 105 dehydrogenase Q22472 56.08 DP00613 RIC-3 1 51 Q6PR54 22.57 x Rif1 1 170

169

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P0AEM6 12.97 x RNA Polymerase AND Escherichia 1 11148 coli Q9Y5B0 40.27 DP00177 RNA polymerase II subunit A C- 4 121 terminal domain phosphatase OR FCP1 Q9A749 26.17 x ("Rnase E" AND caulobacter 1 3 crescentus) P21513 28.84 DP00207 (“RNase E” AND (escherichia coli 3 496 OR e. coli)) P25814 43.1 DP00387 (“RNase P” OR “ribonuclease P” ) 2 157 AND bacillus subtilis O31774 17.88 x ("RNase Y") AND Bacillus subtilis 1 16 P78317 37.89 x RNF4 ubiquitin ligase 1 64 S8GC24 38.75 x ROP6 1 19 P11171 37.5 DP00678 (4.1R OR "Protein 4.1") 1 704 Q8GYN5 69.19 x RPM1-interacting protein 4 1 9 Q03465 32.02 x Rpn4 1 65 P9WIP5 21.13 x Rv2377c 1 1 P9WPQ1 15.49 x Rv3221c 1 1 Q9CQK7 50.62 DP00587 (Rwdd1 OR RWD domain- 1 4 conatining protein 1) Q8N488 78.51 DP00694 RYBP 1 57 P21817 11.49 x RyR1 1 5680 P0AEI4 6.35 x S12 1 2307 P26447 21.78 x S100A4 1 899 P33764 19.8 x S100A3 1 45 P80511 13.04 x S100A12 1 331 P29034 16.33 x S100A2 1 195 P06703 14.44 x S100A6 1 410 P0DJI8 24.59 x (SAA OR serum amyloid A) 1 6478 P22470 42.46 x San1 2 29 P05109 16.13 x (S100A8 OR S100A9) 1 1466 P07602 10.69 DP00733_C004 Saposin-C 4 635 Q19QV9 23.47 x (“SARS coronavirus” AND 2 368 (“nucleocapsid protein” OR “nucleocapsid phosphoprotein”)) Q20010 60.4 x SAS-5 1 39 P04271 25.0 x S100B 1 1667 C4MN95 80.69 x (SbASR-1 OR Abscisic acid stress 1 63 ripening protein) C0J347 28.88 x (SBDS OR TcSBDS) AND 1 1 Trypanosoma cruzi Q931F4 40.85 x Sbi-III 1 1

170

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs Q96T21 37.0 DP00420 (SBP2 OR SECIS binding protein 2) 2 142 P23389 50.93 DP00124 Secretogranin-1 1 413 P40316, 69.44, DP00256, (securin OR PTTG1 OR pituitary 3 25 O95997 71.78 DP00521 transforming gene 1 product) AND (N-terminal OR N-terminus) Q68FT9 7.64 DP00620 Selenocysteine lyase 1 71 Q9BQE4 45.5 x (“selenoprotein s” OR SelS OR 3 2084 VIMP) O94742 47.19 x Sem1 1 33 P02783 83.93 DP00527 (Seminal vesicle protein number 2 153 4 OR SV-IV) Q9EB08 20.28 x (SeMV OR "sesbania mosaic 1 4 virus") AND polyprotein 2a Q07097 16.98 DP00629 (“sendai virus”) AND 2 191 nucleoprotein P04859 52.11 x (“sendai virus”) AND 5 308 phosphoprotein O43236 23.01 x sept4 1 79 O43236 23.01 DP00537 SEPT4 AND septins 1 74 O75920 65.45 x SERF1a 1 7 O86488 31.79 DP00065 Serine-aspartate repeat- 1 2 containing protein D O43464 6.99 DP00315 Serine protease HTRA2 AND 1 231 mitochondria* P53041 10.82 DP00365 Serine/threonine protein 1 68 phosphatase 5 P31645 5.56 x (SERT OR serotonin transporter) 1 8514 P02768 11.82 DP00515 ("serum albumin") AND human 1 42565 P11831 36.02 DP00574 serum response factor 1 24776 P34945 15.91 DP00514 Seryl-tRNA synthetase 3 258 O75533 12.27 x SF3b155 1 23 x x x Sfbl-5 1 2 Q86XK3 35.92 x sfr1 1 35 P10824 15.82 DP00035 S-Gi alpha 1 1 1 P35187 26.88 x Sgs1 helicase 1 323 P29353 30.19 DP00154 SHC-transforming protein 1 1 1430 U5Y0K9 25.95 x shematrin 1 5 Q9Y3R4 21.05 DP00261 Sialidase-2 1 5 P38634 57.75 DP00631 Sic1 AND CDK 13 87 P0AFX7 43.98 DP00552 Sigma-E factor negative 1 16 regulatory protein P03087 27.9 DP00182 Simian virus Major capsid protein 2 79 VP1

171

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs Q5VVH5 33.08 x SIMPL 1 23 P27285 9.96 DP00066, sindbis virus AND capsid protein 1 195 DP00066_C001 P50263 86.08 x SIP18 1 11 P06701 15.54 DP00533 (Sir3p OR silent information 6 385 regulator 3 protein OR SIR3) AND saccharomyces cerevisiae Q96EB6 31.06 x sirt1 AND (N-terminal OR N- 3 83 terminus OR C-terminal OR C- terminus) P25294 53.12 x Sis1 AND yeast 1 80 Q13573 49.63 DP00608 (SKIP OR Ski interaction protein) 1 16 AND SNW P52285 27.16 x skp1 1 918 Q9VAN6 63.41 DP00144 SLBP OR (Histone RNA hairpin- 3 111 binding protein) Q57733 14.29 DP00067 Small heat shock protein HSP16.5 1 28 P32566 32.87 x (small molecule reductase 1 131 regulatory protein OR SmI1 OR knr4) P22531 59.72 DP00130 Small proline-rich protein 2E 2 249 Q6DIC0 34.12 x SMARCA2 1 301 Q9H0W8 31.15 x SMG-9 1 5 P19972 12.61 DP00180, SMK toxin 1 6 DP00180_C003 Q99LM3 70.81 DP00742 (Smoothelin-like protein 1 OR 4 13 SMTNL1) O95149 21.11 x snurportin 1 23 P63292 21.7 DP00125 Somatoliberin 1 6955 O30916 12.12 x SopB 1 200 Q9S446 36.89 DP00208 sortase 3 528 Q99523 3.85 x Sortilin 1 387 Q08826 14.2 DP00482 Sorting nexin-3 1 9 Q07889 20.63 x Sos1 1 515 P03607 18.28 DP00064 southern bean mosaic virus 1 36 capsid protein P08047 24.59 DP00378 Sp1 2 9875 x x x (SPA OR septal pore associated 1 8059 proteins) P07214 15.23 DP00052 SPARC 1 2696 A1Z0H7 38.52 x Spatzle 1 127 G2TRL7 41.18 x spd2 1 22 Q10585 52.42 x spd1 3 42

172

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P15340 96.77 DP00057 Sperm histone 2 1664 Q96SB3 46.38 x spinophilin 1 256 x x x split intein 1 171 O34800 58.93 x SpoIISB 1 5 x x x src family kinase 1 11278 P0A7L8 45.88 DP00140 50S ribosomal protein L27 2 6 P0A7S3 22.58 DP00145 30S ribosomal protein S12 1 17 P0A7T7 18.67 DP00146 30S ribosomal protein S18 1 7 P0A7U3 34.78 DP00147 30S ribosomal protein S19 1 10 P09132 45.83 DP00570 SRP19 1 86 P0AGE2 38.76 DP00722 SSB AND (escherichia coli OR e. 3 66 coli) AND (C-terminal OR C- terminus) Q9FD10 3.92 x (SseJ OR SseJ-H OR SseJ-L) AND 1 31 salmonella Q9UWU0 0.0 x Sso Acp 1 8 O75324 12.5 DP00162 Stannin 1 21 A2VD23 100.0 DP00584 starmaker 3 9 P02808 43.55 x statherin 1 193 P16949 100.0 DP00174 stathmin 5 867 Q7Z626 18.17 x STIL 1 161 Q2M3R5 5.75 x Stim1 1 876 Q7CL96 13.62 DP00302 stringent starvation protein A 1 7 P0AFZ3 42.42 DP00194 (stringent starvation protein B 1 147 OR SspB) P04189 12.86 DP00394 subtilisin AND (propeptide OR 3 166 pro-peptide) Q9UMX1 16.74 x SUFU 1 241 P55789 34.63 x sulfhydryl oxidase ALR 1 33 P38038 8.35 DP00190 ("sulfite reductase") AND NADPH 1 94 P0DMM9 9.49 DP00011 Sulfotransferase 1A3/1A4 1 3 O00204, 27.95, DP00404, (SULT2B1 OR SULT2B1b OR 1 74 O00204-2 27.71 DP00404_A002 Sulfotransferase family cytosolic 2B member 1) Q9UBE0 19.65 DP00485 SUMO-activating enzyme subunit 1 5 1 Q9UBT2 18.28 DP00486 SUMO-activating enzyme subunit 1 3 2 P00441 29.87 DP00652 ("Superoxide dismutase") 1 58522 O31851 16.84 DP00257 superoxide dismutase-like yojM 1 158 O95425 32.79 x supervillin 1 35 O35718 28.89 DP00446 Suppressor of cytokine signaling 1 3127

173

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs 3 Q8NHG7 100.0 x SVIP 1 13 P40688 33.94 x Swallow AND cytoplasmic dynein 1 5 P31109 34.19 DP00113 Synaptobrevin homolog 1 1 27 P63045, 43.1, DP00622, (synaptobrevin OR Vesicle- 3 2159 P63027 43.97 DP00069 associated membrane protein 2) A7UMX5 66.67 DP00535 Synaptopodin 2 variant OR 2 15 Fesselin P60881 29.61 DP00068 Synaptosomal-associated protein 1 1413 25 P18827 50.32 x syndecan-1 1 1484 P32851 19.44 DP00155 Syntaxin-1A 1 1104 P26039 20.31 DP00653 Talin-1 1 60 P10636-8 72.11 DP00126 Tau AND (protein OR alzheimer’s 65 20253 OR tauopathies OR neuronal) P03409 13.31 x Tax AND HTLV transcriptional 1 146 activator E7AIJ0 7.37 x TB1-C-Grx1 1 1 Q6SJ96 29.87 x TBPL2 1 12 P01730 8.52 DP00123 T-cell surface glycoprotein CD4 1 16415 P04234 5.85 DP00505 T-cell surface glycoprotein CD3 2 533 delta chain P07766 21.74 DP00506 T-cell surface glycoprotein CD3 2 321 epsilon chain P09693 15.93 DP00508 T-cell surface glycoprotein CD3 2 1278 gamma chain P20963 38.41 DP00200 T-cell surface glycoprotein CD3 8 713 zeta chain Q9C518 43.14 x TCP8 1 10 P19250 4.23 DP00668 (TDH OR thermostable direct 1 430 hemolysin) AND Vibrio parahaemolyticus Q13148 18.36 x TDP-43 2 1497 B7T1D7 15.44 DP00703 Teg12 1 2 P16458 50.39 DP00659 Telomere-binding protein subunit 2 20 beta A0A077Y0G0 18.42 x (Tex1 OR Trophozoite exported 1 26 protein 1) P04867 23.67 x TGBp1 2 53 Q8MWS6 39.48 DP00500 TgDRE 1 2 S8F3V2 19.19 x TgGCN5 AND "toxoplasma gondii" 1 6 Q96EK4 32.48 x THAP11 1 17 P0AA25 10.09 x thioredoxin AND (escherichia coli 4 1921 OR e. coli)

174

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs Q9BRA2 9.76 x thioredoxin AND human 1 4112 Q962Y6 7.19 DP00642 thioredoxin-glutathione 1 20 reductase AND Schistosoma mansoni P40238 9.13 x thrombopoietin receptor MPL 1 991 Q8GT36 82.52 DP00532 Thylakoid soluble phosphoprotein 1 7 x x x Thylakoid-soluble phosphoprotein 1 1 AND Arabidopsis thaliana P03176 13.56 DP00419 Thymidine kinase 2 13187 P04818 11.5 DP00073 thymidylate synthase 1 4845 P62328 100.0 DP00357 Thymosin beta-4 2 673 O14925 3.83 x TIM23 1 204 P0DJ91 29.93 x Tir AND "Escherichia coli" 1 447 Q8WZ42 17.39 DP00072 titin AND (SH3 OR PEVK) 7 133 Q12000 52.46 x Tma46 1 3 Q9HCJ0 36.27 x tnrc6c 1 14 Q14106 30.52 x Tob2 1 38 O81283 25.02 DP00609 Toc159 2 72 O94826 22.04 x Tom70 1 144 P23627 22.71 x tomato aspermy virus 1 19 cucumovirus coat P02929 48.95 DP00043 tonB 1 905 Q9BW30 47.73 x (Tppp3 OR tubulin 1 9 polymerization-promoting protein family member 3) O94811 42.47 x TPPP/p25 5 34 x x x TRAF3IP2 1 108 P33905 7.69 DP00198 Transcriptional activator protein 1 34 traR P15884 48.88 DP00224 ("Transcription factor 4") AND 2 701 human Q9NQB0 40.06 DP00175 ("Transcription factor 7-like 2") 2 844 Q04207, 24.23, DP00085, (Transcription factor p65 OR NF- 3 10649 Q04206 29.22 DP00129 kappab p65) P32773 40.56 DP00104 Transcription initiation factor IIA 1 23 large subunit P32774 13.93 DP00009 Transcription initiation factor IIA 1 32 small subunit P51123 34.62 DP00081 Transcription initiation factor 1 33 TFIID subunit 1 Q8NER1 6.44 x (transient receptor potential OR 1 11105 TRPV1) Q01853 13.4 DP00435 Transitional endoplasmic 1 36 reticulum ATPase

175

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P0A707 34.44 DP00197 Translation initiation factor IF-3 1 50 Q9SLF3 28.28 DP00610 translocase of chloroplast 132 1 2 Q498R8 30.29 x (translocated promoter region 1 481 protein OR Traf3ip2) Q53WY6 34.09 x Transthyretin 1 6607 P96086 2.24 DP00484 Tricorn protease 1 27 P00942 4.84 DP00430 triosephosphate isomerase yeast 2 186 Q13061 90.26 x Trisk 95 1 8 P15502 13.23 x (tropoelastin OR elastin) 3 9861 Q91006 25.63 DP00131 tropomodulin AND (N-terminal 3 35 OR N-terminus) P63315 37.89 DP00249 Troponin C AND (slow skeletal OR 1 730 cardiac muscles) P19429 44.29 DP00166 Troponin I cardiac muscle 4 2759 I1RJZ4 6.21 x TRP channel AND (C-terminal or 2 129 C-terminus) Q9H1D0 9.67 x TRPV6 1 423 Q9NQA5 7.68 x TRPV5 1 343 P47756 11.55 x TRTK-12 1 27 P82409 29.41 DP00680 Trypsin inhibitor 2 2 13106 P0A877 11.94 DP00252 Tryptophan synthase alpha chain 1 84 x x DP00688 (TTN-1 OR 2MDa_1) 1 3 P04350 8.33 DP00114 Tubulin beta-4 chain 3 11 P32882 10.79 DP00169 Tubulin beta-2 chain 1 35 O15350 17.45 DP00319 Tumor protein p73 1 1283 P41851 17.58 DP00725 Type II secretion system protein 1 130 M P04177 14.66 DP00094 Tyrosine 3-monooxygenase 1 13435 Q5UPJ7 6.94 DP00726 Tyrosine-tRNA ligase 1 427 P00952 7.4 DP00095 Tyrosyl-tRNA synthetase 2 491 P09012 36.88 x U1A 1 337 P26368 19.58 x U2AF65 1 180 P0ABJ1 13.97 DP00089 Ubiquinol oxidase subunit 2 4 24 P0ABI8 4.98 DP00088 Ubiquinol oxidase subunit 1 2 38 Q16222, 7.09, DP00363, UDP-N-acetylhexosamine 1 4 Q16222-2 6.73 DP00363_A002 pyrophosphorylase O75385 26.38 x ULK1 1 276 P83949 19.54 DP00623 Ultrabithorax AND (N-terminal 3 29 OR N-terminus OR C-terminal OR C-terminus) Q9GSG8 16.53 x ultraspiracle AND "aedes aegypti" 1 20 P38293 31.08 x Ump1 2 37

176

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P0AG11 6.47 DP00626, UmuD 1 289 DP00626_C001 P60472 9.88 DP00516 Undecaprenyl pyrophosphate 1 59 synthetase Q9HAU5 34.12 x UPF2 1 137 P23202 21.75 DP00353 URE2 3 265 Q9RP19 8.53 DP00367 (UreG OR urease accessory 7 154 protein) P06132 10.9 DP00308 Uroporphyrinogen decarboxylase 2 613 P54725 39.12 DP00156 UV excision repair protein RAD23 1 8 homolog A Q8QXJ6 22.02 x vanilla mosaic virus coat protein 1 5 B4F3C5 7.93 x VapD 1 29 P12003-1 28.24 DP00556 Vinculin 1 3131 Q9MA75 45.75 x VIP1 1 154 Q98157 9.57 DP00685 Viral macrophage inflammatory 1 117 protein 2 P11473 17.8 DP00184 ("Vitamin D3 receptor") 1 200 P07293 5.55 DP00228 Voltage-dependent L-type 1 9 calcium channel subunit alpha-1S Q14722-2 7.48 DP00090 Voltage-gated potassium channel 1 57 subunit beta-1 P40337 34.74 DP00287 Von Hippel-Lindau disease tumor 1 1064 suppressor Q05323 18.4 DP00627 VP30 1 72 P51610 25.06 x VP16 AND transcription factor 1 439 activator P03305 11.41 DP00573_C004 VP2 foot-and-mouth disease virus 1 123 P03305 11.41 DP00573_C006 VP1 foot-and-mouth disease virus 1 888 P03305 11.41 DP00573_C005 VP3 foot-and-mouth disease virus 1 140 Q85197 4.81 x VPg AND “potato virus” 2 57 P03305 11.41 DP00573_C003 VP4 protein 1 1453 Q730H9 10.58 x Vpr AND "Bacillus cereus" 1 3 P03520 31.32 x (VSVP OR vesicular stomatitis 1 327 virus phosphoprotein) Q9NZQ3 21.05 x (WASP interacting protein OR 2 87 WIP(C)) P42768 44.42 DP00215 (WASP OR Wiskott-Aldrich 6 220 syndrome) AND protein AND (C- terminus or C-terminal) O15213 32.13 x WDR46 1 4 P30291 33.59 DP00611 Wee1 1 772 Q14191 15.64 DP00443 Werner syndrome ATP-dependent 1 16 helicase

177

Table C1 (Continued) UniProt RAPID Disprot Search Term num IDP num score(s) PMIDs all PMIDs P06935 1.75 DP00673, west nile virus AND (capsid 1 100 DP00673_C001 protein C OR polyprotein) x x x WSK3 1 1 P80457 5.71 DP00450 Xanthine AND (dehydrogenase OR 1 13452 oxidase) Q23229 23.02 DP00361 XO lethal protein 1 1 10 P27088 37.83 DP00091 XPA AND DNA 1 858 P0A8H9 38.46 DP00202 yacG 1 6 Q6PKI6 59.02 x (YB-1 OR “Y Box binding protein 2 915 1” OR YBX1) P23292 28.02 x Yck2 kinase 1 32 P69346 28.92 x YefM 3 27 Q7CJN0 3.85 DP00275 Yersinia crystallin 2 3 P31493 26.48 x YopE 1 250 P46937 51.59 DP00702 (Yorkie homolog OR YAP1) 2 672 P20963 38.41 x (zetacyt OR T-cell surface 1 709 glycoprotein CD3 zeta chain) O95405 17.19 DP00141 Zinc finger FYVE domain 1 1 containing protein 9 O00488 58.96 DP00549 Zinc finger protein 593 1 15 Q9H2S9 45.47 DP00376 Zinc finger protein Eos 1 19 P77173 42.07 DP00161 ZipA 2 110

178

Appendix D: Intrinsically Disordered Enzymes

Table D1 Intrinsically disordered enzymes

Name PMIDs 2,4-dienoyl-CoA reductase 15531764 5-Aminolevulinate synthase (ALAS) 25240868 ABL tyrosine kinase 17211892 acetylcholinesterase variant, AChE-R 20173328, 23908786 acylphosphatase 24893801, 20223823, 18832052, 18451804, 16287076, 14872538, 9790846 adenylate cyclase toxin 19860484, 20096704, 21416544, 24145447 alphavirus capsid protease 25100849, 9094737 aminoglycoside 6'-N- 16131761 acetyltransferase type Ii anhydrin 20805515 apo-DCpS 15769464 ARG tyrosine kinase 17211892 arginine kinase 21075117 bacillus lipase 25001212 bacterial luciferase mobile loop 21156144 BCR-ABL tyrosine kinase 22632137 calcineurin, CaN 22100452 cathepsin F 23684953 CCT, phosphocholine 21303909, 24397368, 23238251, 22988242 cytidylytransferase Cel7a, Trichoderma reesi family 7 21112302, 23959893 cellobiohydrolase colicin E9 15004032, 16114886, 16166265, 17375930, 18573254, 19021565, 22310049, 23672584, 23812713 c-Src kinase 19520085, 25071818, 23744817 Dbp5p, DEAD-box protein 5 19281819, 24045937, 21884706 dihydroorotase, DHO 19128030 Dnmt1 25533200, 20352123, 19923434 E2 enzymes sub-family 3R 22507829 E2-C 10350465 EcoO109I 19348764

179

Table D1 (Continued) Name PMIDs epidermal growth factor receptor, 15840573, 22579287 EGFR, EGFR kinase domain ErbB2 receptor tyrosine kinase 24815698 firefly luciferase 19119851, 19492113, 20221465 FtsZ 25305578, 23714328, 23692518, 23692518 glucokinase 23271955 glycolate oxidase 8706682, 18215067 hammerhead 10802069, 21740954 HDAC4 18614528 Hef 24947516 hepatitis B polymerase 23202419 HIV-1 protease 10739910, 15572155, 17243183 hsp90 atpase 22660624 inteins 16288917, 24236406 lombricine kinase, phosphagen 15327979, 21212263, 20121101, 21212263 kinase family Lysyle oxidase 20192271 Metallo beta lactamase 19395380 mGIP/SCP1 24751520, 24925644 multidomain polymerase protein 25297996 myosin II heavy chaine kinase B 20199682 NEIL1 22902625, 23542007 Nickel Superoxide dismutase 25580509 NS3, non structural protein 3 from 23803659, 21112306, 9223519, 24752801 hepatitis C virus Ntrc1 16169010 nuclease colicin 15004032 nudix hydrolase 20657662 ornithine decarboxylase 10623504, 23684952 p1 protease 24603811 p300 acetyltransferase 17438265, 23133622, 23307074, 24253305 PAM, peptidylglycine alpha- 19635792 amidating monooxygenase PfTIM, triosephosphate isomerase 19914198, 15465054 Phototropin 2 21222437 PKS, modular polyketide synthase 22282160, 9166770 polyphenol oxidase 16332393 protein kinase A 17222345, 23946424, 24192038, 25112875 protein kinase C alpha V5 domain 23762412 protein kinase R, PKR 19232355

180

Table D1 (Continued) Name PMIDs PTP1B/PTPN1 17643420, 24845231 RelA/SPoT 24717772 retinal phosphodiesterase gamma 12643535, 18230733, 19075750, 21393250, 21978030, subunit 22514270 Ribonuclease A 9689069 Ribonuclease P 11258888, 11749217, 20476778 RNA polymerase 25261014 Rnase E 12947103, 15236960, 16094605, 16516921, 17447862, 19215771, 20952404, 25432321 Rnase P 11258888, 20476778, 15518563 Rnase Y 21803996 San1, PQC ubiquitin ligase 21211726, 23363599, 21551067, 21941105 selenoprotein k 22963794 selenoprotein methionine sulfoxide 20605785 reductase B1, MsrB1 selenoprotein s, VIMP reductase 23566202, 23914919, 22700979 sgs1 helicase 24038467 sirt1, sirtuin family 23497088, 23811471, 24020004 sortase 22468560 src family kinase 25071818 SRPK1 21600902 Sso Acp 24893801 sulfhydryl oxidase ALR 23207295 TAFI 18722183 Tbsp1 23192346 TgGCN5 family histone 21055425 acetyltransferases thrombin 21782041 thymidylate synthase 16259621, 14967037, 19797058, 20815815, 21878626, 23181752, 23684952 TPPP/p25 21995432 Type IA topoisomerases 18186484 Ube2w E2 25436519 Upb10 26149687 UreG 15542602, 17309280 Vpr, feather degrading minor 19383694 extracellullar protease xylanases 25576604 yck2 kinase 21653825

181

Appendix E: Copyright Permissions

Copyright permission for: DeForte S, Uversky VN: Intrinsically disordered proteins in PubMed: what can the tip of the iceberg tell us about what lies below? RSC Advances 2016, 6(14):11513-11521. About our license to publish

In order to publish material the Royal Society of Chemistry must acquire the necessary legal rights from the author(s) of that material. In general, we must obtain from the original author(s) the right to publish the material in all formats, in all media (including specifically print and electronic), with the right to sublicense those rights. For all articles published in our journals, we require the author to accept a 'licence to publish'. This licence is normally requested on submission of the article. By signing this licence the author (who is either the copyright owner or who is authorised to sign on behalf of the copyright owner, for example his/her employer) grants to the Royal Society of Chemistry the exclusive right and licence throughout the world to edit, adapt, translate, reproduce and publish the manuscript in all formats, in all media and by all means (whether now existing or in future devised). The Royal Society of Chemistry thus acquires an exclusive licence to publish and all practical rights to the manuscript, except the copyright. The copyright of the manuscript remains with the copyright owner. The copyright owner also retains certain rights regarding the sharing and deposition of their article and the re-use of the published material. For short items in journals (news items, etc) we take a non-exclusive licence in the form of a brief 'terms and conditions for acceptance' document.

Assurances In the licence to publish, the author provides the assurances that we need to publish the material, including assurances that the work is original to the author, that the work has not been published already and that permissions have been obtained if previously published material has been included. If the manuscript includes material that belongs to someone else (for example, a figure or diagram), we require the author to obtain all permissions that may be needed from third parties. If you wish to reuse material that was not published originally by the Royal Society of Chemistry please see Re-use permission requests. Download the Royal Society of Chemistry licence to publish

Rights retained by authors When the author accepts the exclusive licence to publish for a journal article, he/she retains certain rights that may be exercised without reference to the Royal Society of Chemistry. Reproduce/republish portions of the article (including the abstract). Photocopy the article and distribute such photocopies and distribute copies of the PDF of the article for personal or professional use only (the Royal Society of Chemistry makes this PDF

182 available to the corresponding author of the article upon publication. Any such copies should not be offered for sale. Persons who receive or access the PDF mentioned above must be notified that this may not be made available further or distributed.). Adapt the article and reproduce adaptations of the article for any purpose other than the commercial exploitation of a work similar to the original. Reproduce, perform, transmit and otherwise communicate the article to the public in spoken presentations (including those that are accompanied by visual material such as slides, overheads and computer projections). The author(s) must submit a written request to the Royal Society of Chemistry for any use other than those specified above. All cases of republication/reproduction must be accompanied by an acknowledgement of first publication of the work by the Royal Society of Chemistry, the wording of which depends on the journal in which the article was published originally. The acknowledgement should also include a hyperlink to the article on the Royal Society of Chemistry website. The author also has some rights concerning the deposition of the whole article.

Deposition and sharing rights The following details apply only to authors accepting the standard licence to publish. Authors who have accepted one of the open access licences to publish, or are thinking of doing so, should refer to the details for open access deposition rights. Deposition of any form of the article is allowed in non-commercial repositories only. Deposition by the Royal Society of Chemistry The Royal Society of Chemistry shall deposit the accepted version of the submitted article in non-commercial repository(ies) as deemed appropriate, including but not limited to the funding agency repository(ies) of the author(s). There shall be an embargo to making the above deposited material available to the public that will be 12 months from the date of acceptance. Deposition by the author(s) When the author accepts the licence to publish for a journal article, he/she retains certain rights concerning the deposition of the whole article. Deposit the accepted version of the submitted article in their institutional repository(ies). There shall be an embargo of 12 months from the date of acceptance, after which time the article will be made available to the public. There shall be a link from this article to the PDF of the version of record on the Royal Society of Chemistry's website, once this final version is available. Make available the accepted version of the submitted article via the personal website(s) of the author(s) or via the intranet(s) of the organisation(s) where the author(s) work(s); no embargo period applies. Make available the PDF of the version of record via the personal website(s) of the author(s) or via the intranet(s) of the organisation(s) where the author(s) work(s). No embargo period applies. Republish the PDF of the version of record in theses of the author(s) in printed form and may make available this PDF in the theses of author(s) via any website(s) that the university(ies) of the author(s) may have for the deposition of theses. No embargo period applies.

Re-use permission requests Material published by the Royal Society of Chemistry and other publishers is subject to all applicable copyright, database protection, and other rights. Therefore, for any publication, whether printed or electronic, permission must be obtained to use material for which the

183 author(s) does not already own the copyright. This material may be, for example, a figure, diagram, table, photo or some other image. Author reusing their own work published by the Royal Society of Chemistry You do not need to request permission to reuse your own figures, diagrams, etc, that were originally published in a Royal Society of Chemistry publication. However, permission should be requested for use of the whole article or chapter except if reusing it in a thesis. If you are including an article or book chapter published by us in your thesis please ensure that your co- authors are aware of this. Reuse of material that was published originally by the Royal Society of Chemistry must be accompanied by the appropriate acknowledgement of the publication. The form of the acknowledgement is dependent on the journal in which it was published originally, as detailed in 'Acknowledgements'. Material published by the Royal Society of Chemistry to be used in another of our publications Authors contributing to our publications (journal articles, book or book chapters) do not need to formally request permission to reproduce material contained in another Royal Society of Chemistry publication. However, permission should be requested for use of a whole article or chapter. For all cases of reproduction the correct acknowledgement of the reproduced material should be given. The form of the acknowledgement is dependent on the journal in which it was published originally, as detailed in the 'Acknowledgements' section. Using third party material in Royal Society of Chemistry publications We must ensure that the material we publish does not infringe the copyright of others. We require the author(s) to obtain, at the earliest opportunity, the relevant permissions that might be needed from third parties to include material that belongs to someone else. Please contact the publisher/copyright owner of the third party material to check how they wish to receive permission requests. Please plan to submit your request well ahead of publication of your material. The most common procedures for permission requests are outlined below. A number of publishers have opted out of receiving express permissions as long as they fall under the rules of theSTM Permission Guidelines. If they do not fall into the category above, the majority of publishers now use RightsLink from the Copyright Clearance Center (CCC) to process their requests. Other publishers have their own permission request forms and/or specify what information they need to process any permission request. If the publisher/copyright owner does not have a specific procedure please complete and submit the ‘permission request form for non-RSC material’ form. Send the form to the permission administrator or editor of the relevant publication. If the copyright owner has opted to publish under a Creative Commons licence, licensees are required to obtain permission to do any of the things with a work that the law reserves exclusively to a licensor and that the licence does not expressly allow. Licensees must credit the licensor, keep copyright notices intact on all copies of the work, and link to the license from copies of the work. In all cases the following rights need to be obtained. Permission is required to include the specified material in the work described and in all subsequent editions of the work to be published by the Royal Society of Chemistry for distribution throughout the world, in all media including electronic and microfilm and to use the material in conjunction with computer-based electronic and information retrieval systems, to grant permissions for photocopying, reproductions and reprints, to translate the material and to publish the translation, and to authorise document delivery and abstracting and indexing services.

184

Please note that the Royal Society of Chemistry is also a signatory to the STM Permission Guidelines. Using material published by the Royal Society of Chemistry in material for another publisher If you require permission to use material from one of our publications or website in a publication not owned by us, and you are not the author of our publication, the following procedures should be followed. Before sending in any request you should check that the material you wish to reproduce is not credited to a source other than the Royal Society of Chemistry. The credit for an image will be given in the caption of the image or sometimes in the list of references. Please plan to submit your request well ahead of publication of your material. Please note that we are unable to supply artwork for the material you may wish to reproduce. Reproducing material from a Royal Society of Chemistry journal To request permission to reproduce material from a Royal Society of Chemistry journal please use RightsLink. Reproducing material from other Royal Society of Chemistry publications If you are reproducing material from a Royal Society of Chemistry book, education or science policy publication or a Royal Society of Chemistry website you must complete and submit the online permission request form (or the PDF version of the permission request form for Royal Society of Chemistry material). Requests are usually for use of a figure or diagram, but they may also be for use of the entire article or chapter. Requests to use individual figures or diagrams are invariably granted. Permission for another publisher to print an entire Royal Society of Chemistry article or chapter may be granted in special circumstances. The permission form should only be used to request permission to reproduce material from Royal Society of Chemistry books, Chemistry World, Education in Chemistry, and other non- journal publications of the Royal Society of Chemistry. For these requests please complete and send the form to our publishing services team.

185

Copyright permission for: Sotomayor-Pérez AC, Ladant D, Chenal A: Disorder-to-order transition in the CyaA toxin RTX domain: implications for toxin secretion. Toxins (Basel) 2015, 7(1):1-20. and

Larion M, Salinas RK, Bruschweiler-Li L, Miller BG, Brüschweiler R: Order-disorder transitions govern kinetic cooperativity and allostery of monomeric human glucokinase. PLoS Biol 2012, 10(12):e1001452.

And

Aït-Bara S, Carpousis AJ, Quentin Y: RNase E in the γ-Proteobacteria: conservation of intrinsically disordered noncatalytic region and molecular evolution of microdomains. Mol Genet Genomics 2014

Creative Commons Corporation (“Creative Commons”) is not a law firm and does not provide legal services or legal advice. Distribution of Creative Commons public licenses does not create a lawyer-client or other relationship. Creative Commons makes its licenses and related information available on an “as-is” basis. Creative Commons gives no warranties regarding its licenses, any material licensed under their terms and conditions, or any related information. Creative Commons disclaims all liability for damages resulting from their use to the fullest extent possible.

Using Creative Commons Public Licenses

Creative Commons public licenses provide a standard set of terms and conditions that creators and other rights holders may use to share original works of authorship and other material subject to copyright and certain other rights specified in the public license below. The following considerations are for informational purposes only, are not exhaustive, and do not form part of our licenses.

Considerations for licensors: Our public licenses are intended for use by those authorized to give the public permission to use material in ways otherwise restricted by copyright and certain other rights. Our licenses are irrevocable. Licensors should read and understand the terms and conditions of the license they choose before applying it. Licensors should also secure all rights necessary before applying our licenses so that the public can reuse the material as expected. Licensors should clearly mark any material not subject to the license. This includes other CC-licensed material, or material used under an exception or limitation to copyright. More considerations for licensors. Considerations for the public: By using one of our public licenses, a licensor grants the public permission to use the licensed material under specified terms and conditions. If the licensor’s permission is not necessary for any reason–for example, because of any applicable exception or limitation to copyright–then that use is not regulated by the license. Our licenses grant only permissions under copyright and certain other rights that a licensor has authority to grant. Use of the licensed material may still be restricted for other reasons, including because others have copyright or other rights in the material. A licensor may make special requests, such as asking that all changes be marked or described. Although not required by our licenses, you are encouraged to respect those requests where reasonable.More considerations for the public.

186

Creative Commons Attribution 4.0 International Public License

By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.

Section 1 – Definitions.

a. Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image. b. Adapter's License means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License. c. Copyright and Similar Rights means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)- (2) are not Copyright and Similar Rights. d. Effective Technological Measures means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements. e. Exceptions and Limitations means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material. f. Licensed Material means the artistic or literary work, database, or other material to which the Licensor applied this Public License. g. Licensed Rights means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license. h. Licensor means the individual(s) or entity(ies) granting rights under this Public License. i. Share means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them.

187

j. Sui Generis Database Rights means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world. k. You means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning.

Section 2 – Scope.

a. License grant. 1. Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to: A. reproduce and Share the Licensed Material, in whole or in part; and B. produce, reproduce, and Share Adapted Material. 2. Exceptions and Limitations. For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions. 3. Term. The term of this Public License is specified in Section 6(a). 4. Media and formats; technical modifications allowed. The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a)(4) never produces Adapted Material. 5. Downstream recipients. A. Offer from the Licensor – Licensed Material. Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License. B. No downstream restrictions. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material. 6. No endorsement. Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i). b. Other rights.

188

1. Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise. 2. Patent and trademark rights are not licensed under this Public License. 3. To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties.

Section 3 – License Conditions.

Your exercise of the Licensed Rights is expressly made subject to the following conditions.

a. Attribution. 1. If You Share the Licensed Material (including in modified form), You must: A. retain the following if it is supplied by the Licensor with the Licensed Material: i. identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated); ii. a copyright notice; iii. a notice that refers to this Public License; iv. a notice that refers to the disclaimer of warranties; v. a URI or hyperlink to the Licensed Material to the extent reasonably practicable; B. indicate if You modified the Licensed Material and retain an indication of any previous modifications; and C. indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License. 2. You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information. 3. If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable. 4. If You Share Adapted Material You produce, the Adapter's License You apply must not prevent recipients of the Adapted Material from complying with this Public License.

189

Section 4 – Sui Generis Database Rights.

Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material:

a. for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database; b. if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material; and c. You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database. For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights.

Section 5 – Disclaimer of Warranties and Limitation of Liability.

a. Unless otherwise separately undertaken by the Licensor, to the extent possible, the Licensor offers the Licensed Material as-is and as-available, and makes no representations or warranties of any kind concerning the Licensed Material, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non- infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. Where disclaimers of warranties are not allowed in full or in part, this disclaimer may not apply to You. b. To the extent possible, in no event will the Licensor be liable to You on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this Public License or use of the Licensed Material, even if the Licensor has been advised of the possibility of such losses, costs, expenses, or damages. Where a limitation of liability is not allowed in full or in part, this limitation may not apply to You.

c. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.

Section 6 – Term and Termination.

a. This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically. b. Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates: 1. automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or

190

2. upon express reinstatement by the Licensor.

For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License.

c. For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License. d. Sections 1, 5, 6, 7, and 8 survive termination of this Public License.

Section 7 – Other Terms and Conditions.

a. The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed. b. Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License.

Section 8 – Interpretation.

a. For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License. b. To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions. c. No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor. d. Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority.

Creative Commons is not a party to its public licenses. Notwithstanding, Creative Commons may elect to apply one of its public licenses to material it publishes and in those instances will be considered the “Licensor.” The text of the Creative Commons public licenses is dedicated to the public domain under the CC0 Public Domain Dedication. Except for the limited purpose of indicating that material is shared under a Creative Commons public license or as otherwise permitted by the Creative Commons policies published at creativecommons.org/policies, Creative Commons does not authorize the use of the trademark “Creative Commons” or any other trademark or logo of Creative Commons without its prior written consent including, without limitation, in connection with any unauthorized modifications to any of its public licenses or any other arrangements, understandings, or agreements concerning use of licensed material. For the avoidance of doubt, this paragraph does not form part of the public licenses

191

Copyright permission for: Kardani J, Roy I: Understanding Caffeine's Role in Attenuating the Toxicity of α-Synuclein Aggregates: Implications for Risk of Parkinson's Disease. ACS Chem Neurosci 2015, 6(9):1613-1625.

Title: Understanding Caffeine’s Role in Attenuating the Toxicity of α-Synuclein Aggregates: Implications for Risk of Parkinson’s Disease Author: Jay Kardani, Ipsita Roy Publication: ACS Chemical Neuroscience Publisher: American Chemical Society Date: Sep 1, 2015 Copyright © 2015, American Chemical Society

PERMISSION/LICENSE IS GRANTED FOR YOUR ORDER AT NO CHARGE This type of permission/license, instead of the standard Terms & Conditions, is sent to you because no fee is being charged for your order. Please note the following:

. Permission is granted for your request in both print and electronic formats, and translations. . If figures and/or tables were requested, they may be adapted or used in part. . Please print this page for your records and send a copy of it to your publisher/graduate school. . Appropriate credit for the requested material should be given as follows: "Reprinted (adapted) with permission from (COMPLETE REFERENCE CITATION). Copyright (YEAR)

American Chemical Society." Insert appropriate information in place of the capitalized words. . One-time permission is granted only for the use specified in your request. No additional uses are granted (such as derivative works or other editions). For any other uses, please submit a new request.

If credit is given to another source for the material you requested, permission must be obtained from that source.

192

Copyright permission for: Krishnan N, Koveal D, Miller DH, Xue B, Akshinthala SD, Kragelj J, Jensen MR, Gauss CM, Page R, Blackledge M et al: Targeting the disordered C terminus of PTP1B with an allosteric inhibitor. Nat Chem Biol 2014, 10(7):558-566.

NATURE PUBLISHING GROUP LICENSE TERMS AND CONDITIONS Apr 11, 2016

This is a License Agreement between Shelly M DeForte ("You") and Nature Publishing Group ("Nature Publishing Group") provided by Copyright Clearance Center ("CCC"). The license consists of your order details, the terms and conditions provided by Nature Publishing Group, and the payment terms and conditions.

All payments must be made in full to CCC. For payment instructions, please see information listed at the bottom of this form.

License Number 3838230488626 License date Mar 29, 2016 Licensed content publisher Nature Publishing Group Licensed content publication Nature Chemical Biology Licensed content title Targeting the disordered C terminus of PTP1B with an allosteric inhibitor Licensed content author Navasona Krishnan, Dorothy Koveal, Daniel H Miller, Bin Xue, Sai Dipikaa Akshinthala, Jaka Kragelj Licensed content date May 20, 2014 Volume number 10 Issue number 7 Type of Use reuse in a dissertation / thesis

Requestor type academic/educational Format print and electronic Portion figures/tables/illustrations Number of figures/tables/illustrations 1 High-res required no Figures Figure 3c Author of this NPG article no

193

Your reference number None Title of your thesis / dissertation Intrinsic Disorder Where You Least Expect It: The Incidence and Functional Relevance of Intrinsic Disorder in Enzymes and the Protein Data Bank Expected completion date Jun 2016 Estimated size (number of pages) 100 Total 0.00 USD

Terms and Conditions

Terms and Conditions for Permissions Nature Publishing Group hereby grants you a non-exclusive license to reproduce this material for this purpose, and for no other use,subject to the conditions below:

1. NPG warrants that it has, to the best of its knowledge, the rights to license reuse of this material. However, you should ensure that the material you are requesting is original to Nature Publishing Group and does not carry the copyright of another entity (as credited in the published version). If the credit line on any part of the material you have requested indicates that it was reprinted or adapted by NPG with permission from another source, then you should also seek permission from that source to reuse the material.

2. Permission granted free of charge for material in print is also usually granted for any electronic version of that work, provided that the material is incidental to the work as a whole and that the electronic version is essentially equivalent to, or substitutes for, the print version.Where print permission has been granted for a fee, separate permission must be obtained for any additional, electronic re-use (unless, as in the case of a full paper, this has already been accounted for during your initial request in the calculation of a print run).NB: In all cases, web-based use of full-text articles must be authorized separately through the 'Use on a Web Site' option when requesting permission.

3. Permission granted for a first edition does not apply to second and subsequent editions and for editions in other languages (except for signatories to the STM Permissions Guidelines, or where the first edition permission was granted for free).

194

4. Nature Publishing Group's permission must be acknowledged next to the figure, table or abstract in print. In electronic form, this acknowledgement must be visible at the same time as the figure/table/abstract, and must be hyperlinked to the journal's homepage. 5. The credit line should read: Reprinted by permission from Macmillan Publishers Ltd: [JOURNAL NAME] (reference citation), copyright (year of publication) For AOP papers, the credit line should read: Reprinted by permission from Macmillan Publishers Ltd: [JOURNAL NAME], advance online publication, day month year (doi: 10.1038/sj.[JOURNAL ACRONYM].XXXXX)

Note: For republication from the British Journal of Cancer, the following credit lines apply. Reprinted by permission from Macmillan Publishers Ltd on behalf of Cancer Research UK: [JOURNAL NAME] (reference citation), copyright (year of publication)For AOP papers, the credit line should read: Reprinted by permission from Macmillan Publishers Ltd on behalf of Cancer Research UK: [JOURNAL NAME], advance online publication, day month year (doi: 10.1038/sj.[JOURNAL ACRONYM].XXXXX)

6. Adaptations of single figures do not require NPG approval. However, the adaptation should be credited as follows:

Adapted by permission from Macmillan Publishers Ltd: [JOURNAL NAME] (reference citation), copyright (year of publication)

Note: For adaptation from the British Journal of Cancer, the following credit line applies. Adapted by permission from Macmillan Publishers Ltd on behalf of Cancer Research UK: [JOURNAL NAME] (reference citation), copyright (year of publication)

7. Translations of 401 words up to a whole article require NPG approval. Please visithttp://www.macmillanmedicalcommunications.com fo r more information.Translations of up to a 400 words do not require NPG approval. The translation should be credited as follows:

Translated by permission from Macmillan Publishers Ltd: [JOURNAL NAME] (reference citation), copyright (year of publication).

195

Note: For translation from the British Journal of Cancer, the following credit line applies. Translated by permission from Macmillan Publishers Ltd on behalf of Cancer Research UK: [JOURNAL NAME] (reference citation), copyright (year of publication)

196

Copyright permission for: DeForte S, Uversky VN: Resolving the ambiguity: Making sense of intrinsic disorder when PDB structures disagree. Protein Sci 2016, 25(3):676-688.

JOHN WILEY AND SONS LICENSE TERMS AND CONDITIONS Apr 11, 2016

This Agreement between Shelly M DeForte ("You") and John Wiley and Sons ("John Wiley and Sons") consists of your license details and the terms and conditions provided by John Wiley and Sons and Copyright Clearance Center.

License Number 3834750065650 License date Mar 23, 2016 Licensed Content Publisher John Wiley and Sons Licensed Content Publication Protein Science Licensed Content Title Resolving the ambiguity: Making sense of intrinsic disorder when PDB structures disagree Licensed Content Author Shelly DeForte,Vladimir N. Uversky Licensed Content Date Jan 9, 2016 Pages 13 Type of use Dissertation/Thesis Requestor type Author of this Wiley article Format Print and electronic Portion Full article Will you be translating? No Title of your thesis / dissertation Intrinsic Disorder Where You Least Expect It: The Incidence and Functional Relevance of Intrinsic Disorder in

197

Enzymes and the Protein Data Bank Expected completion date Jun 2016 Expected size (number of pages) 100 Requestor Location Shelly M DeForte 6007 N. Ithmar Ave

TAMPA, FL 33604 United States Attn: Shelly M DeForte Billing Type Invoice

Billing Address Shelly M DeForte 6007 N. Ithmar Ave

TAMPA, FL 33604 United States Attn: Shelly M DeForte Total 0.00 USD

Terms and Conditions

TERMS AND CONDITIONS This copyrighted material is owned by or exclusively licensed to John Wiley & Sons, Inc. or one of its group companies (each a"Wiley Company") or handled on behalf of a society with which a Wiley Company has exclusive publishing rights in relation to a particular work (collectively "WILEY"). By clicking "accept" in connection with completing this licensing transaction, you agree that the following terms and conditions apply to this transaction (along with the billing and payment terms and conditions established by the Copyright Clearance Center Inc., ("CCC's Billing and Payment terms and conditions"), at the time that you opened your RightsLink account (these are available at any time athttp://myaccount.copyright.com).

Terms and Conditions

 The materials you have requested permission to reproduce or reuse (the "Wiley Materials") are protected by copyright.

 You are hereby granted a personal, non-exclusive, non-sub licensable (on a stand- alone basis), non-transferable, worldwide, limited license to reproduce the Wiley Materials for the purpose specified in the licensing process. This license, and any CONTENT (PDF or image file) purchased as part of your order, is for a one-time use only and limited to any maximum distribution number specified in the license. The first instance of republication or reuse granted by this license must be completed within two years of the date of the grant of this license (although copies prepared before the end date may be distributed thereafter). The Wiley

198

Materials shall not be used in any other manner or for any other purpose, beyond what is granted in the license. Permission is granted subject to an appropriate acknowledgement given to the author, title of the material/book/journal and the publisher. You shall also duplicate the copyright notice that appears in the Wiley publication in your use of the Wiley Material. Permission is also granted on the understanding that nowhere in the text is a previously published source acknowledged for all or part of this Wiley Material. Any third party content is expressly excluded from this permission.

 With respect to the Wiley Materials, all rights are reserved. Except as expressly granted by the terms of the license, no part of the Wiley Materials may be copied, modified, adapted (except for minor reformatting required by the new Publication), translated, reproduced, transferred or distributed, in any form or by any means, and no derivative works may be made based on the Wiley Materials without the prior permission of the respective copyright owner.For STM Signatory Publishers clearing permission under the terms of the STM Permissions Guidelines only, the terms of the license are extended to include subsequent editions and for editions in other languages, provided such editions are for the work as a whole in situ and does not involve the separate exploitation of the permitted figures or extracts,You may not alter, remove or suppress in any manner any copyright, trademark or other notices displayed by the Wiley Materials. You may not license, rent, sell, loan, lease, pledge, offer as security, transfer or assign the Wiley Materials on a stand-alone basis, or any of the rights granted to you hereunder to any other person.

 The Wiley Materials and all of the intellectual property rights therein shall at all times remain the exclusive property of John Wiley & Sons Inc, the Wiley Companies, or their respective licensors, and your interest therein is only that of having possession of and the right to reproduce the Wiley Materials pursuant to Section 2 herein during the continuance of this Agreement. You agree that you own no right, title or interest in or to the Wiley Materials or any of the intellectual property rights therein. You shall have no rights hereunder other than the license as provided for above in Section 2. No right, license or interest to any trademark, trade name, service mark or other branding ("Marks") of WILEY or its licensors is granted hereunder, and you agree that you shall not assert any such right, license or interest with respect thereto

 NEITHER WILEY NOR ITS LICENSORS MAKES ANY WARRANTY OR REPRESENTATION OF ANY KIND TO YOU OR ANY THIRD PARTY, EXPRESS, IMPLIED OR STATUTORY, WITH RESPECT TO THE MATERIALS OR THE ACCURACY OF ANY INFORMATION CONTAINED IN THE MATERIALS, INCLUDING, WITHOUT LIMITATION, ANY IMPLIED WARRANTY OF MERCHANTABILITY, ACCURACY, SATISFACTORY QUALITY, FITNESS FOR A PARTICULAR PURPOSE, USABILITY, INTEGRATION OR NON-INFRINGEMENT AND ALL SUCH WARRANTIES ARE HEREBY EXCLUDED BY WILEY AND ITS LICENSORS AND WAIVED BY YOU.

 WILEY shall have the right to terminate this Agreement immediately upon breach of this Agreement by you.

 You shall indemnify, defend and hold harmless WILEY, its Licensors and their

199

respective directors, officers, agents and employees, from and against any actual or threatened claims, demands, causes of action or proceedings arising from any breach of this Agreement by you.

 IN NO EVENT SHALL WILEY OR ITS LICENSORS BE LIABLE TO YOU OR ANY OTHER PARTY OR ANY OTHER PERSON OR ENTITY FOR ANY SPECIAL, CONSEQUENTIAL, INCIDENTAL, INDIRECT, EXEMPLARY OR PUNITIVE DAMAGES, HOWEVER CAUSED, ARISING OUT OF OR IN CONNECTION WITH THE DOWNLOADING, PROVISIONING, VIEWING OR USE OF THE MATERIALS REGARDLESS OF THE FORM OF ACTION, WHETHER FOR BREACH OF CONTRACT, BREACH OF WARRANTY, TORT, NEGLIGENCE, INFRINGEMENT OR OTHERWISE (INCLUDING, WITHOUT LIMITATION, DAMAGES BASED ON LOSS OF PROFITS, DATA, FILES, USE, BUSINESS OPPORTUNITY OR CLAIMS OF THIRD PARTIES), AND WHETHER OR NOT THE PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THIS LIMITATION SHALL APPLY NOTWITHSTANDING ANY FAILURE OF ESSENTIAL PURPOSE OF ANY LIMITED REMEDY PROVIDED HEREIN.

 Should any provision of this Agreement be held by a court of competent jurisdiction to be illegal, invalid, or unenforceable, that provision shall be deemed amended to achieve as nearly as possible the same economic effect as the original provision, and the legality, validity and enforceability of the remaining provisions of this Agreement shall not be affected or impaired thereby.

 The failure of either party to enforce any term or condition of this Agreement shall not constitute a waiver of either party's right to enforce each and every term and condition of this Agreement. No breach under this agreement shall be deemed waived or excused by either party unless such waiver or consent is in writing signed by the party granting such waiver or consent. The waiver by or consent of a party to a breach of any provision of this Agreement shall not operate or be construed as a waiver of or consent to any other or subsequent breach by such other party.

 This Agreement may not be assigned (including by operation of law or otherwise) by you without WILEY's prior written consent.

 Any fee required for this permission shall be non-refundable after thirty (30) days from receipt by the CCC.

 These terms and conditions together with CCC's Billing and Payment terms and conditions (which are incorporated herein) form the entire agreement between you and WILEY concerning this licensing transaction and (in the absence of fraud) supersedes all prior agreements and representations of the parties, oral or written. This Agreement may not be amended except in writing signed by both parties. This Agreement shall be binding upon and inure to the benefit of the parties' successors, legal representatives, and authorized assigns.

 In the event of any conflict between your obligations established by these terms and conditions and those established by CCC's Billing and Payment terms and conditions, these terms and conditions shall prevail.

200

 WILEY expressly reserves all rights not specifically granted in the combination of (i) the license details provided by you and accepted in the course of this licensing transaction, (ii) these terms and conditions and (iii) CCC's Billing and Payment terms and conditions.

 This Agreement will be void if the Type of Use, Format, Circulation, or Requestor Type was misrepresented during the licensing process.

 This Agreement shall be governed by and construed in accordance with the laws of the State of New York, USA, without regards to such state's conflict of law rules. Any legal action, suit or proceeding arising out of or relating to these Terms and Conditions or the breach thereof shall be instituted in a court of competent jurisdiction in New York County in the State of New York in the United States of America and each party hereby consents and submits to the personal jurisdiction of such court, waives any objection to venue in such court and consents to service of process by registered or certified mail, return receipt requested, at the last known address of such party.

WILEY OPEN ACCESS TERMS AND CONDITIONS Wiley Publishes Open Access Articles in fully Open Access Journals and in Subscription journals offering Online Open. Although most of the fully Open Access journals publish open access articles under the terms of the Creative Commons Attribution (CC BY) License only, the subscription journals and a few of the Open Access Journals offer a choice of Creative Commons Licenses. The license type is clearly identified on the article. The Creative Commons Attribution License The Creative Commons Attribution License (CC-BY) allows users to copy, distribute and transmit an article, adapt the article and make commercial use of the article. The CC-BY license permits commercial and non- Creative Commons Attribution Non-Commercial License The Creative Commons Attribution Non-Commercial (CC-BY-NC)License permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.(see below)

Creative Commons Attribution-Non-Commercial-NoDerivs License The Creative Commons Attribution Non-Commercial-NoDerivs License (CC-BY-NC-ND) permits use, distribution and reproduction in any medium, provided the original work is properly cited, is not used for commercial purposes and no modifications or adaptations are made. (see below) Use by commercial "for-profit" organizations Use of Wiley Open Access articles for commercial, promotional, or marketing purposes requires further explicit permission from Wiley and will be subject to a fee. Further details can be found on Wiley Online Libraryhttp://olabout.wiley.com/WileyCDA/Section/id-410895.html

201

Copyright permission for: Ye Q, Feng Y, Yin Y, Faucher F, Currie MA, Rahman MN, Jin J, Li S, Wei Q, Jia Z: Structural basis of calcineurin activation by calmodulin. Cell Signal 2013, 25(12):2661-2667.

ELSEVIER LICENSE TERMS AND CONDITIONS Apr 19, 2016

This is a License Agreement between Shelly M DeForte ("You") and Elsevier ("Elsevier") provided by Copyright Clearance Center ("CCC"). The license consists of your order details, the terms and conditions provided by Elsevier, and the payment terms and conditions.

All payments must be made in full to CCC. For payment instructions, please see information listed at the bottom of this form.

Supplier Elsevier Limited The Boulevard,Langford Lane Kidlington,Oxford,OX5 1GB,UK Registered Company Number 1982084 Customer name Shelly M DeForte Customer address 6007 N. Ithmar Ave TAMPA, FL 33604 License number 3852650424335 License date Apr 19, 2016 Licensed content publisher Elsevier Licensed content publication Cellular Signalling Licensed content title Structural basis of calcineurin activation by calmodulin Licensed content author Qilu Ye,Yedan Feng,Yanxia Yin,Frédérick Faucher,Mark A. Currie,Mona N. Rahman,Jin Jin,Shanze Li,Qun Wei,Zongchao Jia Licensed content date December 2013 Licensed content volume number 25 Licensed content issue number 12 Number of pages 7 Start Page 2661 End Page 2667

Type of Use reuse in a thesis/dissertation Portion figures/tables/illustrations Number of figures/tables/illustrations 1

202

Format both print and electronic Are you the author of this Elsevier article? No Will you be translating? No Original figure numbers 5 Title of your thesis/dissertation Intrinsic Disorder Where You Least Expect It: The Incidence and Functional Relevance of Intrinsic Disorder in Enzymes and the Protein Data Bank

Expected completion date Jun 2016 Estimated size (number of pages) 100 Elsevier VAT number GB 494 6272 12

Permissions price 0.00 USD VAT/Local Sales Tax 0.00 USD / 0.00 GBP

Total 0.00 USD Terms and Conditions INTRODUCTION 1. The publisher for this copyrighted material is Elsevier. By clicking "accept" in connection with completing this licensing transaction, you agree that the following terms and conditions apply to this transaction (along with the Billing and Payment terms and conditions established by Copyright Clearance Center, Inc. ("CCC"), at the time that you opened your Rightslink account and that are available at any time at http://myaccount.copyright.com). GENERAL TERMS 2. Elsevier hereby grants you permission to reproduce the aforementioned material subject to the terms and conditions indicated. 3. Acknowledgement: If any part of the material to be used (for example, figures) has appeared in our publication with credit or acknowledgement to another source, permission must also be sought from that source. If such permission is not obtained then that material may not be included in your publication/copies. Suitable acknowledgement to the source must be made, either as a footnote or in a reference list at the end of your publication, as follows: "Reprinted from Publication title, Vol /edition number, Author(s), Title of article / title of chapter, Pages No., Copyright (Year), with permission from Elsevier [OR APPLICABLE SOCIETY COPYRIGHT OWNER]." Also Lancet special credit - "Reprinted from The Lancet, Vol. number, Author(s), Title of article, Pages No., Copyright (Year), with permission from Elsevier."

4. Reproduction of this material is confined to the purpose and/or media for which permission is hereby given. 5. Altering/Modifying Material: Not Permitted. However figures and illustrations may be altered/adapted minimally to serve your work. Any other abbreviations, additions, deletions and/or any other alterations shall be made only with prior written authorization of Elsevier Ltd. (Please contact Elsevier at [email protected]) 6. If the permission fee for the requested use of our material is waived in this instance, please be advised that your future requests for Elsevier materials may attract a fee. 7. Reservation of Rights: Publisher reserves all rights not specifically granted in the combination of (i) the license details provided by you and accepted in the course of this licensing transaction, (ii) these terms and conditions and (iii) CCC's Billing and Payment terms and conditions. 8. License Contingent Upon Payment: While you may exercise the rights licensed immediately upon issuance of the license at the end of the licensing process for the transaction, provided that you have disclosed complete and accurate details of your proposed use, no license is finally effective unless and until full payment is received from you (either by publisher or by CCC) as provided in CCC's Billing and Payment terms and conditions. If full payment is not received on a timely basis, then any license preliminarily granted shall be deemed automatically revoked and shall be void as if never granted. Further, in the event that you

203 breach any of these terms and conditions or any of CCC's Billing and Payment terms and conditions, the license is automatically revoked and shall be void as if never granted. Use of materials as described in a revoked license, as well as any use of the materials beyond the scope of an unrevoked license, may constitute copyright infringement and publisher reserves the right to take any and all action to protect its copyright in the materials. 9. Warranties: Publisher makes no representations or warranties with respect to the licensed material. 10. Indemnity: You hereby indemnify and agree to hold harmless publisher and CCC, and their respective officers, directors, employees and agents, from and against any and all claims arising out of your use of the licensed material other than as specifically authorized pursuant to this license. 11. No Transfer of License: This license is personal to you and may not be sublicensed, assigned, or transferred by you to any other person without publisher's written permission. 12. No Amendment Except in Writing: This license may not be amended except in a writing signed by both parties (or, in the case of publisher, by CCC on publisher's behalf). 13. Objection to Contrary Terms: Publisher hereby objects to any terms contained in any purchase order, acknowledgment, check endorsement or other writing prepared by you, which terms are inconsistent with these terms and conditions or CCC's Billing and Payment terms and conditions. These terms and conditions, together with CCC's Billing and Payment terms and conditions (which are incorporated herein), comprise the entire agreement between you and publisher (and CCC) concerning this licensing transaction. In the event of any conflict between your obligations established by these terms and conditions and those established by CCC's Billing and Payment terms and conditions, these terms and conditions shall control. 14. Revocation: Elsevier or Copyright Clearance Center may deny the permissions described in this License at their sole discretion, for any reason or no reason, with a full refund payable to you. Notice of such denial will be made using the contact information provided by you. Failure to receive such notice will not alter or invalidate the denial. In no event will Elsevier or Copyright Clearance Center be responsible or liable for any costs, expenses or damage incurred by you as a result of a denial of your permission request, other than a refund of the amount(s) paid by you to Elsevier and/or Copyright Clearance Center for denied permissions. LIMITED LICENSE The following terms and conditions apply only to specific license types: 15. Translation: This permission is granted for non-exclusive world English rights only unless your license was granted for translation rights. If you licensed translation rights you may only translate this content into the languages you requested. A professional translator must perform all translations and reproduce the content word for word preserving the integrity of the article. 16. Posting licensed content on any Website: The following terms and conditions apply as follows: Licensing material from an Elsevier journal: All content posted to the web site must maintain the copyright information line on the bottom of each image; A hyper-text must be included to the Homepage of the journal from which you are licensing athttp://www.sciencedirect.com/science/journal/xxxxx or the Elsevier homepage for books athttp://www.elsevier.com; Central Storage: This license does not include permission for a scanned version of the material to be stored in a central repository such as that provided by Heron/XanEdu. Licensing material from an Elsevier book: A hyper-text link must be included to the Elsevier homepage at http://www.elsevier.com . All content posted to the web site must maintain the copyright information line on the bottom of each image.

Posting licensed content on Electronic reserve: In addition to the above the following clauses are applicable: The web site must be password-protected and made available only to bona fide students registered on a relevant course. This permission is granted for 1 year only. You may obtain a new license for future website posting. 17. For journal authors: the following clauses are applicable in addition to the above: Preprints: A preprint is an author's own write-up of research results and analysis, it has not been peer-reviewed, nor has it had any other value added to it by a publisher (such as formatting, copyright, technical enhancement etc.). Authors can share their preprints anywhere at any time. Preprints should not be added to or enhanced in any way in order to appear more like, or to substitute for, the final versions of articles however authors can update their preprints on arXiv or RePEc with their Accepted Author Manuscript (see below). If accepted for publication, we encourage authors to link from the preprint to their formal publication via its

204

DOI. Millions of researchers have access to the formal publications on ScienceDirect, and so links will help users to find, access, cite and use the best available version. Please note that Cell Press, The Lancet and some society-owned have different preprint policies. Information on these policies is available on the journal homepage. Accepted Author Manuscripts: An accepted author manuscript is the manuscript of an article that has been accepted for publication and which typically includes author-incorporated changes suggested during submission, peer review and editor-author communications. Authors can share their accepted author manuscript:   immediately o via their non-commercial person homepage or blog o by updating a preprint in arXiv or RePEc with the accepted manuscript o via their research institute or institutional repository for internal institutional uses or as part of an invitation-only research collaboration work-group o directly by providing copies to their students or to research collaborators for their personal use o for private scholarly sharing as part of an invitation-only work group on commercial sites with which Elsevier has an agreement   after the embargo period o via non-commercial hosting platforms such as their institutional repository o via commercial sites with which Elsevier has an agreement In all cases accepted manuscripts should:   link to the formal publication via its DOI   bear a CC-BY-NC-ND license - this is easy to do   if aggregated with other manuscripts, for example in a repository or other site, be shared in alignment with our hosting policy not be added to or enhanced in any way to appear more like, or to substitute for, the published journal article. Published journal article (JPA): A published journal article (PJA) is the definitive final record of published research that appears or will appear in the journal and embodies all value-adding publishing activities including peer review co-ordination, copy-editing, formatting, (if relevant) pagination and online enrichment. Policies for sharing publishing journal articles differ for subscription and gold open access articles: Subscription Articles: If you are an author, please share a link to your article rather than the full-text. Millions of researchers have access to the formal publications on ScienceDirect, and so links will help your users to find, access, cite, and use the best available version. Theses and dissertations which contain embedded PJAs as part of the formal submission can be posted publicly by the awarding institution with DOI links back to the formal publications on ScienceDirect. If you are affiliated with a library that subscribes to ScienceDirect you have additional private sharing rights for others' research accessed under that agreement. This includes use for classroom teaching and internal training at the institution (including use in course packs and courseware programs), and inclusion of the article for grant funding purposes. Gold Open Access Articles: May be shared according to the author-selected end-user license and should contain a CrossMark logo, the end user license, and a DOI link to the formal publication on ScienceDirect. Please refer to Elsevier's posting policy for further information. 18. For book authors the following clauses are applicable in addition to the above: Authors are permitted to place a brief summary of their work online only. You are not allowed to download and post the published electronic version of your chapter, nor may you scan the printed edition to create an electronic version. Posting to a repository: Authors are permitted to post a summary of their chapter only in their institution's repository. 19. Thesis/Dissertation: If your license is for use in a thesis/dissertation your thesis may be submitted to your institution in either print or electronic form. Should your thesis be published commercially, please reapply for permission. These requirements include permission for the Library and Archives of Canada to supply single copies, on demand, of the complete thesis and include permission for Proquest/UMI to supply single copies, on demand, of the complete thesis. Should your thesis be published commercially, please reapply for permission. Theses and dissertations which contain embedded PJAs as part of the formal submission can be posted publicly by the awarding institution with DOI links back to the formal publications on ScienceDirect.

205

Elsevier Open Access Terms and Conditions You can publish open access with Elsevier in hundreds of open access journals or in nearly 2000 established subscription journals that support open access publishing. Permitted third party re-use of these open access articles is defined by the author's choice of Creative Commons user license. See our open access license policy for more information. Terms & Conditions applicable to all Open Access articles published with Elsevier: Any reuse of the article must not represent the author as endorsing the adaptation of the article nor should the article be modified in such a way as to damage the author's honour or reputation. If any changes have been made, such changes must be clearly indicated. The author(s) must be appropriately credited and we ask that you include the end user license and a DOI link to the formal publication on ScienceDirect. If any part of the material to be used (for example, figures) has appeared in our publication with credit or acknowledgement to another source it is the responsibility of the user to ensure their reuse complies with the terms and conditions determined by the rights holder. Additional Terms & Conditions applicable to each Creative Commons user license: CC BY: The CC-BY license allows users to copy, to create extracts, abstracts and new works from the Article, to alter and revise the Article and to make commercial use of the Article (including reuse and/or resale of the Article by commercial entities), provided the user gives appropriate credit (with a link to the formal publication through the relevant DOI), provides a link to the license, indicates if changes were made and the licensor is not represented as endorsing the use made of the work. The full details of the license are available at http://creativecommons.org/licenses/by/4.0. CC BY NC SA: The CC BY-NC-SA license allows users to copy, to create extracts, abstracts and new works from the Article, to alter and revise the Article, provided this is not done for commercial purposes, and that the user gives appropriate credit (with a link to the formal publication through the relevant DOI), provides a link to the license, indicates if changes were made and the licensor is not represented as endorsing the use made of the work. Further, any new works must be made available on the same conditions. The full details of the license are available at http://creativecommons.org/licenses/by-nc-sa/4.0. CC BY NC ND: The CC BY-NC-ND license allows users to copy and distribute the Article, provided this is not done for commercial purposes and further does not permit distribution of the Article if it is changed or edited in any way, and provided the user gives appropriate credit (with a link to the formal publication through the relevant DOI), provides a link to the license, and that the licensor is not represented as endorsing the use made of the work. The full details of the license are available at http://creativecommons.org/licenses/by-nc-nd/4.0. Any commercial reuse of Open Access articles published with a CC BY NC SA or CC BY NC ND license requires permission from Elsevier and will be subject to a fee. Commercial reuse includes:   Associating advertising with the full text of the Article   Charging fees for document delivery or access   Article aggregation   Systematic distribution via e-mail lists or share buttons Posting or linking by commercial companies for use by customers of those companies.

20. Other Conditions: v1.8

206

Copyright permission for: Vittal V, Shi L, Wenzel DM, Scaglione KM, Duncan ED, Basrur V, Elenitoba-Johnson KS, Baker D, Paulson HL, Brzovic PS et al: Intrinsic disorder drives N-terminal ubiquitination by Ube2w. Nat Chem Biol 2015, 11(1):83-89..

NATURE PUBLISHING GROUP LICENSE TERMS AND CONDITIONS Apr 19, 2016

This is a License Agreement between Shelly M DeForte ("You") and Nature Publishing Group ("Nature Publishing Group") provided by Copyright Clearance Center ("CCC"). The license consists of your order details, the terms and conditions provided by Nature Publishing Group, and the payment terms and conditions. All payments must be made in full to CCC. For payment instructions, please see information listed at the bottom of this form. License Number 3852681388675 License date Apr 19, 2016 Licensed content publisher Nature Publishing Group Licensed content publication Nature Chemical Biology Licensed content title Intrinsic disorder drives N-terminal ubiquitination by Ube2w Licensed content author Vinayak Vittal, Lei Shi, Dawn M Wenzel, K Matthew Scaglione, Emily D Duncan, Venkatesha Basrur Licensed content date Dec 1, 2014 Volume number 11 Issue number 1 Type of Use reuse in a dissertation / thesis Requestor type academic/educational Format print and electronic Portion figures/tables/illustrations Number of figures/tables/illustrations 1 High-res required no Figures 4 Author of this NPG article no Your reference number None Title of your thesis / dissertation

207

Intrinsic Disorder Where You Least Expect It: The Incidence and Functional Relevance of Intrinsic Disorder in Enzymes and the Protein Data Bank Expected completion date Jun 2016 Estimated size (number of pages) 100 Total 0.00 USD Terms and Conditions Terms and Conditions for Permissions Nature Publishing Group hereby grants you a non-exclusive license to reproduce this material for this purpose, and for no other use,subject to the conditions below:

1. NPG warrants that it has, to the best of its knowledge, the rights to license reuse of this material. However, you should ensure that the material you are requesting is original to Nature Publishing Group and does not carry the copyright of another entity (as credited in the published version). If the credit line on any part of the material you have requested indicates that it was reprinted or adapted by NPG with permission from another source, then you should also seek permission from that source to reuse the material.

2. Permission granted free of charge for material in print is also usually granted for any electronic version of that work, provided that the material is incidental to the work as a whole and that the electronic version is essentially equivalent to, or substitutes for, the print version.Where print permission has been granted for a fee, separate permission must be obtained for any additional, electronic re-use (unless, as in the case of a full paper, this has already been accounted for during your initial request in the calculation of a print run).NB: In all cases, web-based use of full-text articles must be authorized separately through the 'Use on a Web Site' option when requesting permission.

3. Permission granted for a first edition does not apply to second and subsequent editions and for editions in other languages (except for signatories to the STM Permissions Guidelines, or where the first edition permission was granted for free).

4. Nature Publishing Group's permission must be acknowledged next to the figure, table or abstract in print. In electronic form, this acknowledgement must be visible at the same time as the figure/table/abstract, and must be hyperlinked to the journal's homepage.

5. The credit line should read: Reprinted by permission from Macmillan Publishers Ltd: [JOURNAL NAME] (reference citation), copyright (year of publication) For AOP papers, the credit line should read: Reprinted by permission from Macmillan Publishers Ltd: [JOURNAL NAME], advance online publication, day month year (doi: 10.1038/sj.[JOURNAL ACRONYM].XXXXX)

Note: For republication from the British Journal of Cancer, the following credit lines apply. Reprinted by permission from Macmillan Publishers Ltd on behalf of Cancer Research UK: [JOURNAL NAME] (reference citation), copyright (year of publication)For AOP papers, the credit line should read: Reprinted by permission from Macmillan Publishers Ltd on behalf of Cancer Research UK: [JOURNAL NAME], advance online publication, day month year (doi: 10.1038/sj.[JOURNAL ACRONYM].XXXXX)

6. Adaptations of single figures do not require NPG approval. However, the adaptation should be credited as follows:

208

Adapted by permission from Macmillan Publishers Ltd: [JOURNAL NAME] (reference citation), copyright (year of publication)

Note: For adaptation from the British Journal of Cancer, the following credit line applies. Adapted by permission from Macmillan Publishers Ltd on behalf of Cancer Research UK: [JOURNAL NAME] (reference citation), copyright (year of publication)

7. Translations of 401 words up to a whole article require NPG approval. Please visithttp://www.macmillanmedicalcommunications.com for more information.Translations of up to a 400 words do not require NPG approval. The translation should be credited as follows:

Translated by permission from Macmillan Publishers Ltd: [JOURNAL NAME] (reference citation), copyright (year of publication).

Note: For translation from the British Journal of Cancer, the following credit line applies. Translated by permission from Macmillan Publishers Ltd on behalf of Cancer Research UK: [JOURNAL NAME] (reference citation), copyright (year of publication)

We are certain that all parties will benefit from this agreement and wish you the best in the use of this material. Thank you. Special Terms: v1.1 Questions? [email protected] or +1-855-239-3415 (toll free in the US) or +1- 978-646-2777.

209

Copyright permission for: Hegde ML, Tsutakawa SE, Hegde PM, Holthauzen LM, Li J, Oezguen N, Hilser VJ, Tainer JA, Mitra S: The disordered C-terminal domain of human DNA glycosylase NEIL1 contributes to its stability via intramolecular interactions. J Mol Biol 2013, 425(13):2359-2371.

ELSEVIER LICENSE TERMS AND CONDITIONS Apr 20, 2016

This is a License Agreement between Shelly M DeForte ("You") and Elsevier ("Elsevier") provided by Copyright Clearance Center ("CCC"). The license consists of your order details, the terms and conditions provided by Elsevier, and the payment terms and conditions. All payments must be made in full to CCC. For payment instructions, please see information listed at the bottom of this form. Supplier Elsevier Limited The Boulevard,Langford Lane Kidlington,Oxford,OX5 1GB,UK Registered Company Number 1982084 Customer name Shelly M DeForte Customer address 6007 N. Ithmar Ave

TAMPA, FL 33604 License number 3853101379918 License date Apr 20, 2016 Licensed content publisher Elsevier Licensed content publication Journal of Molecular Biology Licensed content title The Disordered C-Terminal Domain of Human DNA Glycosylase NEIL1 Contributes to Its Stability via Intramolecular Interactions Licensed content author Muralidhar L. Hegde,Susan E. Tsutakawa,Pavana M. Hegde,Luis Marcelo F. Holthauzen,Jing Li,Numan Oezguen,Vincent J. Hilser,John A. Tainer,Sankar Mitra Licensed content date 10 July 2013 Licensed content volume number 425 Licensed content issue number 13 Number of pages 13 Start Page 2359 End Page 2371 Type of Use

210 reuse in a thesis/dissertation Portion figures/tables/illustrations Number of figures/tables/illustrations 1 Format both print and electronic Are you the author of this Elsevier article? No Will you be translating? No Original figure numbers 7 Title of your thesis/dissertation Intrinsic Disorder Where You Least Expect It: The Incidence and Functional Relevance of Intrinsic Disorder in Enzymes and the Protein Data Bank Expected completion date Jun 2016 Estimated size (number of pages) 100 Elsevier VAT number GB 494 6272 12 Permissions price 0.00 USD VAT/Local Sales Tax 0.00 USD / 0.00 GBP Total 0.00 USD Terms and Conditions INTRODUCTION 1. The publisher for this copyrighted material is Elsevier. By clicking "accept" in connection with completing this licensing transaction, you agree that the following terms and conditions apply to this transaction (along with the Billing and Payment terms and conditions established by Copyright Clearance Center, Inc. ("CCC"), at the time that you opened your Rightslink account and that are available at any time at http://myaccount.copyright.com). GENERAL TERMS 2. Elsevier hereby grants you permission to reproduce the aforementioned material subject to the terms and conditions indicated. 3. Acknowledgement: If any part of the material to be used (for example, figures) has appeared in our publication with credit or acknowledgement to another source, permission must also be sought from that source. If such permission is not obtained then that material may not be included in your publication/copies. Suitable acknowledgement to the source must be made, either as a footnote or in a reference list at the end of your publication, as follows: "Reprinted from Publication title, Vol /edition number, Author(s), Title of article / title of chapter, Pages No., Copyright (Year), with permission from Elsevier [OR APPLICABLE SOCIETY COPYRIGHT OWNER]." Also Lancet special credit - "Reprinted from The Lancet, Vol. number, Author(s), Title of article, Pages No., Copyright (Year), with permission from Elsevier." 4. Reproduction of this material is confined to the purpose and/or media for which permission is hereby given. 5. Altering/Modifying Material: Not Permitted. However figures and illustrations may be altered/adapted minimally to serve your work. Any other abbreviations, additions, deletions

211 and/or any other alterations shall be made only with prior written authorization of Elsevier Ltd. (Please contact Elsevier at [email protected]) 6. If the permission fee for the requested use of our material is waived in this instance, please be advised that your future requests for Elsevier materials may attract a fee. 7. Reservation of Rights: Publisher reserves all rights not specifically granted in the combination of (i) the license details provided by you and accepted in the course of this licensing transaction, (ii) these terms and conditions and (iii) CCC's Billing and Payment terms and conditions. 8. License Contingent Upon Payment: While you may exercise the rights licensed immediately upon issuance of the license at the end of the licensing process for the transaction, provided that you have disclosed complete and accurate details of your proposed use, no license is finally effective unless and until full payment is received from you (either by publisher or by CCC) as provided in CCC's Billing and Payment terms and conditions. If full payment is not received on a timely basis, then any license preliminarily granted shall be deemed automatically revoked and shall be void as if never granted. Further, in the event that you breach any of these terms and conditions or any of CCC's Billing and Payment terms and conditions, the license is automatically revoked and shall be void as if never granted. Use of materials as described in a revoked license, as well as any use of the materials beyond the scope of an unrevoked license, may constitute copyright infringement and publisher reserves the right to take any and all action to protect its copyright in the materials. 9. Warranties: Publisher makes no representations or warranties with respect to the licensed material. 10. Indemnity: You hereby indemnify and agree to hold harmless publisher and CCC, and their respective officers, directors, employees and agents, from and against any and all claims arising out of your use of the licensed material other than as specifically authorized pursuant to this license. 11. No Transfer of License: This license is personal to you and may not be sublicensed, assigned, or transferred by you to any other person without publisher's written permission. 12. No Amendment Except in Writing: This license may not be amended except in a writing signed by both parties (or, in the case of publisher, by CCC on publisher's behalf). 13. Objection to Contrary Terms: Publisher hereby objects to any terms contained in any purchase order, acknowledgment, check endorsement or other writing prepared by you, which terms are inconsistent with these terms and conditions or CCC's Billing and Payment terms and conditions. These terms and conditions, together with CCC's Billing and Payment terms and conditions (which are incorporated herein), comprise the entire agreement between you and publisher (and CCC) concerning this licensing transaction. In the event of any conflict between your obligations established by these terms and conditions and those established by CCC's Billing and Payment terms and conditions, these terms and conditions shall control. 14. Revocation: Elsevier or Copyright Clearance Center may deny the permissions described in this License at their sole discretion, for any reason or no reason, with a full refund payable to you. Notice of such denial will be made using the contact information provided by you. Failure to receive such notice will not alter or invalidate the denial. In no event will Elsevier or Copyright Clearance Center be responsible or liable for any costs, expenses or damage incurred by you as a result of a denial of your permission request, other than a refund of the amount(s) paid by you to Elsevier and/or Copyright Clearance Center for denied permissions. LIMITED LICENSE The following terms and conditions apply only to specific license types:

212

15. Translation: This permission is granted for non-exclusive world English rights only unless your license was granted for translation rights. If you licensed translation rights you may only translate this content into the languages you requested. A professional translator must perform all translations and reproduce the content word for word preserving the integrity of the article. 16. Posting licensed content on any Website: The following terms and conditions apply as follows: Licensing material from an Elsevier journal: All content posted to the web site must maintain the copyright information line on the bottom of each image; A hyper-text must be included to the Homepage of the journal from which you are licensing athttp://www.sciencedirect.com/science/journal/xxxxx or the Elsevier homepage for books athttp://www.elsevier.com; Central Storage: This license does not include permission for a scanned version of the material to be stored in a central repository such as that provided by Heron/XanEdu. Licensing material from an Elsevier book: A hyper-text link must be included to the Elsevier homepage at http://www.elsevier.com . All content posted to the web site must maintain the copyright information line on the bottom of each image.

Posting licensed content on Electronic reserve: In addition to the above the following clauses are applicable: The web site must be password-protected and made available only to bona fide students registered on a relevant course. This permission is granted for 1 year only. You may obtain a new license for future website posting. 17. For journal authors: the following clauses are applicable in addition to the above: Preprints: A preprint is an author's own write-up of research results and analysis, it has not been peer- reviewed, nor has it had any other value added to it by a publisher (such as formatting, copyright, technical enhancement etc.). Authors can share their preprints anywhere at any time. Preprints should not be added to or enhanced in any way in order to appear more like, or to substitute for, the final versions of articles however authors can update their preprints on arXiv or RePEc with their Accepted Author Manuscript (see below). If accepted for publication, we encourage authors to link from the preprint to their formal publication via its DOI. Millions of researchers have access to the formal publications on ScienceDirect, and so links will help users to find, access, cite and use the best available version. Please note that Cell Press, The Lancet and some society-owned have different preprint policies. Information on these policies is available on the journal homepage. Accepted Author Manuscripts: An accepted author manuscript is the manuscript of an article that has been accepted for publication and which typically includes author-incorporated changes suggested during submission, peer review and editor-author communications. Authors can share their accepted author manuscript:   immediately o via their non-commercial person homepage or blog o by updating a preprint in arXiv or RePEc with the accepted manuscript o via their research institute or institutional repository for internal institutional uses or as part of an invitation-only research collaboration work-group o directly by providing copies to their students or to research collaborators for their personal use

213

o for private scholarly sharing as part of an invitation-only work group on commercial sites with which Elsevier has an agreement   after the embargo period o via non-commercial hosting platforms such as their institutional repository o via commercial sites with which Elsevier has an agreement In all cases accepted manuscripts should:   link to the formal publication via its DOI   bear a CC-BY-NC-ND license - this is easy to do   if aggregated with other manuscripts, for example in a repository or other site, be shared in alignment with our hosting policy not be added to or enhanced in any way to appear more like, or to substitute for, the published journal article. Published journal article (JPA): A published journal article (PJA) is the definitive final record of published research that appears or will appear in the journal and embodies all value-adding publishing activities including peer review co-ordination, copy-editing, formatting, (if relevant) pagination and online enrichment. Policies for sharing publishing journal articles differ for subscription and gold open access articles: Subscription Articles: If you are an author, please share a link to your article rather than the full-text. Millions of researchers have access to the formal publications on ScienceDirect, and so links will help your users to find, access, cite, and use the best available version. Theses and dissertations which contain embedded PJAs as part of the formal submission can be posted publicly by the awarding institution with DOI links back to the formal publications on ScienceDirect. If you are affiliated with a library that subscribes to ScienceDirect you have additional private sharing rights for others' research accessed under that agreement. This includes use for classroom teaching and internal training at the institution (including use in course packs and courseware programs), and inclusion of the article for grant funding purposes. Gold Open Access Articles: May be shared according to the author-selected end-user license and should contain a CrossMark logo, the end user license, and a DOI link to the formal publication on ScienceDirect. Please refer to Elsevier's posting policy for further information. 18. For book authors the following clauses are applicable in addition to the above: Authors are permitted to place a brief summary of their work online only. You are not allowed to download and post the published electronic version of your chapter, nor may you scan the printed edition to create an electronic version. Posting to a repository: Authors are permitted to post a summary of their chapter only in their institution's repository. 19. Thesis/Dissertation: If your license is for use in a thesis/dissertation your thesis may be submitted to your institution in either print or electronic form. Should your thesis be published commercially, please reapply for permission. These requirements include permission for the Library and Archives of Canada to supply single copies, on demand, of the complete thesis and include permission for Proquest/UMI to supply single copies, on demand, of the complete thesis. Should your thesis be published commercially, please reapply for permission. Theses and dissertations which contain embedded PJAs as part of the formal submission can be posted publicly by the awarding institution with DOI links back to the formal publications on ScienceDirect.

214

Elsevier Open Access Terms and Conditions You can publish open access with Elsevier in hundreds of open access journals or in nearly 2000 established subscription journals that support open access publishing. Permitted third party re- use of these open access articles is defined by the author's choice of Creative Commons user license. See our open access license policy for more information. Terms & Conditions applicable to all Open Access articles published with Elsevier: Any reuse of the article must not represent the author as endorsing the adaptation of the article nor should the article be modified in such a way as to damage the author's honour or reputation. If any changes have been made, such changes must be clearly indicated. The author(s) must be appropriately credited and we ask that you include the end user license and a DOI link to the formal publication on ScienceDirect. If any part of the material to be used (for example, figures) has appeared in our publication with credit or acknowledgement to another source it is the responsibility of the user to ensure their reuse complies with the terms and conditions determined by the rights holder. Additional Terms & Conditions applicable to each Creative Commons user license: CC BY: The CC-BY license allows users to copy, to create extracts, abstracts and new works from the Article, to alter and revise the Article and to make commercial use of the Article (including reuse and/or resale of the Article by commercial entities), provided the user gives appropriate credit (with a link to the formal publication through the relevant DOI), provides a link to the license, indicates if changes were made and the licensor is not represented as endorsing the use made of the work. The full details of the license are available at http://creativecommons.org/licenses/by/4.0. CC BY NC SA: The CC BY-NC-SA license allows users to copy, to create extracts, abstracts and new works from the Article, to alter and revise the Article, provided this is not done for commercial purposes, and that the user gives appropriate credit (with a link to the formal publication through the relevant DOI), provides a link to the license, indicates if changes were made and the licensor is not represented as endorsing the use made of the work. Further, any new works must be made available on the same conditions. The full details of the license are available at http://creativecommons.org/licenses/by-nc-sa/4.0. CC BY NC ND: The CC BY-NC-ND license allows users to copy and distribute the Article, provided this is not done for commercial purposes and further does not permit distribution of the Article if it is changed or edited in any way, and provided the user gives appropriate credit (with a link to the formal publication through the relevant DOI), provides a link to the license, and that the licensor is not represented as endorsing the use made of the work. The full details of the license are available at http://creativecommons.org/licenses/by-nc-nd/4.0. Any commercial reuse of Open Access articles published with a CC BY NC SA or CC BY NC ND license requires permission from Elsevier and will be subject to a fee. Commercial reuse includes:   Associating advertising with the full text of the Article   Charging fees for document delivery or access   Article aggregation   Systematic distribution via e-mail lists or share buttons Posting or linking by commercial companies for use by customers of those companies.

215

Copyright permission for: Christensen LC, Jensen NW, Vala A, Kamarauskaite J, Johansson L, Winther JR, Hofmann K, Teilum K, Ellgaard L: The human selenoprotein VCP-interacting membrane protein (VIMP) is non-globular and harbors a reductase function in an intrinsically disordered region. J Biol Chem 2012, 287(31):26388-26399. To whom it may concern,

And

Aachmann FL, Sal LS, Kim HY, Marino SM, Gladyshev VN, Dikiy A: Insights into function, catalytic mechanism, and fold evolution of selenoprotein methionine sulfoxide reductase B1 through structural analysis. J Biol Chem 2010, 285(43):33315-33323.

It is the policy of the American Society for Biochemistry and Molecular Biology to allow reuse of any material published in its journals (the Journal of Biological Chemistry, Molecular &

Cellular Proteomics and the Journal of Lipid Research) in a thesis or dissertation at no cost and with no explicit permission needed. Please see our copyright permissions page on the journal site for more information.

Best wishes,

Sarah Crespi

American Society for Biochemistry and Molecular Biology 11200 Rockville Pike, Rockville, MD

Suite 302 240-283-6616 JBC | MCP | JLR

Copyright Permission Policy These guidelines apply to the reuse of articles, figures, charts and photos in the Journal of Biological Chemistry,Molecular & Cellular Proteomics and the Journal of Lipid Research.

For authors reusing their own material: Authors need NOT contact the journal to obtain rights to reuse their own material. They are automatically granted permission to do the following: Reuse the article in print collections of their own writing. Present a work orally in its entirety. Use an article in a thesis and/or dissertation. Reproduce an article for use in the author's courses. (If the author is employed by an academic institution, that institution also may reproduce the article for teaching purposes.) Reuse a figure, photo and/or table in future commercial and noncommercial works. Post a copy of the paper in PDF that you submitted via BenchPress. Link to the journal site containing the final edited PDFs created by the publisher. EXCEPTION: If authors select the Author’s Choice publishing option:

216

The final version of the manuscript will be covered under the Creative Commons Attribution license (CC BY), the most accommodating of licenses offered. Click here for details. The final version of the manuscript will be released immediately on the publisher’s website and PubMed Central. Please note that authors must include the following citation when using material that appeared in an ASBMB journal: "This research was originally published in Journal Name. Author(s). Title. Journal Name. Year; Vol:pp-pp. © the American Society for Biochemistry and Molecular Biology." For other parties using material for noncommercial use: Other parties are welcome to copy, distribute, transmit and adapt the work — at no cost and without permission — for noncommercial use as long as they attribute the work to the original source using the citation above. Examples of noncommercial use include: Reproducing a figure for educational purposes, such as schoolwork or lecture presentations, with attribution. Appending a reprinted article to a Ph.D. dissertation, with attribution. For other parties using material for commercial use: Navigate to the article of interest and click the "Request Permissions" button on the middle navigation bar. (See diagram at right.) It will walk you through the steps for obtaining permission for reuse. Examples of commercial use by parties other than authors include: Reproducing a figure in a book published by a commercial publisher. Reproducing a figure in a journal article published by a commercial publisher. Updated April 18, 2016

217