The Dynamics of Protein

Interaction Networks

By Chi Nam Ignatius Pang

A thesis submitted for the degree of

Doctor of Philosophy in Biotechnology

January 2010

School of Biotechnology and Biomolecular Sciences

The University of New South Wales

Table of Contents

Originality Statement...... iii

Copyright Statement...... iv

Authenticity Statement...... v

Acknowledgements ...... vi

Publications...... viii

Book chapter:...... ix

Abstract ...... x

Dedication...... xii

1 Introduction ...... 1

1.1 Protein-protein interactions ...... 2

1.2 Technologies for the discovery of protein-protein interactions ...... 4

1.3 Protein interaction networks ...... 11

1.4 Integrating protein-protein interaction data with genomics and proteomics data

19

1.5 Proteomic technology to quantify protein absolute abundance, half-lives, and

translation rate...... 20

1.6 Post-translational modifications...... 28

1.7 Different methods of detecting post-translational modifications ...... 31

1.8 Aims of this thesis ...... 35

2 Are protein complexes made of cores, modules and attachments?.... 36

3 High throughput protein-protein interaction data: clues for the architecture of protein complexes ...... 37

i

4 Surface accessibility of protein post-translational modifications...... 38

5 Identification of arginine- and lysine-methylation in the proteome of

Saccharomyces cerevisiae and its functional implications ...... 39

6 Proteins deleterious on overexpression are associated with high intrinsic disorder, specific interaction domains and low abundance...... 40

7 Discussion ...... 42

7.1 The role of domain-domain interactions in the formation of stable protein

complexes ...... 43

7.2 Using high-throughput interaction data to determine the architecture of protein

complexes ...... 46

7.3 The role of surface accessibility in post-translational modifications ...... 47

7.4 Identification of arginine- and lysine-methylation using peptide mass spectra.50

7.5 Exploring the function of arginine and lysine methylation in the proteome of S.

cerevisiae ...... 53

7.6 Interaction domains and high intrinsic disorder favours interaction promiscuity

55

7.7 Tight regulation of protein abundance prevents promiscuous protein-protein

interactions ...... 56

8 Conclusions...... 59

9 References...... 61

10 Appendices...... 97

10.1 Appendix I - Publications...... 97

10.2 Appendix II - Book Chapters ...... 98

10.3 Appendix III - Publication highlights ...... 99

ii

Originality Statement

I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgment is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.

Chi Nam Ignatius Pang

School of Biotechnology and Biomolecular Sciences

University of New South Wales

Sydney NSW 2052

Australia

January 2010

iii

Copyright Statement

I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright

Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation.

I also authorize University Microfilms to use the 350 word abstract of my thesis in

Dissertation Abstract International (this is applicable to doctoral theses only).

I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted

I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.

Chi Nam Ignatius Pang

School of Biotechnology and Biomolecular Sciences

University of New South Wales

Sydney NSW 2052

Australia

January 2010

iv

Authenticity Statement

I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.

Chi Nam Ignatius Pang

School of Biotechnology and Biomolecular Sciences

University of New South Wales

Sydney NSW 2052

Australia

January 2010

v

Acknowledgements

I would like to sincerely thank the following people, who have assisted me throughout my thesis:

Professor Marc Wilkins, my supervisor, for being an inspirational teacher and an incredible scientist. The thesis has certainly improved from your ideas and critical reviews, and I have learnt a lot from you. Thank you for proofreading my thesis so many times, which would have been an arduous task for anyone. Thank you for supporting me as a friend, and helping me though the many difficulties I encountered throughout the past 4 years.

Dr. Rohan Williams, my co-supervisor, for being a helpful mentor throughout my studies. Your challenging questions and critical comments have helped sharpen many ideas in my thesis. Thank you for your constant support.

Dr. Mark Tanaka, for being a panel member for my annual reviews, and taking precious time to chair these important meetings.

Dr. Elisabeth Gasteiger, for building the FindMod bulk submission webpage. It is the most critical software for chapter 5, and without it the project would not have come so far. Thank you very much.

Dr. Anne-Claude Gavin, for personal correspondence and facilitating access to the Cellzome peptide mass data.

Dr. Andrew Hayen, for providing technical help on statistics and programming in R.

Dr. Mark Cowley, for giving me countless advice on Linux and R programming. Thank you for assisting me with the illustration of Figure 2 in Chapter 4.

Liang Ma, for charging ahead with the analysis of proteins deleterious upon overexpression. I like this paper very much. Thank you!

Angela Lek, for writing the Perl scripts for acquiring the domain-domain interactions from iPfam.

James Krycer, for contributing to two chapters of my thesis. Your help was important for my thesis. Thank you!

Simone Li, for your help in visualizing protein-protein interactions in GEOMI. Thank you for verifying the GO terms analysis in chapter 5. It was just what we needed to tackle the difficult reviewers' comments.

Yose Widjaja and Dr. Tim Lambert, for giving me the opportunity to contribute to the paper on the Interactorium. The visualization is very high-tech and it is like science fiction becoming reality.

Dr. Merridee Wouters and Dr. Richard George, for your supervision at the Victor Chang Cardiac Research Institute, and for your help with writing the Scooby-Domain paper.

Weit-Tse Hsu, for the opportunity to work with you on the paper.

vi

Tim Couttas, for being a wonderful lab mate and for sharing your ideas.

Daniel Yagoub, for working on a challenging Honours thesis project. I certainly learned much from your work.

Tim and Daniel have found experimental evidence for arginine- and lysine- methylation sites in yeast. Although the data were not used in my thesis, these data allowed me eliminate doubts on whether my results could be real. Thank you both of you!

Adam Lee, for working on the Honours research project on lysine- and arginine- methylation with Marc. I have learnt much from your work.

Dr. Tim Salmon, for IT helpdesk support.

Gordon Kondo, for illustrating Figure 4 of Chapter 2.

Edward Kondo, for building me a powerful Quad Core computer, and for your help in proof reading my thesis.

Catharina Cheung, for editing parts of Chapter 6 and preparing it for publication.

I also like to thank Mark, Liang, Angela, Yose, Simone, Timothy, Daniel, James, Melissa, Jason, Apruv, Clare, Natalie, David, Edwin, Jack, Boer, Ricky, Emily, Kathlen, Chen Ying, Michael, Florian, Anke, Giulia, Sheila, John, Aileen, Rowena, Shaun, and anyone else that I may have missed, for being fantastic and helpful lab mates.

I was the recipient of Australian Postgraduate Award. I thank UNSW and the School of Biotechnology and Biomolecular Sciences for supporting my travel to the Interactome Networks conference in Hinxton, August 2007.

The thesis was printed using double-sided printing to save the environment. This is permitted by the University of New South Wales.

vii

Publications

The following publications form the results chapters of my thesis:

Chapter 2: Pang, CN, Krycer, JR, Lek, A & Wilkins, MR: Are protein complexes made of cores, modules and attachments? Proteomics 2008, 8, 425-34.

Chapter 3: Krycer, JR, Pang, CN & Wilkins, MR: High throughput protein-protein interaction data: clues for the architecture of protein complexes. Proteome Sci

2008, 6, 32.

Chapter 4: Pang, CN, Hayen, A & Wilkins, MR: Surface accessibility of protein post-translational modifications. J Proteome Res 2007, 6, 1833-45.

Chapter 5: Pang, CN, Gasteiger, E & Wilkins, MR: Identification of arginine- and lysine-methylation in the proteome of Saccharomyces cerevisiae and its functional implications. BMC Genomics, 2010, 11, 29.

Chapter 6: Liang, M, Pang, CN, Li, SS & Wilkins, MR: Proteins deleterious on overexpression are associated with high intrinsic disorder, specific interaction domains and low abundance. J Proteome Res, 2010, 9, 1218-25.

viii

Other publications, not central to investigations described in this thesis:

Publications:

Hsu, WT, Pang, CN, Sheetal, J & Wilkins, MR: Protein-protein interactions and disease: use of S. cerevisiae as a model system. Biochim Biophys Acta 2007, 1774,

838-47.

Pang, CN, Lin, K, Wouters, MA, Heringa, J & George, RA: Identifying foldable regions in protein sequence from the hydrophobic signal. Nucleic Acids Res 2008,

36, 578-88.

Widjaja, YY, Pang, CN, Li, SS, Wilkins, MR & Lambert, TD: The Interactorium: visualising proteins, complexes and interaction networks in a virtual 3D .

Proteomics 2009, 9, 5309-15.

Book chapter:

Pang, CN & Wilkins, MR: Online resources for the molecular contextualization of

disease. Methods Mol Med 2008, 141, 287-308.

ix

Abstract

An important challenge of system biology is to understand how proteins interact with each other to dynamically orchestrate biological functions. This thesis focused on understanding the dynamics of protein interaction networks in the yeast

Saccharomyces cerevisiae; this model organism has the most comprehensive set of genomics, proteomics, and protein-protein interaction datasets publically available for bioinformatic and systems biology analyses. I investigated several different aspects of dynamics in protein interaction networks. Firstly, I asked whether protein complexes are made of core, module and attachment proteins. Analysis suggested core proteins were most likely to be mediated by stable domain-domain interactions, followed by module and attachment proteins. Furthermore, we proposed that some protein complexes are likely to be tightly regulated, by only expressing core proteins ‘just-in- time’ to activate the complex when it is needed.

Secondly, we asked whether high-throughput protein-protein interaction data could be used to provide clues on the architecture of protein complexes. Pairwise interaction data was shown to help in defining complex membership, while cores and modules of protein complexes could help determine the spatial proximity of proteins. Predicted domain-domain interactions could explain some interactions within protein complexes, but false positives complicated the analysis.

Thirdly, I showed that post-translational modifications involved in protein-protein interactions are likely to be on the surface of proteins, while artifactual modifications are not preferentially found in coils and helices. Parts of protein structures that mediate transient interactions tend to be intrinsically disordered, and can contain interaction motifs and post-translational modifications that could be recognized and bound by domains.

x

Fourthly, using peptide mass fingerprinting data, I have found 83 arginine and lysine methylation sites in 66 proteins. Evidence from this dataset suggests lysine methylation could block the action of ubiquitin ligase. Fifthly, we asked the question whether overexpression of certain proteins could affect the dynamics of interaction. Proteins deleterious upon overexpression tend to be low in abundance, high in intrinsic disorder, and have a high number of interaction partners. Finally, the investigations above are discussed to show how the sequence-based effects, the abundance-based effects, and conditional binding effects influence the dynamics of protein interaction networks.

xi

Dedication

I dedicate my thesis to my family, especially, my mum (Linda), my dad (Chris), my brother-in-law (Edward), my sister (Rosa), and Edward’s brother Gordon.

I thank my parents, Chris and Linda, for their love and nurture. Thank you for your upbringing, for providing food, shelter, education, and Tender Loving Care! It was hard work and I hope it is my turn to take care of you.

Thank you Rosa and Edward for your countless support whilst I’m writing my thesis!

Thank you Gordon for all the fun and laughter we shared.

I also dedicate my thesis to Catharina, who has been extremely supportive of me. You tolerated me so much while I’m busy writing thesis.

I thank everyone in my family, and all of my friends who have prayed for me, and supported me all these time.

Most of all, I thank you Lord, Jesus Christ, for giving your life for us. I also dedicate my thesis to you.

I like to share a quote with everyone who reads my thesis:

“To live without a faith, without a patrimony to defend, without a steady struggle for the truth, that is not living, but just existing” – Blessed Pier Giorgio Farassati

xii

1 Introduction

Systems Biology involves the integration of complex biological data from various large-scale studies, to study the cell as an integrated system. A very important challenge of system biology is to understand how proteins interact together dynamically to orchestrate biological function, and this is largely assisted by the field of proteomics, which collects much of the data needed for analysis. The proteome is defined as the entire set of proteins expressed by the genome, including all proteins produced from alternatively spliced genes and any co- or post- translational modifications of these proteins.1 The study of protein interaction networks has been made possible with recent large-scale protein-protein interaction studies, and together with recent technologies that collect data from genomics and proteomics experiments, it is possible to collate the necessary parts list to construct a dynamic map of protein interaction networks. Here, we will describe experimental methods that collect the various data necessary to construct and analyse the protein interaction networks.

These will be discussed with respect to the recent advances in technologies and their advantages and disadvantages. We will study the dynamics of protein interaction networks with a focus on Saccharomyces cerevisiae, commonly called baker’s yeast, since this eukaryotic model organism has the most comprehensive genomics, proteomics, and protein-protein interaction data available to date for systems biology analysis. We will formulate hypotheses on what affects the dynamic of protein interaction networks, and test these hypotheses using the available data and bioinformatics analyses.

1

1.1 Protein-protein interactions

Proteins seldom act alone, but they work in a concerted manner, acting as molecular machines.2,3 These molecular machines are protein complexes containing subunits that bind stably and in stoichiometric proportion with each other.2,3 Formation of protein complexes allows proteins of related biological function to work together, co- ordinated in expression and localization.2,3 There are also higher-level interactions connecting protein complexes together to form cellular pathways or networks, formed by proteins that bind transiently and dynamically depending on the biological context.2-6

Together, all the interactions that can occur between proteins in the cell are called the protein interactome.7 Different proteins can interact using a diverse number of interaction interfaces, each with different physiochemical and structural properties that work together to deliver a variety of cellular functions.8 These protein-protein interactions are characterised by physical and biochemical attributes such as van der

Waals interactions, hydrogen bonds, salt-bridges, electrostatic interaction, and hydrophobic interactions. In addition to the physical forces involved, the binding affinity between two proteins also varies according to the shape, conformational flexibility, and surface area of the contacting area. Interfaces of stable domain-domain interactions are mostly larger than 1500 Å2,9 whereas transient interactions usually have a small interface of less than 1500 Å2.10 Larger interaction interfaces tend to be hydrophobic, and smaller interaction interfaces tend to have more polar and charged amino acids that are involved in electrostatic interactions.11 There are three broad categories of protein-protein interaction interfaces, namely, domain-domain interaction12,13 domain- motif interaction14, and binding of post-translational modification by a recognition domain, also known as domain-PTM interaction.13,15 These three types of protein- protein interaction interfaces have different binding strength and specificity, and have different roles in static and dynamic protein-protein interactions.

2

Domain-domain interactions (DDIs) tend to be stable interactions mediated by large hydrophobic interfaces.16,17 They are often found in stable protein complexes.18,19

These interactions have an affinity in the nM to pM range.20,21 An example of domain- domain interaction is the homodimeric interactions of the domain.22 Several interaction partners can bind to the same interaction domain of a protein, although these interaction partners share the same interface their interactions are mutually exclusive; only one partner can bind to the interface at any one time.23 Usually there are several proteins that can bind to the same mutually exclusive interface, which can facilitate the dynamics of protein-protein interactions. A protein can bind to different partners through the same domain-domain interaction interface and engage in various functions in response to changes in the spatial-temporal conditions.

Domain-motif interactions involve the binding of a specific recognition domain with an interaction motif. They usually mediate transient and dynamic interactions in signalling and regulatory networks. The consensus-biding motif is usually 3 to 10 amino acids in length.24 An example of domain-motif interaction is the binding of SH3 domain to the RXXK consensus motif.25 Interaction motifs are often found in coils, loops, or intrinsically disordered region of proteins, which are accessible for binding by specific interaction domains.26 The flexible region that contains the motif will often undergo disorder-to-order transition upon binding the recognition domain, thus stabilizing the interaction.26 Domain-motif interactions have a low affinity of 0.5-10

μM.14 The amino acid sequences for domain-motif interactions typically evolve faster than domain-domain interactions,27 and mutation in the motif sequence could lead to the gain or loss of the interaction in a short evolutionary time-scale. Domain-domain interactions and domain-motif interactions are dependent on the amino acid sequences of the interacting proteins, these interactions are regulated by the sequence based effect described by Wilkins and Kummerfeld (2008).28

3

Domain-PTM interaction involves the binding of a domain to a post-translationally modified amino acid, and is known to be involved in a variety of signalling and regulatory pathways.13,15 An example of domain-PTM interaction is the binding of SH2 domain to phosphorylated tyrosine residues, but not to the corresponding unmodified tyrosine. The presence or absence of a reversible modification, such as phosphorylation, can therefore act as a switch to turn domain-PTM interactions on or off, 13,15 and this is also known as the conditional-binding effect.28 Modifications that can be bound by a recognition domain, such as lysine acetylation and serine phosphorylation, are often accessible on the surface of proteins or within intrinsically disordered regions.29 There are many types of modifications and recognition domains, they work in combination to dynamically change the way proteins interact and deliver a variety of cellular functions.13,15 The classifications of the domain-PTM and domain- motif interactions are not distinct; many PTM-based recognition domains recognise

PTMs found in a sequence motif. Some variation of the sequence within post- translationally modified motifs occurs for fine-tuning the binding specificity of the modified motif to different modification enzymes and recognition domains; this is the case for phosphorylation.30 In sum, information on domain-motif, domain-domain, and domain-PTM interactions can provide molecular details on how to assemble the protein interaction networks, and they can be useful for understanding how proteins interact with different partners to deliver a wide range of cellular processes.13,14

1.2 Technologies for the discovery of protein-protein interactions

There are two high-throughput approaches that measure protein-protein interactions, the first type detects pairwise protein-protein interactions, and the second type purifies protein complexes and identifies their constituent proteins. These two types of technologies will be briefly introduced below, and their advantages and

4

disadvantages will be discussed with respect to lessons learnt from several large-scale protein-protein interaction studies in S. cerevisiae. While there are other technologies that measure protein-protein interactions such as surface plasmon resonance,31,32 co- crystallisation followed by X-ray crystallography, and native gel electrophoresis,33-35 high-throughput methods such as yeast two-hybrid, and affinity purification followed by mass spectrometry can produce the large amount of protein-protein interaction data required for systems biology analysis.

Yeast two-hybrid (Y2H) is a technique that detects pairwise protein-protein interactions. It was developed by Fields and Song (1989).36 In this system, an interaction between a ‘bait’ protein and a ‘prey’ protein reconstitutes a transcription factor, which activates the transcription of a reporter gene. Each of the two proteins in a Y2H experiment is fused with one of the two fragments of the transcription factor; the bait protein is fused with a DNA binding domain and the prey protein is fused with the transcription activation domain. Y2H is primarily used for measuring binary interactions between two proteins. It can be used to measure binary interactions that are weak or transient, and can detect some PTM-dependent interactions.37 There are some inherent limitations of Y2H techniques. For example, approximately 5-10% of the bait proteins can auto-activate the reporter gene without binding to a prey protein, and vice versa.38 False positives can also arise from the overexpression of the hybrid proteins, the fact that interactions must occur in the nucleus,39 the presence of endogenous proteins that bridge the ‘gap’ between bait and prey proteins,38 and interactions mediated by homodomain interactions that may not occur normally.40 Y2H has approximately 25% sensitivity,37 which means it is expected that many interactions are potentially not detected or commonly regarded as false-negatives. Y2H also removes spatial and temporal boundaries between proteins39 which may contribute to the false- positive rates of this assay.

5

Another example of protein-protein interaction detection techniques is the protein- fragment complementation assay (PCA).41 PCA describes any assay in which a reporter protein is split into two or more fragments and reconstituted in an experiment.

The sequence of a reporter protein is split into two or more fragments. These fragments are then fused with proteins that are assayed for their ability to interact. The reported protein is reconstituted when the proteins carrying the fused fragments interact. Similar to the split transcription factor used in Y2H, a mutant version of the murine dihydrofolate reductase (mDHFR) protein is divided into two fragments, F[1,2] and F[3]. F[1,2] is fused to the prey protein and F[3] is fused to the bait protein. If the bait and prey proteins interact physically, the two complementary fragments of the mDHFR enzyme are brought together, which reactivates the mDHFR catalytic activity.

The mDHFR enzyme remains fully catalytic in the presence of DHFR inhibitor methotrexate, and protein-protein interaction is measured by selecting for cells that survive in the presence of methotrexate.41 As the two fragments of mDHFR enzyme must bind together to activate the enzyme, it can only detect interactions within 8 nm.

This reduces the false positives resulting from indirect interactions mediated by one or more proteins that bridge the pair of proteins.41 The protein-fragment complementation assay has additional advantages. It can monitor changes in protein-protein interactions in response to changes in environmental conditions in real-time, and is better at detecting interactions of transmembrane proteins than other assays described here.41

However, false-negatives can also occur in a PCA assay. This happens when two proteins interact but the two fragments of the reporter protein are too far apart to be reconstituted. There are many other variants of Y2H and three of these are briefly described below. First is the split luciferase system,42,43 in which the split luciferase becomes luminous upon detecting an interaction. Second is bimolecular fluorescence complementation (BiFC), which uses split Green Fluorescent Protein (GFP) as the reporter protein.44,45 BiFC allows in vivo imaging of interactions when the split GFP

6

becomes fluorescent as a result of an interaction. Third is the split ubiquitin system,46 designed to detect interactions of transmembrane proteins.

Affinity purification involves using an affinity reagent with strong affinity to the protein under investigation, such as an antibody with strong affinity specific to the protein. The affinity reagent will bind to the protein of interest and allows it to be concentrated while other proteins are washed away or separated. For the affinity purification of a protein complex, the protein of interest is fused to an affinity reagent tag such that it can be efficiently captured with an antibody or affinity resin. Proteins that associate with the tagged protein are co-purified, and members of the purified protein complexes are identified using mass spectrometric techniques. Two main tagging approaches have been used for the large-scale mapping of the S. cerevisiae interactome: the FLAG-tag and the TAP-tag. The first technique involves protein engineered with a FLAG tag, from which protein complexes are purified in a single purification step.47 The FLAG tag has the hydrophilic sequence ‘DYKDDDDK’ and can be bound by the commercially available M1 and M2 antibodies that have specific affinity for this peptide.48,49

Purification of the FLAG-tagged protein involves inserting a recombinant plasmid containing the FLAG-tagged gene into the cell, followed by immunoprecipitation of the protein complex using an antibody that binds specifically to the FLAG tag. To produce large-scale mapping of the yeast interactome, Ho et al. (2002)47 overexpressed the plasmid which contains the gene sequence for the FLAG-tagged proteins but the recombinant gene was not integrated into the genome. A second related technique is the tandem affinity purification (TAP) technique,50,51 which includes two purification steps. The TAP tag is composed of an immunoglobulin (Ig)G-binding domain and a calmodulin-binding peptide tag, separated by a tobacco etch virus (TEV) protease- specific cleavage site. TAP tag uses two rounds of purification, which should reduce the number of contaminants as compared to FLAG tag.50,51 The TAP tag is inserted into a plasmid or integrated into the genome, so the endogenous promoter controls

7

expression of the TAP-tagged protein.38 If it is integrated into the genome, no ‘un- tagged’ copies of the targeted protein are present to compete with the TAP-tagged protein, which should lead to high yield of the purified complex.38 To take advantage of this, Gavin et al.2,52 integrated the gene sequence for TAP-tagged protein into the genome for their large-scale AP/MS screen of the interactome. The protein belonging to the complex can be identified via a number of mass spectrometric approaches. For example, the purified proteins can be identified by a combination of liquid chromatography in-line with tandem mass spectrometry (LC-MS/MS).3 Alternatively, the proteins can be separated by sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE), cutting the gels into small bands, and then identifying the proteins within each band with MS or MS-MS.2

Unlike Y2H or PCA, which measure interactions between two proteins, techniques that purify and analyse protein complexes measure co-operative interactions between proteins. Affinity purification may sometimes purify a mixture of protein complexes, in which protein complexes that share the same protein-protein interaction partners are purified together. It is often unclear which proteins are in direct contact in a protein complex captured by affinity purification. Without accurate information on the topology and architecture of the protein complex, interactions amongst members of the complex are often modelled by the ‘matrix’ or ‘spoke’ model.53 The matrix model assumes each protein interacts with all other proteins in the complex, but can overestimate the number of protein-protein interactions in the complex. This is because not all proteins in a complex contact each other physically, due to obstruction. On the other hand, the spoke model only assumes interaction between the tagged protein and its interaction partners, which tends to ignore the effect of co-operative interactions and proteins that link two or more proteins in the same complex.

8

Different technologies for detecting protein-protein interactions have different strengths and weaknesses (Table 1). The binding affinity of transient interactions is often weak, which makes them difficulty to study. Y2H and PCA can detect some transient or weak interactions, since they do not involve any of the steps used in affinity purification that can disrupt interactions.54 Affinity purification techniques favour the detection of stable protein complexes, though their sensitivity for transient interactions can be increased with minimal washing of captured proteins and differentiation of non- specific interaction partners using spectral counting.55,56 Y2H and PCA approaches are essentially performed within yeast-cells and are therefore easily scalable to higher- eukaryotic species for large-scale protein-protein interaction studies.38,57,58 Affinity purification requires the tagged protein to be expressed in the cell of the organism under investigation, and are presently not performed on a large-scale in higher- eukaryotic organisms. PCA, TAP tag and FLAG tag do not typically alter the original localization of the proteins, which also allows modification enzymes in the corresponding cellular compartment to preserve the original post-translational modification of the proteins under study. Y2H has difficulty in detecting interactions involving long proteins, unstructured proteins, or proteins with high isoelectric point.59

Affinity purification is slightly biased towards the detection of cytosolic interactions and abundant proteins.2,3 In sum, different technologies for detecting protein-protein interactions have different strengths and weaknesses that are complementary to each other, and a combination of several approaches are required to comprehensively map the interactome.

9

Table 1. Comparison of technologies that measure protein-protein interactions.

              !  !                                                " #     1225(@#60$  $3& (5 $"# ( ( ( (5 ( 2(9AGEEFB8QE AGB  27AGEEEB8QF 3AGEESBPI

0(2 $'@!0"& '2  $3& (5 $"#  1 (  1  $3&  1 011(4 (&)% & '22$(' AGEESBIF 1161A,-B

" $     % #       ' &!!$'$26 (5 $"# (5  1  1a  1 (5 ( 04$' )30$!$2$('A-B AGEEGB8PG04$' 2" AGEEQB8G10("' AGEEQBH

3-02" (5 $"# (5  1 (a  1 (5 ( (AGEEGBIR

a: Depends entirely on how the experiment was set up. In theory, either TAP-tag or FLAG-tag could be integrated into the genome, which would maintain original abundance of the tagged protein. Indicated in the table is how the experiment was set-up in the corresponding references.

10 1.3 Protein interaction networks

Protein interaction networks can be built from experimental data, and usually involve integrating data from several large-scale protein-protein interaction analyses.4,5,54,62

Before the network is analysed, data that are likely to be artifactual need to be removed, keeping interactions that are reliable and likely to be biologically relevant.

This is usually achieved by using a consensus approach where interactions found in the majority of experiments are kept for further analysis.63,64 The data are often stored in a database and built or visualized with software.65-68 The resulting networks can be analysed to understand how biological functions are regulated and what properties of the network facilitate these functions.

Protein interaction networks are known to be scale-free. In a scale-free network, there are many proteins that have low connectivity with other parts of the network, and small number of proteins with high number of interaction partners.63,69 Evidence for this can be found in the S. cerevisiase,63 and human58 protein interaction networks. The scale-free nature of protein interaction networks has important implications for the function of proteins in the network, since proteins with large numbers of interactions, known as network ‘hubs’, are thought to be more important to the network. Although earlier studies in S. cerevisiae showed that network hubs were more likely to be essential for cell survival,63,70 this was later found to be an artefact of the sampling methods.54 From using a combined dataset of all known binary interactions in yeast and results from global phenotypic profiling, Yu et al. (2008)54 showed that the number of binary interactions a protein has correlates positively with the number phenotypes it is associated with on knockout or mutation. This means ‘hub’ proteins are likely to be involved in multiple phenotypes, also known as pleiotropy.54

11 The S. cerevisiae protein interaction network is the most complete to date.71 In

2006, Hart et al.71 estimated that there are approximately 37,800 to 75,500 protein- protein interactions in yeast including all binary interactions and interactions within protein complexes. This estimate was calculated using the number of common interactions between three datasets: Gavin et al.2, Krogan et al.3, and Ho et al.47. The number of common interactions was corrected for the estimated false-positive rate of

72%71 and scaled up to the size of the yeast interactome of 5,800 proteins (around

5,8002/2 interactions) to obtain the final estimate. While Hart et al. (2006)71 estimated that more than 50% of the yeast proteome have been mapped, Yu et al. (2008)54 found that only ~20% of the binary interactome including 2,930 interactions amongst 2,018 proteins have been mapped from the combination of data from all large-scale binary interaction studies.54,60,61 In contrast, the human protein interaction network is estimated to have approximately 154,000 to 369,000 interactions, of which 10% have been mapped in 2006.71 There are an estimated ~800 protein complexes in yeast,2 of which approximately 500 were characterized from the large-scale purification and analytical studies.2,3,5 For yeast cells grown in rich media, the discovery of protein complexes using techniques like TAP tag and FLAG tag is approaching saturation in terms of protein coverage.59,72 However, many more protein complexes are likely to be found when yeast is grown in a variety of environmental conditions such as starvation and exposure to chemicals, since protein-protein interactions and thus protein complexes will change in response to specific environments.72 The underrepresentation of membrane proteins in protein-protein interaction datasets contributes to the incompleteness of current datasets. This is mainly due to the difficulty in detecting interactions involving membrane proteins, which requires specialized protocols.73 Methods such as the split-ubiquitin system,46 that were designed to detect interactions involving transmembrane proteins, will help address the lack of interactions involving transmembrane proteins.

12 Using a database of protein-protein interactions and network building software, the yeast interactome can be visualized by representing each protein as a node and each binary interaction as an edge in the graph. The protein interaction networks can be built by integrating all available protein-protein interaction data. Most approaches to integrate protein-protein interaction data rely on the use of scoring systems, which weight the reliability of the interaction by the number of times the interaction has been observed across all protein-protein interaction datasets. The more frequently the interaction has been observed across different datasets, the greater the reliability score the interaction has. Interactions with low reliability scores are usually removed from further analyses. While the previous scoring scheme could potentially favour stable interactions, a fairer scoring system is to count the number of experimental techniques with which the interaction has been observed. Interactions that have been detected by more experiments are considered more reliable,74 and this scoring system can be used independently or in combination with other methods. There are three main approaches for building protein-protein interaction datasets, the first type integrates data and represent them as pairwise interactions,63,64 the second type involves literature-curation of interactions from small-scale studies,75,76 and the third type integrates protein-protein interaction data and represents them as protein complexes.4-6,62 In practise, most studies use a combination of these three methods, but they are introduced separately in this review for the sake of simplicity.

The first type of approach uses integrated data to generate a network composed of pairwise interactions. One of the first comprehensive yeast interactome studies, Han et al. (2004)63 integrated data from various sources: Y2H screens,60,61,77,78 FLAG tag screen,47 TAP tag screens,52 in silico predictions,39 and curated data from the Munich

Information Center for Protein Sequences (MIPS),79 to generate the filtered yeast interactome (FYI). Each interaction in the FYI was verified by at least two different methods, which generated a network with 2,493 interactions amongst 1,379 proteins.

13 Analysis of this network revealed that there are two types of hubs: the ‘party’ hubs and the ‘date’ hubs. Party hubs interact with all of their partners at the same place and time, and date hubs interact with different partners at a different place and/or time.63 A follow-up study by Bertin et al. (2007)64 used similar approaches to build a high- confidence network with the addition of literature-curated data76 and data from further

TAP tag screens.2,3 This network included 2,561 proteins with 5,996 interactions,64 and the distinction between ‘date’ and ‘party’ hubs could still be observed. Krycer et al.

(2008)40 suggested that high-quality pairwise interaction data, like FYI, are useful for informing whether proteins are likely to form a complex or sub-complexes, but are not useful for interpreting the architecture of complexes due to false-positives mediated by homo-domain-domain interactions. This often leads to an assumption that almost all proteins in a complex are in direct contact with each other, which is not possible due to spatial obstruction.

The second type of approach involves the systematic collection of protein-protein interaction data from small-scale studies from the literature, and this approach was used by Reguly et al.75 and Batada et al.76. It was shown that by changing the thresholds at which a literature-curated interaction is accepted, the network can become vastly different, which cast doubts as to whether literature-curated interactions are as reliable as previously thought.80 Problems with the reliability of literature-curated interaction may be due to sampling bias, where more biologically significant proteins are studied more often,80 and the difficulty in extracting accurate information from long free-text documents.74 Initiatives that standardize the submission of protein-protein interaction data, such as the minimal information about a molecular interaction experiment (MIMIx)81 could make the extraction of protein-protein interaction data from literature more reliable.74 It is important to note that incomplete coverage of the network with available data and subtle changes in the method by which data are filtered can change the structure of networks and affect our interpretation.64,80,82

14 The third type of approach is the detection of protein complexes from multiple protein-protein interaction datasets. While Gavin et al.2 and Krogan et al.3 used large- scale TAP-tag to generate data on protein complexes, neither dataset covers the whole yeast interactome and their data on purified protein complexes can be merged to produce a bigger dataset. There have been a number of computational efforts to map the S. cerevisiae ‘complexome’ using existing protein-protein interaction data. These include work by Pu et al.,6 Collins et al.,62 Hart et al.,4 and Wang et at.,5 in which they had used various algorithms to cluster protein-protein interactions into protein complexes, but these algorithms will not be discussed further here. A common finding between these studies is that integrating data from several large-scale screens of the yeast interactome2,3,47,52,60,61 can considerably improve the coverage and sensitivity of detecting protein complexes.5 In addition, computational strategies can be used to deal with the issues of contaminating proteins from large-scale screens of protein complexes using affinity purification and mass spectrometry. The most common strategy involves assigning a weight to each pairwise interaction, which represents the probability that the interaction is correct.3 Based on the weight of the interactions, proteins are clustered into mutually exclusive protein complexes using the Markov cluster algorithm3,83 or similar procedures.4,6,62 This can remove contaminating proteins that bind promiscuously to multiple complexes. However, some approaches, such as the Markov cluster algorithm, can mistakenly combine some complexes which are known to contain shared subunits into one complex.4

A protein complex consists of two or more proteins that interact to form a molecular machine.2 Proteins of the same protein complex are usually involved in similar biological processes and they can have synergistic or regulatory effect on each other’s function.2,3 These complexes usually contain a group of proteins that interact highly with each together, which can be identified computationally using clustering

15 approaches that group the proteins together.3 Analysis of the complexome has generated insights into the ‘modularity’ and ‘hierarchy’ of the protein interaction network. Gavin et al.,2 used an iterative clustering algorithm to classify proteins within complexes into “cores”, “modules”, and “attachment” proteins. “Core” proteins are found in almost all purifications of the same complex. “Module” proteins are proteins that are shared by two or more protein complexes, and can be seen as ‘modular’ component of complexes. “Attachment” proteins are found only in some of the purifications, may not be essential for complex formation,72 and are more likely to be false positive interactions.40 Gavin et al.2 have described 147 “modules” in S. cerevisiae that are shared amongst 491 protein complexes, which supports the hypothesis that sub-complexes can act as shared modules in protein interaction networks. An example of protein complexes with a shared sub-complex is the TAFIID transcription factor complex, and SAGA complex, in which both complexes share the proteins Taf5, 6, 9, 10, and 12.2,4 This is also an example of two different complexes formed by mutually exclusive interactions, in which the TAFIID transcription factor

“attachment” proteins and the SAGA “attachment” proteins bind mutually exclusively to the “core” proteins of Taf5, 6, 9, 10, and 12. By contrast, Hart et al.4 and Wang et al.5 suggested that the complexome has a ‘hierarchical’ structure. In this, proteins are individual units which form stable complexes,5 and complexes are connected to other parts of the network through interactions between complexes. Proteins that form pairwise interactions with multiple complexes or other proteins in the network can mediate interactions between protein complexes, and these interactions can be transient.4,5,67,68 An example of interactions between complexes are pairwise interactions between histone proteins and several chromatin-modifying complexes such as the ISW1 complex, HAT1 complex and the RSC complex.5

Integrating structural data with protein-protein interaction data could improve the reliability of the protein-protein interaction data, and could lead to better understanding

16 of interaction dynamics. To build the structural interaction network, Kim et al. (2006)23 incorporated protein structural data with a filtered yeast protein interaction network by mapping Pfam domains84 to each protein. A pairwise interaction was kept in the network only if it could be explained by a structure of a protein complex, or a homologue of the complex inferred by using domain-domain interactions from iPfam.85

The protein-protein interaction interface for each pairwise interaction was found using three-dimensional (3-D) exclusion. Hub proteins, those with 5 or more interaction partners, could be classified into two categories; singlish-interface hubs and multiple interface hubs.23 Singlish-interface can bind to multiple partners in a mutually exclusive manner; only one protein can bind to the same interface at any one time. The interaction partners of singlish-interface proteins can change dynamically to mediate different functions.23 Multiple interaction partners of singlish-interface hubs arise mainly through gene duplication. While there are other possible mechanisms such as the gain of a novel interaction, these will not be explored here. In contrast, many proteins can bind to different interfaces on a multiple-interface hub simultaneously, and these may also correspond to protein complexes containing a higher proportion of essential proteins.23

Using network visualization tools, protein interaction networks containing thousands of proteins and edges can be represented as a map. We note that this representation is only one of several other possible representations. For example, another visualization format is the matrix-layout. A 1’ in a cell of the matrix represents an interaction between two proteins labelled in the corresponding column and row, and a

‘0’ represents the lack of interaction. The advantage of visualization is that it can show complicated interactions simultaneously and intuitively. The layout of the network represented by nodes and edges can be adjusted automatically using layout algorithms, such that densely connected nodes are placed closer together and nodes less central to the network are placed closer to the outer perimeter of the map.86 Many

17 different types of proteomics parameters can be co-visualized in the network by using various representations for the nodes and edges, such as different colour, label, size, and shape.86 Applications such as Cytoscape65 and VisANT66 can visualize interaction networks in two-dimensions, but it is difficult to interpret high numbers of overlapping and intersecting edges in a two-dimensional graph.67 Ho et al. (2008)67 used GEOMI24 to represent the graph in three dimensions. To clearly visualize interactions between complexes,67 each protein complex can be simplified as a node in the network and interactions between individual subunits of each complex can be ignored to reduce visual complexity. This can provide a higher-level perspective on how protein complexes co-ordinate their function with each other.4,5,67,68 Widjaja et al. (2009)68 developed the Interactorium, which has extended visualization approaches by visualizing the cell in several levels of detail. Interactions can be visualized on the level of cell organelles, higher-level interactions between protein complexes, interactions within a complex, and at the most detailed level, the structure of a protein will be shown if its structure has been solved.68 Visualization tools can act as a platform to integrate and map biological data to a network,24 and they can be coupled with a database engine to perform instant queries and generate hypothesis-driven views on the network. The next challenge will be to visualize the dynamics of protein interaction networks. This includes visualization of changes in protein abundance, localization, and protein-protein interaction over time, and under different environmental conditions.86

18 1.4 Integrating protein-protein interaction data with genomics and proteomics

data

Integration of protein-protein interaction data with genomics and proteomics data is important for understanding of the dynamics of protein-protein interactions. Gene expression data, together with information on the transcriptional networks87 can help understand the regulation of gene expression. Temporal gene expression data, such as when genes are expressed in a cell cycle88 and time series analysis of microarray data,89 can help determine whether the timing of gene expression plays an important role in the function of biological pathways and protein-protein interactions of interest.

Proteins need to be co-localized for them to interact. Information on the sub-cellular localization of proteins,90 including membrane localization, proteins present in multiple locations or involved in intracellular trafficking, provides information on their spatial constraints. Protein abundance,91,92 half-lives93 and translation rate94 can provide insight into how the concentration of proteins may affect protein-protein interactions.

The structure of proteins is also important for understanding the function of protein- protein interactions. Information on domain-domain interactions23,85 and domain-motif interactions14 are valuable for understanding how proteins interact. Data on post- translational modifications can provide information on signalling pathways, and provide insights on the conditional binding effect. The high coverage of the yeast protein interaction networks and the wealth of genomics, proteomics, and functional data available for S. cerevisiae make it an ideal model for exploring the dynamics of the interactome. Nevertheless, this task still remains a challenge due to a lack of data on the dynamics of the interactome and the difficulty in capturing spatial-temporal change in protein-protein interactions in experiments.

19 1.5 Proteomic technology to quantify protein absolute abundance, half-lives,

and translation rate

The function of a cell can be regulated by changes in protein abundance. For example, a cell will express proteins that control growth and division when environmental conditions are favourable, but will down-regulate these proteins and up- regulate proteins involved in various stress responses when the cell is exposed to environmental stressors. Wilkins and Kummerfeld (2008)28 described the term abundance-based effect, which discussed how changes in the abundance of proteins can affect the dynamics of protein-protein interactions or the dynamic composition of protein complexes. This may lead to changes in cellular processes. Testing of the abundance-based effect can be assisted by technologies that measure protein abundance, protein half-life and translation rate. For example, these data could help us explore whether proteins in the same complex have similar stoichiometry, whether some proteins are more tightly regulated as a means of controlling protein-protein interactions, and whether the composition of protein complexes change under different conditions.

Protein abundance, protein translation rate and half-life are important aspects of the dynamics of the proteome. The range of intracellular protein abundance spans several orders of magnitudes and in eukaryotes can range from less than 50 copies/cell to more than 106 copies/cell. 91,92 For both small-scale and large-scale studies on intracellular protein abundances, the presence of highly abundant proteins often obscures the detection of proteins with low abundance. While this problem can be solved with the latest mass spectrometry technology and genetic techniques developed for model organisms such as S. cerevisiae,92 adapting these methods to study the dynamics of protein interactions networks for other organisms and under many different environmental conditions remains a challenge.

20 In the past several years, there have been major developments in technology for the analysis of protein abundance, protein half-lives and translation rates for S. cerevisiae.

In a pioneering study, Ghaemmaghami et al. (2003)91 performed a proteome-wide study of protein abundance under normal growth conditions. TAP-tag insertion was performed on approximately 6,200 proteins of the yeast proteome, and was successful for 98% of these. Protein products were observed for approximately 3,400 of the TAP- tagged ORFs under normal growth conditions. Protein abundance was measured for each TAP tagged protein using western blotting and chemiluminescence detection of a

TAP-specific anti-calmodulin binding peptide (anti-CBP) antibody.91 Protein abundance data were collected for 3,868 proteins. On average, there were 4,800 copies of each protein expressed per mRNA transcript,91 when protein abundance was compared to microarray analysis of transcripts.95 It is important to note that these results may be inaccurate in some cases, since the TAP-tag could significantly affect protein expression and/or the half-lives of some proteins.2 Whilst previous attempts at measuring protein abundance using mass spectrometry were biased towards high abundant proteins, Ghaemmaghami et al. (2003)91 detected several classes of proteins with low abundance, such as transcription factors and cell-cycle proteins.

Newman et al. (2008)96 used GFP-tagged proteins and flow cytometry to measure protein abundance with single-cell resolution. A GFP-tag was integrated into 4,159 yeast proteins with one GFP-tagged protein in each strain. The fluorescence level of cells of each strain was then measured using flow cytometry. The intensity of the fluorescence is proportional to the abundance of the protein, and this was used as a measure of protein abundance. Since fluorescent intensity does not directly measure the number of copies of the protein per cell, protein abundance was represented as an arbitrary unit. The use of flow cytometry allows the selection of cells of similar size to minimize heterogeneity in cell size affecting protein abundance measurement. The use

21 of GFP-tagged proteins and flow cytometry also enabled quantitative measurement of protein expression variance (or noise) amongst a population of cells. There was a large difference in noise for proteins with similar levels of expression, represented as coefficient of variation (CV). Proteins with a high level of noise were involved in processes of stress-response, amino acid biosynthesis, and in heat shock. In contrast, proteins with low noise were involved in translation and protein degradation. The advantage of GFP technology is that it measures protein abundance in a single cell, and this contrasts to other technologies that measure levels from a population of cells.

In a related study, Cohen et al. (2008)97 used a fluorescent labelling strategy and time- lapse fluorescence video microscopy to visualise and measure protein expression of human cancer cells in response to a drug. Using an engineered strain of cancer cell line in which a cytoplasmic protein was tagged with a red fluorescent protein

(mCherry), each cell could be distinguished from the background using automated software. Each targeted protein was tagged with an enhanced yellow fluorescent protein (eYFP), from which its fluorescent intensity in arbitrary units was then used to measure protein abundance for each cell. These measurements can be used to estimate the noise or cell to cell variance in protein expression. This method can also track temporal changes in protein abundance. On the whole, methods developed by

Newman et al. (2008)96 and Cohen et al. (2008)97 are major advances because they can measure protein abundance at the level of a single cell. This reveals that protein abundance can vary within a population of cells, even though they have the same genes and are grown in similar conditions.

Absolute protein expression (APEX) is a mass-spectrometric method for measuring protein abundances in a mixture of proteins.98 APEX relies on the use of multidimensional protein identification technology (MudPIT), which is 2-D liquid chromatography, coupled with online tandem mass spectrometry. The analysis involves the identification of proteins from peptide fragmentation spectra, accounting

22 for the mass spectrometry sampling depth, and counting the number of times each peptide has been observed. In APEX, the observed peptide counts for a protein are used to estimate the absolute protein abundance in copies per cell, based on a set of formulae that convert observed peptide counts to protein abundance. Protein abundance is calculated by accounting for the likelihood that a peptide would be observed by the mass spectrometer, based on a number of properties of the peptide such as molecular weight, solubility, and differential ionization. When Lu et al.98 compared protein expression data measured using APEX with mRNA expression data from several types of microarray analysis,95,99-103 it was shown that an average of

~5,600 copes of each protein were expressed per mRNA, which is close to earlier estimates by Ghaemmaghemi et al.,91 and that mRNA abundance explains 73% of the variance in protein expression in yeast.98

In the two previous studies, approximately 20% of the predicted yeast proteome was not detected, which may be due to many different issues, but one important issue is the difficulty in detecting proteins of low abundance.92 More recently, selected reaction monitoring has been used to analyse the yeast proteome, including low abundance proteins with less than 50 copies/cell and highly abundant proteins.92 This covers the full dynamic range of protein abundance in the yeast cell. This is currently the most accurate method of protein abundance measurement, and is a significant improvement in sensitivity and accuracy over other MS-based approaches.104-106 This method relies on the detection of proteotypic peptides, which are peptides preferentially detected for any particular protein during an LC-MS/MS experiment.107 The abundance of a targeted protein is measured by monitoring the signal intensity of its proteotypic peptide, in comparison to the signal intensity for a known quantity of the same peptide that has been isotopically labelled and mixed into the sample as internal reference. For example, Picotti et al. (2009)92 used selected up to 5 proteotypic peptides per protein by screening a large proteomic data repository that includes yeast specific data,

23 PeptideAtlas.108 For proteins with less than 5 proteotypic peptides found, more were predicted using the computational tool PeptideSieve. This predicts whether tryptic peptides are likely to be proteotypic based on their sequence and physio-chemical properties.107 However, the accuracy of this technique is reduced when proteins do not have any reliable proteotypic peptides. This method can be applied to study changes in protein expression for a biological protein network under different environmental conditions, and at different time points. The above are only a few MS-based approaches that are in use for measuring protein abundance. Other possible methods include using proteomics database such as PeptideAtlas to help plan and improve MS- based quantitative analysis,109 and the use of highly accurate and sensitive Velos

Orbitraps mass spectrometer to quantitate protein abundance.110

Protein half-lives and translation rate will influence the protein abundance in a cell.

The half-lives of proteins are affected by endogenous protease degradation, and degradation via the ubiquitin/proteasome pathway. After treating yeast cells with the protein translation inhibitor cycloheximide, Belle et al. (2006)93 quantified the half-life of

TAP-tagged proteins by measuring protein abundance at several time points, using western blotting and chemiluminescence detection of a TAP-specific anti-CBP antibody.91 The half-life of the protein was then quantified by measuring the rate of change of protein abundance. Other methods to measure protein abundance, such as

MS-based approaches, can also be used to measure protein half-lives when cycloheximide is used to inhibit protein translation. For example, Doherty et al.

(2009)111 used stable isotope labelling with amino acids in cell culture (SILAC) to measure protein half-lives in human cells. Belle et al.93 reported the relationships between protein half-life,93 protein production rate estimated from ribosome density,112,113 and protein abundance.91 It was found that proteins with similar abundance, half-lives, and protein production rate also had similar functions.93 For example, ribosomal proteins and enzymes had long half-lives, whilst transcription

24 factors and cell cycle proteins were rapidly degraded. From these data, Belle et al.

(2006)93 used half-lives and protein production rate to predict protein abundance in copies per cell using differential equations. Elsewhere, protein translation rate has been measured using polysome profiling and deep sequencing of ribosome-protected mRNA fragments.94 Translation rate is estimated from the density of ribosomes bound to mRNA sequence for each protein, and was found to provide better estimates of protein abundance than mRNA level alone. This method could also find the position in which the ribosome binds to the mRNA with single-codon precision, and could be used to find protein translation starting at non-AUG start sites. Ribosome densities were generally found to be greater at the 5' end of the mRNA, than at the 3' end, thus explaining why shorter genes were better translated.94 Since this method is high- throughput and can be multiplexed, it can easily be adapted to monitor translation rate in different cell types, diseases states, or under different perturbed conditions.

Whilst a number of methods have been developed for quantifying parameters related to protein expression, they have different strengths and weaknesses. Several attributes that enable these technologies to be used in high-throughput studies in different organisms are shown in Table 2. Whilst techniques that utilise tagging such as

TAP tags or GFP tags can be targeted to a protein of particular interest, they require substantial preliminary work to engineer a tag onto the protein of interest, which makes these techniques difficult to adapt for other organisms. Once a library of tagged strains is created, each tagged strain needs to be maintained, catalogued and stored. In addition, for TAP-tagged proteins, abundance measurements can only be performed on one protein at one time in a serial manner, and the amount of manual labour required to perform these experiment makes the TAP-tag less cost effective.114 For comparison across multiple samples, a known quantity of a TAP-tagged protein is needed to act as internal reference, and the resolution of this method makes it only semi-quantitative. Methods which do not require genetic modifications for the analysis

25 of protein abundance are much more adaptable to higher eukaryotic organisms.

Several methods allow multiplexed measurements of protein abundance, and therefore the abundance of hundreds of proteins could be achieved in a single experiment.

These methods are particularly suitable for high-throughput analysis. Whilst each of the high-throughput approaches discussed above presents some weakness, the combination of several methods could be used to improve the accuracy of protein abundance quantification. The study by Malmström et al. (2009)115 combined three methods including SRM,92,116 APEX,98 and the estimation of protein abundance from the average mass spectrum peak intensity of the top three predicted peptides,117 to improve the accuracy and coverage of protein abundance measurement. The method presented by Malmström et al. (2009),115 being based on mass spectrometry, should also be adaptable to other organisms.

26 Table 2. Comparison of methods for quantifying protein abundance, half-lives, and translation rate.

Technology Targeteda Genetic Multiplexedc Accuracy for low Adaptabilitye Differential protein Paper engineeringb abundance expression proteinsd studiesf Protein Abundance Western blot, TAP-tagged proteins, and Yes Yes No Low Difficult Less suitableg Ghaemmaghaemi TAP-specific anti-CBP antibody et al., (2003)91 MudPIT, LC-MS/MS, and calculation of No No Yes Medium Easy Yes Lu et al., (2007)98 protein abundance using observed peptide counts Selected reaction monitoring (SRM) of Yes No Yes Very high Easy Yes Picotti et al., proteotypic peptides (2009)92

Translation Rate Polysome profiling and deep sequencing of No No Yes Medium Easy Yes Ingolia et al., ribosome-protected mRNA fragments (2009)94 Protein Half-life Western blot, TAP-tagged proteins, TAP- Yes Yes No Low Difficult Less suitableg Belle et al., specific anti-CBP antibody, and protein (2009)93 translation inhibitor cycloheximide Protein Expression Variance (noise) GFP-tagged proteins and high-throughput Yes Yesh Noi Low Medium Yes Newman et al., flow cytometry (2006)96 a: Can the method be used to target measurement to a particular protein? b: Whether genetic engineering of the organism is required for protein abundance measurement c: Whether the measurements can be multiplexed and can be made for 100+ proteins in a single run of the experiment d: Whether the method can accurately detect the abundance of low abundance proteins e: Whether the method can be easily adapted for organisms other than S. cerevisiae f: Whether it is possible to compare two more samples of proteins to analyse changes in protein expression under different environmental conditions g: This method is only semi-quantitative for comparing protein expression between two samples h: This is cost effective since a library of GFP-tagged protein are now available for general use i: Whilst not multiplexed, the method is very high-throughput due to the use of high-throughput flow cytometry

27 1.6 Post-translational modifications

Post-translational modifications are an important way in which the function of many proteins are dynamically regulated. There are many different types of modifications, for example, phosphorylation, acetylation, methylation, ubiquitination, and sumoylation, and it has been suggested by Yang et al. (2005)118 and Hoffmann et al. (2008)119 that there are at least 200 types of modifications. There are two main types of post- translational modifications (PTMs), the first type involves the proteolytic cleavage of the polypeptide chain, and the second type involves covalent addition of chemical groups to amino acid side chains. Some commonly known modifications are restricted to certain residues. For example, phosphorylation is limited to serine, threonine, tyrosine, and histidine120 in higher eukaryotic organisms. Modifications can either be reversible or irreversible. Reversible modifications are important to the dynamic function of proteins as the presence or absence of the modification can turn the function of the protein on or off. This will be the focus of this review as they are more likely to be involved in the conditional binding effect that regulate protein-protein interactions.15 We will also limit the review of modifications in this thesis to that of intracellular proteins, since they are involved with the dynamics of protein interaction networks. Most N- linked or O-linked glycosylation is found on extracellular proteins and these are not reviewed here. This is with the exception of O-linked N-acetylglucosamine monosaccharides (O-GlcNAc), which is a reversible modification found on intracellular proteins and have a role in controlling transcription and cellular signalling.121

Modifications including but not limited to phosphorylation, acetylation, methylation, and ubiquitination, can be found on multiple residues along the N-terminal tail of histones. This affects their interaction with other proteins, controlling chromatin compaction and the up- or down-regulation of gene expression.122 This mode of

28 regulation of histone proteins is dubbed the ‘histone-code’ and describes how the combination of different modifications on multiple residues of histones can control its function.122 Yang (2005)118 suggested that ‘multisite protein modification’, which involves two or more modification sites on the same protein, is not limited to histones but can be found in other eukaryotic proteins. Examples include the tumour suppressor protein p53,123 the signalling protein 14-3-3,123 and the RNA polymerase II.124,125 For example, phosphorylation on different positions on the C-terminal domain (CTD) of the largest subunit of RNA polymerase II regulates different biological processes. The CTD contains multiple repeats of the heptamer sequence-motif Tyr1-Ser2-Pro3-Thr4-Ser5-

Pro6-Ser7, which can be highly phosphorylated. Phosphorylation on Ser5 of the motif is known to be involved in early transcription termination, mRNA capping, and histone

H3K4 methylation, while phosphorylation on Ser7 is known to be involved in snRNA- processing.124,125 Modifications such as acetylation, methylation, neddylation, ubiquitination and sumoylation can be found on lysine residues, meaning that they can potentially compete with each other for the same lysine residues.118 In addition, there can be up to three additions of methyl groups to the -amino group of a lysine residue, and different methylation states can have distinct effects on a protein’s function.126

Whilst it is possible several kinases can target the same phosphorylation site, suggesting there are redundancy in phosphorylation pathways, evidence for redundancy has largely been gathered from in vitro assays which may not accurately represent the specificity of the kinase.127

The function of proteins can be affected by their post-translational modifications.

The conformation of a protein can be regulated by a modification and affect its biological function. Take the case of eukaryotic glycogen phosphorylase, where the phosphorylation of residue Ser 14 by phosphorylase kinase changes the conformation of the enzyme to an activated form.128,129 This enzyme is deactivated when it is dephosphorylated by phosphatase-1.130 Modification can also affect the sub-cellular

29 localization of the modified proteins. One example is mono-ubiquitination of the yeast protein Rnr2p which is required for export of this protein from the nucleus to the cytoplasm.131 Similarly, methylation of the nuclear localization signal of the RNA helicase A is required for nuclear import.132 Some modifications are known to affect protein half-life, where poly-ubiquitination on lysines serve as a signal to target the modified protein to be degraded by the ubiquitin/proteasome pathway; the 19S proteasome cap can bind to poly-ubiquitin and target the poly-ubiquitinated protein to be degraded by the proteasome.133 Whereas some modifications encourage protein- protein interactions, such as the binding of acetyllysine by bromodomain,13,15 other modifications can block protein-protein interactions. For instance, methylation is known to inhibit the interaction between heterogeneous nuclear ribonucleoprotein K with c-

Src.134 Furthermore, signal transduction in the cell is largely mediated by PTMs such as phosphorylation, where multiple proteins interact through modification binding domains and modification sites that form signalling pathways to transmit signals within the cell.13,15 Since PTMs can affect the function of proteins dynamically, and regulate a diverse number of biological processes, therefore, they are crucial for our understanding of the dynamics of protein interaction networks.

Methylation is an important post-translational modification that has been widely studied on histone proteins, and recent studies suggest it could be quite widespread.135-137 Methylation is commonly found on lysine- and arginine-residues, and will be the focus of study in part of this thesis. Methylation of lysine is a reversible modification that involves the addition of one to three methyl groups on the amino acid’s -amine group, to form mono-, di- or tri-methyllysine. While studies of lysine methylation are focused on histone proteins, lysine methylation is also found on several non-histone proteins. They are mainly ribosomal proteins or proteins involved in protein translation, such as eukaryotic elongation factor 1-, and ribosomal proteins of S. cerevisiae,138-142 S. pombe,143 A. thaliana,144 and human.145-148 Arginine

30 methylation involves the addition of one or two methyl groups to the amino acid’s guanidine groups, forming mono- or di-methylarginine. There are two types of di- methylarginine, the first type involves the addition of one methyl group to each of the two terminal nitrogen atoms of the guanidino group of arginine and is called symmetrical di-methylarginine.149 The second type involves the addition of two methyl groups to one of the terminal nitrogen atoms of the guanidino group, and is called asymmetrical di-methylarginine.149 Arginine methylation is often found in arginine- and glycine-rich motifs,150 and although arginine methylation is not directly involved in translation,151 it is involved in RNA regulation and processing,152 and also the transport of various components of ribosome assembly between the cytoplasm and nucleus.153,154 It has been suggested that monomethylated arginine can be converted to citrulline via deimination by protein arginine deiminase 4.155,156 Whilst this could possibly be a pathway for reversing arginine methylation, this remains a controversial issue. In addition to lysine- and arginine methylation, methylation can also be found on

C-terminal leucine residues,157,158 and on glutamic acid in the conserved ‘Gly-Gly-Gln’

(GGQ) sequence motif,159 though these types of methylation are not common in the eukaryotic cell.

1.7 Different methods of detecting post-translational modifications

There are various methods for detecting the presence of post-translational modifications on a protein. Some can detect whether a protein is modified but cannot resolve the exact residue where the modification is attached. Firstly, post-translational modifications on proteins can be observed as ‘trains’ of spots on a 2-D electrophoresis gel, caused by shifts in mass or charge due to the addition or removal of chemical groups from the protein.160 Western-blotting of such protein ‘trains’ can detect whether they are indicators of modification or if they are composed of different proteins with

31 similar pI and molecular mass.137 The use of Western-blot is discussed in more details below. A more accurate method of identifying modified proteins is radioisotopic labelling of the chemical substrates for the modification. For example, [3H]-S-adenosyl methionine may be used for identifying protein methylation,161 and 32P can be used to identify sites of phosphorylation.162 Modified proteins containing the radioactive substrate can be identified on 2-D electrophoresis gels or 1-D SDS-PAGE gel using radiosensitive photography. In addition, modified proteins can be identified using

Western blots and modification-specific antibodies.163 Antibodies that bind specifically to lysine- or arginine-methylated proteins are available for detection of methylation using Western blot,135-137 and have been used to identify lysine methylation in Mus musculus.137 The above methods are parallel but not high-throughput, and can be complemented with LC-MS-MS to identify modified proteins and their modification sites. On the other hand, large-scale screening for modified proteins can be achieved by using proteome chip technology, in which a large number of proteins are immobilized onto a surface. By incubating a proteome chip with the enzyme that catalyses the modification and a radiolabelled substrate, the modified proteins can be identified.164 While this method can identify the proteins modified by an enzyme, however, proteome chip is an in vitro assay where the specificity of the enzyme may not accurately reflect in vivo conditions. It is important to note that the methods described above can only be used to identify modified proteins; they cannot be used to locate the position of the modified amino acid in the protein sequence.

Although tandem mass spectrometry is commonly used to detect modification sites,165 peptide mass fingerprinting can also be used to search for new modifications.166 The FindMod program166 caters for this approach. It requires peptide mass spectra from a mostly pure protein, for example a spot from 2-D gel, and examines experimental peptide masses for differences in mass with theoretical peptides for that protein that correspond to post-translational modification. Peptides

32 that are potentially modified are checked to see if they contain amino acids that carry the modification, whether the modification is known in the organism, and whether the methylation is in a motif specific for the modification. Using peptide mass spectra, the

FindMod program can be used to find modified peptides, but it does not usually pinpoint the exact location of the modification site. In this thesis, we have improved the existing FindMod method in order to locate the modification site with approximately

90% accuracy.167 We developed a strategy to identify high-confidence methylation sites using 5 stringent filters, which take advantage of a number of replicate analyses for each protein and the presence of overlapping peptides. We used this method to find methylation sites on a global scale in S. cerevisiae, by analysing a proteome-scale set of MALDI-ToF mass spectra2 for putative methylated peptides.

Post-translational modification sites can be identified using liquid chromatography in line with tandem mass spectrometry. This approach is widespread but also has limitations. Some modified peptides are present in sub-stoichiometric amounts and require steps to increase their yield. Affinity chromatography, such as immobilized

168 127,169-171 metal affinity chromatography (IMAC) and TiO2, can be used to enrich for modifications carrying a negative charge, for example phosphorylation. However, affinity chromatography is not suitable for the identification of protein methylation, in which the methyl moiety does not have a specific charge. In contrast, modification- specific antibodies can be used to enrich for modified proteins or peptides from complex mixtures prior to mass spectrometry.172 Ong et al. (2004) has identified lysine- and arginine- methylation using anti-methyllysine and anti-methylarginine antibody precipitation techniques.136 This is used in combination with stable isotope labelling by amino acid in cell culture (SILAC), in which substrates labelled with heavy-isotopes are added to the cell culture such that the labelled-substrates are incorporated into the biochemical products of the cell, such as intermediate substrates and proteins. The biochemical products that contain the heavy-isotope can be determined using mass

33 spectrometry-based approaches. For the detection of protein N-methylation,

13 13 [ CD3]methionine was added to the cell culture and converted to [ CD3]S-adenosyl methionine within the cell, a heavy isotope labelled substrate for arginine and lysine methylation.136 SILAC can improve the identification of methylation sites,136 since methylation sites must contain the heavy isotope. The isotopic label can also be used to distinguish between trimethylation and acetylation that are near-isobaric. In addition,

SILAC can be used to quantify the relative change in modification status of a protein between two samples. For example, this technique was used to quantitate the differential regulation of phosphopeptides in pheromone-treated S. cerevisiae haploid cells as compared to controlled cells in which phosphorylation sites was labelled with

SILAC.173 While the methods above identify only one type of modification at any one experiment, improvements in technical strategies and mass spectrometry have made possible the proteome-wide study of specific or many modifications, recent examples include the analysis of a bacterial proteome174 and a plant proteome for many modifications at once.175

Software that utilizes machine learning algorithms can be used to predict post- translational modification sites.176,177 These software use algorithms such as neural networks and support vector machines, to learn to recognize specific patterns and physiochemical parameters associated with previously known modification sites, such as the nearby amino acids, secondary structure, hydrophobicity, surface accessibility,178 and propensity for structural disorder.179 Using the amino acid sequence, these software can predict modification sites for a specific type of modification for the input sequence, which can become the target for experimental verification. For example, prediction software specific for methylation sites includes

MeMo,177 MASA,178 and BPB-PPMS.180 Some software can also predict the enzymes that catalyse the modification event, for example, the phosphorylation site prediction software which accompanies the PHOSIDA database will predict the phosphorylation

34 site and the corresponding kinase.127 However, these prediction methods vary in accuracy and they need to be used with caution. For example, the lysine acetylation site prediction software developed by Gnad et al. (2010)181 has a true-positive rate of

78%, while other programs such as LysAcet182 and PredMod183 have true-positive rates of 53% and 25% respectively.181 This suggests there can be a large difference in accuracy between different modification-site prediction programs.

1.8 Aims of this thesis

The aim of this thesis is to understand how abundance-based effects, sequence- based effects, and conditional binding effects28 can work together to influence the dynamics of protein interaction networks. These aims are addressed in 5 chapters in which I used bioinformatics, integrative data analysis of genomics, proteomics and functional data, and hypothesis testing to explore the dynamics of protein interaction networks in S. cerevisiae. These chapters consist of papers published or submitted for publication to peer-reviewed journals. The hypotheses explored (or tested) in each chapter, the contributions I made to each publication, and the contributions of each co- investigator, are detailed in the précis to each results chapter.

35 2 Are protein complexes made of cores, modules and

attachments?

An important aim of this thesis is to understand whether the structure of proteins, in particular their domains, are important for the formation of protein complexes.

Therefore, I used a series of statistical tests to show whether interactions involving core, module, or attachment proteins from the Gavin et al. (2006) dataset are more likely to be mediated by domain-domain interactions. In addition, I explored whether protein abundance and protein half-lives are important in controlling the dynamics of protein complexes. Gavin et al. (2006)2 proposed that protein complexes are made of cores, modules, and attachments. I also analysed changes in protein expression under different environmental conditions, and asked whether core, module, or attachment proteins, are more tightly regulated than other proteins under different conditions.

These three types of proteins are expected to be regulated differently through the abundance-based effect, and therefore affect the stability, composition, and dynamics of protein complexes in different ways.

This chapter was published as Pang, CN, Krycer, JR, Lek, A & Wilkins, MR: Are protein complexes made of cores, modules and attachments? Proteomics 2008, 8,

425-34. I designed the experiments, performed the bioinformatics and statistical analyses, and wrote the manuscript. James Robert Krycer performed preliminary analysis on protein abundance and protein half-lives, and critically evaluated the manuscript. Angela Lek wrote the Perl scripts to collect domain-domain interaction data from iPfam.85 Prof. Wilkins directed the project, contributed to the experimental designs, and critically reviewed the manuscript. A copy of this publication cannot be included here due to copyright restrictions, but it can be accessed via the following website: http://onlinelibrary.wiley.com/doi/10.1002/pmic.200700801/abstract.

36 3 High throughput protein-protein interaction data: clues for

the architecture of protein complexes

In the third chapter,40 having previously found that domain-domain interactions are central for the formation of many complexes, I explore whether high-throughput protein-protein interaction data can provide clues for the architecture of protein complexes.40 High-throughput protein-protein interaction data includes pairwise interaction data from the filtered yeast interactome (FYI),63 protein complexes data from

Gavin et al. (2006),2 and predicted domain-domain interaction data from iPfam.85 The

19S and 20S proteasome, mediator, and SAGA complexes are analysed as case studies.40 Structural or structure-associated data in the literature for these protein complexes is compared to high-throughput data, to explore the potential of high- throughput techniques in determining the constituent proteins of complexes, the spatial proximity of proteins in complexes, and how proteins interact within the complex.

This chapter is the publication Krycer, JR, Pang, CN & Wilkins, MR: High throughput protein-protein interaction data: clues for the architecture of protein complexes.

Proteome Sci, 2008, 6, 32. I generated and processed interaction datasets, generated and analysed domain-domain interaction data, provided preliminary 2-D visualization of protein complexes, drafted some sections of the manuscript, and critically revised the document. James R. Krycer built the 2-D representations of the structures, mapped data onto these, generated the figures in the manuscript and co-drafted the manuscript. Prof. Marc R. Wilkins designed and directed the project, co-drafted and critically revised the manuscript.

37 Proteome Science BioMed Central

Research Open Access High throughput protein-protein interaction data: clues for the architecture of protein complexes James R Krycer1, Chi Nam Ignatius Pang1 and Marc R Wilkins*1,2

Address: 1School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, Australia and 2Systems Biology Initiative, School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, Australia Email: James R Krycer - [email protected]; Chi Nam Ignatius Pang - [email protected]; Marc R Wilkins* - [email protected] * Corresponding author

Published: 26 November 2008 Received: 18 July 2008 Accepted: 26 November 2008 Proteome Science 2008, 6:32 doi:10.1186/1477-5956-6-32 This article is available from: http://www.proteomesci.com/content/6/1/32 © 2008 Krycer et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background: High-throughput techniques are becoming widely used to study protein-protein interactions and protein complexes on a proteome-wide scale. Here we have explored the potential of these techniques to accurately determine the constituent proteins of complexes and their architecture within the complex. Results: Two-dimensional representations of the 19S and 20S proteasome, mediator, and SAGA complexes were generated and overlaid with high quality pairwise interaction data, core-module- attachment classifications from affinity purifications of complexes and predicted domain-domain interactions. Pairwise interaction data could accurately determine the members of each complex, but was unexpectedly poor at deciphering the topology of proteins in complexes. Core and module data from affinity purification studies were less useful for accurately defining the member proteins of these complexes. However, these data gave strong information on the spatial proximity of many proteins. Predicted domain-domain interactions provided some insight into the topology of proteins within complexes, but was affected by a lack of available structural data for the co- activator complexes and the presence of shared domains in paralogous proteins. Conclusion: The constituent proteins of complexes are likely to be determined with accuracy by combining data from high-throughput techniques. The topology of some proteins in the complexes will be able to be clearly inferred. We finally suggest strategies that can be employed to use high throughput interaction data to define the membership and understand the architecture of proteins in novel complexes.

Introduction highly dynamic, allowing for rapid changes in the pro- Most cellular processes involve multiprotein complexes teome such as to external stimuli [4]. Despite the contri- [1-3]. Proteins interact either transiently or stably within bution of protein complexes and interactions to the these complexes, with identical or different subunits. regulation and execution of biological processes, rela- Many proteins are also members of more than one com- tively few complexes are well-understood in terms of plex, and thus part of a protein-protein interaction net- structure and function [5]. work inside the cell [2-4]. The interaction networks are

Page 1 of 9 (page number not for citation purposes) Proteome Science 2008, 6:32 http://www.proteomesci.com/content/6/1/32

High-throughput techniques, such as yeast-two hybrid that the interactions defined by individual high through- (Y2H) [6] and affinity-purification/mass-spectrometry put methods were inconsistent with published informa- (AP-MS), have accelerated the generation of protein-pro- tion about these complexes. However, when integrated tein interaction (PPI) data on a large scale. Following pio- together, data from high throughput studies provided neering studies on the interactome [7], several large-scale higher accuracy of interactions and greater insight into the studies have been undertaken in yeast and other species structure of complexes [19]. Since this study, improved (e.g. [3,8-10]). These have led to the development of some high-throughput datasets such as FYI [11] and Gavin et high quality datasets of pairwise PPIs. For instance, the fil- al.'s [12] data on the yeast complexome have been pub- tered yeast interactome (FYI) is an intersection of different lished. These have not been examined in detail to reveal datasets, including Y2H data, AP-MS data, in silico predic- their correlation with the architecture of known com- tions, Munich Information Centre for Protein Sequences plexes nor have these data been examined to understand physical interactions, and protein complexes reported in which can best define the members of a complex and pro- the literature (see Han et al. [11] for details). It contains vides clues to structural associations. 2,493 high-confidence interactions for 1,379 proteins. In this study, we have investigated the manner in which An alternative to studying the interactions of individual high throughput data or combinations of this data reflect proteins is to define all complexes in the cell (the 'com- the architecture of proteins in three large, well-defined plexome') and their constituent proteins. In a study by complexes – the proteasome, the mediator and the SAGA Gavin et al. [12], tandem affinity purification tagging [13] complexes. We show that high throughput data concern- was used to define 491 protein complexes in yeast, 257 of ing the interactions of individual proteins, particularly in which were novel. Multiple replicate purifications combination, can accurately define the members of a pro- revealed that within each complex, proteins could be clas- tein complex. However, the same data were surprisingly sified as core, module, or attachment proteins, according poor in accurately predicting physical proximity of pro- to the frequency of their appearance in the various forms teins. Data from HTP studies of protein complexes were of that complex. Core proteins were present in most puri- weaker in accurately defining constituent subunits, but fications of a complex, whilst attachment proteins were the core and module proteins were useful to help under- dynamic members present only some of the time. Module stand the architecture or topology of protein complexes. proteins were two or more attachment proteins, found together in more than one complex [14]. We have recently Methods elucidated the structural basis of Gavin et al.'s classifica- Two-dimensional structural representation of protein tion [12], finding that interactions between core proteins complexes and between two or more module proteins are likely to be Two-dimensional structural representations of protein mediated by domain-domain interactions. Interactions complexes were generated, derived from structural or within and between attachment proteins were less likely structure-associated data in the literature. A representa- to occur in this manner [14]. tion of the 20S core particle (CP) of the proteasome was derived from its X-ray crystal structure [20] As the struc- A novel avenue of investigation made possible with the ture of the yeast 19S regulatory particle (RP) has not been Y2H technique has been to build low-resolution models elucidated, a model of the 19S RP was adapted from Fer- of complexes. This is achieved by determining the subunit rell and coworkers [21], based on genetic and biochemical architecture of complexes and the manner in which subu- studies, and Sharon and colleagues [22] based on high nits interact. The yeast yeast RNA polymerase (pol) III range mass-spectrometry. A structural representation of [15], ribonuclease P (RNase P) [16] and protein com- the mediator complex was derived from models proposed plexes involved in human mRNA degradation [17], for using electron microscopy [23] and an interaction net- example, have been investigated by these means although work based on genetic and biochemical data [24]. Med19 in a focused, low-throughput way. Despite the steady (Rox3) was recently shown to be part of the middle mod- accumulation of protein-protein interaction and protein ule (instead of the head module) [25], and this was taken complex data generated by large-scale screens, these data into account in our representation. A structural represen- have not been widely used to understand the detailed tation of the SAGA complex was adapted from a model architecture of protein complexes. A few studies have derived using electron microscopy, immunolabelling, and compared high throughput data to well-characterised mutant complexes [26] This study proposed that the complexes to define any limitations [2,18,19]. For SAGA complex contained five modules (Domains I-V), instance, Edwards et al. [19] compared past interaction and others have proposed that additional proteins are datasets to known three-dimensional structures of RNA likely to be part of these modules [27] SAGA-associated polymerase II, Arp2/3, and the proteasome. They found proteins were obtained from Daniel and Grant [28].

Page 2 of 9 (page number not for citation purposes) Proteome Science 2008, 6:32 http://www.proteomesci.com/content/6/1/32

Protein-protein interaction datasets chosen as they are large, multisubunit complexes that Experimental data for protein-protein interactions were have well-characterised structures. The proteasome com- from several sources. Data on protein complexes was from plex, responsible for the degradation of most proteins in Gavin et al. [12]. Complexes 2 and 75 described the 20S the cell, is composed of a 20S cylindrical core particle and 19S proteasomal subcomplexes respectively while (CP) flanked by two 19S regulatory particles (RPs) each Complexes 81 and 445 described the SAGA and mediator which contain a base and a lid [30]. The mediator com- complexes respectively. Filtered yeast interactome (FYI) plex passes information from gene-specific activators and data was sourced from Han et al [11], and domain- repressors to core transcriptional machinery via its interac- domain interaction (DDI) data was extracted from iPfam tion with TFIIH and RNA polymerase II (RNAP II) release 20.0 [29] using custom Perl scripts, version 5.8.7, [31,32]. It is comprised of 25 subunits, found in the head, as described previously [14]. middle, tail, and CDK modules [23,33]. The SAGA com- plex regulates transcription of stress-induced and highly- Mapping interaction data onto the structures of protein regulated genes via histone acetylation and direct interac- complexes tion with the TATA-binding protein and other transcrip- To understand how data from different high-throughput tion factors [34,35] It has three distinct functional analytical techniques can help accurately define the pro- modules [26,28,35,36] along with other associated pro- tein membership of complexes and elucidate topology, teins. We constructed 2-D representations of the proteas- protein-protein interaction data was mapped onto our 2- ome (Figure 1A), mediator complex (Figure 2A) and the D representations of the structures of complexes. Experi- SAGA complex (Figure 2B) according to the Materials and mental pairwise interactions from the FYI dataset and DDI Methods. Note that our representation of the proteasome were represented by lines between protein nodes whilst considers just one half of the structure as the proteasome membership of complexes, according to Gavin et al [12], is symmetrical. We sought to understand explore two key was represented by node shading. issues for these complexes. Firstly, whether high through- put protein-protein interaction datasets could accurately Results and discussion determine the protein members of the complexes and sec- Two-dimensional representations of protein complexes ondly, whether pairwise protein interactions, those from Our investigation involved analysing the proteasome and protein complexes or those predicted from the presence of two transcriptional coactivator complexes. These were

Two-dimensionaldomain-domainFigure 1 interaction structural (DDI)representations data of the proteasome, overlaid with filtered yeast interactome (FYI) and predicted Two-dimensional structural representations of the proteasome, overlaid with filtered yeast interactome (FYI) and predicted domain-domain interaction (DDI) data. (A) The 2-D representation of the 19S RP is built from genetic, biochemical and mass-spectrometric data (see Methods) and the representation of the 20S CP is built from the known 3-D structure. Note that the proteasome is symmetrical and thus we show only the top half here. (B) Pairwise protein interaction data from FYI clearly groups proteins of the 19S RP and 20S CP into two separate groups but show interactions between non- adjacent proteins. (C) Predicted domain-domain interactions are seen between members of the proteasome base. They are also seen within and between the  and  rings of the 20S CP. (D) The intersection of FYI pairwise interactions and predicted domain-domain interactions.

Page 3 of 9 (page number not for citation purposes) Proteome Science 2008, 6:32 http://www.proteomesci.com/content/6/1/32

Two-dimensional(FYI)Figure and 2 predicted structural domain-domain representations interaction of the(DDI) mediator data and SAGA complexes, overlaid with filtered yeast interactome Two-dimensional structural representations of the mediator and SAGA complexes, overlaid with filtered yeast interactome (FYI) and predicted domain-domain interaction (DDI) data. (A) and (B) The 2-D representa- tions of the mediator and SAGA complexes are built from structural and interaction data, but note that 3-D structures of these complexes are not known. (C) Pairwise protein interaction data from FYI clearly defined the mediator and SAGA com- plexes. Interactions between many structurally non-adjacent proteins, however, are seen. (D) Very few domain-domain inter- actions could be predicted for the mediator or SAGA complexes. certain domains can provide clues to the topology or SAGA complex itself; this may be due to these interactions architecture of a complex. being transient or perhaps refractory to analysis using one or more interaction-measuring techniques. In the media- Can pairwise interactions clearly define protein tor complex, proteins showed a large range in the number complexes? of their interactions, with some proteins showing 1 inter- The FYI dataset is an intersection of different interaction action but others showing >10. Interestingly, the CDK datasets (see Han et al [11] for details), and is enriched for module subunits and protein Med1 showed no interac- high confidence, pairwise protein interactions. We inves- tions with the rest of the mediator complex (Figure 2C). tigated how these pairwise interactions, when mapped as The CDK module is known to be a temporary inhibitor of lines connecting proteins on our 2-D representations, the mediator, thus a lack of interactions may reflect the reflected the features of candidate complexes. For the pro- transient nature of its interactions with the mediator [32]. teasome, interactions in the FYI dataset clearly defined the For the complexes examined here, it was apparent that 19S RP as one complex and the 20S CP as an independent high quality pairwise interactions, from the FYI database, complex (Figure 1B). Protein membership was also very could accurately indicate whether proteins are likely to clear, with 100% of subunits showing at least one interac- form a protein complex and which proteins are the con- tion with other subunits in the same complex. Proteins in stituent subunits. the 19S RP show a large number of interactions with other proteins in the same complex, but far fewer interactions Can high throughput analyses of complexes define their were seen between members of the 20S CP. It was noted, constituent proteins? however, that the subcomplexes in the proteasome, for To understand if data from high throughput AP-MS of instance the 19S RP lid and base, could not be discerned complexes could accurately define the members of pro- from this data. The SAGA and mediator complexes were tein complexes, we mapped complexes from Gavin et al. also clearly defined by the FYI data. All proteins of [12] onto our 2-D representations. For the proteasome, domains I to V of the SAGA complex showed multiple the 19S and 20S subcomplexes corresponded to Gavin et interactions, with most proteins showing evidence of al. [12] complexes 2 and 75 respectively. Encouragingly, a interaction with every other protein in the complex (Fig- total of 91% of the known proteasomal subunits were ure 2C). However, it was striking that the SAGA-associated seen in the high-throughput-defined complexes (see proteins showed no interactions with proteins in the shading, Figures 3B,C). Similarly the Gavin et al. [12]

Page 4 of 9 (page number not for citation purposes) Proteome Science 2008, 6:32 http://www.proteomesci.com/content/6/1/32

complexes 81 and 445, which correspond to the SAGA base such as Rpt4 and Rpt5, see Figure 1A,B). However a and mediator complexes respectively, showed 85% of very large number of interactions were seen for many pro- SAGA and 80% of mediator subunits (see shading, Figures teins that are unlikely to be structurally associated (e.g. the 4C,D). This indicates that AP-MS has a high true positive FYI data suggests Rpn12 to have >10 interactions, many of identification rate for the constituent proteins of these which are with proteins that are a considerable distance complexes. However, it was also seen that high through- away in 3-D space, Figure 1B). In the proteasome 20S CP, put AP-MS suggested a large number of other, false posi- there were fewer pairwise interactions documented, and tive proteins as members of these complexes [see many of these reflected structural associations (Figure 1B). Additional files 1, 2, 3]. An additional 10 non-proteasome For example, Pup3 is known to be adjacent to Pup1 and proteins (29%) and 23 non-mediator/SAGA proteins Pre1 in the  ring and proximal to Pre8 in the neighbour- were observed (49%). This highlights that high through- ing  ring (Figure 1A) and this was seen in the FYI protein- put AP-MS alone may not accurately define the members protein interactions. However, the Pre1 protein showed of a complex, and is weaker than data from the FYI data- 10 interactions, many of which were not consistent with base for the clear definition of complexes and member- the structural topology of the complex. For the SAGA and ship thereof. This contrasts with expectations that AP-MS, mediator complexes, a similar trend was observed. The which seeks to purify protein complexes to homogeneity, FYI data described interactions between some proteins in should provide unambiguous data in this regard. It also the SAGA or mediator complexes that are structurally contrasts with comments elsewhere [2,18,19] to this adjacent, however many proteins showed an excessively effect. large number of interactions that were not consistent with the likely positions of protein subunits in 3-D space. Pairwise interactions do not accurately reflect the architecture of complexes There are a number of possible explanations for FYI data We next sought to understand the degree to which protein being weak in reflecting the structural topology of protein interaction data, represented as pairwise interactions in complexes. Yeast two-hybrid experiments, which are a key the FYI dataset, could provide clues into the architecture part of the FYI dataset, are known to generate false posi- or topology of proteins in complexes. We examined tive interactions. These can arise due to the overexpression whether FYI pairwise interactions were likely or indeed of proteins as part of the technique, their requirement to possible, by reference to our structural representations. In interact in the nucleus [18] and the possible involvement the proteasome 19S CP, the FYI data described interac- of one or more endogenous subunits that bridge the 'gap' tions between many proteins that are known to be struc- between bait and prey proteins. This bridging could turally associated (e.g. those that form the rings in the explain why, along with others, Gcn5 and Spt3 apparently

Two-dimensionalaffinityFigure purified 3 protein structural complexes representation of the proteasome 19S RP and 20S CP, overlaid with high-throughput data from Two-dimensional structural representation of the proteasome 19S RP and 20S CP, overlaid with high- throughput data from affinity purified protein complexes. (A) 2-D representation as in Figure 1A. (B) Core, module and attachment proteins from Complex 2 [12] overlaid on the 2-D representation. Note that the core proteins are mostly those in the 20S CP. (C) Core, module and attachment proteins from Complex 75 [12] overlaid on the 2-D representation. Note that the core proteins here are mostly those in the proteasome lid and hinge.

Page 5 of 9 (page number not for citation purposes) Proteome Science 2008, 6:32 http://www.proteomesci.com/content/6/1/32

Two-dimensionalaffinityFigure purified 4 complexes structural representations of the SAGA and mediator complexes, overlaid with high-throughput data from Two-dimensional structural representations of the SAGA and mediator complexes, overlaid with high- throughput data from affinity purified complexes. (A) and (B) 2-D representations as in Figures 2A and 2B. (C) Core, module and attachment proteins from Complex 81 [12] overlaid on the 2-D representation. Core proteins correspond to the Spt and Ada proteins in the complex. (D) Core, module and attachment proteins from Complex 445 [12] overlaid on the 2-D representation. Core proteins correspond to almost all proteins in the mediator complex. interact in the SAGA complex when these subunits are clearly seen to be interacting proteins in these complexes, spatially isolated [26] (Figure 2B). Yeast-two hybrid may particularly for the mediator complex, the proteasome also detect interactions mediated by homo-domain- 19S RP and 20S CP. It should be noted that whilst all 20S domain interactions that may not normally occur (see CP core proteins were not adjacent in our 2-D representa- later evaluation of domain-domain interactions); this tions, they do in fact interact as part of the stacked  and may explain some of the unexpected interactions in the  rings [37] For the SAGA complex, the core proteins were proteasome 20S CP. This is an important consideration, seen in three groups. Interestingly, the SAGA complex is albeit usually ignored, in the interpretation of pairwise yet to be crystallised and our model was built by consid- interactions in the FYI database. ering data from electron microscopy, immunolabelling studies and mutant complexes [26-28,38] The core pro- Core and module proteins reflect the architecture of a tein data suggests these 3 protein groups are almost cer- complex tainly physically associated in the topology of the SAGA From their high throughput screen of the complexome, complex. Thus, for the proteins examined here and others Gavin et al. [12] proposed that complexes can have core in the Gavin et al. [12] dataset, core proteins are likely to proteins, present all the time, as well as module and be spatially grouped together and define aspects of the attachment proteins which interact less strongly and/or topology of complexes. only in certain conditions. To understand the relationship of core, module and attachment proteins to the architec- Modules, representing two or more proteins, were defined ture of protein complexes, we overlaid all protein types as those that are sometimes associated with the core of a from Complexes 2, 75, 81, and 445 onto our 2-D repre- complex [12]. It was expected that they reflect topological sentations (Figures 3, 4). Data is also given in tabular form features of the complexes of interest. For the proteasome, in Additional file 2. In our analysis, we have ignored it was clear that modules 57 and 141 were comprised of attachment proteins; this is because they are 'singletons' proteins that were structurally associated, being adjacent that were seen to occasionally interact with the cores of or near-adjacent in our models (see numbered proteins, complexes [12]. They thus provided no information con- Figure 3C). However module 93, comprised of proteins cerning the topology of a complex. Rpn3 and Nas6, was not consistent with our 2-D struc- tural representation. Note that our model of the 19S RP Core proteins in the proteasome, mediator and SAGA was adapted from Ferrell and coworkers [21] and Sharon complexes (black shading, Figures 3B,C and 4C,D) were and colleagues [22], and was built in the absence of a 3-D

Page 6 of 9 (page number not for citation purposes) Proteome Science 2008, 6:32 http://www.proteomesci.com/content/6/1/32

crystal structure. Thus, it remains possible that either In the co-activator complexes, only 6 domain-domain Rpn3 or Nas6 may be incorrectly positioned or that mod- interactions were predicted (Figure 2D). Many of these, ule 93 is not true. Evidence that Nas6 interacts with Rpn3 for example Med7-Med9 and Spt7-Gcn5, are between pro- in the literature [29] and FYI (see Figure 1B), and FYI evi- teins that we expect to be structurally adjacent. In contrast dence that Rpn3 interacts with its expected neighbours to the proteasome, hetero-domain-domain interactions (Figure 1B), suggests the latter may be the case. Examining were seen to feature here. For example, Med7 and Med9 the SAGA complex, two modules were seen – Modules 84, were predicted to interact via the Med7 and 146 (see numbered proteins, Figure 4C). The proteins (Pfam: PF05983) and RNA polymerase II transcription in Module 146 are adjacent in our model and are likely to mediator domain (Pfam: PF07544). Part of the reason for be structurally associated; this is consistent with their clas- this difference is that the interactions were extracted from sification as a module. The proteins of Module 84 (Gcn5 the iPfam database, which is based on structural data. The and Taf10), whilst proposed to both interact with Spt7 structure of the proteasome is known from X-ray crystal- [26], are not yet known to interact directly. Their existence lography and mass spectrometry [22,37] whilst the struc- as a module provides some evidence that this may be the ture of the SAGA and mediator complexes are mostly case. To summarise, modules described proteins that in based on interaction studies [24] and electron microscopy many cases were physically associated in a complex. This [23,26]. Accordingly, the domain-domain interactions provides useful clues to the topology of proteins in a qua- found for the co-activator exist only due to the observa- ternary structure. tion of similar domain pairs between other proteins in crystallised complexes. Predicted domain-domain interactions can identify structural features of a complex A comparison of predicted DDIs in the proteasome and Finally, predicted domain-domain interactions (DDIs) the co-activator complexes suggests that where a complex between proteins were mapped onto our 2-D representa- contains many proteins that are paralogs or of similar tions of complexes. Domain-domain interactions are not domain content, such as complexes with ring structures, high throughput data per se, but were examined to under- DDI interactions are unlikely to help understand the pre- stand how this data type can help in the interpretation of cise structural association of subunits within a complex. high throughput protein-protein interactions. The pre- However, the accurate prediction of structural associa- dicted DDIs in the proteasome were numerous and tions from DDI data may be possible in complexes that reflected many aspects of the 3-D positions of proteins. All contain essentially unrelated proteins whose interactions proteins in the 19S RP base showed domain-domain are mediated by hetero-domain-domain interactions. interactions with each other (see lines connecting proteins in Figure 1C). The proteins Rpt1-6 were predicted to inter- Conclusion act with each other by a common AAA-ATPase domain In this study, we have generated 2-D structural representa- (Pfam: PF00004). This is structurally accurate as they tions of three well-characterised protein complexes. We interact with each other in a hexameric ring in the order compared high-throughput experimental data and DDI Rpt1/2/6/4/5/3 [21,39]. However cross-ring DDIs were data against these 2-D representations to determine the also predicted due to the presence of same domain, but degree to which the data reflect true structural associations are unlikely to occur as they are inconsistent with the pro- of proteins. Whilst the 2-D representations, we believe, teaseome's 3-D structure. In the 20S CP, DDIs were seen are useful means to approach this analysis it should be within and between all proteins of the  and  rings (Fig- noted that we did not consider the stoichiometry within ure 1C). Around-ring DDIs were seen in the 20S CP, as the complexes. Further, the 3-D structures of the co-activa- would be expected. Some incorrect cross-ring DDIs were tor complexes are inferred but unknown. Nevertheless, also predicted, for example between Pre1 and Pre3 that numerous interactions reflected structural features of the are not adjacent proteins within or between the  and  protein complexes and these are discussed below. rings [20,37] (Figure 1C). The reason why these were observed is due to the presence of the proteasome domain Complexes described by Gavin et al. [12] were useful for (Pfam: PF00227) in all 20S CP subunits. This domain is a understanding structural associations of proteins in the putative homomeric interaction domain and thus each proteasome, mediator and SAGA complexes. Core pro- subunit could theoretically bind to the other subunits in teins in these complexes reflected the true interactions and the complex. Interestingly, this might explain some of the associations of many proteins whilst module proteins FYI interactions that were inconsistent with the proteas- captured small groups of proteins that are physically co- ome topology (see Figure 1B); this is further highlighted associated. Attachment proteins were not anticipated to by the overlap of many proteasome 20S CP DDIs with provide strong insight into the structure of complexes and interactions documented in the FYI database (see Figure indeed many of these were false positive interactors or 1D). interactors that, due to weak or transient interaction, are

Page 7 of 9 (page number not for citation purposes) Proteome Science 2008, 6:32 http://www.proteomesci.com/content/6/1/32

yet to be conclusively associated with the complexes stud- tools such as GEOMI [42] to construct a possible repre- ied here. sentation of the complex. The approach will be most accu- rate where high-quality data is available, although the lack The FYI dataset of pairwise interactions was the most use- of information from high-throughput analyses on the sto- ful and accurate means to determine membership of the ichiometry of proteins in each complex may complicate proteasome, mediator and SAGA complexes. However the resulting predictions. these data did not provide clear insight into the structural topology of complexes due to an over-representation of Abbreviations false positive interactions. This is an important observa- 2-D: two-dimensional; HTP: high throughput; Y2H: yeast tion as the FYI dataset, which might be expected to have a two hybrid; FYI: filtered yeast interactome; PPI: protein- reduced degree of false positives as any interaction needs protein interaction; DDI: domain-domain interaction; to be seen at least twice [11], still overestimated the degree SAGA: Spt-Ada-Gcn5-acetyltransferase; RP: regulatory par- of true interactions. Interestingly, the more widespread ticle; CP: core particle; TFIID: transcription factor II D; use of iterative Y2H interactions as pioneered in Rain et al. NMR: nuclear magnetic resonance. [7] could address this issue in the future. Our study of pre- dicted domain-domain interactions for the proteasome Competing interests and co-activator complexes revealed a variable numbers The authors declare that they have no competing interests. of such interactions, being influenced by the lack of avail- able structural data for the co-activator complexes and the Authors' contributions presence of shared domains in paralogous proteasomal JRK built the 2-D representations of the structures, proteins. Thus whilst we have shown elsewhere[14] that mapped data onto these, generated the figures in the man- DDIs can explain the mechanism of interaction of core uscript and co-drafted the manuscript. CNIP generated and module proteins in Gavin complexes [12], the utility and processed interaction data sets, generated and ana- of DDIs to predict the 3-D topology of proteins in many lysed domain-domain interaction data, drafted some sec- complexes will require a far greater number of complexes tions of the manuscript and critically revised the to be studied with NMR or X-ray crystallography to better document. MRW designed and directed the project, co- populate domain-domain interaction databases. drafted and critically revised the manuscript. All authors read and approved the manuscript. Having examined the insights that high throughput anal- yses can provide, we may ask: how can HTP data be used Additional material to help predict the topology of complexes in cases where complexes are not well characterised? Aloy et al. [40] explored this issue prior to the availability of protein core- Additional file 1 Compilation of high-throughput data for the 19S RP and 20S CP pro- module-attachment descriptions of complexes [12] and teasomal subunits, as well as other proteins in Complexes 2 and 75 in without reference to the FYI dataset [11,41], but managed Gavin et al [12]. A table that illustrates the degree to which high to structurally model 42 complexes and partially model a throughput complexes reflect the proteins of the proteasome. further 12. We expect that use of these new resources in Click here for file the following way would help expand on this, at least for [http://www.biomedcentral.com/content/supplementary/1477- the topology of complexes. Gavin et al. [12] complexes 5956-6-32-S1.pdf] can be used as an initial template. An overlay of high qual- Additional file 2 ity pairwise interaction data, such as the FYI dataset Compilation of high-throughput data for proteins in the mediator and [11,41], should be useful to eliminate spurious interactors SAGA complexes and associated proteins described in Complex 81 and and thus confirm protein membership of the complex. 445 in Gavin et al. [12]. A table that illustrates how high throughput The core proteins can be considered as a group of structur- complexes reflect the protein subunits of the mediator and SAGA com- ally associated proteins and the modules as groups of pro- plexes. teins for which physical interaction (particularly for 2- Click here for file protein modules) are highly likely. In many cases, the [http://www.biomedcentral.com/content/supplementary/1477- 5956-6-32-S2.pdf] examination of domain-domain interactions in core and module proteins will assist in understanding the likely Additional file 3 pairwise interactions that occur within the core and mod- Classification of proteasomal and coactivator protein subunits from ule of complexes; whilst this was not seen in the three Gavin et al [12]. A table showing core, module and attachment classifi- complexes we have examined here, we have recently cation of proteins in the proteasomal and coactivator protein complexes. shown that this is the case for many Gavin-defined com- Click here for file plexes in the yeast cell [14]. Finally, the resulting pairwise [http://www.biomedcentral.com/content/supplementary/1477- 5956-6-32-S3.pdf] interaction data might be projected into a 3-D space using

Page 8 of 9 (page number not for citation purposes) Proteome Science 2008, 6:32 http://www.proteomesci.com/content/6/1/32

Acknowledgements 23. Chadick JZ, Asturias FJ: Structure of eukaryotic Mediator com- JRK was a recipient of a BABS University of New South Wales Scholarship. plexes. Trends Biochem Sci 2005, 30(5):264-71. 24. Guglielmi B, van Berkum NL, Klapholz B, Bijma T, Boube M, Boschi- CNIP was the recipient of an Australian Postgraduate Award. ero C, Bourbon HM, Holstege FC, Werner M: A high resolution protein interaction map of the yeast Mediator complex. References Nucleic Acids Res 2004, 32(18):5379-91. 1. Alberts B: The cell as a collection of protein machines: prepar- 25. Baidoobonso SM, Guidi BW, Myers LC: Med19(Rox3) regulates ing the next generation of molecular biologists. Cell 1998, Intermodule interactions in the Saccharomyces cerevisiae 92(3):291-4. mediator complex. J Biol Chem 2007, 282(8):5551-9. 2. Aloy P, Russell RB: The third dimension for protein interac- 26. Wu PY, Ruhlmann C, Winston F, Schultz P: Molecular architec- tions and complexes. Trends Biochem Sci 2002, 27(12):633-8. ture of the S. cerevisiae SAGA complex. Mol Cell 2004, 3. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, 15(2):199-208. Schultz J, Rick JM, Michon AM, Cruciat CM, others: Functional 27. Timmers HT, Tora L: SAGA unveiled. Trends Biochem Sci 2005, organization of the yeast proteome by systematic analysis of 30(1):7-10. protein complexes. Nature 2002, 415(6868):141-7. 28. Daniel JA, Grant PA: Multi-tasking on chromatin with the 4. Tucker CL, Gera JF, Uetz P: Towards an understanding of com- SAGA coactivator complexes. Mutat Res 2007, 618(1– plex protein networks. Trends Cell Biol 2001, 11(3):102-6. 2):135-48. 5. Goll J, Uetz P: The elusive yeast interactome. Genome Biol 2006, 29. Finn RD, Marshall M, Bateman A: iPfam: visualization of protein- 7(6):223. protein interactions in PDB at domain and amino acid reso- 6. Fields S, Song O: A novel genetic system to detect protein-pro- lutions. Bioinformatics 2005, 21(3):410-2. tein interactions. Nature 1989, 340(6230):245-6. 30. Walz J, Erdmann A, Kania M, Typke D, Koster AJ, Baumeister W: 26S 7. Rain JC, Selig L, De Reuse H, Battaglia V, Reverdy C, Simon S, Lenzen proteasome structure revealed by three-dimensional elec- G, Petel F, Wojcik J, Schachter V, others: The protein-protein tron microscopy. J Struct Biol 1998, 121(1):19-29. interaction map of Helicobacter pylori. Nature 2001, 31. Woychik NA, Hampsey M: The RNA polymerase II machinery: 409(6817):211-5. structure illuminates function. Cell 2002, 108(4):453-63. 8. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A compre- 32. Bjorklund S, Gustafsson CM: The yeast Mediator complex and hensive two-hybrid analysis to explore the yeast protein its regulation. Trends Biochem Sci 2005, 30(5):240-4. interactome. Proc Natl Acad Sci USA 2001, 98(8):4569-74. 33. Collins SR, Miller KM, Maas NL, Roguev A, Fillingham J, Chu CS, 9. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Schuldiner M, Gebbia M, Recht J, Shales M, others: Functional dis- Ooi CE, Godwin B, Vitols E, others: A protein interaction map of section of protein complexes involved in yeast chromosome Drosophila melanogaster. Science 2003, 302(5651):1727-36. biology using a genetic interaction map. Nature 2007, 10. Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, 446(7137):806-10. Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, others: 34. Huisinga KL, Pugh BF: A genome-wide housekeeping role for Towards a proteome-scale map of the human protein-pro- TFIID and a highly regulated stress-related role for SAGA in tein interaction network. Nature 2005, 437(7062):1173-8. Saccharomyces cerevisiae. Mol Cell 2004, 13(4):573-85. 11. Han JD, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy 35. Sterner DE, Grant PA, Roberts SM, Duggan LJ, Belotserkovskaya R, D, Walhout AJ, Cusick ME, Roth FP, others: Evidence for dynam- Pacella LA, Winston F, Workman JL, Berger SL: Functional organ- ically organized modularity in the yeast protein-protein ization of the yeast SAGA complex: distinct components interaction network. Nature 2004, 430(6995):88-93. involved in structural integrity, nucleosome acetylation, and 12. Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau TATA-binding protein interaction. Mol Cell Biol 1999, C, Jensen LJ, Bastuck S, Dumpelfeld B, others: Proteome survey 19(1):86-98. reveals modularity of the yeast cell machinery. Nature 2006, 36. Zapater M, Sohrmann M, Peter M, Posas F, de Nadal E: Selective 440(7084):631-6. requirement for SAGA in Hog1-mediated gene expression 13. Rigaut G, Shevchenko A, Rutz B, Wilm M, Mann M, Seraphin B: A depending on the severity of the external osmostress condi- generic protein purification method for protein complex tions. Mol Cell Biol 2007, 27(11):3900-10. characterization and proteome exploration. Nat Biotechnol 37. Baumeister W, Walz J, Zuhl F, Seemuller E: The proteasome: par- 1999, 17(10):1030-2. adigm of a self-compartmentalizing protease. Cell 1998, 14. Pang CN, Krycer JR, Lek A, Wilkins MR: Are protein complexes 92(3):367-80. made of cores, modules and attachments? Proteomics 2008, 38. Wu PY, Winston F: Analysis of Spt7 function in the Saccharo- 8(3):425-34. myces cerevisiae SAGA coactivator complex. Mol Cell Biol 15. Flores A, Briand JF, Gadal O, Andrau JC, Rubbi L, Van Mullem V, 2002, 22(15):5367-79. Boschiero C, Goussot M, Marck C, Carles C, others: A protein- 39. Schmidt M, Hanna J, Elsasser S, Finley D: Proteasome-associated protein interaction map of yeast RNA polymerase III. Proc proteins: regulation of a proteolytic machine. Biol Chem 2005, Natl Acad Sci USA 1999, 96(14):7815-20. 386(8):725-37. 16. Houser-Scott F, Xiao S, Millikin CE, Zengel JM, Lindahl L, Engelke DR: 40. Aloy P, Bottcher B, Ceulemans H, Leutwein C, Mellwig C, Fischer S, Interactions among the protein and RNA subunits of Saccha- Gavin AC, Bork P, Superti-Furga G, Serrano L, others: Structure- romyces cerevisiae nuclear RNase P. Proc Natl Acad Sci USA based assembly of protein complexes in yeast. Science 2004, 2002, 99(5):2684-9. 303(5666):2026-9. 17. Lehner B, Sanderson CM: A protein interaction framework for 41. Bertin N, Simonis N, Dupuy D, Cusick ME, Han JD, Fraser HB, Roth human mRNA degradation. Genome Res 2004, 14(7):1315-23. FP, Vidal M: Confirmation of organized modularity in the yeast 18. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork interactome. PLoS Biol 2007, 5(6):e153. P: Comparative assessment of large-scale data sets of pro- 42. Ho E, Webber R, Wilkins MR: Interactive three-dimensional vis- tein-protein interactions. Nature 2002, 417(6887):399-403. ualization and contextual analysis of protein interaction net- 19. Edwards AM, Kus B, Jansen R, Greenbaum D, Greenblatt J, Gerstein works. J Proteome Res 2008, 7(1):104-12. M: Bridging structural biology and genomics: assessing pro- tein interaction data with known complexes. Trends Genet 2002, 18(10):529-36. 20. Groll M, Ditzel L, Lowe J, Stock D, Bochtler M, Bartunik HD, Huber R: Structure of 20S proteasome from yeast at 2.4 A resolu- tion. Nature 1997, 386(6624):463-71. 21. Ferrell K, Wilkinson CR, Dubiel W, Gordon C: Regulatory subunit interactions of the 26S proteasome, a complex problem. Trends Biochem Sci 2000, 25(2):83-8. 22. Sharon M, Taverner T, Ambroggio XI, Deshaies RJ, Robinson CV: Structural organization of the 19S proteasome lid: insights from MS of intact complexes. PLoS Biol 2006, 4(8):e267.

Page 9 of 9 (page number not for citation purposes) 4 Surface accessibility of protein post-translational

modifications

While the first two results chapters focused on protein complexes, the fourth and fifth chapters focus on post-translational modifications. Here, I explore whether post- translational modifications are likely to be within surface-accessible regions, and discuss how surface accessibility of post-translational modifications can be important for protein-protein interaction. This study is the most comprehensive analysis of the structural environment of post-translational modifications to date, in which 8,378 incidences of 44 types of post-translational modifications were analysed using 19 different bioinformatic tools.

This chapter has been published: Pang, CN, Hayen, A & Wilkins, MR. Surface accessibility of protein post-translational modifications. J Proteome Res, 2007, 6, 1833-

45. I designed the experiments, performed the bioinformatics and statistical analyses, and wrote the manuscript. Dr. Andrew Hayen provided important technical advice on the use of statistical tests and the R programming language. Prof. Marc R. Wilkins contributed to the experimental designs, directed the project, and critically reviewed the manuscript.

Reproduced with permission from Pang, CN, Hayen, A & Wilkins, MR. Surface accessibility of protein post-translational modifications. J Proteome Res, 2007, 6, 1833-

45. Copyright 2010 American Chemical Society.

38 Surface Accessibility of Protein Post-Translational Modifications

Chi Nam Ignatius Pang,† Andrew Hayen,‡ and Marc Ronald Wilkins*,†

Systems Biology Group, School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, 2052, Australia, and School of Mathematics and Statistics, University of New South Wales, Sydney, NSW, 2052, Australia

Received December 14, 2006

Protein post-translational modifications are crucial to the function of many proteins. In this study, we have investigated the structural environment of 8378 incidences of 44 types of post-translational modifications with 19 different approaches. We show that modified amino acids likely to be involved in protein-protein interactions, such as ester-linked phosphorylation, methylarginine, acetyllysine, sulfotyrosine, hydroxyproline, and hydroxylysine, are clearly surface associated. Other modifications, including O-GlcNAc, phosphohistidine, 4-aspartylphosphate, methyllysine, and ADP-ribosylarginine, are either not surface associated or are in a protein’s core. Artifactual modifications were found to be randomly distributed throughout the protein. We discuss how the surface accessibility of post- translational modifications can be important for protein-protein interactivity.

Keywords: post-translational modifications • protein-protein interaction • surface accessibility • intrinsic disorder • domains and linkers

Introduction recognition domains, for example, Src homology 2 (SH2) domain binds phosphotyrosine (pTyr), 14-3-3 domain binds There are strong links between the biological type of the phosphoserine (pSer), and acetylation and methylation of lysine post-translational modifications and their surface accessibility. residues in histones create binding sites for bromo- and An amino acid side chain that undergoes enzymatic post- chromo- domains, respectively.3 Association of a modified translational modification (PTM) needs to be accessible on the protein with these interaction domains may be controlled and surface of the protein. Surface-accessible regions of a protein switched dynamically by the addition or removal of the PTM. must have more free backbone hydrogen bonds for associating The main criterion for this dynamic switching is the reversibility with the enzymes that catalyze the modification and the reverse of the modification. Phosphorylation is a well-known example reaction, if required.1 A PTM recognition domain may specif- of this. Different combinations of post-translational modifica- ically recognize and bind to a modified amino acid only if it is on the surface of the protein or polypeptide. Modified amino tion sites and interaction domains can control a protein’s acid side chains packed within the structured and ordered interaction partners and how they work in concert to orches- - 3,4 region of the protein would typically be inaccessible to any PTM trate the protein protein interaction networks. recognition domain due to steric hindrance.2 Several studies of protein-protein interaction networks have There are certain protein structural properties that correlate shown the importance of intrinsic disorder in the topology of with the surface accessibility of amino acid residues in a folded protein-protein interactions network. Consensus results showed polypeptide chain. These include protein linker regions, coils that hub proteins were enriched for intrinsic disordered regions or loops, and disordered regions. Although PTMs may not as compared to non-hub proteins.5-7 These disordered regions always be in surface-accessible regions, surface-accessible confer flexibility for hub proteins to interact with a diverse amino acids would have a higher likelihood of being modified. number of partners, with high specificity and low affinity. This For instance, increased flexibility of disordered regions allow has important implications for the reversibility of these protein- them to fold, such that the amino acid side chains would easily protein interactions and their role in regulating the protein- fit into a modifying enzyme’s catalytic site. This is one reason protein interaction network.7 The function of a protein can be why PTMs such as acetylation, methylation, phosphorylation, controlled by conformational changes upon phosphorylation and ADP-ribosylation occur mainly within regions of intrinsic of protein loops, causing disorder-order transition and allow- disorder.1,2 ing phosphorylation mediated protein-protein interaction to Protein interaction domains may bind and recognize post- occur.8,9 These interactions are mediated by the hydrogen- translational modification. There are a diverse variety of PTM bonding of the phosphoryl group of the phosphorylated residues by the binding domain, for example, the interaction * To whom correspondence should be addressed. E-mail: m.wilkins@ between 14-3-3 protein and 14-3-3 binding partners.10 Fur- unsw.edu.au. Phone: +612 9385 3633. Fax: +612 9385 1483. † School of Biotechnology and Biomolecular Sciences. thermore, proteins with more interaction partners have a ‡ School of Mathematics and Statistics. greater likelihood of being phosphorylated.11

10.1021/pr060674u CCC: $37.00 © 2007 American Chemical Society Journal of Proteome Research 2007, 6, 1833-1845 1833 Published on Web 04/12/2007 research articles Pang et al.

This manuscript asks fundamental questions about where datasets: post-translationally modified residues and unmodi- modifications are found on proteins. As there is relatively little fied residues. The modified residue dataset contained the structural information for native, post-translationally modified following information: the type and position of the modified proteins, it is not known which modifications are surface asso- amino acid, the Swiss-Prot accession number of the source ciated and which are not. Accordingly, here we aim to deter- protein, and the identification number of the source protein. mine whether post-translationally modified amino acids are Each entry in the modified residue dataset included comments more likely to be within surface-accessible regions as compared on the reliability of the modification: whether it was experi- to unmodified amino acids. First, we selected a set of proteins mentally validated, predicted by sequence alone, or assigned that contain post-translational modifications. Second, various by taxonomic similarity. The modified residue dataset was scoring systems were used to predict molecular properties of classified by modified residue type, for example, a group for the protein sequences, such as the protein’s surface accessibil- phosphoserine and another group for phosphotyrosine. ity and regions of intrinsic disorder. Third, the residues from Removal of Homologous Residues from the Modified each sequence were separated into the modified residue dataset Residue Dataset. An issue with the modified residue dataset and the nonmodified residue dataset. Finally, hypothesis testing was that there were a number of modified residues docu- was performed to test whether there are any significant mented in orthologous proteins. If all of these modified residues differences between the structural environment of the modified were kept, this would introduce bias into our subsequent and unmodified residues. For all modified proteins docu- analyses. Accordingly, the modified residue dataset was filtered mented in Swiss-Prot, we strikingly show that most reversible to remove redundant entries when the same modified residue modifications are found on the surface of proteins. We discuss was found to be invariant in two or more protein orthologues. how the surface accessibility of post-translational modifications Only one modified residue was kept. When done for all can be important for protein-protein interactivity. modified residues from all modified proteins, this yielded a nonredundant modified residue dataset. In cases where modi- Methods fied residues came from proteins with different numbers of Database of Modified Proteins. Sequences with PTMs of modifications, modified residues from the sequence with more interest were extracted from the UniProt/Swiss-Prot database, modifications were retained. For example, the human 60S release 49.712 using Swissknife.13 Modification information was acidic ribosomal protein P0 (RLA0_HUMAN, P05388) has an from the MOD_RES field. The MOD_RES field includes modi- experimentally verified phosphoserine at position 304. The fications such as phosphorylation, methylation, and acetylation mouse orthologue (RLA0_MOUSE, P14869) also has an experi- but does not include fatty-acid modifications or glycosylation. mentally verified phosphoserine at the invariant position 304. Swiss-Prot entries containing the O-linked N-acetylglucosamine In this case, the modified residue from the human protein was (O-GlcNAc) modification were retrieved from Swiss-Prot release retained, as the human protein was known to have a greater 50.4, also using Swissknife. The entries with O-GlcNAc modi- number of experimentally determined modifications. The result fications were merged with entries containing MOD_RES data of the above classification was that some sequences in the to obtain the final sequence and PTM dataset for this study. modified protein database only contained redundant modified Unless otherwise stated, the term post-translational modifica- amino acid residues. These sequences were subsequently tion used in this study refers only to the MOD_RES and removed from the modified protein database to ensure that O-GlcNAc entries. there was minimal redundancy in the unmodified residue Filtering of Modified Protein Database. As we were to dataset. So returning to our example, RLA0_MOUSE was subsequently examine structural properties of proteins, and had ultimately removed from the modified protein database. to ensure that our results were unbiased, the sequence database Unmodified Residue Dataset. The unmodified residue dataset was filtered to remove homologues, short sequences, and contains groups of the 20 unmodified amino acids, extracted membrane-bound proteins. UniRef90 merges sequences with from a subset of the sequences used to generate the modified 90% or more sequence identity from UniProt into sequence protein dataset. The structure of this dataset was identical to clusters.14 Members of relevant sequence clusters of interest that for the modified residue dataset. Although residues in the were extracted using custom Perl scripts and BioPerl and modified residue dataset were derived from more than one grouped together into FASTA files containing multiple se- sequence in a UniRef90 cluster, only one sequence was chosen quences.15 Multiple sequence alignments for these clusters were from each UniRef90 cluster to generate the unmodified residue performed with ProbCons,16 with the exception of long se- dataset. Experimentally verified and potentially modified resi- quences, which were aligned with ClustalW.17 The sequence dues, including those assigned by taxonomic similarity, were dataset was then filtered further. Sequences with less than 37 excluded from the dataset. Because the N- and C-termini of amino acid residues were removed from the database. This proteins tend to be more solvent accessible than the rest of number was chosen because it approximates the length of the the protein, the first 5 and last 5 residues of each protein smallest protein structural domain in the CATH database.18 sequence were also excluded from the unmodified residue Membrane-associated proteins, as defined by the presence of dataset. To reduce the size of the unmodified residue dataset, the keyword “membrane” or variants thereof in UniProt entries, 10 000 residues were randomly sampled for each type of amino were also removed, as were any proteins that were in the same acid type. This became the final unmodified residue dataset. UniRef90 cluster as a membrane protein. This removed all Protein Surface Accessibility Prediction. The solvent or proteins containing transmembrane domains from the modi- surface accessibility of modified and unmodified residues in fied protein database. The list of Swiss-Prot accession numbers protein sequences was predicted with four different ap- of the proteins in the final sequence database is in Table S1 proaches. The first three methods utilize propensity scores for (see Supporting Information). each amino acid to be on the surface or buried within the core. Modified Residue Dataset. Amino acids from all proteins These are: average surface accessibility (ASA)19 with window in our modified protein database were classified into two sizes of 3, 5, 7, and 9; Kyte-Doolittle’s hydropathy scale20 with

1834 Journal of Proteome Research • Vol. 6, No. 5, 2007 Surface Accessibility of Protein Modifications research articles window size of 9; and a hydropathy scale (GOR) developed by approximated by a normal distribution (p < 0.05). Levene’s test Naderi-Manesh et al. (2001)21 with a window size of 17. For was performed to test whether there were approximately equal the latter, a threshold accessibility of 9% was chosen because variances between the modified and unmodified residue it has the best prediction accuracy out of the thresholds pro- datasets (p < 0.05). vided. The fourth method, RVP-net,22,23 uses a neural network Depending on the result of Levene’s test, two-sided paired to predict the relative solvent accessibility of amino acids. Student’s t-tests were performed, assuming equal variance or The amino acid propensity scores for ambiguous amino acid unequal variance (p < 0.05). Where data were not normally code, B (aspartic acid or asparagine), Z (glutamic acid or distributed, the two-sided Wilcoxon rank sum test was used glutamine), and X (any of the 20 common amino acids), were (p < 0.05). The density plot for each pair of data was drawn to estimated by the arithmetic mean of the possible amino acids. graphically explore the qualitative differences between the two This estimation method was used for methods that utilize datasets. amino propensity score, including DomCut, domain linker Summarizing Linker and Domain Region Prediction Re- index, and George and Heringa’s linker propensity scores. The sults. The propensity for modified and unmodified amino acids exception was that the geometric mean, rather than the to be found in helical or non-helical linkers was determined arithmetic mean, is used for ASA’s smoothing window and with a decision tree. The decision tree was manually created ambiguous amino acid score estimation. The arithmetic mean and is shown in Figure 1. Results of statistical tests from Linker, for the residue’s propensity score within a smoothing window with window sizes of 5, 7, and 9 (Linker_5,7,9), Domcut scores was calculated, and this score was allocated within the central and DLI scores both with window size of 15 (Domcut_15 and residue. Only odd-number window sizes were utilized. When DLI_15 respectively), were first evaluated. A vote was cast from the window was flanking outside the N- or C-terminus of the these general linker prediction systems to determine whether sequence, the window was shortened on the corresponding side a residue is likely to be within a linker region, domain region, that had exceeded the sequence. or a region where no significant predictions could be made. If Prediction of Intrinsically Disordered Regions. The natively a residue was predicted to be found in a linker region, it was disordered or unstructured regions of sequences containing further evaluated as a helical or non-helical linker region. modified and unmodified residues were predicted with four Statistical test results from George and Heringa’s helical and neural network methods. The first predictor was RONN,24 which non-helical scoring systems (Helical_5,7,9 and Non-Heli- is a neural network method based on the bio-basis function. cal_5,7,9 respectively) were used to cast a vote to decide Three other neural-networks from the DisEMBL package were whether the residue was likely to be within a helical or non- also used:25 coils predictor, hot loops predictor, and the X-ray helical linker region. Then, to ensure consistency with second- structure missing co-ordinates predictor. Coils predictor ary structure prediction results, results from PSIPRED were (DCOILS) proposes loops and coils as defined by Kabash and used to determine if helical regions were helical linkers or Sander.26 The hot loops predictor (HOTLOOPS) estimates general linkers and if coiled and extended regions were non- regions that may have a high degree of mobility and high helical linkers or general linkers. B-factor if the protein’s structure is elucidated with X-ray crystallography. The X-ray structure missing co-ordinate pre- Results dictor (REMARK465) is trained with remark465 entries in the Databases of Proteins and Residues. To allow the surface PDB database. accessibility of modified amino acids to be determined, we first Prediction of Linker versus Domain Regions. The propen- prepared a database of modified and unmodified residues. sity of modified and unmodified amino acids to be in a linker Great care was taken in the preparation of this database. region was predicted with 5 different scoring systems. Linker Regular protein tertiary structure is more likely to be absent in (Linker), helical linker (Helical), and non-helical linker (Non- short polypeptides; therefore, short sequences were removed Helical), by George and Heringa (2003)27 were used with from the database. The surface accessibility of residues in window sizes of 5, 7, and 9 residues. The third and fourth membrane proteins requires more sophisticated methods to scoring systems, Domcut28 and domain linker index (DLI),29 predict accurately; consequently, they were also excluded from both used a window size of 15 residues. this study. The number of sequences remaining after each step Secondary Structure Prediction. Secondary structure of of filtering is shown in the Supporting Information (Table S2). sequence regions containing modified and unmodified residues After the removal of sequences from the database according was predicted by PSIPRED to be coiled, helical, or extended.30 to the methods, there were 4022 protein sequences from which Default parameters were employed, and the Swiss-Prot se- modified residues were then sourced and 3796 protein se- quence database release 49.5 was used. The most frequently quences from which unmodified residues were then sourced. predicted secondary structure containing each type of modified The number of modified residues resulting from this process residue was recorded. ranged from 3255 (phosphoserine) to 2 (1-thioglycine). Where Hypothesis Testing. Hypothesis tests between the modified less than 17 incidences of a modification were found, these residue dataset and unmodified residue dataset were performed modifications were ignored. This ensured that only modifica- to determine whether there were significant differences be- tions with a reliable number of datapoints were analyzed. tween the structural environments of residues in the two sets. Interpretation of Statistical Tests. Statistical tests were used The hypothesis tests were used to compare each structural to compare the structural environments of modified and property for each modified residue type, such as phospho- unmodified residues. In all cases, we assumed that the un- serine, to the relevant unmodified residue type, in this case modified residues represented a mixed population of both unmodified serine. Statistical analyses were performed using modified and unmodified amino acids, present in a variety of the R statistical package.31 structural environments. The reason why modified amino acids Kolmogorov-Smirnov tests (K-S tests) were performed to are likely to be present in the unmodified residue dataset is verify if modified and unmodified residue datasets could be that we know only a very small proportion of all post-

Journal of Proteome Research • Vol. 6, No. 5, 2007 1835 research articles Pang et al.

Figure 1. Decision tree for summarizing linker and domain region prediction results. Refer to text for more details regarding the decision tree. translational modifications. If the structural environment of a population of modified residues was significantly different to their corresponding unmodified residues, the modification was classified as either surface associated or within the protein core. We illustrate this concept in Figure 2. For the subsequent presentation of results (Tables 1-4), we use orange to show a modification is significantly surface associated, yellow to show a modification is significantly protein core, and white to show that a modification is not significantly different to unmodified residues in its structural environment. Hypothesis testing results for all the prediction methods are available as Support- ing Information (Tables S3-S4). Reversible Modifications. Enzyme-mediated reversible modi- fications were expected to show a preference for surface- accessible and disordered region environments. We found strong evidence that phosphoserine, phosphothreonine, phos- photyrosine, and N6-acetyllysine were more likely to be found within surface-accessible and disordered region than their Figure 2. Graphical interpretation of statistical tests. The density unmodified counterparts (Table 1). As an example, density plots estimates (white) at the center represents the distribution of showing differences between the structural environments of structural environment for the unmodified residues. As the phosphoserine and unmodified serine are shown (Figures 3-5). modification status of most amino acids is unknown, it represents They graphically illustrate that phosphoserine is more surface a mixed population of both modified and unmodified residues. accessible than unmodified serine and that it has a greater The density plots at the left and right side represent the two propensity to be found in regions of intrinsic disorder and coils. possible scenarios for the modified residues; they are discrimi- nated as surface associated (orange) or buried within protein core The structural environment predictions for phosphoserine were (yellow), respectively. Density plots that overlap markedly with consistent among the various methods used. the density estimate for the unmodified residues (white) indicates Phosphohistidine and 4-aspartylphosphate were predicted no significant differences in their structural environment to to be in different structural environments to the other phospho- unmodified residues. The color scheme used here is identical to amino acids (Table 1). Phosphohistidine demonstrated no that for Tables 1 to 4. specific preference for surface or buried regions as compared to unmodified residues and no significant preferences for O-GlcNAc is a reversible modification and one which exists disordered or ordered regions. In comparison to unmodified interchangeably with phosphoserine and phosphothreonine.32 aspartic acid, 4-aspartylphosphate was clearly predicted to be However, we found that there was only weak evidence that buried within the core of proteins and within ordered regions. O-GlcNAc-serine or -threonine was within regions of intrinsic

1836 Journal of Proteome Research • Vol. 6, No. 5, 2007 Surface Accessibility of Protein Modifications research articles

Table 1. Structural Environment of Reversible Modificationsa preference in terms of surface accessibility as compared to unmodified serine or threonine. Note that the number of known O-GlcNAc modified residues was low and this reduced the statistical power of our analysis. Methylation. Methylarginine is a modification that is enzyme- mediated and may be involved in protein-protein interac- tions.33 It showed a propensity to be surface accessible and in disordered regions, as compared to unmodified arginine (Table 2). The results showed striking similarity with the reversible ester-linked phosphorylation and acetyllysine that are mediated by enzyme catalysis and recognized by interaction domains. Interestingly, asymmetric dimethylarginine showed a very high average number of modifications per protein. This was due to the proteins nucleolin (P08199), RNA-binding protein EWS (Q07666), and polyadenylate binding protein 2 (Q28165) having 10, 29, and 14 asymmetric dimethylarginines, respectively. Methylation of lysine is enzyme-mediated and is commonly found in proteins associated with nucleic acids.4 In contrast to the strong surface accessibility results for methylarginine, there were conflicting results for mono-, di- and trimethyllysine. Most prediction methods for mono- and di-methyllysine did not show any significant results for surface accessibility or a propensity for order or intrinsic disorder. For trimethyllysine, the results for RVP-net predicted it to be within surface- accessible regions, whereas ASA, KD, and GOR methods b C, coils or loops; E, extended (-sheets); H, helices. c L, general linker; HL, helical linker; NHL, non-helical linker; D, domain; NS, not significant. predicted the opposite to be true. Therefore, trimethyllysine does not seem to have any clear preference for the surface Table 2. Structural Environment of Methylationa accessibility, due to the inconsistent results. This contrasts with our observation that trimethyllysine in histones is on surface- accessible tail regions.4 Irreversible Modifications. We grouped together a number of diverse but irreversible modifications (Table 3). These showed some differences in their structural environment. All acetylated residues, with the exception of N-terminal acetylal- anine, were predicted to be surface associated and found in regions of intrinsic disorder. Interestingly, N-acetylalanine does not show a preference for surface accessibility, but there was some evidence to suggest that it is found in intrinsically disordered regions. Note that although the majority of N- acetylation modifications in the dataset are found at the N-terminus, a small number of N-acetylated residues are found within the sequence of proteins but at sites of protein cleavage. Sulfotyrosine was shown to prefer surface-accessible and disordered region as compared to unmodified tyrosine. Pyruvic acid (serine) and ADP-ribosylarginine are unusual as they are suggested to be found within the protein’s core and in regions of structural order. Allysine and 4-carboxyglutamate did not show any conclusive result. Hydroxylation. It is not yet known whether hydroxylation is reversible or irreversible.4 It tends to be found multiple times in repeated protein sequences, for example in the Gly-X-X repeat of collagen. We found most hydroxylations to be clearly within surface-accessible and intrinsically disordered regions, as compared to unmodified residues (Table 4). Note, however, that the results for this modification may be biased by the high a number found on proteins such as human collagen alpha-1(V) chain (P20908), which is known to have 59 hydroxylated residues of which 45 are hydroxyprolines. b C, coils or loops; E, extended (-sheets); H, helices. c L, general linker; HL, Amidation. Most incidences of amidation, particularly those helical linker; NHL, non-helical linker; D, domain; NS, not significant. associated with hydrophobic residues, were preferentially found disorder in comparison to the unmodified amino acid type. within surface-accessible and disordered regions of proteins Further, O-GlcNAc-serine and -threonine do not show any (Table 4). The exception to this was amidated arginine and

Journal of Proteome Research • Vol. 6, No. 5, 2007 1837 research articles Pang et al.

Table 3. Structural Environment of Irreversible Modificationsa

a

b C, coils or loops; E, extended (-sheets); H, helices. c L, general linker; HL, helical linker; NHL, non-helical linker; D, domain; NS, not significant. lysine, which were not significantly more surface accessible or tures, the majority of post-translational modifications were buried as compared to their unmodified residue counterparts. predicted to be within coiled regions. Only 5 out of 44 post- They did, however, show some propensity to be within translational modifications presented in Tables 1-4 and Table disordered regions. The results for neutral hydrophobic resi- S5 were not predicted to be within coil regions, instead being dues were supported by a larger number of datapoints than predicted as either in helices or extended regions (-sheets). for charged hydrophilic residues. Table 4 only presents the Those five modifications were phosphohistidine, 4-aspar- structural environment for selected amidation types; more tylphosphate, ADP-ribosylarginine, isoleucine amide, and me- amidation types are available as Supporting Information (Table thionine amide. S5). Accuracy of Predictions. Although the purpose of this Spontaneous or Artifactual Modifications. Spontaneous or manuscript is to understand the surface-accessible nature of artifactual modifications did not show any strong preferences PTMs, it is also valuable to briefly compare the results of the for surface accessibility or propensity for regions of order or predictions obtained from the different methods used here. intrinsic disorder (Table 4). However, pyrrolidone carboxylic Some surface accessibility prediction methods provide consis- acid showed some evidence to be within regions of intrinsic tent and sensitive hypothesis testing results, for example, ASA disorder. and RVP-net. Unexpectedly, there is relatively little difference Linker, Domains and Preferred Secondary Structure. The in results from different window sizes for ASA. However, ASA_3 linker region predictions produced by the decision tree for predictions tended to be weaker then ASA with larger window phosphoserine, phosphothreonine, and phosphotyrosine ap- sizes. The KD_9 and GOR_17 methods also tended to make peared to give high quality predictions, as they were all found weaker predictions. to be within non-helical linker regions. This was consistent with Intrinsic disorder prediction methods accurately predicted the surface accessibility and intrinsic disorder predictions for the structural environment of reversible post-translational this modification. However, for some other modification types, modifications, because these modifications are likely to be such as methylarginine, there was disagreement between within regions of intrinsic disorder. There were 18 out of 44 linker/domain predictions and other structural environment modification types where all four order/disorder region predic- parameters. Domains are typically regions of order rather than tion methods agreed with each other. In these cases, the intrinsic disorder, but this was not shown for methylarginine. predictions are of very high confidence. In most other cases Of the 44 post-translational modification types presented in (20 out of 44), three out of four prediction methods were in Tables 1-4 (and Supporting Information Table S5), 23 were agreement. These predictions are also likely to be of high predicted as domain-associated, 8 predictions were non-helical confidence. linkers, 2 were predicted as general linkers, and 11 predictions For secondary structure prediction, we used a single method, produced statistically insignificant results. For secondary struc- PSIPRED.30 Although we did not use another method to provide

1838 Journal of Proteome Research • Vol. 6, No. 5, 2007 Surface Accessibility of Protein Modifications research articles

Table 4. Structural Environment of Hydroxylation, Amidation, and Spontaneous or Artifactual Modificationsa

a

b C, coils or loops; E, extended (-sheets); H, helices. c L, general linker; HL, helical linker; NHL, non-helical linker; D, domain; NS, not significant. confirmatory predictions, PSIPRED is widely accepted as being (2006)37 used homology modeling of protein tertiary structure one of the most accurate.34,35 For the prediction of linker/ and surface-accessibility calculation of predicted structure to domain, our decision tree (Figure 1) helped make the results predict phosphorylation sites. This method, however, is limited more succinct and interpretable. Interestingly, of the 44 modifi- by a lack of structural information.37 Intrinsic disorder has been cations studied, none were predicted to be found in helical used in the prediction of phosphorylation1 and methylation linkers. This observation also holds true for all 117 modification sites.38 Our results, in which we studied modifications in types we studied (see Table S3, Supporting Information). addition to those above, should be useful in the prediction of many more types of modifications on proteins. Discussion Phosphorylation. Ester-bonded phosphorylation such as Here we have used a number of approaches to predict the phosphoserine, phosphothreonine, and phosphotyrosine were structural environment of protein post-translational modifica- strongly predicted to be within surface accessible, intrinsically tions. These approaches have clearly revealed that modified disordered, and coiled linker regions. This is consistent with amino acids, with the exception of those that are spontaneous their involvement in phosphorylation-mediated conformational or artifactual, are found nonrandomly in proteins. As we have change in proteins and in phosphoamino acid-mediated 8 studied 44 types of modification, and in most cases have used protein interactions. more than one method to predict protein structural environ- Phosphohistidine and 4-aspartylphosphate were not surface ments, we believe this is the most comprehensive study of associated, and 4-aspartylphosphate was clearly predicted to modification-associated structural environments to date. be within the core of proteins. Interestingly, proteins containing In other studies, the surface accessibility and degree of phosphohistidine and 4-aspartylphosphate are both involved intrinsic disorder have been used to aid the prediction of PTM. in the bacterial two-component system. The more complex Lee et al. (2006)36 used RVP-net23 to predict the surface variant of the two-component system is called the phospho- accessibility of proteins and used this to predict phosphoryl- relay system (reviewed in refs 39 and 40), which involves ation and sulfation sites. Any residues with surface accessibility protein-protein interactions between a histidine protein ki- above a threshold were predicted as modified.36 Arthur et al. nase, which acts as a transmembrane sensor, and a cognate

Journal of Proteome Research • Vol. 6, No. 5, 2007 1839 research articles Pang et al.

Figure 3. Density estimates of surface accessibility prediction scores for phosphoserine and unmodified serine, using the ASA method. The numbers of datapoints per graph are shown within the legend. The x-axis represents the prediction scores obtained from using each of the window sizes. A higher scores means that the residue is more likely to be surface accessible and vice versa. (A) window size of 3, (B) window size of 5, (C) window size of 7, and (D) window size of 9. response regulator. Histidine kinase uses ATP to autophos- The conformational change also exposes a previously buried phorylate a conserved histidine residue within itself.8 Subse- surface that may participate in new protein-protein interac- quently, the phosphoryl group of histidine kinase is transferred tions. For example, upon phosphorylation of the aspartic acid to a conserved aspartic acid residue in the response regulator. at the N-terminal domain, the protein SpoA exposes a previ- In our study, phosphohistidine was predicted by PSIPRED to ously buried DNA binding site at its C-terminus.8,42,43 be within a helical region. The tertiary structure of phospho- O-Linked GlcNAc. O-GlcNAc is interchangeable with phos- transferase Spo0B confirms this.40 The residues surrounding phoserine and phosphothreonine.32 Contrary to phosphoamino phosphohistidine form an R-helix that participates in the acids, O-GlcNAc showed a lack of preference for intrinsic hydrophobic interaction with the response regulator; this disorder regions and was not clearly surface associated. In fact, hydrophobic interface is well conserved throughout evolution.40 serine O-GlcNAc seemed to prefer hydrophobic regions, as This may explain why we found phosphohistidine to be slightly shown by the results for GOR_17. The preference of O-GlcNAc hydrophobic by ASA_5, KD_9, and GOR_17. There are reasons transferases for hydrophobic and amphipathic regions has been why 4-aspartylphosphate residues are predicted to be buried documented,44 and it has been proposed that the presence of within the core of proteins. In the unphosphorylated form of the O-GlcNAc within hydrophobic domains is responsible for relevant proteins, such as Spo0F, the unmodified apspartic acid disrupting protein-protein interactions.44 This is different to residue is solvent exposed. Spo0F can form a complex with the other modifications that are required for protein-protein histidine kinase, KinA, which allows the transfer of the phos- interactions to take place. phoryl moiety from the histidine kinase to the aspartic acid N-Acetyllysine. Almost half of the N-acetyllysine residues (37 residue. Phosphorylation of Spo0F leads to a conformational out of 77) were from histone proteins. N-acetyllysine was clearly change by mediating an intra-protein interaction. This causes predicted to be surface associated in our study. The surface the aspartic acid residue to become less solvent exposed and accessibility of acetyllysine, particularly in association with prevents hydrolysis of the high-energy acyl phosphate bond.40,41 histone tails, allows it to be modified by acetyltransferase and

1840 Journal of Proteome Research • Vol. 6, No. 5, 2007 Surface Accessibility of Protein Modifications research articles

Figure 4. Density estimates of surface accessibility and coil region predictions for phosphoserine and unmodified serine using 4 different methods. The numbers of datapoints are shown within the legend. The x-axis values represent the prediction scores. (A) Hydrophathy score developed by Kyte and Doolittle (1982) using a window size of 9 (KD_9); a higher score is associated with protein core. (B) Naderi-Manesh et al. (2001) method, using window size of 17 (GOR_17); a higher score is associated with protein core. (C) Neural network prediction method (RVP-net); a score of 100% means the residue is fully surface accessible. (D) Coils prediction from PSIPRED; a high score predicts a residue is likely to be within protein coils. subsequently reversed by a deacetylase. It also allows N- nation by protein arginine deiminase 4 (PAD4). Although this acetyllysine to be recognized and interact with bromodomains. may be part of the pathway for reversing arginine methylation,33 Bromodomains are present in many eukaryotic proteins which a review of the in vitro studies shows that the aforementioned are involved in the regulation of gene expression.4 enzymatic activity is unlikely;48 therefore, arginine demethyl- Arginine Methylation. Arginine methylation regulates RNA ation remains a controversial issue. The structural environment processing, transcriptional regulation, signal transduction, and of methylarginine also suggests it may be associated with DNA repair.45 It is known to be found on proteins, such as specific recognition domains. However, the involvement of histones, that are subject to multisite modification. We showed methyl-arginine in protein-protein interactions remains un- that it is likely to be surface associated and found in regions of known.33 A final observation is that the consistency of the intrinsic disorder. Surface-associated modifications, such as structural environment of methylarginine indicates it may be phosphoserine and phosphotyrosine, tend to be reversible by associated with a specialized sequence motif. This is support enzymes. Arginine methylation also shows these criteria and by Daily et al. (2005),38 who showed the enrichment of glycine, supports the hypothesis that arginine methylation is reversible. and the depletion of glutamic acid and glutamine (11 residues Although arginine methylases are known, the demethylases are around methylated arginine. yet to be discovered. The members of proteins in the same Irreversible Modifications. Irreversible modifications, such family as amine oxidase LSD1, a known lysine demethylase, as modifications 4-carboxyglutamate, allysine, pyruvic acid may also be possible candidates for intracellular demethyl- (serine), and ADP-ribosylarginine, were not expected to prefer ase.33,45 Recent in vivo studies such as those performed by surface-accessible or disordered regions for at least two reasons. Cuthbert et al. (2004)46 and Wang et al. (2004)47 have suggested First, they are not recognized by modification-dependent that monomethylarginine is converted to citrulline via deimi- binding domains, requiring the modification to be physically

Journal of Proteome Research • Vol. 6, No. 5, 2007 1841 research articles Pang et al.

Figure 5. Density estimates of prediction results of intrinsic disorder region prediction results for phosphoserine and unmodified serine using various neural network disorder prediction methods. The numbers of datapoints are shown within the legend. The x-axis values represent the prediction scores. (A) The bio-basis neural network predictor (RONN). Three neural network predictors from the DisEMBL: (B) coils predictor (DCOILS), (C) X-ray crystallography protein structure missing co-ordinate predictor (REMARK465), and (D) the hot loops predictor (HOTLOOPS). For these prediction methods, the results are expressed as a probability of the residue to be upon the surface of protein. For all graphs, a value close to 1 predicts that the residue is surface accessible. accessible on a protein’s surface. Second, the irreversible nature to bind other proteins.50 This similarity to the role of hydroxy- of these modifications means that the removal of the modifica- proline is striking. Tyrosine sulfation is also involved in optimal tion by an enzyme, through binding and catalysis, does not receptor-ligand interactions, optimal proteolytic processing, occur. This is consistent with our results. and proteolytic activation of extracellular proteins.50,51 The Hydroxylation. We found hydroxylated residues to be in importance of tyrosine sulfation for protein-protein interaction regions of coiled secondary structure, in regions of high surface explains why this modification needs to be highly surface accessibility and intrinsic disorder. This is consistent with the accessible, as predicted by our results. Note, however, that our high presence of hydroxylation in collagen molecules and other study mostly analyzed secreted sulfated proteins because we proteins in the extracellular matrix. The hydroxylation of removed membrane associated proteins from our sequence proline allows additional hydrogen bonding to occur between database. There are typically 3-4 acidic amino acids within molecules of the collagen triple helix49 and bestows collagen (5 residues of the tyrosine sulfation site.50 These charged fibers with structural rigidity. In effect, it makes possible a residues will cause tyrosine sulfation sites to be in highly structurally necessary protein-protein interaction. Although surface-accessible and disordered regions. the structural environment predictions for hydroxylation are Amidation. Amidation occurs in proteins after they are strong, the presence of repetitive hydroxylation motifs within cleaved by endoproteinases. If the cleavage site is C-terminal the modified residue dataset may have introduced bias in to a glycine, the enzymes peptidylglycine R-hydroxylating estimating surface accessibility and intrinsic disorder. monooxygenase and peptidyl-R-hydroxyglycine R-amidating Sulfotyrosine. Sulfotyrosine is thought to be irreversible. It lyase will remove this glycine and amidate the adjacent up- modulates protein-protein interactions of secreted or mem- stream amino acid.52 If the upstream amino acid is neutral and brane bound proteins by providing them with hydrogen bonds hydrophobic, it will tend to favor amidation, whereas charged

1842 Journal of Proteome Research • Vol. 6, No. 5, 2007 Surface Accessibility of Protein Modifications research articles hydrophilic residues are less reactive (Interpro: IPR00013453). regards to the quantity of data, the number of significant results To allow access of all enzymes, it is intuitive that this occurs increased with the incidence of the modification. This can be in disordered and surface-accessible regions. This was sup- clearly observed in the Supporting Information (Table S3). ported by our results. Usually there are at least two basic Although this study successfully estimated surface accessibility residues after the glycine cleavage site,54 which contributes to for 44 types of post-translational modification, similar studies this being a surface-associated, intrinsically disordered region. for other modifications will not be possible until a greater Conformation Change is Required for ADP-Ribosylation number of modifications are experimentally found. of G-Proteins. ADP-ribosylation is one of the few modifications Evolutionary Rates of Surface-Accessible Regions and that we predicted to be buried in the core of proteins. All Surface-Associated Modifications. There are many proteins incidences of ADP-ribosylation in our dataset involved the within a cell that contain modified amino acids and many modification of G-protein’s GRs subunit by the cholera entero- proteins with modification-specific binding domains. The toxin subunit A1. It is known that the GDP-bound form of GRs question raised is then how does the cell prevent random subunit cannot be ADP-ribosylated, in contrast to the GTP- interactions from occurring? It is believed, for example, that bound form.55 The binding of GTP causes a conformational the assembly of phosphorylated proteins with their corre- change and may expose a normally buried arginine to ADP- sponding protein complex takes place “just-in-time” for the ribosylation. This locks the G-protein in the GTP-bound form, required biological activity.11,61 The gene for the phosphorylated which causes the production of cAMP and the ultimate protein is regulated and is only expressed just as it is needed. induction of diarrhoea.55 ADP-ribosylation is a relatively bulky When the required job is completed, the phosphorylated PTM in comparison to other modifications listed in this study; protein is then quickly degraded via the ubiquitin-proteasome the bulky size may enable it to work as a molecular “wedge”. pathway. Evidence suggests that dynamically transcribed pro- N-Terminal Acetylation. Most acetylated residues in our teins are more likely to be phosphorylated and regulated by modified residue dataset, except for lysine, were found only at targeted degradation.62 Moreover, proteins with more interac- the N-terminal end of a protein. N-acetylation is an irreversible tion partners are more likely to be phosphorylated, and these co-translational modification.41 Its function is largely undefined, proteins are more likely to be actively targeted for degrada- and it is expected that there are many more N-terminally tion.11 Most importantly, these mechanisms were shown to acetylated amino acids on as-yet uncharacterised proteins.41 have co-evolved independently in the cell-cycle regulation of The N-terminal region of proteins tends to be unfolded, surface humans, S. cerevisiae, S. pombe, and A. thaliana.62 Jensen et accessible, and disordered, which agrees with our structural al. (2006)62 has shown that the evolutionary loss or gain of environment predictions. N-acetylalanine does not have any transcriptional regulation and post-translational modifications preferences for surface accessibility or inaccessibility; this might are highly correlated and tend to occur in unison within a short indicate that N-acetylalanine has a different function to N- evolutionary time scale. It is tempting to mention that surface- terminal acetylated amino acids or may just reflect its hydro- accessible regions may have faster evolutionary rates,63,64 phobic nature. similarly for disordered regions, especially those within alter- 65-67 Spontaneous and Artifactual Modifications. Because spon- natively spliced exons. These regions may evolve faster due taneous and artifactual modifications occur randomly and to less structural constraints and more freedom for non- 66 ubiquitously, they are expected to occur in any position along synonymous amino acid substitution. They may have con- a protein sequence, regardless of surface accessibility or order/ tributed to the faster evolutionary speed of phosphorylation disorder. This was seen in our results. Pyrrolidone carboxylic sites and possibly for other post-translational modifications, acid spontaneously forms by the cyclization of N-terminal required for the above-mentioned co-evolution. glutamine residues. It is known to be an experimental arti- Conclusion fact.56,57 Within our modified residues dataset, 149 out of 438 pyrrolidone carboxylic acid modifications were at the N- Here we have shown that reversible post-translational terminus of polypeptides. A further 91 and 7 were C-terminal modifications, particularly those specifically bound by recogni- to arginine and lysine, respectively, potentially representing an tion domains, are mainly found within surface-accessible experimental artifact formed spontaneously on trypsin cleavage regions of proteins. They are also more likely to be within of peptides during protein characterization experiments. The regions of intrinsic disorder. Some irreversible modifications deamidation of asparagine may occur as a result of enzymatic show strong preference for surface-association, others show modification,58 may form spontaneously under physiological preference for protein core, and some show no clear preference conditions,59 or may be an experimental artifact.59,60 It typically for structural environment. This manuscript has also shown forms in an N-G motif, which can be found anywhere in a the power of using the combination of many surface acces- protein. sibility, intrinsic disorder region, and structural prediction Other Statistical Considerations. There are a number of methods to generate a consensus prediction. aspects of the statistical tests that warrant discussion. The Acknowledgment. C.N.I.P. is the recipient of an Aus- clustering of modifications by their biological function or tralian Postgraduate Award. This research is supported in part position in protein, prior to statistical tests, may further by a University of New South Wales Faculty Research Grant. improve reliability and interpretability of results. N-terminal We thank Robert M. Esnouf for providing the standalone acetylation is one example where the position of the residue version of RONN software. in the protein is probably a more important factor than its immediate sequence environment. The consideration of re- Supporting Information Available: Table S1 contains peated sequence motifs, such as found in collagen, should also the Swiss-Prot accession number for the sequences utilized in be considered to ensure minimal structural bias. However, this the construction of the modified residue dataset and the requires a balance of minimizing dataset redundancy with unmodified residue dataset. Table S2 contains the number of having enough data to produce statistically reliable results. In sequences through each filtering step of creating the final

Journal of Proteome Research • Vol. 6, No. 5, 2007 1843 research articles Pang et al. sequence database. Table S3 contains the summary hypothesis (17) Thompson, J. D.; Higgins, D. G.; Gibson, T. J. CLUSTAL W: testing results, including the modification types with less than improving the sensitivity of progressive multiple sequence align- - ment through sequence weighting, position-specific gap penalties 17 datapoints and is similar to Tables 1 4 in the text. Table S4 and weight matrix choice. Nucleic Acids Res. 1994, 22(22), 4673- is similar to Table S3, however with information regarding the 4680. statistical testing results. It displays the p-value of the hypoth- (18) Pearl, F.; Todd, A.; Sillitoe, I.; Dibley, M.; Redfern, O.; Lewis, T.; esis test, whether the mean for the modified residue dataset Bennett, C.; Marsden, R.; Grant, A.; Lee, D.; Akpor, A.; Maibaum, M.; Harrison, A.; Dallman, T.; Reeves, G.; Diboun, I.; Addou, S.; was above, equal to, or below the mean for the unmodified Lise, S.; Johnston, C.; Sillero, A.; Thornton, J.; Orengo, C. The residue dataset. This is followed by the mean for the unmodi- CATH Domain Structure Database and related resources Gene3D fied dataset, the 95% confidence interval for the hypothesis test, and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res. 2005, 33(Database issue), whether the t-test or Wilcox rank sum test was used, and D247-251. whether the variance was equal or unequal as determined by (19) Moelbert, S.; Emberly, E.; Tang, C. Correlation between sequence Levene’s test. Table S5 contains the structural environment of hydrophobicity and surface-exposure pattern of database pro- - amidation and it is similar to Tables 1-4 in the text. This teins. Protein Sci. 2004, 13(3), 752 762. (20) Kyte, J.; Doolittle, R. F. A simple method for displaying the material is available free of charge via the Internet at http:// hydropathic character of a protein. J. Mol. Biol. 1982, 157(1), 105- pubs.acs.org. 132. (21) Naderi-Manesh, H.; Sadeghi, M.; Arab, S.; Moosavi Movahedi, A. A. Prediction of protein surface accessibility with information References theory. Proteins 2001, 42(4), 452-459. (22) Ahmad, S.; Gromiha, M. M.; Sarai, A. Real value prediction of (1) Iakoucheva, L. M.; Radivojac, P.; Brown, C. J.; O’Connor, T. R.; solvent accessibility from amino acid sequence. Proteins 2003, Sikes, J. G.; Obradovic, Z.; Dunker, A. K. The importance of 50(4), 629-635. intrinsic disorder for protein phosphorylation. Nucleic Acids Res. (23) Ahmad, S.; Gromiha, M. M.; Sarai, A. RVP-net: online prediction 2004, 32(3), 1037-1049. of real valued accessible surface area of proteins from single (2) Dunker, A. K.; Brown, C. J.; Lawson, J. D.; Iakoucheva, L. M.; sequences. Bioinformatics 2003, 19(14), 1849-1851. Obradovic, Z. Intrinsic disorder and protein function. Biochem- (24) Yang, Z. R.; Thomson, R.; McNeil, P.; Esnouf, R. M. RONN: the istry 2002, 41(21), 6573-6582. bio-basis function neural network technique applied to the (3) Pawson, T.; Nash, P. Assembly of cell regulatory systems through detection of natively disordered regions in proteins. Bioinfor- protein interaction domains. Science 2003, 300(5618), 445-452. matics 2005, 21(16), 3369-3376. (4) Seet, B. T.; Dikic, I.; Zhou, M. M.; Pawson, T. Reading protein (25) Linding, R.; Jensen, L. J.; Diella, F.; Bork, P.; Gibson, T. J.; Russell, modifications with interaction domains. Nat. Rev. Mol. Cell Biol. R. B. Protein disorder prediction: implications for structural 2006, 7(7), 473-483. proteomics. Structure 2003, 11(11), 1453-1459. (5) Patil, A.; Nakamura, H. Disordered domains and high surface (26) Kabsch, W.; Sander, C. Dictionary of protein secondary struc- charge confer hubs with the ability to interact with multiple ture: pattern recognition of hydrogen-bonded and geometrical proteins in interaction networks. FEBS Lett. 2006, 580(8), 2041- features. Biopolymers 1983, 22(12), 2577-2637. 2045. (27) George, R. A.; Heringa, J. An analysis of protein domain linkers: (6) Ekman, D.; Light, S.; Bjorklund, A. K.; Elofsson, A. What properties their classification and role in protein folding. Protein Eng. 2002, characterize the hub proteins of the protein-protein interaction 15(11), 871-879. network of Saccharomyces cerevisiae? Genome Biol. 2006, 7(6), (28) Suyama, M.; Ohara, O. DomCut: prediction of inter-domain R45. linker regions in amino acid sequences. Bioinformatics 2003, (7) Dunker, A. K.; Cortese, M. S.; Romero, P.; Iakoucheva, L. M.; 19(5), 673-674. Uversky, V. N. Flexible nets. The roles of intrinsic disorder in (29) Dumontier, M.; Yao, R.; Feldman, H. J.; Hogue, C. W. Armadillo: protein interaction networks. Febs J. 2005, 272(20), 5129-5148. domain boundary prediction by amino acid composition. J. Mol. (8) Johnson, L. N.; Lewis, R. J. Structural basis for control by Biol. 2005, 350(5), 1061-1073. phosphorylation. Chem. Rev. 2001, 101(8), 2209-2242. (30) Jones, D. T. Protein secondary structure prediction based on (9) Groban, E. S.; Narayanan, A.; Jacobson, M. P. Conformational position-specific scoring matrices. J. Mol. Biol. 1999, 292(2), 195- changes in protein loops and helices induced by post-transla- 202. tional phosphorylation. PLoS Comput. Biol. 2006, 2(4), e32. (31) R Development Core Team. R: A language and environment for (10) Bustos, D. M.; Iglesias, A. A. Intrinsic disorder is a key charac- statistical computing; R Foundation for Statistical Computing: teristic in partners that bind 14-3-3 proteins. Proteins 2006, 63(1), Vienna, Austria, 2005. http://www.R-project.org. 35-42. (32) Slawson, C.; Hart, G. W. Dynamic interplay between O-GlcNAc (11) Batada, N. N.; Hurst, L. D.; Tyers, M. Evolutionary and physi- and O-phosphate: the sweet side of protein regulation. Curr. ological importance of hub proteins. PLoS Comput. Biol. 2006, Opin. Struct. Biol. 2003, 13(5), 631-636. 2(7), e88. (33) Bannister, A. J.; Kouzarides, T. Reversing histone methylation. (12) Bairoch, A.; Apweiler, R.; Wu, C. H.; Barker, W. C.; Boeckmann, Nature 2005, 436(7054), 1103-1106. B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M.; (34) Eyrich, V. A.; Marti-Renom, M. A.; Przybylski, D.; Madhusudhan, Martin, M. J.; Natale, D. A.; O’Donovan, C.; Redaschi, N.; Yeh, L. M. S.; Fiser, A.; Pazos, F.; Valencia, A.; Sali, A.; Rost, B. EVA: S. The Universal Protein Resource (UniProt). Nucleic Acids Res. continuous automatic evaluation of protein structure prediction 2005, 33(Database issue), D154-159. servers. Bioinformatics 2001, 17(12), 1242-1243. (13) Hermjakob, H.; Fleischmann, W.; Apweiler, R. Swissknife - “lazy (35) Montgomerie, S.; Sundararaj, S.; Gallin, W. J.; Wishart, D. S. parsing” of SWISS-PROT entries. Bioinformatics 1999, 15(9), 771- Improving the accuracy of protein secondary structure prediction 772. using structural alignment. BMC Bioinformatics 2006, 7, 301. (14) Wu, C. H.; Apweiler, R.; Bairoch, A.; Natale, D. A.; Barker, W. C.; (36) Lee, T. Y.; Huang, H. D.; Hung, J. H.; Huang, H. Y.; Yang, Y. S.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Wang, T. H. dbPTM: an information repository of protein post- Magrane, M.; Martin, M. J.; Mazumder, R.; O’Donovan, C.; translational modification. Nucleic Acids Res. 2006, 34(Database Redaschi, N.; Suzek, B. The Universal Protein Resource (Uni- issue), D622-627. Prot): an expanding universe of protein information. Nucleic (37) Arthur, J. W.; Sanchez-Perez, A.; Cook, D. I. Scoring of predicted Acids Res. 2006, 34(Database issue), D187-191. GRK2 phosphorylation sites in Nedd4-2. Bioinformatics 2006, (15) Stajich, J. E.; Block, D.; Boulez, K.; Brenner, S. E.; Chervitz, S. A.; 22(18), 2192-2195. Dagdigian, C.; Fuellen, G.; Gilbert, J. G.; Korf, I.; Lapp, H.; (38) Daily, K. M.; Radivojac, P.; Dunker, A. K. In Intrinsic disorder and Lehvaslaiho, H.; Matsalla, C.; Mungall, C. J.; Osborne, B. I.; protein modifications: building an SVM predictor for methylation; Pocock, M. R.; Schattner, P.; Senger, M.; Stein, L. D.; Stupka, E.; IEEE Symposium on Computational Intelligence in Bioinformat- Wilkinson, M. D.; Birney, E. The Bioperl toolkit: Perl modules ics and Computational Biology, CIBCB: San Diego, CA, Novem- for the life sciences. Genome Res. 2002, 12(10), 1611-1618. ber 2005; pp 475-481. (16) Do, C. B.; Mahabhashyam, M. S.; Brudno, M.; Batzoglou, S. (39) Hoch, J. A.; Varughese, K. I. Keeping, signals straight in phos- ProbCons: Probabilistic consistency-based multiple sequence phorelay signal transduction. J. Bacteriol. 2001, 183(17), 4941- alignment. Genome Res. 2005, 15(2), 330-340. 4949.

1844 Journal of Proteome Research • Vol. 6, No. 5, 2007 Surface Accessibility of Protein Modifications research articles

(40) Varughese, K. I. Molecular recognition of bacterial phosphorelay (54) Bradbury, A. F.; Smyth, D. G. Biosynthesis of the C-terminal amide proteins. Curr. Opin. Microbiol. 2002, 5(2), 142-148. in peptide hormones. Biosci. Rep. 1987, 7(12), 907-916. (41) Walsh, C. T. Posttranslational Modifications of Proteins: Expand- (55) Enomoto, K.; Gill, D. M. Cholera toxin activation of adenylate ing Natrue’s Inventory, 1st ed.; Roberts and Co. Publishers: cyclase. Roles of nucleoside triphosphates and a macromolecular Colorado, 2006. factor in the ADP ribosylation of the GTP-dependent regulatory (42) Cho, H. S.; Pelton, J. G.; Yan, D.; Kustu, S.; Wemmer, D. E. component. J. Biol. Chem. 1980, 255(4), 1252-1258. Phosphoaspartates, in bacterial signal transduction. Curr. Opin. (56) Sanger, F.; Thompson, E. O.; Kitai, R. The amide groups of insulin. - Struct. Biol. 2001, 11(6), 679 684. Biochem. J. 1955, 59(3), 509-518. (43) Kern, D.; Volkman, B. F.; Luginbuhl, P.; Nohaile, M. J.; Kustu, S.; (57) Awade, A. C.; Cleuziat, P.; Gonzales, T.; Robert-Baudouy, J. Wemmer, D. E. Structure of a transiently phosphorylated switch Pyrrolidone carboxyl peptidase (Pcp): an enzyme that removes in bacterial signal transduction. Nature 1999, 402(6764), 894- pyroglutamic acid (pGlu) from pGlu-peptides and pGlu-proteins. 898. Proteins 1994, 20(1), 34-51. (44) Yang, X.; Zhang, F.; Kudlow, J. E. Recruitment of O-GlcNAc (58) Hochstrasser, D. F. Proteome in perspective. Clin. Chem. Lab. transferase to promoters by corepressor mSin3A: coupling - protein O-GlcNAcylation to transcriptional repression. Cell 2002, Med. 1998, 36(11), 825 836. 110(1), 69-80. (59) Sarioglu, H.; Lottspeich, F.; Walk, T.; Jung, G.; Eckerskorn, C. (45) Bedford, M. T.; Richard, S. Arginine methylation an emerging Deamidation as a widespread phenomenon in two-dimensional regulator of protein function. Mol. Cell 2005, 18(3), 263-272. polyacrylamide gel electrophoresis of human blood plasma (46) Cuthbert, G. L.; Daujat, S.; Snowden, A. W.; Erdjument-Bromage, proteins. Electrophoresis 2000, 21(11), 2209-2218. H.; Hagiwara, T.; Yamada, M.; Schneider, R.; Gregory, P. D.; (60) Wright, H. T. Nonenzymatic deamidation of asparaginyl and Tempst, P.; Bannister, A. J.; Kouzarides, T. Histone deimination glutaminyl residues in proteins. Crit. Rev. Biochem. Mol. Biol. antagonizes arginine methylation. Cell 2004, 118(5), 545-553. 1991, 26(1), 1-52. (47) Wang, Y.; Wysocka, J.; Sayegh, J.; Lee, Y. H.; Perlin, J. R.; Leonelli, (61) de Lichtenberg, U.; Jensen, L. J.; Brunak, S.; Bork, P. Dynamic L.; Sonbuchner, L. S.; McDonald, C. H.; Cook, R. G.; Dou, Y.; complex formation during the yeast cell cycle. Science 2005, Roeder, R. G.; Clarke, S.; Stallcup, M. R.; Allis, C. D.; Coonrod, S. 307(5710), 724-727. A. Human PAD4 regulates histone arginine methylation levels via (62) Jensen, L. J.; Jensen, T. S.; de Lichtenberg, U.; Brunak, S.; Bork, demethylimination. Science 2004, 306(5694), 279-283. P. Co-evolution of transcriptional and post-translational cell-cycle (48) Thompson, P. R.; Fast, W. Histone citrullination by protein regulation. Nature 2006, 443(711), 594-597. arginine deiminase: is arginine methylation a green light or a (63) Goldman, N.; Thorne, J. L.; Jones, D. T. Assessing the impact of - roadblock? ACS Chem. Biol. 2006, 1(7), 433 441. secondary structure and solvent accessibility on protein evolu- (49) Mizuno, K.; Hayashi, T.; Bachinger, H. P. Hydroxylation-induced tion. Genetics 1998, 149(1), 445-458. stabilization of the collagen triple helix. Further characterization (64) Pal, C.; Papp, B.; Lercher, M. J. An, integrated view of protein of peptides with 4(R)-hydroxyproline in the Xaa position. J. Biol. evolution. Nat. Rev. Genet. 2006, 7(5), 337-348. Chem. 2003, 278(34), 32373-32379. (65) Brown, C. J.; Takayama, S.; Campen, A. M.; Vise, P.; Marshall, T. (50) Moore, K. L. The biology and enzymology of protein tyrosine W.; Oldfield, C. J.; Williams, C. J.; Dunker, A. K. Evolutionary rate O-sulfation. J. Biol. Chem. 2003, 278(27), 24243-24246. heterogeneity in proteins with long disordered regions. J. Mol. (51) Monigatti, F.; Hekking, B.; Steen, H., Protein sulfation analysis-A - primer. Biochim. Biophys. Acta 2006, 1764(12), 1904-1913. Evol. 2002, 55(1), 104 110. (52) Martinez, A.; Treston, A. M. Where does amidation take place? (66) Romero, P. R.; Zaidi, S.; Fang, Y. Y.; Uversky, V. N.; Radivojac, P.; Mol. Cell. Endocrinol. 1996, 123(2), 113-117. Oldfield, C. J.; Cortese, M. S.; Sickmeier, M.; LeGall, T.; Obradovic, (53) Mulder, N. J.; Apweiler, R.; Attwood, T. K.; Bairoch, A.; Bateman, Z.; Dunker, A. K. Alternative splicing in concert with protein A.; Binns, D.; Bradley, P.; Bork, P.; Bucher, P.; Cerutti, L.; Copley, intrinsic disorder enables increased functional diversity in mul- R.; Courcelle, E.; Das, U.; Durbin, R.; Fleischmann, W.; Gough, ticellular organisms. Proc. Natl. Acad. Sci. U.S.A. 2006, 103(22), J.; Haft, D.; Harte, N.; Hulo, N.; Kahn, D.; Kanapin, A.; Krestya- 8390-8395. ninova, M.; Lonsdale, D.; Lopez, R.; Letunic, I.; Madera, M.; (67) Chen, F. C.; Wang, S. S.; Chen, C. J.; Li, W. H.; Chuang, T. J. Maslen, J.; McDowall, J.; Mitchell, A.; Nikolskaya, A. N.; Orchard, Alternatively and constitutively spliced exons are subject to S.; Pagni, M.; Ponting, C. P.; Quevillon, E.; Selengut, J.; Sigrist, C. different evolutionary forces. Mol. Biol. Evol 2006, 23(3), 675-682. J.; Silventoinen, V.; Studholme, D. J.; Vaughan, R.; Wu, C. H. InterPro, progress and status in 2005. Nucleic Acids Res. 2005, 33(Database issue), D201-205. PR060674U

Journal of Proteome Research • Vol. 6, No. 5, 2007 1845 5 Identification of arginine- and lysine-methylation in the

proteome of Saccharomyces cerevisiae and its functional

implications

In this fifth chapter, I use peptide mass spectra from S. cerevisiae to search for novel arginine- and lysine-methylated residues, and discuss their functional implications. I also examine the accuracy of using the FindMod tool166 for discovering methylation sites, and investigate whether arginine and lysine methylations are more widespread in S. cerevisiae than previously thought. I test if methylation is found in specific sequence motifs, whether methylation on non-histone proteins is involved in multisite modification, and in particular, whether lysine-methylation could possibly block ubiquitination and could therefore increase the half-lives of proteins.

This chapter was published online on 5th February, 2010: Pang, CN, Gasteiger, E &

Wilkins, MR. Identification of arginine- and lysine-methylation in the proteome of

Saccharomyces cerevisiae and its functional implications. BMC Genomics, 2010, 11,

92. I designed the experiments, performed the bioinformatics and statistical analyses, and wrote the manuscript. Elisabeth Gasteiger developed the FindMod bulk submission interface, and critically reviewed the manuscript. Prof. Marc R. Wilkins contributed to the experimental designs, directed the project, and critically reviewed the manuscript.

39 Pang et al. BMC Genomics 2010, 11:92 http://www.biomedcentral.com/1471-2164/11/92

RESEARCHARTICLE Open Access Identification of arginine- and lysine-methylation in the proteome of Saccharomyces cerevisiae and its functional implications Chi Nam Ignatius Pang1,2, Elisabeth Gasteiger3, Marc R Wilkins1,2*

Abstract Background: The methylation of eukaryotic proteins has been proposed to be widespread, but this has not been conclusively shown to date. In this study, we examined 36,854 previously generated peptide mass spectra from 2,607 Saccharomyces cerevisiae proteins for the presence of arginine and lysine methylation. This was done using the FindMod tool and 5 filters that took advantage of the high number of replicate analysis per protein and the presence of overlapping peptides. Results: A total of 83 high-confidence lysine and arginine methylation sites were found in 66 proteins. Motif analysis revealed many methylated sites were associated with MK, RGG/RXG/RGX or WXXXR motifs. Functionally, methylated proteins were significantly enriched for protein translation, ribosomal biogenesis and assembly and organellar organisation and were predominantly found in the cytoplasm and ribosome. Intriguingly, methylated proteins were seen to have significantly longer half-life than proteins for which no methylation was found. Some 43% of methylated lysine sites were predicted to be amenable to ubiquitination, suggesting methyl-lysine might block the action of ubiquitin ligase. Conclusions: This study suggests protein methylation to be quite widespread, albeit associated with specific functions. Large-scale tandem mass spectroscopy analyses will help to further confirm the modifications reported here.

Background positively correlated with gene activity [4], while H3K79 The methylation of proteins is of increasing biological are involved in gene silencing [5,6]. Histone H3K79 interest. It is predominantly found on lysine and argi- methylation is evolutionarily conserved and is involved nine residues, but has also been found on histidine, glu- in several pathways, including Sir protein-mediated het- tamic acid and on the carboxyl groups of proteins erochromatic gene silencing [7]. meiotic checkpoint (reviewed in Grillo and Colombatto 2005) [1]. Methyla- control [8] and in the G1 and S phase DNA damage tion of lysine involves the addition of one to three checkpoint functions of Rad9p [9,10]. While studies of methyl groups on the amino acid’s ε-amine group, to lysine methylation have mainly focused on histone pro- form mono-, di- or tri-methyllysine. Its function is best teins, several non-histone proteins are also known to be understood in histones. Methylation on the tails of his- lysine-methylated. They are mainly ribosomal proteins tone proteins, in conjunction with acetylation and phos- or proteins involved in protein translation [11], and phorylation, controls their interaction with other include Rpl12p [12,13], Rpl23p [12,14], Rpl42p [15], and proteins, affects chromatin compaction and the up- or eEF1Ap [16]. down-regulation of gene expression [2]. For S. cerevisiae, The methylation of arginine involves the addition of lysine methylation is found in histone H3 and histone one or two methyl groups to the amino acid’s guanidino H4 [3]. Tri-methylation at H3K4 and H3K36 is group, forming mono- or di-methylarginine. It is predo- minantly known to be associated with RNA regulation * Correspondence: [email protected] and processing [17]. In S. cerevisiae, Hmt1p is a type 1 1 School of Biotechnology and Biomolecular Sciences, University of New arginine methyltransferase that catalyses the formation South Wales, Sydney, NSW, 2052, Australia

© 2010 Pang et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Pang et al. BMC Genomics 2010, 11:92 Page 2 of 16 http://www.biomedcentral.com/1471-2164/11/92

of mono- and asymmetric di-methylarginine. This peptide mass fingerprinting identified the following enzyme is known to methylate a number of proteins lysine-methylated proteins: neurofilament triplet-I pro- that contain an RGG-motif; these include Npl3p, Hrp1p, tein, Hsc70 protein, creatine kinase, a-tubulin, a-actin, Nab2p, Gar1p, Nop1p, Nsr1p, Yra1p, Sbp1p, and Hrb1p. b-actin, and g-actin. Furthermore, a-actin and creatine These proteins have been implicated in poly(A)+ mRNA kinase were found to be methylated in muscle tissue. binding, processing and export [17], ribosome biogenesis The use of tandem mass spectrometry to discover new [18-20] and gene silencing [21]. Moreover, methylation protein post-translational modifications is common [34]. is required for the nuclear export of RNA binding pro- However, peptide mass fingerprinting can also be used teins Npl3p, Hrp1p, and Nab2p [22,23]. The repeated to search for new PTM sites [35]. The FindMod pro- RGG-motif was known as a RNA-binding motif [24], gram [35] caters for this approach. It requires peptide and this also supports the role of arginine methylation mass spectra from a mostly pure protein, for example a in the regulation of mRNA binding [25]. The methyla- spot from 2-D gel, and examines experimental peptide tion of nuclear shuttling proteins is suggested to weaken masses for differences in mass with theoretical peptides their binding with cargo proteins and disrupt their for that protein that correspond to post-translational export from the nucleus [26]. Arginine methylation is modification. Peptides that are potentially modified are also known to facilitate or block protein-protein interac- checked to see if they contain amino acids that can tions. Arginine methylation of SmB protein facilitates carry the modification. Where very high accuracy pep- the binding of tudor domains in SMN, SPF30, and tide mass measurements can be made, for example with TDRD3 proteins [27]. In contrast, arginine methylation new instruments like the prOTOF2000, high confidence of Sam68 blocks the interaction of nearby proline-rich predictions are possible. Parent-ion masses from tandem motif with an SH3 domain, but not to a WW domain mass spectrometry data can also be used in FindMod, [28]. More examples on methylarginine-regulated inter- where it may serve as an initial screen for PTMs before actions are reviewed in McBride and Silver (2001) [29] employing more sophisticated and computationally and Bedford and Clarke (2009) [30]. expensive methods [36,37]. There have been several studies to identify arginine or Here we describe a strategy for the discovery of lysine-methylated proteins on a proteome-wide scale. In methylation on a global scale, using peptide mass finger- the first of these studies, arginine-methylated protein printing data, and implement this to search for methy- complexes were purified from HeLa cell extracts using lated lysine and arginine residues in the yeast proteome. anti-methylarginine antibodies specific against RG-rich A proteome-scale set of MALDI-ToF mass spectra [38] sequences [31]. This resulted in the identification of was analysed for putative methylated peptides. The over 200 arginine-methylated proteins, involved in pre- application of 5 filters yielded high-confidence methyla- mRNA processing, protein translation, and DNA tran- tion sites that were then further investigated to under- scription. However the actual methylation sites on these stand where they are found in protein sequences and proteins remain unknown [32]. The second study uti- their likely function. lised stable isotope labelling by amino acid in cell cul- 13 ture (SILAC), in which [ CD3]methionine was Results 13 converted to [ CD3]S-adenosyl methionine, the sub- Large-scale methylation discovery in yeast peptide mass strate for arginine and lysine methylation [32]. Advan- spectra tages of this method included increased confidence of FindMod was used to analyse peptide mass spectra for identification, a capacity to distinguish between tri- 2,607 yeast proteins out of a total ~6,500 (representing methylation and acetylation which are near-isobaric, and 40% of the total proteome) for the presence of mono- the ability to quantify the relative changes in methyla- and di-methylation. A tailor-made mass tolerance was tion status of a protein between two samples. In combi- calculated for each spectrum to reduce spurious peptide nation with anti-methyllysine and anti-methylarginine matches; the average of this for all spectra was ± 0.04 antibody immunoprecipitation techniques, Ong et al. Da. Of all the 24,105 FindMod queries, there were (2004) [32] was able to identify methylation on histones 17,471 matches to potentially methylated peptides (Fig- from HeLa cell extracts, such as on histone H3K27. ure 1). Five filtering strategies, used sequentially, were Around 30 other proteins were also found to be methy- then applied to this set to find methylation sites of very lated at RG-rich motifs and most of these proteins are high confidence. The first filter removed peptides that RNA binding or associated with mRNA processing path- matched to unmodified peptide sequences as these pep- ways. The third study used anti-methyllysine antibodies tide masses are likely to be unmodified peptides. Con- to search for organ-specific lysine methylation in Mus versely, peptides masses that did not match to musculus [33]. Proteomic analysis of brain tissue extract unmodified peptide sequences are likely to be modified, by2-DPAGE,westernblotting,andMALDI-ToF and these were analysed with the second filter. The Pang et al. BMC Genomics 2010, 11:92 Page 3 of 16 http://www.biomedcentral.com/1471-2164/11/92

second filter removed any peptides that contained D or 2a).Thetruepositiverateformonomethyl-Rcouldnot E residues, as artifactual methylation may result from be accurately estimated since the number of test cases partial methyl esterification of D or E residues [39]. The was insufficient for accurate evaluation. Similarly, the third filter was designed to take advantage of redun- true positive rate for the artificial methylation set was dancy within each FindMod output, by removing one- 78% for monomethyl-K, 89% for dimethyl-K, and 90% off or spurious mass spectra. It searched for modifica- for both monomethyl-R and dimethyl-R (Table 2b). tions that were found in two or more overlapping pep- Additional results for the evaluation of the true positive tides (Figure 2), and took advantage of the reduced rate of FindMod are shown in Additional file 4. efficiency of tryptic cleavage at methylated residues [32], To further assess the accuracy of the FindMod where overlapping peptides with missed cleavages were approach, methylation sites discovered by FindMod likely to be found. The fourth filter reduced FindMod were cross-referenced with known methylation sites in false positives by considering whether modifications the literature and databases. Whilst only a small number found by FindMod were unambiguous or ambiguous. of proteins are documented as methylated in the litera- An unambiguous modification had only one FindMod ture, we confirmed 3 proteins (Ssb1p, Ssb2p, Tub2p) as match against one query peptide mass (Additional File methylated (Table 3). If we included methyl-lysine sites 1), an ambiguous modification had more than one in peptides containing D and E, we also confirmed the match against a query mass (Additional File 1). For the methylation of Tef1p and Rpl23p. This included 3 lysine peptide to be included in the final set of methylated methylation sites (K30, K79, and K390) from Tef1p, and peptides, at least one peptide in the overlapping peptides 1 lysine dimethylation site (K110) from Rpl23p [14] had to be an unambiguous peptide match. The use of (Additional file 5). Furthermore, we found 15 methy- these 4 filters resulted in 169 high confidence methy- lated ribosomal proteins in S. cerevisiae, consistent with lated peptides, from 17,471 initial low confidence the presence of methylation sites in ribosomal proteins matches (Figure 1). of eukaryotes, such as S. cerevisiae [12-15,40], S. pombe While overlapping peptides helped localise methylation [41], A. thaliana [42], and human [43-46]. sites to one or more peptides, they did not necessarily Discovery rate of methylated peptides, unmodified localise the methylation to one amino acid. To address peptides, and lysine and arginine-methylated residues this, we used a fifth filter. When two or more modified The discovery rate of a peptide is the frequency of pro- peptides that passed filters 1-4 were also found to tein identifications in which a particular peptide is overlap and share the same modification site, the mod- observed. Methylated peptides with low discovery rates ification was classified as high confidence and kept. are likely to be sub-stoichiometric and partially methy- Note that any results for lysine trimethylation were lated. It was predicted that there should be many more discarded from the study since it is near-isobaric to unmodified peptides than methylated peptides, and that lysine acetylation. From this filtering process, we found methylated peptides will have a lower discovery rate 40 lysine-methylated proteins with 45 lysine methyla- sincetheyarelikelytobesub-stoichiometric.Thedis- tion sites: 25 with mono- and 20 with di-methylation. covery rate of high confidence methylated peptides was Similarly, we found 31 arginine-methylated proteins found to be significantly lower than that of unmodified with 38 arginine methylation sites: 20 with mono- and peptides (p < 0.0001). The median discovery rate for 18 with di-methylation. There were 5 proteins that unmodified peptides was 0.50, and the median value for contained both arginine and lysine methylation. The arginine and lysine methylated peptides was 0.03. To list of high confidence methylated proteins and methy- check that the lower discovery rate of methylated resi- lation sites are shown in Table 1, additional informa- dues was not due to differences in peptide ionisation tion on these high confidence methylated peptides and efficiency, we examined if there was a correlation methylation sites are shown in Additional file 2 and between the discovery rates of methylated and unmodi- Additional file 3 correspondingly. fied residues. In the set of results, there were 69 methy- Confirmation of FindMod protein methylation lated residues for which the corresponding unmodified To establish the accuracy of our methylation discovery residues were also seen. The discovery rate of methy- approach, we theoretically digested all known methy- lated residues was significantly but weakly correlated lated proteins in Swiss-Prot and analysed the resulting with the discovery rate of matching unmodified residues peptides with our FindMod approach. We supplemented (Kendall’s τ = 0.22, p < 0.01), consistent with expected. this with a larger set of theoretically methylated pro- A list of methylated proteins and the methylation sites teins. The average true positive rate for FindMod at 0.04 discovered by FindMod is shown in Table 1. The dis- Da was 89%. For methylation sites in Swiss-Prot, Find- covery rate of all high confidence methylated peptides Mod had a true positive rate of 100% for monomethyl- and methylation sites are shown in Additional file 2 and K, 98% for dimethyl-K, and 76% for dimethyl-R (Table Additional file 3 correspondingly. Pang et al. BMC Genomics 2010, 11:92 Page 4 of 16 http://www.biomedcentral.com/1471-2164/11/92

Figure 1 Finding high confidence modified peptides using 5 filtering strategies. By removing peptides that matched to both methylated peptides and unmodified peptides (filter 1) and peptides which contains D or E residues (filter 2), 2,641 peptides were left. By filtering for overlapping peptides (filter 3), and where at least one of the overlapping peptide is an unambiguous peptide match (filter 4), 163 peptides remained. These included 108 unambiguous peptide matches, 43 ambiguous peptide matches, and 12 peptides in both categories. Found in these peptides are 45 lysine methylation sites and 38 arginine methylation sites, all of which are of high confidence (filter 5). Pang et al. BMC Genomics 2010, 11:92 Page 5 of 16 http://www.biomedcentral.com/1471-2164/11/92

Figure 2 Filter 2: using two or more overlapping peptides to improve modification confidence. A peptide can have no missed cleavage or one missed cleavage (MC) at either the N-terminus or C-terminus of the peptide. The modification site has to be found in two or more overlapping peptides for it to be accepted for subsequent analyses. There are a few scenarios where these modification sites can be found. a) There can be one peptide with no missed cleavage overlapping with another peptide with one missed cleavage. b) The modification site can be found in two overlapping peptides each with one missed cleavage. c) In the third scenario, the modification site may be found in three peptides, one peptide with no missed cleavage, and two peptides with one missed cleavage each. In all of the above cases, at least one peptide in the overlapping peptides has to be an unambiguous peptide match. Cases where this was not seen were excluded from all subsequent analyses.

Biological function, sub-cellular localization, abundance significantly enriched in the cellular components of the and half-life of methylated proteins ribosome and cytoplasm. Methylated proteins are known to be involved in several Protein abundance data from Ghaemmaghami et al. pathways, such as translation [11] and RNA processing (2003) [50] was used to compare the abundance of [49]. To investigate the function of the methylated pro- methylated proteins to non-methylated proteins. This teins from yeast, gene ontology (GO) annotations for all was used to determine if lower abundance proteins, yeast methylated proteins from FindMod analysis and more likely to be involved in signal transduction and Swiss-Prot were compared to non-methylated yeast pro- regulation [51], are methylated. Methylated proteins teins (Table 4). It was found that a number of biological were found to have a higher median abundance of processes were enriched with very high statistical signifi- 11500, as compared to non-methylated proteins, which cance, specifically translation, ribosome biogenesis and had a median abundance of 2220 (p < 0.0001, Figure assembly, RNA metabolic process, and organelle organi- 3a). Despite this, several methylated proteins of low zation and biogenesis. The molecular function of struc- abundance were seen including 5 proteins of less than tural activity, RNA binding, and translation regulator 1000 molecules per cell. These included Snf2p (217 activity were also significantly enriched. As may be copies/cell), Snu114p (300 copies/cell), Mrpl20p (358 expected from the above, methylated proteins were copies/cell), and Rpl3p (450 copies/cell). Examples of Pang et al. BMC Genomics 2010, 11:92 Page 6 of 16 http://www.biomedcentral.com/1471-2164/11/92

Table 1 List of methylated proteins and methylation sites discovered by FindMod. Gene name Swiss-Prot Description Methylated residuesa accession ARC40 P38328 Actin-related protein 2/3 complex subunit 1 mK121 ATP2 P00830 ATP synthase subunit beta, mitochondrial precursor mK196 CDC11 P32458 Cell division control protein 11 dR35 DIP2 Q12220 U3 small nucleolar RNA-associated protein 12 mK350 DNF1 P32660 Probable phospholipid-transporting ATPase DNF1 mK541 ECM29 P38737 Proteasome component ECM29 mR542, dR1112 EDE1 P34216 EH domain-containing and endocytosis protein 1 mR252 EMG1 Q06287 Essential for mitotic growth 1 mK147 ERB1 Q04660 Ribosome biogenesis protein ERB1 mK577, mK581 FKS1 P38631 1,3-beta-glucan synthase component FKS1 mR946, dR946, dR952, mR962, dR962, dR1527

GCD10 P41814 tRNA (adenine-N(1)-)-methyltransferase non-catalytic subunit mK436, mR447 TRM6

GCN1 P33892 Translational activator GCN1 dK1446 GCN20 P43535 Protein GCN20 dK656 GUS1 P46655 Glutamyl-tRNA synthetase, cytoplasmic mR371 HAS1 Q03532 ATP-dependent RNA helicase HAS1 dK444 IMD3 P50095 Probable inosine-5’-monophosphate dehydrogenase IMD3 mR168

ISW1 P38144 ISWI chromatin-remodeling complex ATPase ISW1 mK14 LEA1 Q08963 U2 small nuclear ribonucleoprotein A’ dR141 MPG1 P41940 Mannose-1-phosphate guanyltransferase dK299 MRPL17 P36528 54S ribosomal protein L17, mitochondrial precursor mK70 MRPL20 P22354 54S ribosomal protein L20, mitochondrial precursor mK104 NIP1 P32497 Eukaryotic translation initiation factor 3 subunit C mK514 NOC2 P39744 Nucleolar complex protein 2 mK384 NOG2 P53742 Nucleolar GTP-binding protein 2 dR336 NOT1 P25655 General negative regulator of transcription subunit 1 mR256 POL12 P38121 DNA polymerase alpha subunit B mK84 PRP43 P53131 Pre-mRNA-splicing factor ATP-dependent RNA helicase PRP43 dK662

PRT1 P06103 Eukaryotic translation initiation factor 3 subunit B mR572 PSD2 P53037 Phosphatidylserine decarboxylase proenzyme 2 precursor mR252

PYK1 P00549 Pyruvate kinase 1 mR216, dR216 RPA2 P22138 DNA-directed RNA polymerase I subunit RPA2 dK513 RPB2 P08518 DNA-directed RNA polymerase II subunit RPB2 mR496 RPL16A P26784 60S ribosomal protein L16-A dK148 RPL18A, P07279 60S ribosomal protein L18 mR105 RPL18B RPL1A, RPL1B P53030 60S ribosomal protein L1 dK207 RPL20A, P0C2I0 60S ribosomal protein L20 dK47 RPL20B RPL27A, P0C2H6, P0C2H7 60S ribosomal protein L27 dR15, dK133 RPL27B RPL2A, RPL2B P05736 60S ribosomal protein L2 dR21 RPL3 P14126 60S ribosomal protein L3 mR275, dK384 RPL4A P10664 60S ribosomal protein L4-A dR84, mK104 RPL7A P05737 60S ribosomal protein L7-A mR218 RPL7B Q12213 60S ribosomal protein L7-B mR218 Pang et al. BMC Genomics 2010, 11:92 Page 7 of 16 http://www.biomedcentral.com/1471-2164/11/92

Table 1: List of methylated proteins and methylation sites discovered by FindMod. (Continued) RPL8B P29453 60S ribosomal protein L8-B mK15, mK43, dK241

RPN2 P32565 26S proteasome regulatory subunit RPN2 mK376 RPS11A, P26781 40S ribosomal protein S11 mR67 RPS11B RPS13 P05756 40S ribosomal protein S13 mK140 RPS17A P02407 40S ribosomal protein S17-A dK59 RRP5 Q05022 rRNA biogenesis protein RRP5 dR215, dK769 RSC1 P53236 Chromatin structure-remodeling complex subunit RSC1 mR454 RSC30 P38781 Chromatin structure-remodeling complex protein RSC30 mR692

RVB2 Q12464 RuvB-like protein 2 mK412 SKI3 P17883 Superkiller protein 3 dK1088 SMB1 P40018 Small nuclear ribonucleoprotein-associated protein B mK138, mK145 SNF2 P22082 Transcription regulatory protein SNF2 dK1028 SNU114 P36048 114 kDa U5 small nuclear ribonucleoprotein component dK356, mK935 SSB1 P11484 Heat shock protein SSB1 dR513 SSB2 P40150 Heat shock protein SSB2 dR513 TDH3 P00359 Glyceraldehyde-3-phosphate dehydrogenase 3 dR11 TIF32 P38249 Eukaryotic translation initiation factor 3 subunit A dK192 TUB2 P02557 Tubulin beta chain dR318 URA7 P28274 CTP synthase 1 dK28 USO1 P25386 Intracellular protein transport protein USO1 mK119 UTP22 P53254 U3 small nucleolar RNA-associated protein 22 mK1158 VPS52 P39904 Vacuolar protein sorting-associated protein 52 mR224 YKU70 P32807 ATP-dependent DNA helicase II subunit 1 dR549 YPR097W Q06839 PX domain-containing protein YPR097W dK249 a: m, monomethylated; d, dimethylated proteins with high abundance are Rp1Bp (265,000 precisely by comparing methylated proteins to other copies/cell) and Tdh3p (169,000 copies/cell). proteins from the same GO slim biological process, this The methylation of lysine residues has been suggested approach was limited by the relatively small number of to block their ubiquitination, leading to a longer protein methylated proteins (66 proteins) in the dataset. Methy- half-life [52]. To investigate this possibility, protein half- lated proteins mapped to 33 gene ontology biological life data from Belle et al. (2006) [53] was used to com- process categories, with an average of 2 proteins per pare the half-life of lysine-methylated proteins to non- category, which was unsuitable for appropriate statistical methylated proteins. Interestingly, we found methylated analyses. proteins had a longer median half-life of 66 minutes, as Interplay of methylation and other post-translational compared to 43 minutes for non-methylated proteins (p modifications = 0.012, Figure 3b). A striking difference between the To see if lysine methylation might block ubiquitination, methylated and non-methylated proteins was the the Ubipred software [54] was used to predict if known absence of a group of proteins with very short half-life methylated lysine sites are also subject to ubiquitination. (see arrow in Figure 3b). Despite this, our approach also The Ubipred software has an accuracy of 84.4% and is identified 18 methylated proteins with half-life less than thus sufficiently reliable for this test. It was found that 60 minutes. Examples of methylated proteins with 43% of high-confidence lysine methylation sites were shorter half-lives are Rrp5p (15 minutes), Ski3p (32 min- also predicted to be ubiquitination sites. This result utes), and Snu114p (52 minutes). Examples of proteins lends support to the hypothesis that methylation might with long half-life are Utp22p (13,266 minutes) and block ubiquitination, potentially prolonging the half-life Atp2p (6,627 minutes), although we note that these of lysine-methylated proteins. numbers may be erroneous estimations in the Belle et It has recently been reported that the methylation of al. study (2006) [53]. Although the abundance and half- arginine can regulate the phosphorylation (or depho- lives of methylated proteins could be analysed more sphorylation) of some proteins [55-61]. To investigate Pang et al. BMC Genomics 2010, 11:92 Page 8 of 16 http://www.biomedcentral.com/1471-2164/11/92

Table 2 True positive rate at mass tolerance of 0.04 Da. Type of True positive rate No. of methylation sites correctly No. of methylation sites No. of peptides methylation (%) matched tested tested a) Non-redundant known mono- and di-methylation Monomethyl-K 100 27 466 1,201 Dimethyl-K 98 89 280 699 Monomethyl-R N.D.a N.D.a 36 Dimeth-R 76 28 137 495 b) Artificial mono- and di-methylation Monomethyl-K 78 217 11,377 30,782 Dimethyl-K 89 3,673 11,377 30,691 Monomethyl-R 90 453 6,941 18,728 Dimethyl-R 90 2,140 6,902 18,594 a: N.D. - not determined whether there is evidence of interplay between arginine methylation sites documented in Swiss-Prot contained methylation and phosphorylation in S. cerevisiae,we the RGG, RGX, or RXG-motifs. However, FindMod examined the proportion of arginine-methylated pro- found 7 methylation sites with the RXG or RGX motifs. teins that are known to be phosphorylated in databases Two methylation sites, Tdh3p dimethyl-R11 and Rpl4Ap and in the literature. It was found that 94% (30/32) of dimethyl-R84, matched to the GXXRXG motif, which arginine-methylated yeast proteins are known to be conforms with the known RXG motif and the additional phosphorylated. This is a considerable increase over the GXXR motif found in this study. Three methylation 38% (2,548/6,709) of all S. cerevisiae proteins known to sites had the novel WXXXR motif. be phosphorylated and suggests a possible interplay of arginine methylation and phosphorylation [55-61]. Discussion Arginine and lysine methylation motifs Large-scale discovery of lysine and arginine methylation To determine if methylation sites are enriched in speci- sites fic sequence-motifs, all yeast methylation sites from In this study, 45 lysine methylation sites and 38 arginine FindMod analysis and the Swiss-Prot database were ana- methylation sites were identified in 66 proteins in the S. lysed to find enriched sequence-motifs. Methionine was cerevisiae proteome. These include 4 proteins previously found to be at position -1 from lysine methylation in 5 known to be methylated in yeast or in other organisms FindMod sites and two additional methylation sites pre- and 15 proteins that are functionally related to others viously documented in S. cerevisiae (Table 5). This pre- known to be methylated. Our findings support earlier sence of methylation was of very high statistical studies [31-33] that suggested methylation to be quite significance (p = 1.18 × 10-6)ascomparedtothat widespread. Whilst many of our methylation sites are expected in any random sequence of yeast proteins. By novel and have not been confirmed by MS-MS, the fil- contrast, residues found to be significantly enriched ters and replicate analyses we used in association with adjacent to arginine methylation included W at position the FindMod tool provided a robust means by which -4 (p = 1.50 × 10-7), and G at position -3 (p = 6.08 × protein methylation could be detected. The false positive 10-6). While it was previously known that arginine rate was estimated to be 11% at 0.04 Da mass error. methylation is found in RGG motifs, Wooderchak et al. Notwithstanding this, it should be noted that whilst we (2008) [62] showed that arginine methylation is also did study 2,607 proteins from yeast, this is only ~40% of found in RXG and RGX motifs. No known S. cerevisiae the total yeast proteome. Therefore, we expect that up

Table 3 Methylated proteins identified independently by both FindMod and described in the literature. Ordered locus name (Swiss-Prot accession) Protein name Literaturea YDL229W (P11484) Ssb1p Wang and Lazarides [48], Wang et al. [47], Iwabata et al. [33] YNL209W (P40150) Ssb2p Wang and Lazarides [48], Wang et al. [47], Iwabata et al. [33] YPR080W, YBR118W (P02994) Tef1p/Tef2p/eEF1ap Cavallius et al. [16], Iwabata et al. [33] YBL087C, YER117W (P04451) Rpl23A/Rpl23B Porras-Yakushi et al. [14] YFL037W (P02557) Tub2p Iwabata et al. [33] a: Proteins previously known to be methylated in yeast and in other organisms Pang et al. BMC Genomics 2010, 11:92 Page 9 of 16 http://www.biomedcentral.com/1471-2164/11/92

Table 4 Methylated proteins from yeast are enriched in specific processes, functions and components. a b c d Rank Term (GO ID) n1,1 n1,2 n2,1 n2,2 Corrected p-value Biological process 1 Translation (6412) 37 48 305 5712 3.57 e-23 2 Ribosome biogenesis and assembly (42254) 20 65 311 5706 5.03 e-7 3 RNA metabolic process (16070) 21 64 664 5353 0.01 4 Organelle organization and biogenesis (6996) 30 55 1230 4787 0.04

Molecular function 1 Structural molecule activity (5198) 31 54 304 5712 3.95 e-17 2 Translation regulator activity (45182) 5 80 47 5969 0.01 3 RNA binding (3723) 10 75 225 5791 0.03

Cellular component 1 Ribosome (5840) 33 52 307 5710 5.56 e-19 2 Cytoplasm (5737) 59 26 2701 3316 1.05 e-4 a: Number of methylated proteins with this GO slim term b: Number of non-methylated proteins with this GO term c: Number of methylated proteins with other GO slim term d: Number of non-methylated proteins with other GO slim term to 60% of methylated proteins would have been missed. For example, sub-stoichiometric levels of methylations Further methylation sites may have been missed due to were observed in the human heterogeneous nuclear difficulties in mass spectrometric detection; an example ribonucleoprotein K (hnRNP K), in which < 33% of is methylarginine, which is often found in arginine- and hnRNP K were asymmetrically dimethylated at R303, glycine-rich regions that produce tryptic peptides that and < 10% were monomethylated at R287 [56]. Our are too small for routine MALDI-TOF analysis. results from FindMod analysis support these observa- Discovery rates may reflect the sub-stoichiometric nature tions since the proportion of methylated peptides seen of methylation for any protein was very low. The sub-stoichiometric Previous research has highlighted that methylated pep- nature of methylation events was also supported by a tides are difficult to discover [32] and this is made more weak but significant correlation between the discovery difficult because methylation is sub-stoichiometric [34]. rates of modified and unmodified paired peptides.

Figure 3 Distribution of abundance of methylated proteins and half-life of lysine-methylated proteins, versus non-methylated proteins. a) The x-axis represents protein abundance, in copies per cell, as log base 10. The abundance of methylated proteins is represented with a solid line, while the abundance of non-methylated proteins is represented with a dotted line. b) The x-axis represents protein half-life, minutes, in log base 10. The half-life of lysine-methylated proteins is represented with a solid line, while the half-life of non-methylated proteins is represented with a dotted line. The arrow points to a group of proteins with very short half-life, seen only in ‘other’ proteins, which are likely to be unmethylated. Note that in both figures, ‘other’ proteins are those for which methylation was not found; this group may, however, contain some methylated proteins. Note also that abundance data and half-life data was not available for all yeast proteins in Belle et al. (2006) [53] and Ghaemmaghami et al. (2003) [50]. Pang et al. BMC Genomics 2010, 11:92 Page 10 of 16 http://www.biomedcentral.com/1471-2164/11/92

Table 5 Lysine and arginine methylation motifs. Gene name Swiss-Prot accession Methylation sitea Motifb MK motif Gcd10p P41814 meth-K4362 RGKLHPLMTMKGGGGYLMWCH Pfk2p P16862 meth-K1801 HSYTDLAYRMKTTDTYPSLPK Rpl23Ap P04451 dimeth-K1101 GVIANPKGEMKGSAITGPVGK Rps17Ap P02407 dimeth-K592 KIAGYTTHLMKRIQKGPVRGI Rvb2p Q12464 meth-K4122 LISVAQQIAMKRKNNTVEVED Ura7p P28274 dimeth-K282 VLASSTGMLMKTLGLKVTSIK Uso1p P25386 meth-K1192 NGKYPSPLVMKQEKEQVDQFS

RGx or RxG motif Ecm29p P38737 dimeth-R11122 LAKSSALWSSRKGIAFGLGAI Gus1p P46655 dimeth-R3712 IYRCNLTPHHRTGSTWKIYPT Rpl27Bp P38706 dimeth-R152 LKAGKVAVVVRGRYAGKKVVI Rpl4Ap P10664 dimeth-R842 IPRVGGGGTGRSGQGAFGNMC Rps11Bp P26781 dimeth-R672 KCPFTGLVSIRGKILTGTVVS Tdh3p P00359 dimeth-R112 MVRVAINGFGRIGRLVMRIAL Tub2p P02557 dimeth-R3182 GRYLTVAAFFRGKVSVKEVED

WxxxR and/or GxxR motif Cdc11p P32458 dimeth-R352 VMIVGQSGSGRSTFINTLCGQ Ecm29p P38737 meth-R5422 ARLFNIWGTVRTNRFDIIEES Fks1p P38631 meth-R9462, dimeth-R9462 TLRTRIWASLRSQTLYRTISG Fks1p P38631 dimeth-R15272 YHRNSWIGYVRMSRARITGFK Rpl4Ap P10664 dimeth-R842 IPRVGGGGTGRSGQGAFGNMC Rpl7Ap, Rpl7Bp P05737 meth-R2182 SNPSGGWGVPRKFKHFIQGGS Rsc30p P38781 meth-R6922 SIKSFSSGNNRFHSNGKEFLF Tdh3p P00359 dimeth-R112 MVRVAINGFGRIGRLVMRIAL a: Evidence for the presence of the methylation site on this protein 1: Swiss-Prot, 2: methylation site confirmed by FindMod analysis b: Methylation site matching the specified motif is underlined, the methylation site is highlighted in bold.

However, there may be explanations, other than biologi- found to be involved in RNA metabolic processes and cal, for the lower discovery rate of modified peptides. are involved in RNA binding. This is consistent with the These included inefficient trypsin cleavage which occurs function of several proteins known to be methylated at C-terminal to methylated lysine and arginine residues RG-rich motifs [49]. The methylation of arginine in RG- [32] and differences in MALDI-ToF ionisation of the rich motifs is conserved in human, and their RNA bind- methylated peptides as seen with different proteotypic ing activity is also conserved [32]. One such example is peptides [63]. the fragile X mental retardation protein (FMRP) [25]. Methylated proteins are involved in specific biological Thirdly, our methylated proteins were enriched in the functions and processes, are higher in abundance and ribosome and the cytoplasm. This is consistent with the have longer half-life sites of translation and association with RNA inside the Methylated proteins were found to be enriched for spe- cell [22,23]. Whilst the lack of methylated proteins cific biological processes, molecular functions and sub- enriched in the nucleus and nucleolus was not expected, cellular localizations. Firstly, methylated proteins were these may have arisen due to our reduced set of pro- enriched in translation, ribosome biogenesis and assem- teins for analysis (40% of the yeast proteome). In addi- bly. This is consistent with previous studies in which tion, nuclear proteins such as histone and Npl3p are methylated proteins have been linked to translation in known to have peptides with multiple modification sites Escherichia coli, S. cerevisiae,andSchizosaccharomyces butthesewerenotsearchedforinthisstudy.Methy- pombe [11]. Ribosomal proteins are also known to show lated proteins found in this study were significantly lysine or arginine methylation, for example the riboso- higher in abundance than proteins currently known to mal proteins L10a, L12, and L26a of Arabidopsis [42]. be non-methylated. This is partly explained by riboso- Secondly, the methylated proteins described here were mal proteins and proteins involved in translation, some Pang et al. BMC Genomics 2010, 11:92 Page 11 of 16 http://www.biomedcentral.com/1471-2164/11/92

of which we found to be methylated, being of very high responsible for most methylation sites are also unknown abundance [50,64]. Methylated proteins were also found (e.g. Tef1p K30, Pfk2p K180), and the function of sev- to be of longer average half-life. This may be due to eral methyltransferase proteins in S. cerevisiae remain their role in translation [11], where ribosomal proteins poorly characterized [13]. Therefore, more experiments are generally stable [53]. are required to elucidate the function of methylation in Interplay of methylation and other post-translational S. cerevisiae. modifications The methylation of lysine is known to block the action Conclusions of ubiquitin ligase [65], preventing proteins from degra- This study is a step towards the definition of the methyl dation via the ubiquitin/proteasome system [52,66]. Our proteome of S. cerevisiae. It will be useful to guide observation of a distinct group of low half-life proteins future experiments on its predominance and role in the in S. cerevisiae, none of which were methylated, suggests cell. For example, experiments are needed to elucidate that lysine methylation might be on many proteins and the function of methylation and how each site is regu- prevent their ubiquitination. The limited number of ubi- lated, which with the exception of histone methylation quitination sites currently known on yeast proteins is largely unknown. Secondly, experiments to investigate [67,68] makes it currently difficult to check if lysine whether methylation sites overlap with poly-ubiquitina- methylation, as found in this study, is found on residues tion sites, and therefore prevent protein degradation via that can also be poly-ubiquitinated. However, our pre- the ubiquitin/proteasome pathway could be undertaken. diction of putative ubiquitination sites [54] showed that Thirdly, it will be important to understand whether the 43% of the lysine methylation sites in 40 proteins may functions of methylated proteins are co-regulated by be ubiquitinated. ubiquitination, phosphorylation or other post-transla- Several studies suggested that there is interplay tional modifications. Finally, the ultimate goal in study- between arginine methylation and phosphorylation of ing methylation should be to build networks of some proteins [55-61]. Arginine methylation may antag- methylated proteins, their interaction partners and mod- onise phosphorylation [56,57], act as a switch to enable ifying enzymes to elucidate their dynamics as a system, the binding of phosphatase to encourage dephosphoryla- similar to previous work on protein phosphorylation tion [60], or encourage phosphorylation [59]. On the [70-72]. other hand, phosphorylation can either interfere with arginine methylation [58,61], or promote the recruit- Methods ment of arginine methyltransferase [55]. We found that MALDI-ToF mass spectra for S. cerevisiae the majority of arginine-methylated proteins in our This study employed MALDI-ToF peptide mass finger- study(30outof32or94%)areknownfromthelitera- printing spectra from the large-scale characterization of ture to be phosphorylated, suggesting an interplay protein complexes in S. cerevisiae [38]. There were between arginine methylation and phosphorylation in 36,854 peptide mass spectra containing 1.2 million these proteins. However these arginine methylation and empirical masses, with an average mass error of 0.02 phosphorylation sites were not necessarily directly adja- Da. These were from 2,607 proteins out of ~6,500 pro- cent in the protein sequence. teins (40%) in the yeast proteome, whereby each protein Arginine and lysine methylation motifs had an average of 11 spectra or at least 3 spectra. Pep- Motif analysis showed that many methylation sites tide masses corresponding to unmodified peptides or described here conform with previously known motifs. tryptic peptides of porcine trypsin were removed, as For example, 7 arginine methylation sites discovered by were peptides less than 500 Da. FindMod conformed with the known RXG and RGX Tailor-made mass tolerance for each empirical spectrum motifs [62]. Arginine methylation sites were also An error threshold was calculated for each of the 36,854 enriched in GXXR motifs, which correlated with the spectra; this was possible as the identity of all proteins enrichment of glycine residues nearby arginine methyla- was known. For each spectrum, the mass differences tion sites [69]. In addition, two experimentally verified between the empirical and theoretical mass of all known methylated sites in Pfk2p and Rpl23Ap annotated in unmodified peptides were calculated. The average and Swiss-Prot along with 5 FindMod sites suggests the exis- median mass tolerance was 0.04 Da. To ensure high tence of a MK lysine methylation motif. The discovery accuracy of methylation discovery, only spectra with a of the novel enriched methylation motif WXXXR sup- mass error (Additional file 6) that was lower than 0.1 ports the possibility that there are more methylation Da were used for the identification of methylation sites. sites to be found in S. cerevisiae. These also raise an FindMod analysis of yeast proteins interesting question concerning which motifs are methy- Each peptide mass spectra was analysed with FindMod lated by specific methyltransferases. Methyltransferases [35]. A bulk submission web interface to FindMod was Pang et al. BMC Genomics 2010, 11:92 Page 12 of 16 http://www.biomedcentral.com/1471-2164/11/92

developed http://ca.expasy.org/tools/findmod/findmod_- methylated residues using the method as described batch.html. Each FindMod query used the UniProt above. Partially methylated peptides are likely to have a accession number for the protein identified through low discovery rate. While mass spectra with a maximum peptide mass fingerprinting (from Gavin et al., 2006) mass tolerance of 0.1 Da were used for finding the [38], the experimental peptide masses for this protein methylation sites to limit the false positive rate, all avail- and the tailor-made mass tolerance in Da. Other Find- able mass spectra with a mass tolerance of up to 1.5 Da Mod parameters included the use of monoisotopic mass, were used for the calculation of discovery rate. That is a maximum of 1 missed cleavage by trypsin, no amino because more mass spectra were needed to increase the acid substitutions, that the peptides were M+H+ and sample size for discovery rate calculation. could contain oxidised methionine or tryptophan. The Evaluation of the true positive and false positive rate peptide masses were matched to theoretical peptides Swiss-Prot entries with known lysine and arginine generated from the precursor sequence. The program methylation sites were obtained from Swiss-Shop http:// searched for 71 types of post-translational modifications au.expasy.org/swiss-shop/, for Swiss-Prot release 57.2 in all experimental peptide masses [35], including [73],bysearchingtheMOD_RESfieldusingthekey- mono-, di-, and tri-methylation http://www.expasy.ch/ words ‘methyllysine’ and ‘methylarginine’. S. cerevisiae tools/findmod/findmod_masses.html. Matches to 6 types proteins sequences were downloaded from Swiss-Prot of modifications were removed from the analyses, as by using the query ‘organism:4932’. The annotation of they are not found in S. cerevisiae or may lead to many known methylation sites were obtained from the false positives due to their low mass; for more details MOD_RES field of the Swiss-Prot entry, and type of seeAdditionalfile6.TheSwiss-Protdatabaseversion methylation were determined from the standard RESID 51.6 and TrEMBL version 34.6 [73] were used for the nomenclature [78]. The proteins were processed into FindMod matches. mature forms where appropriate; these contain no sig- Filters to remove low quality methylation sites nal peptides, propeptides, intein regions, and only con- For the methylated peptides to be included in the analy- sists of protein chains annotated by the ‘CHAIN’ field sis, they needed to pass the following 5 filters. The pep- of the Swiss-Prot entry. For each M or W residues in a tides1)cannotbeanunmodifiedpeptide,2)hadto peptide, the mass of methionine and tryptophan oxida- contain no Asp or Glu residues, and 3) have no or one tion was added to the total mass of the peptide. Only missed tryptic cleavage. In addition, 4) the peptide must methylated peptides with a maximum of one-missed have two or more overlapping peptides and at least one cleavage and with masses between 500 and 3,000 Da peptide in the overlapping peptides had to be an unam- were used. Since lysine trimethylation is near-isobaric to biguous peptide match. 5) When two or more modified lysine acetylation, trimethylation was not included in peptides that passed filters 1-4 were also found to over- the analysis. Two in silico test sets, the known methyla- lap and share the same modification site, the modifica- tion set and the artificial methylation set, were used to tion was classified as high confidence and kept. The use evaluatethetruepositiverateofFindModforthedis- of overlapping peptides to improve the reliability of covery of mono- and di-methylation on arginine and methylation site is facilitated by methylation sites found lysine residues. The known methylation test set con- at the C-terminus of peptides. Trypsin cleavage at tained known lysine and arginine methylation sites from methylated arginine and lysine has been observed in Swiss-Prot. The set of sequences from which methyla- many LC-MS/MS experiments [32,74-77], and is less tion sites were found was non-redundant at the 90% efficient than at non-methylated residues. A list of tryp- identity level, generated using UniRef90 [79]. This test tic peptides with C-terminal methylated amino acids, setincluded883knownmono-anddi-methylation identified by LC-MS-MS, is shown in Additional file 7. sites. The artificial mono- and di-methylation sites on Calculation of discovery rate lysine and arginine residues were generated by simu- The discovery rate for an unmodified peptide was calcu- lated methylation on theoretical unmodified peptides. lated as the fraction of protein identifications in which Theartificialtestsethasmoredatathantheknown the unmodified peptide is observed. In the case of dupli- methylation test set, to allow more accurate estimation cated genes, the counts of protein identifications were of the true positive rate. Approximately 6% of lysine summed together because peptide mass fingerprinting residues from S. cerevisiae protein sequences were ran- cannot distinguish between proteins that do not differ domly sampled to generate artificially methylated pep- in primary sequence. The discovery rate for a particular tides for monomethyl-K. The sampling procedure was unmodified residue in the protein was calculated as the repeated for dimethyl-K, monomethyl-R, and dimethyl- sum of the discovery rate of all the unmodified peptides R. The second test set was referred to as the artificial that contain the residue. Discovery rates were also cal- methylation set, and contained 36,594 artificial mono- culated for modified methylated peptides and and di-methylation sites. Pang et al. BMC Genomics 2010, 11:92 Page 13 of 16 http://www.biomedcentral.com/1471-2164/11/92

The true positive rate of FindMod, with the 5 filters term enrichment was assessed using Fisher’sexacttest described above, was evaluated using known methylation and Bonferroni correction [82]. All statistical analyses sites and artificial methylation sites. Removal of peptides were performed using the R statistical package version containing D or E residues were not required since no 2.2.1 [83]. artifactual methylation on D or E residues were intro- duced to the in silico test sets. The true positive rate Additional file 1: Examples of ambiguous and unambiguous was evaluated at the mass tolerance of 0.04 Da, since peptide matches. This file contains examples of ambiguous and unambiguous peptide matches. this was the median mass tolerance all empirical for Click here for file peptide masses [38]. For each test set, a true positive [ http://www.biomedcentral.com/content/supplementary/1471-2164-11- FindMod match requires the residue, sequence position 92-S1.DOC ] and the type of methylation to be correctly matched. Additional file 2: List of lysine- and arginine-methylated peptides. This file contains the list of all high confidence arginine- and lysine- The true positive rate of FindMod, was calculated as the methylated peptides, and their corresponding discovery rates. number of true positive matches divided by the sum of Click here for file the number of false positive matches and the number of [ http://www.biomedcentral.com/content/supplementary/1471-2164-11- true positives, represented as a percentage. 92-S2.XLS ] Additional file 3: List of lysine and arginine methylation sites. This Arginine and lysine methylation motif analysis file contains the list of all high confidence methylated residues and their Ten amino acid residues N-terminal and C-terminal to corresponding discovery rates. each methylation site were included in the motif analy- Click here for file [ http://www.biomedcentral.com/content/supplementary/1471-2164-11- sis. The number of times each amino acid occurs at 92-S3.XLS ] each of these positions was counted. For any methyla- Additional file 4: Additional benchmarking results. Evaluation of the tion site less than 10 residues from the N- or C-termi- true positive rates of FindMod using different range of mass tolerance nus of the protein, positions beyond the limit of the from 0.01 to 0.10 Da, and the known non-redundant methylation test set and the artificial methylation test set. sequence were disregarded. To measure whether an Click here for file amino acid was significantly enriched at each position, a [ http://www.biomedcentral.com/content/supplementary/1471-2164-11- p-value was calculated using the prop.test function in 92-S4.DOC ] the R statistical package. A one-sided statistical test was Additional file 5: List of methylated peptides of Tef1p and Rpl23p discovered by FindMod. This file contains the list of methylated used, with an alternative hypothesis that there was an peptides of Tef1p and Rpl23p found by FindMod. These peptides may enrichment of amino acid frequency over the average contain E and D residues, as E and D residues were not filtered for the frequency. Bonferroni’s correction was used to correct analysis. prop.test Click here for file the p-value calculated by to reduce false [ http://www.biomedcentral.com/content/supplementary/1471-2164-11- positives. 92-S5.DOC ] Functional analysis and statistical tests Additional file 6: Supplementary methods. This file describes how the Functional data co-analysed with modifications were tailor-made error tolerances was calculated, and also provide a list of low-quality post-translational modifications that were excluded from protein abundance [50], protein half-life [53], Gene FindMod’s analysis. Ontology (GO) slim (from Saccharomyces Genome Click here for file Database, ftp://ftp.yeastgenome.org/yeast/) [80] and pro- [ http://www.biomedcentral.com/content/supplementary/1471-2164-11- 92-S6.DOC ] tein complexes [81]. Nonparametric tests were used for Additional file 7: List of methylation at C-terminus of peptides.This all statistical analyses. Protein abundance data, in copies file is a list of peptides with methylation at C-terminus of peptides, per cell, was from Ghaemmaghammi et al. (2003) [50]. collected from literature. Protein half-life data, in minutes, was from Belle et al. Click here for file [ http://www.biomedcentral.com/content/supplementary/1471-2164-11- (2006) [53]. To investigate if lysine methylation might 92-S7.XLS ] block ubiquitination, the Ubipred software [54] was used to predict if known methylated lysine sites are also subject to ubiquitination. To investigate if arginine Abbreviations methylated proteins were co-regulated by phosphoryla- Da: Daltons; GO: gene ontology; LC-MS/MS: liquid chromatography tandem tion, the Swiss-Prot database release 57.2 [73] was mass spectrometry; MALDI-ToF: Matrix assisted laser desorption ionisation - examined to see if methylated proteins also had experi- time of flight; PTM: post-translational modification. mentally determined protein threonine, serine, and tyro- Acknowledgements sine phosphorylation sites. Mann-Whitney tests, a non- CNIP was the recipient of Australian Postgraduate Awards. EG was supported parametric substitute for Student’s t-test, were used to by the Swiss Federal Government through the Federal Office of Education ’ and Science. This research was supported in part by a University of New compare between two samples. Kendall s correlation South Wales Faculty Research Grant, by the University of New South Wales coefficient, a non-parametric substitute for Pearson’s Goldstar Scheme and the NSW State Government Science Leveraging Fund. correlation coefficient, was used to measure the signifi- The author thanks Timothy A. Couttas, Daniel Yagoub, Simone S. Li, and cance of the correlation between two samples. GO slim Pang et al. BMC Genomics 2010, 11:92 Page 14 of 16 http://www.biomedcentral.com/1471-2164/11/92

Adam Lee for their helpful discussions on this manuscript. We thank A.C. 17. Xu C, Henry PA, Setya A, Henry MF: In vivo analysis of nucleolar proteins Gavin for facilitating access to the Cellzome peptide mass data. modified by the yeast arginine methyltransferase Hmt1/Rmt1p. RNA 2003, 9(6):746-759. Author details 18. Russell ID, Tollervey D: NOP3 is an essential yeast protein which is 1School of Biotechnology and Biomolecular Sciences, University of New required for pre-rRNA processing. J Cell Biol 1992, 119(4):737-747. South Wales, Sydney, NSW, 2052, Australia. 2Systems Biology Initiative, 19. Kondo K, Inouye M: Yeast NSR1 protein that has structural similarity to University of New South Wales, Sydney, NSW, 2052, Australia. 3Swiss Institute mammalian nucleolin is involved in pre-rRNA processing. The Journal of of Bioinformatics, Swiss-Prot Group, CMU - 1, rue Michel Servet, CH-1211 biological chemistry 1992, 267(23):16252-16258. Geneva 4, Switzerland. 20. Lee WC, Zabetakis D, Melese T: NSR1 is required for pre-rRNA processing and for the proper maintenance of steady-state levels of ribosomal Authors’ contributions subunits. Molecular and cellular biology 1992, 12(9):3865-3871. CNIP designed the method for searching arginine and lysine methylation 21. Loo S, Laurenson P, Foss M, Dillin A, Rine J: Roles of ABF1, NPL3, and sites using FindMod, performed the statistical and bioinformatics analyses, YCL54 in silencing in Saccharomyces cerevisiae. Genetics 1995, and wrote the manuscript. EG implemented the FindMod bulk submission 141(3):889-902. program. MRW supervised the project and critically reviewed the manuscript. 22. Green DM, Marfatia KA, Crafton EB, Zhang X, Cheng X, Corbett AH: Nab2p All authors read and approved the manuscript. is required for poly(A) RNA export in Saccharomyces cerevisiae and is regulated by arginine methylation via Hmt1p. The Journal of biological Received: 12 November 2009 chemistry 2002, 277(10):7752-7760. Accepted: 5 February 2010 Published: 5 February 2010 23. Shen EC, Henry MF, Weiss VH, Valentini SR, Silver PA, Lee MS: Arginine methylation facilitates the nuclear export of hnRNP proteins. Genes & References development 1998, 12(5):679-691. 1. Grillo MA, Colombatto S: S-adenosylmethionine and protein methylation. 24. Kiledjian M, Dreyfuss G: Primary structure and binding activity of the Amino Acids 2005, 28(4):357-362. hnRNP U protein: binding RNA through RGG box. The EMBO journal 1992, 2. Strahl BD, Allis CD: The language of covalent histone modifications. 11(7):2655-2664. Nature 2000, 403(6765):41-45. 25. Dolzhanskaya N, Merz G, Aletta JM, Denman RB: Methylation regulates the 3. Garcia BA, Hake SB, Diaz RL, Kauer M, Morris SA, Recht J, Shabanowitz J, intracellular protein-protein and protein-RNA interactions of FMRP. J Cell Mishra N, Strahl BD, Allis CD, Hunt DF: Organismal differences in post- Sci 2006, 119(Pt 9):1933-1946. translational modifications in histones H3 and H4. The Journal of 26. McBride AE, Cook JT, Stemmler EA, Rutledge KL, McGrath KA, Rubens JA: biological chemistry 2007, 282(10):7641-7655. Arginine methylation of yeast mRNA-binding protein Npl3 directly 4. Pokholok DK, Harbison CT, Levine S, Cole M, Hannett NM, Lee TI, Bell GW, affects its function, nuclear export, and intranuclear protein interactions. Walker K, Rolfe PA, Herbolsheimer E, Zeitlinger J, Lewitter F, Gifford DK, The Journal of biological chemistry 2005, 280(35):30888-30898. Young RA: Genome-wide map of nucleosome acetylation and 27. Cote J, Richard S: Tudor domains bind symmetrical dimethylated methylation in yeast. Cell 2005, 122(4):517-527. arginines. The Journal of biological chemistry 2005, 280(31):28476-28483. 5. Briggs SD, Xiao T, Sun ZW, Caldwell JA, Shabanowitz J, Hunt DF, Allis CD, 28. Bedford MT, Frankel A, Yaffe MB, Clarke S, Leder P, Richard S: Arginine Strahl BD: Gene silencing: trans-histone regulatory pathway in chromatin. methylation inhibits the binding of proline-rich ligands to Src homology Nature 2002, 418(6897):498. 3, but not WW, domains. The Journal of biological chemistry 2000, 6. Frederiks F, Tzouros M, Oudgenoeg G, van Welsem T, Fornerod M, 275(21):16030-16036. Krijgsveld J, van Leeuwen F: Nonprocessive methylation by Dot1 leads to 29. McBride AE, Silver PA: State of the arg: protein methylation at arginine functional redundancy of histone H3K79 methylation states. Nature comes of age. Cell 2001, 106(1):5-8. structural & molecular biology 2008, 15(6):550-557. 30. Bedford MT, Clarke SG: Protein arginine methylation in mammals: who, 7. van Leeuwen F, Gafken PR, Gottschling DE: Dot1p modulates silencing in what, and why. Molecular cell 2009, 33(1):1-13. yeast by methylation of the nucleosome core. Cell 2002, 109(6):745-756. 31. Boisvert FM, Cote J, Boulanger MC, Richard S: A proteomic analysis of 8. San-Segundo PA, Roeder GS: Role for the silencing protein Dot1 in arginine-methylated protein complexes. Mol Cell Proteomics 2003, meiotic checkpoint control. Mol Biol Cell 2000, 11(10):3601-3615. 2(12):1319-1330. 9. Giannattasio M, Lazzaro F, Plevani P, Muzi-Falconi M: The DNA damage 32. Ong SE, Mittler G, Mann M: Identifying and quantifying in vivo checkpoint response requires histone H2B ubiquitination by Rad6-Bre1 methylation sites by heavy methyl SILAC. Nat Methods 2004, 1(2):119-126. and H3 methylation by Dot1. The Journal of biological chemistry 2005, 33. Iwabata H, Yoshida M, Komatsu Y: Proteomic analysis of organ-specific 280(11):9879-9886. post-translational lysine-acetylation and -methylation in mice by use of 10. Wysocki R, Javaheri A, Allard S, Sha F, Cote J, Kron SJ: Role of Dot1- anti-acetyllysine and -methyllysine mouse monoclonal antibodies. dependent histone H3 methylation in G1 and S phase DNA damage Proteomics 2005, 5(18):4653-4664. checkpoint functions of Rad9. Molecular and cellular biology 2005, 34. Mann M, Jensen ON: Proteomic analysis of post-translational 25(19):8430-8443. modifications. Nat Biotechnol 2003, 21(3):255-261. 11. Polevoda B, Sherman F: Methylation of proteins involved in translation. 35. Wilkins MR, Gasteiger E, Gooley AA, Herbert BR, Molloy MP, Binz PA, Ou K, Mol Microbiol 2007, 65(3):590-606. Sanchez JC, Bairoch A, Williams KL, Hochstrasser DF: High-throughput mass 12. Lhoest J, Lobet Y, Costers E, Colson C: Methylated proteins and amino spectrometric discovery of protein post-translational modifications. J Mol acids in the ribosomes of Saccharomyces cerevisiae. Eur J Biochem 1984, Biol 1999, 289(3):645-657. 141(3):585-590. 36. Bandeira N, Tsur D, Frank A, Pevzner PA: Protein identification by spectral 13. Porras-Yakushi TR, Whitelegge JP, Clarke S: A novel SET domain networks analysis. Proceedings of the National Academy of Sciences of the methyltransferase in yeast: Rkm2-dependent trimethylation of ribosomal United States of America 2007, 104(15):6140-6145. protein L12ab at lysine 10. The Journal of biological chemistry 2006, 37. Tsur D, Tanner S, Zandi E, Bafna V, Pevzner PA: Identification of post- 281(47):35835-35845. translational modifications by blind search of mass spectra. Nat 14. Porras-Yakushi TR, Whitelegge JP, Clarke S: Yeast ribosomal/cytochrome c Biotechnol 2005, 23(12):1562-1567. SET domain methyltransferase subfamily: identification of Rpl23ab 38. Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, methylation sites and recognition motifs. The Journal of biological Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, chemistry 2007, 282(17):12368-12376. Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, 15. Itoh T, Wittmann-Liebold B: The primary structure of protein 44 from the Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, large subunit of yeast ribosomes. FEBS Lett 1978, 96(2):399-402. Neubauer G, Rick JM, Kuster B, Bork P, et al: Proteome survey reveals 16. Cavallius J, Zoll W, Chakraburtty K, Merrick WC: Characterization of yeast modularity of the yeast cell machinery. Nature 2006, 440(7084):631-636. EF-1 alpha: non-conservation of post-translational modifications. 39. Jung SY, Li Y, Wang Y, Chen Y, Zhao Y, Qin J: Complications in the Biochimica et biophysica acta 1993, 1163(1):75-80. assignment of 14 and 28 Da mass shift detected by mass spectrometry Pang et al. BMC Genomics 2010, 11:92 Page 15 of 16 http://www.biomedcentral.com/1471-2164/11/92

as in vivo methylation from endogenous proteins. Anal Chem 2008, 60. Zhu W, Mustelin T, David M: Arginine methylation of STAT1 regulates its 80(5):1721-1729. dephosphorylation by T cell protein tyrosine phosphatase. The Journal of 40. Lee SW, Berger SJ, Martinovic S, Pasa-Tolic L, Anderson GA, Shen Y, Zhao R, biological chemistry 2002, 277(39):35787-35790. Smith RD: Direct mass spectrometric analysis of intact proteins of the 61. Hsu Ia W, Hsu M, Li C, Chuang TW, Lin RI, Tarn WY: Phosphorylation of yeast large ribosomal subunit using capillary LC/FTICR. Proceedings of the Y14 modulates its interaction with proteins involved in mRNA National Academy of Sciences of the United States of America 2002, metabolism and influences its methylation. The Journal of biological 99(9):5942-5947. chemistry 2005, 280(41):34507-34512. 41. Sadaie M, Shinmyozu K, Nakayama J: A conserved SET domain 62. Wooderchak WL, Zang T, Zhou ZS, Acuna M, Tahara SM, Hevel JM: methyltransferase, Set11, modifies ribosomal protein Rpl12 in fission Substrate profiling of PRMT1 reveals amino acid sequences that extend yeast. The Journal of biological chemistry 2008, 283(11):7185-7195. beyond the “RGG” paradigm. Biochemistry 2008, 47(36):9456-9466. 42. Carroll AJ, Heazlewood JL, Ito J, Millar AH: Analysis of the Arabidopsis 63. Mallick P, Schirle M, Chen SS, Flory MR, Lee H, Martin D, Ranish J, Raught B, cytosolic ribosome proteome provides detailed insights into its Schmitt R, Werner T, Kuster B, Aebersold R: Computational prediction of components and their post-translational modification. Mol Cell Proteomics proteotypic peptides for quantitative proteomics. Nat Biotechnol 2007, 2008, 7(2):347-369. 25(1):125-131. 43. Goldenberg CJ, Eliceiri GL: Methylation of ribosomal proteins in HeLa 64. Newman JR, Ghaemmaghami S, Ihmels J, Breslow DK, Noble M, DeRisi JL, cells. Biochimica et biophysica acta 1977, 479(2):220-234. Weissman JS: Single-cell proteomic analysis of S. cerevisiae reveals the 44. Shin HS, Jang CY, Kim HD, Kim TS, Kim S, Kim J: Arginine methylation of architecture of biological noise. Nature 2006, 441(7095):840-846. ribosomal protein S3 affects ribosome assembly. Biochemical and 65. Michalek MT, Grant EP, Rock KL: Chemical denaturation and modification biophysical research communications 2009, 385(2):273-278. of ovalbumin alters its dependence on ubiquitin conjugation for class I 45. Scolnik PA, Eliceiri GL: Methylation sites in HeLa cell ribosomal proteins. antigen presentation. J Immunol 1996, 157(2):617-624. Eur J Biochem 1979, 101(1):93-101. 66. Chuikov S, Kurash JK, Wilson JR, Xiao B, Justin N, Ivanov GS, McKinney K, 46. Swiercz R, Person MD, Bedford MT: Ribosomal protein S2 is a substrate for Tempst P, Prives C, Gamblin SJ, Barlev NA, Reinberg D: Regulation of p53 mammalian PRMT3 (protein arginine methyltransferase 3). The activity through lysine methylation. Nature 2004, 432(7015):353-360. Biochemical journal 2005, 386(Pt 1):85-91. 67. Lu JY, Lin YY, Qian J, Tao SC, Zhu J, Pickart C, Zhu H: Functional dissection 47. Wang C, Lin JM, Lazarides E: Methylations of 70,000-Da heat shock of a HECT ubiquitin E3 ligase. Mol Cell Proteomics 2008, 7(1):35-45. proteins in 3T3 cells: alterations by arsenite treatment, by different 68. Gupta R, Kus B, Fladd C, Wasmuth J, Tonikian R, Sidhu S, Krogan NJ, stages of growth and by virus transformation. Arch Biochem Biophys 1992, Parkinson J, Rotin D: Ubiquitination screen using protein microarrays for 297(1):169-175. comprehensive identification of Rsp5 substrates in yeast. Molecular 48. Wang C, Lazarides E: Arsenite-induced changes in methylation of the systems biology 2007, 3:116. 70,000 dalton heat shock proteins in chicken embryo fibroblasts. 69. Daily KM, Radivojac P, Dunker AK: Intrinsic disorder and protein Biochemical and biophysical research communications 1984, 119(2):735-743. modifications: building an SVM predictor for methylation. IEEE 49. Yu MC, Bachand F, McBride AE, Komili S, Casolari JM, Silver PA: Arginine Symposium on Computational Intelligence in Bioinformatics and methyltransferase affects interactions and recruitment of mRNA Computational Biology, CIBCB 2005: November 2005 2005; San Diego, processing and export factors. Genes & development 2004, California, USA 2005, 475-481. 18(16):2024-2035. 70. Ptacek J, Devgan G, Michaud G, Zhu H, Zhu X, Fasolo J, Guo H, Jona G, 50. Ghaemmaghami S, Huh WK, Bower K, Howson RW, Belle A, Dephoure N, Breitkreutz A, Sopko R, McCartney RR, Schmidt MC, Rachidi N, Lee SJ, O’Shea EK, Weissman JS: Global analysis of protein expression in yeast. Mah AS, Meng L, Stark MJ, Stern DF, De Virgilio C, Tyers M, Andrews B, Nature 2003, 425(6959):737-741. Gerstein M, Schweitzer B, Predki PF, Snyder M: Global analysis of protein 51. Bedford MT, Richard S: Arginine methylation an emerging regulator of phosphorylation in yeast. Nature 2005, 438(7068):679-684. protein function. Molecular cell 2005, 18(3):263-272. 71. Linding R, Jensen LJ, Ostheimer GJ, van Vugt MA, Jorgensen C, Miron IM, 52. Desiere F, Deutsch EW, Nesvizhskii AI, Mallick P, King NL, Eng JK, Aderem A, Diella F, Colwill K, Taylor L, Elder K, Metalnikov P, Nguyen V, Pasculescu A, Boyle R, Brunner E, Donohoe S, Fausto N, Hafen E, Hood L, Katze MG, Jin J, Park JG, Samson LD, Woodgett JR, Russell RB, Bork P, Yaffe MB, Kennedy KA, Kregenow F, Lee H, Lin B, Martin D, Ranish JA, Rawlings DJ, Pawson T: Systematic discovery of in vivo phosphorylation networks. Cell Samelson LE, Shiio Y, Watts JD, Wollscheid B, Wright ME, Yan W, Yang L, 2007, 129(7):1415-1426. Yi EC, Zhang H, et al: Integration with the human genome of peptide 72. Fiedler D, Braberg H, Mehta M, Chechik G, Cagney G, Mukherjee P, Silva AC, sequences obtained by high-throughput mass spectrometry. Genome Shales M, Collins SR, van Wageningen S, Kemmeren P, Holstege FC, biology 2005, 6(1):R9. Weissman JS, Keogh MC, Koller D, Shokat KM, Krogan NJ: Functional 53. Belle A, Tanay A, Bitincka L, Shamir R, O’Shea EK: Quantification of protein organization of the S. cerevisiae phosphorylation network. Cell 2009, half-lives in the budding yeast proteome. Proc Natl Acad Sci USA 2006, 136(5):952-963. 103(35):13004-13009. 73. The UniProt Consortium: The Universal Protein Resource (UniProt). Nucleic 54. Tung CW, Ho SY: Computational identification of ubiquitylation sites acids research 2009, , 37 Database: D169-174. from protein sequences. BMC Bioinformatics 2008, 9:310. 74. Couttas TA, Raftery MJ, Bernardini G, Wilkins MR: Immonium ion scanning 55. Gupta P, Ho PC, Huq MD, Khan AA, Tsai NP, Wei LN: PKCepsilon for the discovery of post-translational modifications and its application stimulated arginine methylation of RIP140 for its nuclear-cytoplasmic to histones. J Proteome Res 2008, 7(7):2632-2641. export in adipocyte differentiation. PLoS One 2008, 3(7):e2658. 75. Beck HC, Nielsen EC, Matthiesen R, Jensen LH, Sehested M, Finn P, 56. Ostareck-Lederer A, Ostareck DH, Rucknagel KP, Schierhorn A, Moritz B, Grauslund M, Hansen AM, Jensen ON: Quantitative proteomic analysis of Huttelmaier S, Flach N, Handoko L, Wahle E: Asymmetric arginine post-translational modifications of human histones. Mol Cell Proteomics dimethylation of heterogeneous nuclear ribonucleoprotein K by protein- 2006, 5(7):1314-1325. arginine methyltransferase 1 inhibits its interaction with c-Src. The 76. Dave KA, Hamilton BR, Wallis TP, Furness SGB, Whitelaw ML, Gorman JJ: Journal of biological chemistry 2006, 281(16):11115-11125. Identification of N, Nepsilon-dimethyl-lysine in the murine dioxin 57. Yamagata K, Daitoku H, Takahashi Y, Namiki K, Hisatake K, Kako K, Mukai H, receptor using MALDI-TOF/TOF- and ESI-LTQ-Orbitrap-FT-MS. Int J Mass Kasuya Y, Fukamizu A: Arginine methylation of FOXO transcription factors Spec 2007, 268(2-3):168-180. inhibits their phosphorylation by Akt. Molecular cell 2008, 32(2):221-231. 77. Wisniewski JR, Zougman A, Kruger S, Mann M: Mass spectrometric 58. Yun CY, Fu XD: Conserved SR protein kinase functions in nuclear import mapping of linker histone H1 variants reveals multiple acetylations, and its action is counteracted by arginine methylation in methylations, and phosphorylation as well as differences between cell Saccharomyces cerevisiae. J Cell Biol 2000, 150(4):707-718. culture and tissue. Mol Cell Proteomics 2007, 6(1):72-87. 59. Chen W, Daines MO, Hershey GK: Methylation of STAT6 modulates STAT6 78. Garavelli JS: The RESID Database of Protein Modifications as a resource phosphorylation, nuclear translocation, and DNA-binding activity. J and annotation tool. Proteomics 2004, 4(6):1527-1533. Immunol 2004, 172(11):6744-6750. 79. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH: UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics (Oxford, England) 2007, 23(10):1282-1288. Pang et al. BMC Genomics 2010, 11:92 Page 16 of 16 http://www.biomedcentral.com/1471-2164/11/92

80. Hong EL, Balakrishnan R, Dong Q, Christie KR, Park J, Binkley G, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hirschman JE, Hitz BC, Krieger CJ, Livstone MS, Miyasato SR, Nash RS, Oughtred R, Skrzypek MS, Weng S, Wong ED, Zhu KK, Dolinski K, Botstein D, Cherry JM: Gene Ontology annotations at SGD: new data sources and annotation methods. Nucleic acids research 2008, , 36 Database: D577-581. 81. Hart GT, Lee I, Marcotte ER: A high-accuracy consensus map of yeast protein complexes reveals modular nature of gene essentiality. BMC Bioinformatics 2007, 8:236. 82. Rivals I, Personnaz L, Taing L, Potier MC: Enrichment or depletion of a GO category within a class of genes: which test?. Bioinformatics (Oxford, England) 2007, 23(4):401-407. 83. R Development Core Team: R: A language and environment for statistical computing. Vienna: R Foundation for statistical computing 2005.

doi:10.1186/1471-2164-11-92 Cite this article as: Pang et al.: Identification of arginine- and lysine- methylation in the proteome of Saccharomyces cerevisiae and its functional implications. BMC Genomics 2010 11:92.

Submit your next manuscript to BioMed Central and take full advantage of:

• Convenient online submission • Thorough peer review • No space constraints or color figure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit 6 Proteins deleterious on overexpression are associated with

high intrinsic disorder, specific interaction domains and

low abundance

The sixth chapter focuses on the abundance-based effects, in which I analyse a dataset generated by systematic overexpression of S. cerevisiae proteins,184 and explore what makes a protein deleterious when it is overexpressed. A hypothesis was tested that overexpression of proteins with low abundance is more likely to cause deleterious phenotypes than overexpression of highly abundant proteins, since the latter has little relative impact on the overall abundance of the protein. We also hypothesise that overexpressed proteins may have greater chance of binding promiscuously to other proteins in the network, which could cause deleterious phenotypes.

This chapter has been published: Liang, M, Pang, CN, Li, SS & Wilkins, MR.

Proteins deleterious on overexpression are associated with high intrinsic disorder, specific interaction domains and low abundance. J Proteome Res, 2010, 9, 1218-25. I generated and processed domain-domain interaction datasets, analysed intrinsically disordered regions, analysed the functions of interaction domains, drafted some sections of the manuscript. Liang Ma analysed the physiochemical parameters, analysed the protein complex data, analysed domain-domain interactions and their types, and co-drafted the manuscript. Simone S. Li provided data on protein pI, performed preliminary analysis on protein complex data, prepared figures for publication. Prof. Marc R. Wilkins designed and directed the project, and co-drafted the manuscript.

40 Reproduced with permission from Liang, M, Pang, CN, Li, SS & Wilkins, MR. (2010)

Proteins deleterious on overexpression are associated with high intrinsic disorder, specific interaction domains and low abundance. J Proteome Res, 9, 1218-25.

Copyright 2010 American Chemical Society.

41 Proteins Deleterious on Overexpression Are Associated with High Intrinsic Disorder, Specific Interaction Domains, and Low Abundance

Liang Ma, Chi Nam Ignatius Pang, Simone S. Li, and Marc R. Wilkins*

Systems Biology Initiative and School of Biotechnology and Biomolecular Sciences, University of New South Wales, NSW, Australia

Received August 4, 2009

In proteomics, there is a major challenge in how the functional significance of overexpressed proteins can be interpreted. This is particularly the case when examining proteins in cells or tissues. Here we have analyzed the physicochemical parameters, abundance level, half-life and degree of intrinsic disorder of proteins previously overexpressed in the yeast Saccharomyces cerevisiae. We also examined the interaction domains present and the manner in which overexpressed proteins are, or are not, associated with known complexes. We found a number of protein characteristics were strongly associated with deleterious phenotypes. These included protein abundance (where low-abundance proteins tend to be deleterious on overexpression), intrinsic disorder (where a striking association was seen between percent disorder and degree of deleterious effect), and the number of likely domain-domain interactions. Furthermore, we found a number of domain types, for example, DUF221 and the ubiquitin interaction motif, that were present predominantly in proteins that are deleterious on overexpression. Together, these results provide strong evidence that particular types of proteins are deleterious on overexpression whereas others are not. These factors can be considered in the interpretation of protein expression differences in proteomic experiments.

Keywords: Intrinsic disorder • domain-domain interactions • protein overexpression • proteomics

Introduction suggested no significant difference, compared to proteins that had no discernible phenotype on overexpression. Proteomic analysis is frequently applied to case-control Here we have further analyzed the systematic overexpression studies. An aim, and indeed a result of many of these studies, experiments from Sopko et al.1 We show that protein abun- is the detection of proteins that display significant changes in dance level, some physicochemical characteristics, the type of expression in association with a phenotype. A major challenge interaction domains and the degree of intrinsic disorder of with over- or underexpressed proteins, notably those found in proteins are clearly associated with deleterious phenotype on cells or tissues, has been the interpretation of their functional overexpression. This suggests that these factors should be significance. It remains difficult to predict if a change in protein considered as part of the interpretation of proteomic analyses, expression, while associated with a phenotype, is actually especially for proteins that are overexpressed. deleterious and/or causative and if it will affect cellular homeostasis. Materials and Methods Recently, Sopko et al.1 explored this issue but from a different Deleterious and Neutral Proteins. A total of 5032 unique perspective. Instead of examining a phenotype with proteomic genes were overexpressed by Sopko et al.1 and of these, 768 approaches, they systematically overexpressed 5032 proteins, were deleterious on overexpression while 4264 showed no one per strain, in Saccharomyces cerevisiae and examined each discernible negative phenotype. The deleterious proteins were overexpressor for phenotypic effect. One of the most striking further classified as ‘lethal’, ‘abnormal morphology’ and ‘cell observations is that 15% of proteins (768 proteins out of 5032 cycle arrest’. While it was of interest to analyze these separately, tested) generated a discernible deleterious phenotype on there was considerable overlap among these classes; for overexpression. All phenotypes showed slow growth in the example, 111 proteins were classified as both ‘abnormal presence of galactose, and included those with abnormal morphology’ and ‘cell cycle arrest’ and 21 proteins fell into all morphology or showing cell cycle arrest. As a group, they three classes. Statistical analysis relies on independence be- contained a high number of signaling molecules, transcription tween the samples; however, disregarding proteins shared factors and cell cycle regulators. An exploration of whether among classes left only 61 ‘lethal’ proteins, 60 ‘abnormal deleterious proteins were overrepresented in known complexes morphology’ proteins and 7 ‘cell cycle arrest’ proteins. Owing to these small sample sizes and thus weak statistical power, * Address for Correspondence: Prof. Marc Wilkins, School of Biotechnol- ogy and Biomolecular Sciences, University of New South Wales, Sydney, all proteins that were toxic on overexpression were analyzed NSW, 2052, Australia. E-mail: [email protected], Fax: +61-2-9385-1483. as one group of ‘deleterious’ proteins and compared to the

1218 Journal of Proteome Research 2010, 9, 1218–1225 10.1021/pr900693e © 2010 American Chemical Society Published on Web 01/07/2010 Proteins Deleterious on Overexpression research articles ‘neutral’ proteins of no discernible phenotype. The two groups are referred to by these names throughout this study. We note that, since the publication of Sopko et al.,1 three deleterious genes (ordered locus names YKL157W, YCL014W, YLL016W) have been merged with three other genes in the Saccharomyces Genome Database2 and Swiss-Prot3 databases (APE2, BUD3, SDC25, respectively). To keep our analyses consistent with Sopko et al.,1 we have maintained these as separate genes, as previously described. Databases and Calculation of Protein Parameters. For the analysis of all proteins, sequence data was from Swiss-Prot release 54.3 Prior to calculation of protein parameters, protein sequences were processed to their mature forms according to Swiss-Prot annotation. Protein isoelectric point was calculated according to Bjellqvist et al.4 and grand average hydropathy (GRAVY) calculated according to Kyte and Doolittle.5 Protein domains and interaction domains were from iPfam release 20;6 for proteins carrying interaction domains, putative domain- domain interactions were calculated by determining the num- ber of proteins which carry complementary interaction do- mains in the proteome, as in Pang et al.7 Protein abundance data, in copies per cell, was from Ghaemmaghami et al.,8 protein half-life estimates from Belle et al.9 and protein complex data from Gavin et al.10 Estimates of protein structural disorder, also known as intrinsic disorder, were from Kim et al.11 whereby a score and classifier were assigned to each residue in a protein and the percent disorder computed by dividing the number of disordered residues by protein length. Structural disorder is the tendency of a protein to lack a unique 3-D structure, instead existing in a dynamic ensemble of conformations.

Data Transformation. Log10 transformations were applied to protein molecular weight, abundance, half-life and data such as the number of domain-domain interactions. Inverse-sine (arcsin) transformations were applied to percentage data. Neither log transformations nor inverse-sine transformations affect the outcome of hypothesis testing,12 and in most cases, they were used only to assist in visualization of data distribu- tions. The log transformation was also used to improve linearity prior to regression analysis. Statistical Analyses and Graphs. Wilcoxon rank-sum tests were used to compare the distributions of protein molecular weight, pI, GRAVY scores, abundance, half-life, intrinsic dis- order and the number of domain-domain interactions be- tween deleterious proteins and neutral proteins. The Wilcoxon rank-sum test is a nonparametric test for assessing whether two independent samples of observations come from the same distribution. All tests were performed with a significance level of 5%. Covariances of pairs of parameters were studied through calculating Pearson correlation coefficients. For covariance analysis, F-value of less than 0.3 was considered to indicate Figure 1. Physicochemical properties of neutral and deleterious little or no association between pairs of parameters. All proteins. (a) Protein molecular weight, log scale (b) protein pI statistical analyses, as well as box and whisker plots, density and (c) protein grand average hydropathy (GRAVY). Vertical lines 13 estimates, scatter plots and bar plots were undertaken with R. in all graphs indicate median values. Density estimates of all distributions were based on a Gaussian 14 kernel. that are deleterious on overexpression are likely to be different to those that are neutral on overexpression in the yeast cell. Results Proteins Deleterious on Overexpression Show Different In their generation and analysis of the deleterious and Physicochemical Characteristics to Those of Neutral Proteins. neutral overexpressed proteins, Sopko et al.1 focused their We examined protein size, isoelectric point and hydropathy to investigation predominantly on the functions of proteins. Here, determine if deleterious proteins are different in these param- we are instead investigating the characteristics of the deleteri- eters to neutral proteins. It was found that deleterious proteins ous proteins and the manner in which they interact with other tended to be larger than neutral proteins. They showed less proteins. We hypothesize that the characteristics of the proteins proteins of mass 10-20 kDa (Figure 1a), with median values

Journal of Proteome Research • Vol. 9, No. 3, 2010 1219 research articles Ma et al.

Figure 2. Box and whisker plot of protein abundance, for proteins that are deleterious on overexpression and those that are neutral Figure 3. Proteins deleterious on overexpression have shorter on overexpression. The abundance of proteins that are deleteri- half-lives than neutral overexpressed proteins. Note that half- ous is lower than that of proteins that are neutral on overex- life in this graph is plotted on a logarithmic scale. pression. Note that copies per cell is a represented on a logarithmic scale. in a cell compared to its normal wild-type level is associated with the likelihood of a protein being deleterious. of 54 and 39 kDa, respectively. This is a significant difference Proteins Deleterious on Overexpression Have Shorter (Wilcoxon rank-sum test, p ) 2.2 × 10-16). In both cases, a Half-Lives. Sopko et al.1 noted that many proteins deleterious bimodal distribution of protein size was evident. For isoelectric on overexpression were cell-cycle associated, implying that the point, it was found that deleterious proteins showed a reduced mistiming of protein expression can be of consequence in the degree of very acidic proteins, in association with a slightly cell. To explore whether the dynamics of protein turnover is larger proportion of proteins of neutral pI (Figure 1b). The also different between neutral and overexpressed proteins, we distribution of pI was not significantly different between neutral studied protein half-lives using data from Belle et al.9 Proteins and deleterious overexpressors (Wilcoxon rank-sum test, p ) that are deleterious on overexpression (total 452) were found 0.12). Examination of protein hydropathy (Figure 1c) showed to have a shorter half-life (median of 34 min) as compared to that deleterious proteins as a group are more hydrophilic than the half-life of neutral proteins (total 2267, median of 45 min) neutral proteins, having median GRAVY scores of -0.47 and (Figure 3). This difference is significant (Wilcoxon rank-sum -0.36, respectively. This is a significant difference between test, p ) 1.7 × 10-7) and suggests that proteins which show these two groups of proteins (Wilcoxon rank-sum test, p ) 2.9 tight regulation have a tendency to be deleterious when - × 10 9). overexpressed.

Proteins Deleterious on Overexpression Are of Lower Proteins Deleterious on Overexpression Show a Higher 15 Abundance in the Wild-Type Cell. To understand if deleterious Degree of Intrinsic Disorder. Gsponer et al. noted that effects are associated with the degree of overexpression of a proteins with short half-life are often high in structural disorder. protein, we investigated the wild-type abundance levels of Having observed that deleterious proteins tend to be of shorter deleterious and neutral proteins. Abundance data was available half-life, this prompted us to investigate if they have similar or for a total of 504 deleterious and 2601 neutral proteins from a different degrees of structural disorder to proteins that are large-scale study.8 The abundance of deleterious proteins neutral on overexpression. We found that deleterious proteins ranged from 57 to 524 000 copies per cell with a median of show a much higher degree of structural disorder than those 1970 copies per cell, while abundance of neutral proteins that were neutral on overexpression (Figure 4a); the presence ranged from 41 to 883 000 copies per cell with a median of of deleterious proteins with 40-60% of structural disorder was 2350 copies per cell. This difference was statistically significant particularly noteworthy. This difference was significant (Wil- - (Wilcoxon rank-sum test on log-transformed data, p ) 0.003). coxon rank-sum test, p ) 2.2 × 10 16). We further investigated The most abundant deleterious protein was the histone H4 whether the degree of toxicity for proteins that are deleterious protein, while a ketol-acid reductoisomerase ILV5 was the most on overexpression, as documented by Sopko et al.1 on a scale abundant neutral protein. There was substantial overlap be- of 1 (lethal on overexpression) to 5 (wild-type), showed any tween the distributions of deleterious and neutral proteins relationship with the degree of structural disorder. A striking (Figure 2); however, higher abundance proteins tended to have relationship was seen between the percent disorder of proteins a neutral overexpression phenotype. For example, of the 59 and their degree of toxicity when overexpressed (Figure 4b). proteins in yeast with >100 000 copies per cell, only 6 out of Proteins that were lethal on overexpression showed the highest an expected 11 were deleterious. On the other hand, of the 24 median percent disorder (36%) and there was a consistent drop proteins with <100 copies per cell, double the number of in median percent disorder in proteins that were less toxic on proteins (8 rather than the expected 4) were deleterious on overexpression, with the exception of proteins of toxicity level overexpression. Overall, the degree of overexpression of protein 3. Proteins that had no detectable phenotypic effect on over-

1220 Journal of Proteome Research • Vol. 9, No. 3, 2010 Proteins Deleterious on Overexpression research articles

Figure 5. Deleterious proteins have a much higher number of domain-domain interactions than nondeleterious proteins and these have particular characteristics. (a) Domain-domain inter- actions of deleterious and neutral proteins in the proteome; deleterious proteins clearly show a greater number of domain- Figure 4. Analysis of intrinsic disorder in overexpressed proteins. domain interactions. (b) Biomolecular interaction-associated (a) Deleterious proteins show a higher degree of intrinsic domains overrepresented in deleterious proteins, represented by disorder, as compared to neutral proteins. (b) Box and whisker fold enrichment over neutral proteins. The horizontal dotted line plot showing that the degree of intrinsic disorder is highest in is the 1:1 ratio equivalent to zero enrichment. proteins that are most toxic on overexpression (group 1) but lowest in proteins that have no phenotypic effect on overexpres- measure interactions and avoids issues with high false positives sion (group 5). With the exception of group 3, there is a clear in nonstructurally measured interactions. The number of increase in median disorder in association with increasing possible iPfam domain-domain interactions of proteins that toxicity. are deleterious or neutral on overexpression was compared (Figure 5a). Deleterious proteins had 1-238 possible domain- expression (i.e., apparently wild-type) showed a median of 18% domain interactions with a median of 23, while neutral proteins intrinsic disorder. had 1-245 domain-domain interactions with a median of 12. Proteins Deleterious on Overexpression Have More This difference in the number of domain-domain interactions Domain-Domain Interactions and Show Bias in Their (Figure 5a) was significant (Wilcoxon rank-sum test, p ) 3.8 × Types of Interaction Domains. Sopko et al.1 explored whether 10-8). there was a different proportion of deleterious proteins, as Having found differences in the numbers of domain-domain opposed to neutral, annotated as subunits of complexes in the interactions of deleterious and neutral proteins, we wished to MIPS database. They concluded that there was no association then understand if there were also differences in the types of between membership of complexes and toxicity, implying that interaction domains present in these proteins. Each type of protein-protein interactions of deleterious overexpressors are iPfam interaction domain was tallied for the deleterious and not likely to mediate any toxic effect. To further investigate if neutral proteins. A total of 357 deleterious proteins and 1658 this is the case, we undertook a more detailed analysis of neutral proteins had interaction domains; 70 deleterious and protein-protein interactions, based on domain-domain in- 299 neutral proteins had more than one kind of domain. A teractions.7 This serves as a structurally based means to noteworthy observation was that numerous interaction do-

Journal of Proteome Research • Vol. 9, No. 3, 2010 1221 research articles Ma et al. Table 1. Interaction Domains That Appeared Uniquely in Deleterious Proteins, but Not in Neutral Proteinsa

occurrence in occurrence in occurrence in interaction domain (Pfam ID) deleterious proteins neutral proteins yeast proteome Ubiquitin interaction motif (PF02809) 4 0 5 DUF221 (domain of unknown function) (PF02714) 4 0 4 Cation transporter/ATPase C-terminus (PF00689) 3 0 5 Cation transporter/ATPase N-terminus (PF00690) 3 0 7 Gelsolin repeat domain (PF00626) 3 0 4 Response regulator receiver domain (PF00072) 3 0 4 Sec23 beta sandwich (PF08033)and helical (PF04815) domain 3 0 4 Zinc finger Sec23, Sec24 (PF04810) 3 0 4

a This table is limited to domains found in at least 3 deleterious proteins and includes a domain of unknown function (DUF221). A full list of domains enriched in deleterious proteins is in Supplementary Table 1. mains were found uniquely on deleterious proteins and were compared to those that are neutral, show differences in many not found in any neutral proteins (Table 1). These interaction characteristics. A final question we asked is whether there is domains included the ubiquitin interaction motif, protein any correlation between these characteristics and whether these domains involved in COPII vesicle coat complex formation, the correlations are different in proteins that are deleterious or response regulator receiver domain involved in signal trans- neutral on overexpression. The parameters of protein mass, pI, duction, gelsolin repeat domains involved in actin depolymeri- average hydropathy, abundance, half-life, disorder and do- sation and domains found on cation transporters. A further main-domain interactions were compared in pairs and cor- domain of unknown function, DUF221, was also uniquely relation coefficients for each pairwise comparison were calcu- found in deleterious proteins. These interaction domains are lated (see Supplementary Figure 1). Only two comparisons not widespread in the yeast proteome. For example, the showed a correlation of F > 0.3. A weak positive correlation ubiquitin interaction motif is found on 5 proteins in the yeast was seen between protein abundance and half-life (F-value 0.34 proteome; 4 of these were deleterious on overexpression. The and 0.29 for deleterious and neutral proteins, respectively). A DUF221 domain is present in 4 proteins in the yeast proteome strong negative correlation was seen between protein hydr- and all 4 of these proteins are deleterious on overexpression. opathy and structural disorder (F-value -0.68 and -0.61 for In addition to interaction domains found uniquely associated deleterious and neutral proteins, respectively). The latter with deleterious proteins, there were other interaction domains correlation shows that highly hydrophilic proteins have a that were more commonly seen in deleterious proteins. Figure tendency to be structurally disordered and that highly hydro- 5b shows the degree of overrepresentation of interaction phobic proteins have a tendency to be structurally ordered domains that had at least 4 occurrences among deleterious (Figure 7). For both correlations, F-values were similar for proteins. It can be seen that a variety of interaction domain proteins deleterious or neutral on overexpression. types, ranging from those involved in transcriptional regulation Discussion (homeobox) through to others involved in protein-protein interaction (F-box, ankyrin, SNARE domain) are overrepre- In their landmark study, Sopko et al.1 overexpressed 5032 sented up to 28-fold in deleterious proteins over those that are proteins in S. cerevisiae and showed that 15% of these were of neutral. The complete list of domains, and their degree of deleterious phenotype. They noted that the deleterious proteins overrepresentation, is in Supplementary Table 1. were enriched for proteins of specific function, namely, those Some Complexes Have No Proteins That Are Deleterious involved in signaling, regulation of transcription or under cell- on Overexpression. Having found a clear bias in many aspects cycle control. Here, we have extended the analysis of results of proteins that are deleterious on overexpression, we wished from this large overexpression experiment to better understand to understand if deleterious proteins are randomly found in why certain proteins were deleterious on overexpression. We protein complexes in the proteome or if they are associated believe this is of relevance to proteomic researchers who, with some complexes but not others. Overexpressed proteins having found overexpressed proteins, are seeking to understand mapped to a total of 482 known yeast complexes.10 There were if this might lead to deleterious effects in the system they are 178 complexes (37%) which had no proteins deleterious on studying. overexpression and 304 (63%) that contained one or more Physicochemical Properties of Deleterious Proteins. The deleterious proteins (Figure 6a). Complexes that had no size of proteins deleterious on overexpression was, as a group, deleterious members included mitochondrial ribosomal large larger than those that were neutral on overexpression. A lower subunit (30 subunits), exosome 3′-5′ exoribonuclease complex quantity of proteins of mass 10-20 kDa was clearly seen, (19 subunits) and the SKI complex (17 subunits). By contrast, corresponding to those composed of a single domain.16 A complexes that contained proteins deleterious on overexpres- higher quantity of deleterious proteins with two or more sion had up to 11 deleterious subunits (Figure 6b). In such domains was evident. It was also seen that deleterious proteins complexes, a strong positive correlation was seen between the were collectively more hydrophilic than neutral proteins. While number of overexpressed proteins in a complex and the average hydropathy is an imprecise measure of protein type, number of these which were deleterious (r ) 0.72). Some small where some membrane proteins are of high hydrophobicity complexes had all subunits deleterious on overexpression; these but others are not,17 this observation suggests that overexpres- included the Vps27/Hse1 protein complex and complex #43110 sion-associated deleterious effects are not due to an overabun- which contains Rli1p and Rps7Ap proteins. dance of hydrophobic proteins interacting promiscuously in Correlation Analysis of Protein Parameters. The above the cell. We also noted that there was no ‘spike’ of proteins analyses revealed that proteins deleterious on overexpression, deleterious on overexpression of pI 6-6.5, corresponding to

1222 Journal of Proteome Research • Vol. 9, No. 3, 2010 Proteins Deleterious on Overexpression research articles

Figure 6. Some complexes contain no proteins that are deleterious on overexpression, while others contain many. (a) The number of known complexes which contain no deleterious proteins versus the number of complexes that contain one or more deleterious proteins. (b) For complexes that contain proteins which are deleterious on overexpression, the number of deleterious proteins tends to increase with the size of a complex (line of best fit, r ) 0.72). the approximate pH of the yeast cytoplasm.18 This suggests that low-abundance proteins, overexpression would cause a dra- precipitation of proteins at their pI or formation of inclusion matic fold increase in copies per cell and could disrupt cellular bodies, documented in association with heterologous overex- homeostasis. By contrast, proteins that are normally of high pression in S. cerevisiae,19 is unlikely to be present or a reason abundance would show a smaller fold increase in copies per for deleterious overexpression. Together, these data suggest cell when overexpressed. This is more likely to be accom- that, while physicochemical parameters are clearly biased, they modated by compensatory mechanisms in the cell. Interest- are not clear indicators which can be used in isolation to ingly, recent quantitative analysis of protein abundances in predict the functional impact of an overexpressed protein. individual yeast or human cells has shown that a range of Deleterious Proteins Show a High Degree of Intrinsic expression levels is seen and can be tolerated for many, but Disorder. A striking observation in our study was that deleteri- certainly not all proteins.26,27 ous proteins, as a group, had shorter half-life and showed The Impact of Overexpression on Protein Complexes. There enrichment in predicted structural disorder. A relationship of is considerable debate surrounding the effect of protein over- half-life and intrinsic disorder has been recently reported15,20,21 expression on complexes. Previous studies have suggested that and intrinsic disorder is also reported to be associated with overexpression of the components of complexes has a limited proteins under tight regulation.15,22 However, we also found role in deleterious phenotypes.1,28,29 Neither the core or that the severity of deleterious phenotype showed a strong attachment units of yeast protein complexes10 were reported positive association with percent of disorder. Intrinsic disorder to be enriched for deleterious proteins29 and the topology of is thus likely to be an informative protein parameter when the protein complex was reported to have a limited role in investigating the functional impact of any overexpressed determining which protein would be deleterious upon over- protein. It can be easily calculated with bioinformatics tools expression.28 However, we have found that proteins deleterious (e.g., Kim et al.11) for proteins of interest. on overexpression were not present in 37% of the yeast Our observations on protein disorder and half-life would complexes recently defined by Gavin et al.10 in their systematic suggest that many deleterious proteins are regulatory. This was study. This included some complexes with large numbers of noted by Sopko et al.1 and is supported by other observations subunits such as the exosome (17 subunits). It is interesting to that regulatory proteins have short half-life,23 are enriched in consider that the exosome complex contains symmetrical disordered regions,24 are likely to cause deleterious phenotypes components30 which can self-associate and form alternative when overexpressed25 and are low in abundance.8 Indeed, we species of the protein complex upon overexpression. This may did note that deleterious proteins were, as a group, of lower reduce the likelihood of promiscuous binding with other abundance than those that were neutral on overexpression. For proteins and a deleterious phenotype. It also raises the question

Journal of Proteome Research • Vol. 9, No. 3, 2010 1223 research articles Ma et al. cally regulated proteins.31 The dynamic proteins are expressed ‘just-in-time’ to activate complexes, but only when the function of the whole complex is required. This was illustrated for the cell cycle, involving many cyclin proteins. This ‘just-in-time’ activation model can explain why deleterious proteins are enriched for many cell-cycle proteins;1 overexpression of the dynamic component activates the complex even when it is not required to act. In contrast, the overexpression of static subunits cannot activate the complex. This strongly suggests that the cell will tolerate ‘noise’ in expression levels for constitutive proteins of a complex but not their tightly regulated ‘just-in-time’ protein subunits. This observation is critical for the interpretation of proteomic expression data; however, it is acknowledged that there is relatively little information of this type available in databases. Domain-Domain Interactions and Types. Sopko et al.1 examined overexpressed proteins for enrichment of any do- maintype.We,alternatively,focusedonpotentialdomain-domain interactions and showed that deleterious proteins have a much higher number of these than neutral proteins. The reason why proteins with a large number of domain-domain interactions are deleterious on overexpression is that they could form undesirable interactions15 with proteins that have a compatible interaction domain but usually interact with other proteins.25 The binding of an overexpressed protein with the native binding partners can become saturated, permitting proteins with the next highest affinity to interact with the overexpressed protein.32 Dissociation constants dictate that the more highly overexpressed a protein becomes, the more likely it is to show this effect. We further showed that some domains involved in protein-protein or protein-nucleic acid interactions were uniquely or almost uniquely associated with deleterious pro- teins in the yeast proteome. Interestingly, this has highlighted specific essential processes other than the cell cycle that are susceptible to overexpression. For example, the gelsolin domain and Sec23/Sec24 related domains are present in Sec23p, Sec24p and Sfb3p; all proteins are associated with the formation of the COPII vesicle coat, essential for retrograde vesicular transport between ER and Golgi.33 Proteins containing the SNARE domain, also strongly overrepresented in deleterious proteins, also function in the same retrograde vesicle transport pathway. In a further example, the deleterious proteins Snl1p, Skn7p and Ssk1p all contain the response regulator receiver domain and act in the branched two-component osmosensing pathway.34

Figure 7. The average hydropathy of proteins (GRAVY) shows a Conclusions strong negative correlation with the percent of protein disorder. Highly hydrophilic proteins have a tendency to be structurally This study has shown that S. cerevisiae proteins which are disordered and highly hydrophobic proteins have a tendency to deleterious on overexpression have specific characteristics, as be structurally ordered. This trend is similar for proteins which compared to proteins of neutral phenotype. However, there are deleterious (a) or neutral (b) on overexpression, and is appear to be different reasons why certain proteins are common to all proteins (c). deleterious, reflecting their roles in different pathways and cellular functions. Some proteins are deleterious on overex- of whether proteins that self-associate, to form homodimers pression in that they carry specific domains that by themselves, or larger homomultimers, are less likely to be deleterious on or through their interactions, are toxic. Other proteins show overexpression. A lack of data for the propensity of proteins to high intrinsic disorder, which is likely to be associated with self-multimerise makes it difficult to currently answer this their deleterious effect. Yet others are tightly regulated, and are question. likely to perturb cellular homeostasis when the dynamics of In considering the likelihood that protein overexpression will their expression is disrupted. It is suggested that these par- be deleterious on a complex, we must also consider if temporal ticular features, where known, might serve to prioritize ‘proteins patterns of expression will be altered. Recently, it was proposed of interest’ in proteomic studies when linking these to a that many protein complexes contain static proteins which are deleterious phenotype. However, it must equally be kept in expressed in a constitutive manner along with some dynami- mind that combinations of many protein characteristics,

1224 Journal of Proteome Research • Vol. 9, No. 3, 2010 Proteins Deleterious on Overexpression research articles reflecting the diversity of protein functions in the cell, might (13) R Development Core Team, R: A language and environment for need to be co-considered to make strong predictions. Artificial statistical computing. http://www.R-project.org. (14) Sheather, S. J.; Jones, M. C. A reliable data-based bandwidth intelligence approaches, such as hidden Markov models or selection method for kernel density estimation. J. R. Statist. Soc. neural networks, could be of use in future studies, although B 1991, 53 (3), 683–90. any predictions will need careful interpretation to ensure they (15) Gsponer, J.; Futschik, M. E.; Teichmann, S. A.; Babu, M. M. Tight regulation of unstructured proteins: from transcript synthesis to are biologically relevant. protein degradation. Science 2008, 322 (5906), 1365–8. (16) Burley, S. K. An overview of structural genomics. Nat. Struct. Biol. Acknowledgment. C.N.I.P. is the recipient of an Aus- 2000, 7, 932–4. tralian Postgraduate Award. M.R.W. and the New South (17) Hennig, L. WinGene/WinPep: user-friendly software for the analy- Wales Systems Biology Initiative acknowledge financial sis of amino acid sequences. BioTechniques 1999, 26 (6), 1170–2. (18) Calahorra, M.; Martinez, G. A.; Hernandez-Cruz, A.; Pena, A. support from the New South Wales State Government Office Influence of monovalent cations on yeast cytoplasmic and vacu- for Science and Medical Research and the University of New olar pH. Yeast 1998, 14 (6), 501–15. South Wales. (19) Cousens, L. S.; Shuster, J. R.; Gallegos, C.; Ku, L. L.; Stempien, M. M.; Urdea, M. S.; Sanchez-Pescador, R.; Taylor, A.; Tekamp- Supporting Information Available: A full list of Olson, P. High level expression of proinsulin in the yeast, Saccha- romyces cerevisiae. Gene 1987, 61 (3), 265–75. domains enriched in deleterious proteins and figure of pairwise (20) Tompa, P.; Prilusky, J.; Silman, I.; Sussman, J. L. Structural disorder comparisons. This material is available free of charge via the serves as a weak signal for intracellular protein degradation. Internet at http://pubs.acs.org. Proteins 2008, 71 (2), 903–9. (21) Doherty, M. K.; Hammond, D. E.; Clague, M. J.; Gaskell, S. J.; Beynon, R. J. Turnover of the human proteome: determination of protein intracellular stability by dynamic SILAC. J. Proteome Res. References 2009, 8 (1), 104–12. (22) Edwards, Y. J.; Lobley, A.; Pentony, M. M.; Jones, D. T. Insights (1) Sopko, R.; Huang, D.; Preston, N.; Chua, G.; Papp, B.; Kafadar, K.; into the regulation of intrinsically disordered proteins in the Snyder, M.; Oliver, S. G.; Cyert, M.; Hughes, T. R.; Boone, C.; human proteome by analysing sequence and gene expression data. Andrews, B. Mapping pathways and phenotypes by systematic Genome Biol. 2009, 10 (5), R50. gene overexpression. Mol. Cell 2006, 21 (3), 319–30. (23) Belle, A.; Tanay, A.; Bitincka, L.; Shamir, R.; O’Shea, E. K. (2) Hong, E. L.; Balakrishnan, R.; Dong, Q.; Christie, K. R.; Park, J.; Quantification of protein half-lives in the budding yeast proteome. Binkley, G.; Costanzo, M. C.; Dwight, S. S.; Engel, S. R.; Fisk, D. G.; Proc. Natl. Acad. Sci. U.S.A. 2006, 103 (35), 13004–9. Hirschman, J. E.; Hitz, B. C.; Krieger, C. J.; Livstone, M. S.; Miyasato, (24) Lobley, A.; Swindells, M. B.; Orengo, C. A.; Jones, D. T. Inferring S. R.; Nash, R. S.; Oughtred, R.; Skrzypek, M. S.; Weng, S.; Wong, function using patterns of native disorder in proteins. PLoS E. D.; Zhu, K. K.; Dolinski, K.; Botstein, D.; Cherry, J. M. Gene Comput. Biol. 2007, 3 (8), e162. Ontology annotations at SGD: new data sources and annotation (25) Niu, W.; Li, Z.; Zhan, W.; Iyer, V. R.; Marcotte, E. M. Mechanisms methods. Nucleic Acids Res. 2008, 36 (Database issue), D577–81. of cell cycle control revealed by a systematic and quantitative (3) UniProt Consortium. The Universal Protein Resource (UniProt). overexpression screen in S. cerevisiae. PLoS Genet. 2008, 4 (7), Nucleic Acids Res 2007, 35 (Database issue), D193–7. e1000120. (4) Bjellqvist, B.; Hughes, G. J.; Pasquali, C.; Paquet, N.; Ravier, F.; (26) Cohen, A. A.; Geva-Zatorsky, N.; Eden, E.; Frenkel-Morgenstern, Sanchez, J. C.; Frutiger, S.; Hochstrasser, D. The focusing positions M.; Issaeva, I.; Sigal, A.; Milo, R.; Cohen-Saidon, C.; Liron, Y.; Kam, of polypeptides in immobilized pH gradients can be predicted Z.; Cohen, L.; Danon, T.; Perzov, N.; Alon, U. Dynamic proteomics from their amino acid sequences. Electrophoresis 1993, 14 (10), of individual cancer cells in response to a drug. Science 2008, 322, 1023–31. 1511–6. (5) Kyte, J.; Doolittle, R. F. A simple method for displaying the (27) Newman, J. R.; Ghaemmaghami, S.; Ihmels, J.; Breslow, D. K.; hydropathic character of a protein. J. Mol. Biol. 1982, 157 (1), 105– Noble, M.; DeRisi, J. L.; Weissman, J. S. Single-cell proteomic 32. analysis of S. cerevisiae reveals the architecture of biological noise. (6) Finn, R. D.; Marshall, M.; Bateman, A. iPfam: visualization of Nature 2006, 441 (7095), 840–6. protein-protein interactions in PDB at domain and amino acid (28) Oberdorf, R.; Kortemme, T. Complex topology rather than complex resolutions. Bioinformatics 2005, 21 (3), 410–2. membership is a determinant of protein dosage sensitivity. Mol. (7) Pang, C. N.; Krycer, J. R.; Lek, A.; Wilkins, M. R. Are protein Syst. Biol. 2009, 5, 253. complexes made of cores, modules and attachments. Proteomics (29) Semple, J. I.; Vavouri, T.; Lehner, B. A simple principle concerning 2008, 8 (3), 425–34. the robustness of protein complex activity to changes in gene (8) Ghaemmaghami, S.; Huh, W. K.; Bower, K.; Howson, R. W.; Belle, expression. BMC Syst. Biol. 2008, 2,1. A.; Dephoure, N.; O’Shea, E. K.; Weissman, J. S. Global analysis of (30) Pereira-Leal, J. B.; Levy, E. D.; Kamp, C.; Teichmann, S. A. Evolution protein expression in yeast. Nature 2003, 425 (6959), 737–41. of protein complexes by duplication of homomeric interactions. (9) Belle, A.; Tanay, A.; Bitincka, L.; Shamir, R.; O’Shea, E. K. Genome Biol 2007, 8 (4), R51. Quantification of protein half-lives in the budding yeast proteome. (31) de Lichtenberg, U.; Jensen, L. J.; Brunak, S.; Bork, P. Dynamic Proc. Natl. Acad. Sci. U.S.A. 2006, 103 (35), 13004–9. complex formation during the yeast cell cycle. Science 2005, 307 (10) Gavin, A. C.; Aloy, P.; Grandi, P.; Krause, R.; Boesche, M.; Marzioch, (5710), 724–7. M.; Rau, C.; Jensen, L. J.; Bastuck, S.; Dumpelfeld, B.; Edelmann, (32) Wilkins, M. R.; Kummerfeld, S. K. Sticking together? Falling apart? A.; Heurtier, M. A.; Hoffman, V.; Hoefert, C.; Klein, K.; Hudak, M.; Exploring the dynamics of the interactome. Trends Biochem. Sci. Michon, A. M.; Schelder, M.; Schirle, M.; Remor, M.; Rudi, T.; 2008, 33 (5), 195–200. Hooper, S.; Bauer, A.; Bouwmeester, T.; Casari, G.; Drewes, G.; (33) Stagg, S. M.; LaPointe, P.; Razvi, A.; Gurkan, C.; Potter, C. S.; Neubauer, G.; Rick, J. M.; Kuster, B.; Bork, P.; Russell, R. B.; Superti- Carragher, B.; Balch, W. E. Structural basis for cargo regulation of Furga, G. Proteome survey reveals modularity of the yeast cell COPII coat assembly. Cell 2008, 134 (3), 474–84. machinery. Nature 2006, 440 (7084), 631–6. (34) Li, S.; Ault, A.; Malone, C. L.; Raitt, D.; Dean, S.; Johnston, L. H.; (11) Kim, P. M.; Sboner, A.; Xia, Y.; Gerstein, M. The role of disorder in Deschenes, R. J.; Fassler, J. S. The yeast histidine protein kinase, interaction networks: a structural analysis. Mol. Syst. Biol. 2008, Sln1p, mediates phosphotransfer to two response regulators, Ssk1p 4, 179. and Skn7p. EMBO J. 1998, 17 (23), 6952–62. (12) Quinn, G.; Keough, M. Experimental Design and Data Analysis for Biologists; Cambridge University Press: Cambridge, 2002. PR900693E

Journal of Proteome Research • Vol. 9, No. 3, 2010 1225 7 Discussion

In this thesis, I have investigated many aspects of the dynamics of protein interaction networks. In chapter 2, I asked the question are protein complexes made of cores, modules, and attachments? There were differences found between cores, modules, and attachment proteins in protein abundance and half-life, and evidence to suggest that core proteins are tightly regulated. Core or module proteins were more likely to be mediated by domain-domain interactions than attachment proteins, supporting their role in forming stable complexes. In chapter 3, I asked whether high- throughput protein-protein interaction data could provide clues on the architecture of protein complexes. Pairwise interaction data was shown to help in defining complex membership, while cores and modules of protein complexes could help determine the spatial proximity of proteins. Although predicted domain-domain interactions could provide useful clues on interactions within each protein complex, false positive interactions could complicate these analyses. In chapter 4, I asked whether post- translational modifications are surface accessible and whether this is associated with their protein-protein interactivity. Post-translational modifications known to be involved in protein-protein interactions, such as phosphoserine and methylarginine, were found to be clearly surface associated, whilst artifactual modifications were randomly distributed on proteins. In chapter 5, I undertook a proteome-wide screen for methylation in the yeast S. cerevisiae. Using peptide mass spectra and the FindMod tool, I found 83 high-confidence lysine and arginine methylation sites on 66 proteins.

Methylated proteins were mainly involved in translation and ribosomal biogenesis and assembly, and methylation sites were found to be associated with specific sequence motifs. Preliminary evidence suggests lysine methylation could possibly block the action of ubiquitin ligase. In chapter 6, we asked what properties make proteins deleterious when overexpressed? Deleterious proteins were associated with high

42 intrinsic disorder, specific interaction domains, and low abundance. The implications of all results in this thesis on the dynamics of protein interaction networks will now be discussed. The relevance of the results to the abundance-based effects, sequence- based effects, and conditional binding effects,28 and how these effects contribute to the dynamics of protein interaction networks are also discussed.

7.1 The role of domain-domain interactions in the formation of stable protein

complexes

An important issue regarding protein complexes is whether domain-domain interactions are important for their formation and stability. As domain-domain interactions tend to be stable, they are likely to contribute to high affinity interactions within a complex. The core, module, and attachment model of protein complexes, defined by Gavin et al.,2 is of particular interest as it attempts to model the ‘modularity’ and dynamic organisation of proteins in complexes. The model implies that a hierarchy of interactions exists in complexes, where interactions have varying degrees of stability. The results in chapter 2 confirm this expectation, that interactions involving core proteins have a higher proportion of stable domain-domain interactions than interaction involving module proteins, and interactions involving attachment proteins have the lowest proportion of domain-domain interactions. These observations are also supported by Itzhaki et al. (2006),17 that domain-domain interactions are frequently found in protein complexes, and is most frequently found amongst core proteins, followed by module and attachment proteins. Using the yeast structural interaction network, Kim et al. (2006)23 similarly showed that proteins with multiple domain-domain interaction interfaces are more likely to form protein complexes.

43 Insights into the hierarchy of interactions in complexes can be gained from another perspective, via the mass spectrometry of intact protein complexes.185-187 Interactions within purified, intact complexes can be perturbed using ionic solutions of increasing strength and/or increasing the collision energy within a tandem mass spectrometer.185-

187 These cause sub-complexes or individual proteins with weak affinity to dissociate from the rest of the complex, leaving the stable core intact. This generates smaller sub- complexes, which can be further analysed using tandem mass spectrometry. Studies by Robinson et al.185-187 used the abovementioned method to identify protein-protein interactions within the yeast exosome,185 yeast 19S proteasome lid,186 eukaryotic translation factor eIF3,187 and the yeast ribosomal stalk.188 Results from these studies185-187 showed that multi-subunit complexes are likely to consist of stable sub- complexes similar to core and module proteins, and weakly binding subunits similar to attachment proteins, which supports the core, module, and attachment model.

The results from chapter 2 suggest that domain-domain interactions are important for stable interactions in complexes.19 Yet there are other types of interactions that connect complexes from different biological processes to other parts of the interaction network, and act as scaffolds.4,5,67,68,189 Interactions that connect protein complexes are likely to be weak or transient interactions, since these interactions can form and disassociate quickly and dynamically in response to external stimuli.190 Therefore, instead of domain-domain interactions, which involve large interfaces and form stable interactions, interactions that connect protein complexes are likely to be characterized by small interaction interfaces190 with weak affinities14,191 but moderate specificity.18,190,191 These include domain-motif or domain-PTM interactions. Although domain-motif and domain-PTM interactions can be affected by dynamic expression of proteins,192,193 co-localization of interacting proteins,26 or post-translational modifications,192 the specificity of the interaction can also be determined by the sequence of the interaction domain and interaction motif.191,194 This highlights the

44 importance of the sequence-based effect in contributing to the dynamics of interaction networks. With regards to the core, module, and attachment model of protein complexes, the interactions of attachment proteins are seldom mediated by domain- domain interactions,19 and therefore are likely to be involved in transient interactions that contribute to interactions between complexes. It is challenging to find interaction motifs and motifs specific for PTMs, which are usually short and degenerate sequences found in high frequencies by chance in protein sequences.14 The consideration of structural environments, as undertaken in chapter 4, can improve the specificity of identifying short motifs and modification sites29,178,195 because they are likely to be found in coils, surface accessible regions, and intrinsically disordered regions.29,195

As domain-domain interactions are important for the stable binding of protein complexes, predicted domain-domain interactions were analysed to show whether these could be used to understand the architecture of protein complexes. In chapter

3,40 we showed that accurate prediction of domain-domain interactions might be possible for complexes that contain essentially unrelated proteins, whose interactions are mediated by hetero-domain-domain interactions. The domain-domain interaction thus may provide clues to structure. However, this relies on the availability of known structural information and domain-domain interactions. For example, there were few predicted domain-domain interactions within the SAGA and mediator complexes, as there was a lack of 3-D structures that has been elucidated for these two complexes.

Where the interactions within a protein complex are not known, experiments such as X- ray crystallography and mass spectrometric analysis of intact complexes, reviewed in

Robinson et al. (2007),196 could help fill the knowlege gap. For protein complexes that contain multiple paralogous proteins that interact by homologous interaction domains, many false positive homo-domain-domain interactions are expected to be predicted.

An example of this was the 20S proteasome.40 The presence of paralogous proteins in

45 protein complexes may arise due to the duplication of proteins,23 and duplication of homomeric interactions197-199 as a part of the evolution of protein complexes. Similarly, homo-domain-domain interactions may contribute to the very high proportion of domain-domain interactions observed for interactions involving core proteins, as illustrated in figure 3 of chapter 2. Despite this, my results highlight the importance of protein structural data in understanding how proteins interact with each other within protein complexes, and contribute to our knowledge on the architecture of protein complexes.

7.2 Using high-throughput interaction data to determine the architecture of

protein complexes

With the large amount of high-throughput protein-protein interaction data available, it is important to know whether this experimental data can be used to deduce the topology of multi-subunit complexes. In chapter 3,40 the high-throughput interaction data for several protein complexes were analysed in detail, including the 19S and 20S proteasome, mediator and SAGA complexes. These analyses have shown that current high-throughput interaction data, without support from other sources of experimental data, cannot be used to determine the precise topology of protein complexes.

However, high-throughput interaction data can provide valuable clues on the membership of protein complexes and the spatial proximity of proteins in complexes.

This is supported by a number of lines of evidence. Firstly, pairwise interaction data from the filtered yeast interactome (FYI)63 accurately predicted the membership of protein complexes. Members from each complex are connected by pairwise interactions and a lack of interactions can be observed between sub-complexes. The

26S proteasome is an example of a complex that is known to dissociate into two smaller sub-complexes, the 19S proteasome and the 20S proteasome,200 and no

46 pairwise interactions were observed between the two sub-complexes in the FYI dataset.40 Secondly, the core proteins defined in Gavin et al.2 also identified these two sub-complexes, in which different core proteins were assigned to the 20S proteasome and 19S proteasome respectively. Thirdly, core and module proteins in all four complexes studied tended to form physical contacts in the complex, which provided clues on the spatial proximity of proteins. Fourthly,19 Sardiu et al.201 analysed interaction data for the S. cerevisiae Rpd3 histone deacetylase complexes201 and the human interaction network centred on AAA+ ATPases Tip49a and Tip49b.202 These complexes were shown to form clear cores, modules, and attachments.2,19 The incompleteness of high-throughput interaction datasets remains an obstacle for detailed analysis of interactions within protein complexes, and it seems that both pairwise interaction data and affinity purification data were not complete. This is evident whereby none of the high-throughput interaction datasets studied in Krycer et al.40 were able to identify the SAGA associated sub-complex. The use of other relevant information, such as protein abundance, network topology or interaction density may help increase the resolution of mapping the architecture of protein complexes.5

7.3 The role of surface accessibility in post-translational modifications

The structural environment in which post-translational modifications are found can influence their role in protein-protein interactions. Modification sites need to be accessible on the surface of proteins for them to be recognized by modification-specific interaction domains and contribute to the conditional binding effect. Prior to the study presented here, it was unclear which types of modifications were likely to be surface accessible, and whether there were any exceptions. To address this issue, in Pang et al.,29 I investigated the structural environment of 8378 incidences of 44 types of post- translational modifications.29 To improve the robustness of the results, 19 different

47 algorithms were used to predict the structural environment of modified residues, and this provided a consensus view on which PTM is likely to be on the surface of proteins.

It was found that modifications likely to be involved in protein-protein interactions,13,15 such as phosphorylation and arginine methylation, are surface associated. Arginine methylation is commonly found in arginine-glycine-rich motifs and can be recognised and bound by the Tudor domain, which supports this observation.203 Some modifications found at the core of proteins or at buried residues, including 4- phosphoaspartic acid and ADP-ribosylation, can affect protein-protein interactions through allosteric regulation.15,29 Artifactual modifications, such as deaminated asparagine and pyrrolidone carboxylic acid, were found to be randomly distributed throughout the protein, and are therefore not likely to be involved in protein-protein interactions. While ordered domains can mediate stable binding of proteins in complexes, some transient contacts can be mediated by domain-PTM interactions.204

In these, the modified residues that mediate the interaction tend to be found on surface accessible and disordered regions. As surface accessibility and propensity of disorder was predicted from the protein sequence, this also suggests that a sequence-based effect is associated with the conditional binding effect.

Other studies have also supported the importance of intrinsically disordered regions in the role of post-translational modifications. Both Kim et al. (2008)205 and Gsponer et al. (2008)206 have shown that proteins with a high proportion of disordered regions are more likely to be kinase targets. Kim et al.205 showed that many kinases contain disordered regions that can be phosphorylated by another kinase, which allows kinases to interact with each other transiently in a cascading fashion. This provides a structural model on how kinases interact with each other to form signalling cascades.

Whereas phosphorylation is associated with intrinsically disordered regions, acetylation is found in ordered secondary structures. Using a comprehensive dataset of 3600 acetylation sites from human,29 Choudhary et al. (2009)207 have shown that lysine

48 acetylation sites are typically found in helices and -sheets. This is in contrast to the results presented in chapter 4,29 which shows acetylated lysine residues are preferred in coils. Even though lysine-acetylated residues are associated with ordered secondary structure, Kim et al. (2006)208 suggested that the side chain of lysine tends to be surface-exposed. Recent evidence also supports the importance of surface accessibility in post-translational modifications. Gnad et al. (2009)127 used statistical tests to compare surface accessibility of phosphorylated and non-phosphorylated sites for the same amino acid, and showed that phosphorylation sites in S. cerevisiae are preferably found in surface accessible regions.

At the start of my doctoral research, there was uncertainty on whether algorithms that predict surface accessibility and intrinsically disordered regions can provide structural insights to the biological function proteins. The use of methods that predict intrinsically disordered regions were mostly confined to prediction of regions of proteins that could not be resolved in X-ray crystallography.209 That is with the exception of a few studies where intrinsically disordered regions were associated to their role in interaction motifs,210 phosphorylation,179 methylation,211 and protein-protein intearctions.210,212 My study29 has shown that it is possible to use algorithms that predict structural environment to gain insight on whether post-translational modifications are accessible on the surface of proteins, and whether this has an effect on their biological function. Pang et al. (2007)29 was published in the same issue of Journal of Proteome

Research along side three articles by Xie et al.213,214 and Vucetic et al.,215 which investigated the importance of intrinsic disorder in protein function. These papers were published at a time that is perhaps the beginning of an explosion of interest in intrinsically disordered regions and their role in cellular function.206,216-218

49 7.4 Identification of arginine- and lysine-methylation using peptide mass

spectra

The development of a method that identifies arginine and lysine methylation using peptide mass spectra presents a number of technical challenges. In chapter 5, I introduced the FindMod program, developed by Wilkins et al. (1999),166 that was designed to search for modified peptides from peptide mass spectra. The original

FindMod program has several shortcomings, which made the identification of high confidence post-translational modification sites from peptide mass spectra a difficult task. Firstly, even though the accuracy of the FindMod program was not officially benchmarked, it is likely that FindMod produces more false positives than a tandem mass spectrometry experiment.219-221 Secondly, unlike the use of tandem mass spectrometry, which can discover new post-translational modification sites with high accuracy, FindMod cannot determine the exact position of the modification site when multiple amino acids that can carry the modification are present in a peptide.166 Despite these disadvantages, a large and comprehensive set of tandem mass spectrometry data for purified S. cerevisiae proteins was not publically available. Peptide mass spectra were, however, available from Gavin et al. (2006).2 This is one of the most comprehensive peptide mass spectra datasets, which contains data for 2,607, out of

~6500 proteins (40%) in the yeast proteome, and also contains a high number of replicate analyses per protein. Therefore, I decided to search for post-translational modifications in this large peptide mass spectra dataset. The decision to focus on the identification of arginine and lysine methylation was due to a number of factors, but mainly because it was shown in chapter 3 that arginine methylation tends to be surface accessible and is implicated in protein-protein interactions.

As the original FindMod algorithm has a high false positive rate and there was a reasonably high degree of noise in the ~30,000 peptide mass spectra from Gavin et

50 al.,2 identification of arginine and lysine methylation from peptide mass spectra required improving the accuracy of FindMod. This led to the development of five filters that took advantage of the high number of replicate analyses per protein and the presence of overlapping peptides to improve the true positive rate of FindMod. These filters allowed the identification of methylation sites with an average true positive rate of

89% at 0.04 Da mass error. The results of this analysis provided a large number of high confidence methylation sites which can be further validated by experiments, and it is an example of how bioinformatics tools can generate new hypotheses which can then be tested and/or refined using hypothesis-driven analyses.222 This is especially relevant for questions that need to be tested on a global scale. One fascinating question that arose is whether lysine methylation could block the action of ubiquitin ligase, and therefore prevent the degradation of lysine-methylated proteins via the ubiquitin/proteasome pathway. The results from chapter 5 also suggest that careful re- analyses of publically available tandem mass spectrometry data from repositories such as PeptideAtlas108 can possibly be a useful means of generating new biological insights.

Since the concept of the proteome was developed, there has been increasing interest in the analysis of post-translational modifications, which often involves large- scale discovery using tandem mass spectrometry.114 Phosphorylation has been the most widely studied modification and large-scale analysis of phosphorylation often discovers hundreds to thousands of phosphorylation sites at a time.127,173,223-226 For example, Gnad et al. (2009)207 has found 3,620 phosphorylation sites mapped to 1,118 proteins in yeast. However there are few other large-scale studies on other types of modifications.227 Some recent examples include the discovery of sumoylation228 and acetylation207 in proteins from human cell lines. The number of analyses of methylation in non-histone proteins is also growing.135-137 The number of methylated proteins found in this thesis is comparable to other analyses that used tandem mass spectrometry. In

51 chapter 5, I have described 66 methylated proteins in yeast, whereas Ong et al.

(2004)136 and Boisvert et al. (2003)135 described 33 and 200 methylated proteins respectively in human cell lines. Boisvert et al.135 did not, however, locate the exact position of the methylation site. By comparing the number of methylated and phosphorylated proteins found in large-scale studies, and the numbers of kinases (122 in yeast)229 versus methyltransferases (9 in yeast),139,230,231 it is likely that phosphorylation is more prevalent than methylation.

The false discovery rate can be used to compare the FindMod approach with the tandem mass spectroscopy approach. The comparison is independent of the number and types of modification in different studies. The false discovery rate (FDR) of using tandem mass spectrometry to search for methylation sites is less than 1%,127,207 whereas the FindMod approach is less accurate and has a FDR of 11%. This suggests the FindMod approach is more suitable for screening lysine or arginine residues that are potentially methylated, whereas tandem mass spectrometry is necessary for unambiguous identification of the methylation site. Whereas most studies analyse one type of modification at a time, whole-proteome analysis of multiple types of post- translational modifications has recently been attempted using the Shewanella oneidensis. This experiment, by Gupta et al. (2007)174, characterized 4,037 chemical modifications in 1,673 proteins with 5% FDR, and is a promising approach towards characterizing all modifications in the proteome of a cell in a single analysis pipeline.174

Analysis similar to that of Gupta et al.174 will no doubt advance our understanding of multi-site modification, especially on how different post-translational modifications co- operate or compete with each other.118

52 7.5 Exploring the function of arginine and lysine methylation in the proteome

of S. cerevisiae

In chapter 5,167 I showed that methylation is likely to be quite common.135-137 Unlike phosphorylation226,232 and lysine acetylation,207 which are involved in a wide variety of biological processes, methylation was found to be associated with specific functions such as translation and ribosome biogenesis and assembly.167 This confirms what is known in the literature.151 In addition, methylation sites were found in specific sequence motifs, such as MK, RxG, RGx, GxxR, and WxxxR, which associate the sequence- based effect with methylation. Motif-based analysis, reviewed in Tan and Linding

(2009),233 can be used to discover motifs specific for methyltransferases and methylation binding domains, which can then be validated with experiments. It is unclear whether these motifs are involved in domain-PTM interactions, and therefore contribute to the conditional binding effect. I investigated whether arginine and lysine methylation are involved in protein-protein interactions, but the results for these investigations were omitted from chapter 5, during the revision process of the paper.

Nevertheless, I did find that in protein Tdh3p, residue R11 maps to the interface involved in homodimerization. This is inferred from a homologous structure with 67% sequence identity to Tdh3p (Figure 1, PDB: 1IHX).234 This suggests that dimethylation at R11 may affect Tdh3p’s dimerization, and the overall assembly of the Tdh3p homotetramer.235 Although not directly on an interacting residue, dimethylation on

Gcn1p K1446 might also affect its interaction with Gcn20p as the nearby G1444D mutation reduces Gcn1p:Gcn20p complex formation.236 The role of methylation in the protein-protein interactions described above will need to be verified by experiments.

53 Figure 1. Arginine methylation could affect homodimerization of glyceraldehyde-

3-phosphate dehydrogenase, isozyme 3 (Thd3p). The yeast Tdh3p protein has a homologue with 67% sequence identity in Palinusus versicolor, commonly called the

South China sea lobster. The protein dimerizes in both organisms, and the dimer for P. versicolor is represented as the green chain and blue chain (PDB:1IHX). The S. cerevisiae R11 residue (yellow) and D187 residue (red) are conserved in P. versicolor, and these two residues form hydrogen bonds in the homologous structure. The R11 residue, which was found to be dimethylated in chapter 5, is present at the interaction interface and therefore methylation at R11 may affect Tdh3p dimerization.

Several types of post-translational modifications can attach mutually exclusively to lysine residues so are effectively in competition. These types of modifications include acetylation, ubiquitination, sumoylation, neddylation, biotinylation, and methylation.118

For example, evidence shows that acetylation can compete with sumoylation and thus regulate nuclear transport of proteins.237 Mono, di- and tri-methylation on the same lysine are also known to have different functions in histone proteins.238-240 For example,

54 histone H3K36 mono- and di-methylation is sufficient for maintaining cell growth, but

H3K36 tri-methylation is needed for proper gene silencing.239 Poly-ubiquitination can act as a signal to target a protein for degradation via the ubiquitin/proteasome pathway,133 but poly-ubiquitination on lysine residues can be blocked by other modifications at the same residue. There are two lines of evidence that suggest that acetylation and methylation can block ubiquitination. Firstly, poly-ubiquitination of p53 by Mcm2 can be blocked by mutually exclusive acetylation of the same lysine residue.241 Without poly-ubiquitination, p53 is not targeted for degradation by the proteasome and the half-life of p53 is extended.241 Secondly, I presented preliminary evidence in chapter 5 that lysine methylation can possibly block ubiquitination. Lysine- methylation of proteins may prevent their degradation via the ubiquitin/proteasome pathway and extend their half-lives.167 The above pieces of evidence suggest different modifications on lysine residues are unique, in that they may act as a switch to dynamically control the function of proteins including sub-cellular localization and protein half-life.

7.6 Interaction domains and high intrinsic disorder favours interaction

promiscuity

It was hypothesised that overexpression of some proteins can cause deleterious phenotypes if they can bind promiscuously to other proteins. When a protein is overexpressed, mass action encourages it to bind to low affinity interaction partners.217

If a protein has a large number of potential interactions, overexpression of the protein can encourage promiscuous interactions that can disrupt essential cellular processes and lead to deleterious phenotypes. Therefore, several research groups have tested whether overexpression of proteins in complexes can cause deleterious phenotypes.2,184,242,243 They reported that proteins deleterious on overexpression are

55 not preferentially found in protein complexes,184,242,243 and that neither the core or attachment units of protein complexes243 are enriched for deleterious proteins.206 This is in contrast to the enrichment of essential genes in protein complexes.4,5

Despite the lack of association between deleterious proteins and protein complexes, we found that deleterious proteins have a significantly higher number of domain- domain interaction partners than neutral proteins.244 This observation is similar to

Vavouri et al. (2009),217 who reported that deleterious proteins have significantly higher number of short linear motifs that mediate domain-motif or domain-PTM interactions. It is interesting to note that interaction motifs and modification sites are more likely to be found in intrinsically disordered regions.217 This could explain why the proportion of intrinsic disorder is associated with the toxicity of the protein on overexpression, as shown in chapter 6. Tompa et al. (2009)245 suggest that ordered interaction domains and disordered regions in each interacting protein can co-operate to mediate interactions, which could explain why both interaction domains and disordered regions can be found in proteins deleterious on overexpression. Thus while our study has found that proteins deleterious on overexpression do show a bias in their interaction, they also tend to be low in abundance, or are found in protein complexes involved in cell-cycle regulation.244,246

7.7 Tight regulation of protein abundance prevents promiscuous protein-

protein interactions

It has been hypothesised that the expression of highly intrinsically disordered proteins are tightly regulated. This might prevent them from functioning inadvertently at the wrong time or interacting promiscuously with other proteins, causing deleterious function.206,217,218 It was shown in chapter 5 and in Vavouri et al.217 that overexpression

56 of proteins with high intrinsic disorder in yeast tend to cause a deleterious phenotype.

Moreover, Gsponer et al.,206 Edwards et al.218, and Vavouri et al.217 have all shown that many genomics and proteomics parameters, such as abundance and half-life of transcripts and proteins, are significantly correlated with intrinsic disorder in proteins from lower to higher eukaryotes. With post-translational modifications, phosphorylation tends to be found in disordered regions, and phosphorylation can mediate the formation of protein complexes, or control the co-localization of protein complex subunits.247 Phosphorylation of PEST motifs206,248 that tend to be found in disordered proteins,206 can target proteins for degradation via the ubiquitin/proteasome pathway.

This suggests intrinsically disordered proteins are tightly regulated from the level of transcripts to protein degradation, and can be a means of preventing promiscuous interactions that could lead to deleterious phenotypes within eukaryotic cells.206,217,218,249

In chapter 2, I have shown that core proteins in complexes tend to be under tighter regulation than module or attachment proteins. It was difficult to reconcile how core proteins could be tightly regulated yet form the stable core of protein complexes simultaneously. However, de Lichtenberg et al. (2005)88 suggested that the function of some protein complexes are tightly regulated. The model suggests the function of a protein complex is activated only when the last subunit is added to a complex, whereby the last subunit acts as a ‘key’ to turn on the complex. The last subunit is dynamically expressed and joins with the rest of the complex ‘just-in-time’; and is degraded promptly when the function is no longer needed. This can prevent the complex from unnecessarily disrupting other cellular functions. The ‘just-in-time’ model of protein complex assembly,193 could explain the tight regulation of some core proteins (chapter

2), and could also explain why many cell-cycle proteins cause deleterious phenotypes when they are overexpressed (chapter 5). An important area of research will be to understand how ‘just-in-time’ assembly of protein complexes and tight regulation of

57 intrinsically disordered protein can work together to regulate the dynamics of protein interaction networks.

58 8 Conclusions

In conclusion, I have investigated different aspects of the dynamics of protein interaction networks using advanced bioinformatics, and the results demonstrate how powerful data mining and analysis can be for interactome research. Firstly, I have analysed how different aspects of protein structure, such as interaction domains and intrinsically disordered regions, can be important for the dynamics of protein interaction networks. Domain-domain interactions are important for the stable formation of protein complexes, and can be used to explain hetero-domain-domain interactions within protein complexes. On the other hand, transient interactions are important for facilitating interactions that connect protein complexes from different parts of the network. Parts of protein structures that participate in transient interactions tend to be intrinsically disordered, and can contain interaction motifs and post-translational modifications which mediate domain-motif and domain-PTM interactions.

Secondly, I have focused on the core, module, and attachment model of complexes since it attempts to capture the dynamic modularity of protein complexes. Evidence from high-throughput interaction data and from analyses of the quaternary structure of protein complexes suggests interactions of different stability are likely to exist.

Interactions involving core proteins tend to be the most stable, followed by module and attachment proteins. Core proteins of some complexes were tightly regulated, which suggests core proteins can act as a key that activates a complex ‘just-in-time’.

Thirdly, I explored how surface accessibility of post-translational modifications influences their role in protein-protein interactions. Modifications involved in protein- protein interactions are likely to be on the surface of proteins, highlighting the importance of structural information for providing insights on the role of post- translational modifications.

59 Fourthly, I have found preliminary evidence that suggest lysine modifications, such as ubiquitination and methylation, can be present mutually exclusively on lysine and act in different cellular processes. My results suggest that lysine methylation could block the action of ubiquitin ligase, which could protect lysine-methylated proteins from degradation via the ubiquitin/proteasome pathway. Fifthly, evidence from our analysis and from literature suggests proteins deleterious upon overexpression tends to be low in abundance, high in intrinsic disorder, and have a higher number of potential interaction partners. Finally, the examples above show how abundance-based effects, sequence-based effects, and conditional binding effects can influence the dynamic of protein interactions networks.

60 9 References

1. Wasinger, VC, Cordwell, SJ, Cerpa-Poljak, A, Yan, JX, Gooley, AA,

Wilkins, MR, Duncan, MW, Harris, R, Williams, KL & Humphery-Smith, I:

Progress with gene-product mapping of the Mollicutes:

Mycoplasma genitalium. Electrophoresis 1995, 16, 1090-4.

2. Gavin, AC, Aloy, P, Grandi, P, Krause, R, Boesche, M, Marzioch, M,

Rau, C, Jensen, LJ, Bastuck, S, Dumpelfeld, B, Edelmann, A, Heurtier,

MA, Hoffman, V, Hoefert, C, Klein, K, Hudak, M, Michon, AM, Schelder,

M, Schirle, M, Remor, M, Rudi, T, Hooper, S, Bauer, A, Bouwmeester, T,

Casari, G, Drewes, G, Neubauer, G, Rick, JM, Kuster, B, Bork, P,

Russell, RB & Superti-Furga, G: Proteome survey reveals modularity

of the yeast cell machinery. Nature 2006, 440, 631-6.

3. Krogan, NJ, Cagney, G, Yu, H, Zhong, G, Guo, X, Ignatchenko, A, Li, J,

Pu, S, Datta, N, Tikuisis, AP, Punna, T, Peregrin-Alvarez, JM, Shales, M,

Zhang, X, Davey, M, Robinson, MD, Paccanaro, A, Bray, JE, Sheung, A,

Beattie, B, Richards, DP, Canadien, V, Lalev, A, Mena, F, Wong, P,

Starostine, A, Canete, MM, Vlasblom, J, Wu, S, Orsi, C, Collins, SR,

Chandran, S, Haw, R, Rilstone, JJ, Gandi, K, Thompson, NJ, Musso, G,

St Onge, P, Ghanny, S, Lam, MH, Butland, G, Altaf-Ul, AM, Kanaya, S,

Shilatifard, A, O'Shea, E, Weissman, JS, Ingles, CJ, Hughes, TR,

Parkinson, J, Gerstein, M, Wodak, SJ, Emili, A & Greenblatt, JF: Global

landscape of protein complexes in the yeast Saccharomyces

cerevisiae. Nature 2006, 440, 637-43.

61 4. Hart, GT, Lee, I & Marcotte, ER: A high-accuracy consensus map of

yeast protein complexes reveals modular nature of gene

essentiality. BMC Bioinformatics 2007, 8, 236.

5. Wang, H, Kakaradov, B, Collins, SR, Karotki, L, Fiedler, D, Shales, M,

Shokat, KM, Walther, TC, Krogan, NJ & Koller, D: A complex-based

reconstruction of the Saccharomyces cerevisiae interactome. Mol

Cell Proteomics 2009, 8, 1361-81.

6. Pu, S, Vlasblom, J, Emili, A, Greenblatt, J & Wodak, SJ: Identifying

functional modules in the physical interactome of Saccharomyces

cerevisiae. Proteomics 2007, 7, 944-60.

7. Sanchez, C, Lachaize, C, Janody, F, Bellon, B, Roder, L, Euzenat, J,

Rechenmann, F & Jacq, B: Grasping at molecular interactions and

genetic networks in Drosophila melanogaster using FlyNets, an

Internet database. Nucleic Acids Res 1999, 27, 89-94.

8. Keskin, O, Gursoy, A, Ma, B & Nussinov, R: Principles of protein-

protein interactions: what are the preferred ways for proteins to

interact? Chem Rev 2008, 108, 1225-44.

9. Nooren, IM & Thornton, JM: Diversity of protein-protein interactions.

EMBO J 2003, 22, 3486-92.

10. Nooren, IM & Thornton, JM: Structural characterisation and

functional significance of transient protein-protein interactions. J

Mol Biol 2003, 325, 991-1018.

11. Kundrotas, PJ & Alexov, E: Electrostatic properties of protein-protein

complexes. Biophys J 2006, 91, 1724-36.

62 12. Kim, WK, Henschel, A, Winter, C & Schroeder, M: The many faces of

protein-protein interactions: A compendium of interface geometry.

PLoS Comput Biol 2006, 2, e124.

13. Pawson, T & Nash, P: Assembly of cell regulatory systems through

protein interaction domains. Science 2003, 300, 445-52.

14. Neduva, V, Linding, R, Su-Angrand, I, Stark, A, de Masi, F, Gibson, TJ,

Lewis, J, Serrano, L & Russell, RB: Systematic discovery of new

recognition peptides mediating protein interaction networks. PLoS

Biol 2005, 3, e405.

15. Seet, BT, Dikic, I, Zhou, MM & Pawson, T: Reading protein

modifications with interaction domains. Nat Rev Mol Cell Biol 2006,

7, 473-83.

16. Bordner, AJ & Abagyan, R: Statistical analysis and prediction of

protein-protein interfaces. Proteins 2005, 60, 353-66.

17. Itzhaki, Z, Akiva, E, Altuvia, Y & Margalit, H: Evolutionary conservation

of domain-domain interactions. Genome Biol. 2006, 7, R125.

18. Bader, S, Kuhner, S & Gavin, AC: Interaction networks for systems

biology. FEBS Lett 2008, 582, 1220-4.

19. Pang, CN, Krycer, JR, Lek, A & Wilkins, MR: Are protein complexes

made of cores, modules and attachments? Proteomics 2008, 8, 425-

34.

20. Schreiber, G: Kinetic studies of protein-protein interactions. Curr

Opin Struct Biol 2002, 12, 41-7.

63 21. Reichmann, D, Rahat, O, Cohen, M, Neuvirth, H & Schreiber, G: The

molecular architecture of protein-protein binding sites. Curr Opin

Struct Biol 2007, 17, 67-76.

22. Sackett, DL & Lippoldt, RE: Thermodynamics of reversible monomer-

dimer association of tubulin. Biochemistry 1991, 30, 3511-7.

23. Kim, PM, Lu, LJ, Xia, Y & Gerstein, MB: Relating three-dimensional

structures to protein networks provides evolutionary insights.

Science 2006, 314, 1938-41.

24. Ahmed, A, Dwyer, T, Forster, M, Fu, X, Ho, J, Hong, S-H, Koschutzki, D,

Murray, C, Nikolov, NS, Taib, R, Tarassov, A & Xu, K. (2005). GEOMI:

GEOmetry for Maximum Insight. In Proceeding of 13th International

Symposium on Graph Drawing, Limerick, Ireland, September 2005, pp.

468-479.

25. Liu, Q, Berry, D, Nash, P, Pawson, T, McGlade, CJ & Li, SS: Structural

basis for specific binding of the Gads SH3 domain to an RxxK

motif-containing SLP-76 peptide: a novel mode of peptide

recognition. Mol Cell 2003, 11, 471-81.

26. Puntervoll, P, Linding, R, Gemund, C, Chabanis-Davidson, S,

Mattingsdal, M, Cameron, S, Martin, DM, Ausiello, G, Brannetti, B,

Costantini, A, Ferre, F, Maselli, V, Via, A, Cesareni, G, Diella, F, Superti-

Furga, G, Wyrwicz, L, Ramu, C, McGuigan, C, Gudavalli, R, Letunic, I,

Bork, P, Rychlewski, L, Kuster, B, Helmer-Citterich, M, Hunter, WN,

Aasland, R & Gibson, TJ: ELM server: A new resource for

investigating short functional sites in modular eukaryotic proteins.

Nucleic Acids Res 2003, 31, 3625-30.

64 27. Beltrao, P & Serrano, L: Specificity and evolvability in eukaryotic

protein interaction networks. PLoS Comput Biol 2007, 3, e25.

28. Wilkins, MR & Kummerfeld, SK: Sticking together? Falling apart?

Exploring the dynamics of the interactome. Trends Biochem Sci

2008, 33, 195-200.

29. Pang, CN, Hayen, A & Wilkins, MR: Surface accessibility of protein

post-translational modifications. J Proteome Res 2007, 6, 1833-45.

30. Mok, J, Kim, PM, Lam, HY, Piccirillo, S, Zhou, X, Jeschke, GR, Sheridan,

DL, Parker, SA, Desai, V, Jwa, M, Cameroni, E, Niu, H, Good, M,

Remenyi, A, Ma, JL, Sheu, YJ, Sassi, HE, Sopko, R, Chan, CS, De

Virgilio, C, Hollingsworth, NM, Lim, WA, Stern, DF, Stillman, B, Andrews,

BJ, Gerstein, MB, Snyder, M & Turk, BE: Deciphering protein kinase

specificity through large-scale analysis of yeast phosphorylation

site motifs. Sci Signal 2010, 3, ra12.

31. Szabo, A, Stolz, L & Granzow, R: Surface plasmon resonance and its

use in biomolecular interaction analysis (BIA). Curr Opin Struct Biol

1995, 5, 699-705.

32. Huber, W & Mueller, F: Biomolecular interaction analysis in drug

discovery using surface plasmon resonance technology. Curr

Pharm Des 2006, 12, 3999-4021.

33. Camacho-Carvajal, MM, Wollscheid, B, Aebersold, R, Steimle, V &

Schamel, WW: Two-dimensional Blue native/SDS gel

electrophoresis of multi-protein complexes from whole cellular

lysates: a proteomics approach. Mol Cell Proteomics 2004, 3, 176-82.

65 34. Swamy, M, Siegers, GM, Minguet, S, Wollscheid, B & Schamel, WW:

Blue native polyacrylamide gel electrophoresis (BN-PAGE) for the

identification and analysis of multiprotein complexes. Sci STKE

2006, 2006, pl4.

35. Schägger, H & von Jagow, G: Blue native electrophoresis for

isolation of membrane protein complexes in enzymatically active

form. Anal Biochem 1991, 199, 223-231.

36. Fields, S & Song, O: A novel genetic system to detect protein-protein

interactions. Nature 1989, 340, 245-6.

37. Braun, P, Tasan, M, Dreze, M, Barrios-Rodiles, M, Lemmens, I, Yu, H,

Sahalie, JM, Murray, RR, Roncari, L, de Smet, AS, Venkatesan, K, Rual,

JF, Vandenhaute, J, Cusick, ME, Pawson, T, Hill, DE, Tavernier, J,

Wrana, JL, Roth, FP & Vidal, M: An experimentally derived

confidence score for binary protein-protein interactions. Nat

Methods 2009, 6, 91-7.

38. Drewes, G & Bouwmeester, T: Global approaches to protein-protein

interactions. Curr Opin Cell Biol 2003, 15, 199-205.

39. von Mering, C, Krause, R, Snel, B, Cornell, M, Oliver, SG, Fields, S &

Bork, P: Comparative assessment of large-scale data sets of

protein-protein interactions. Nature 2002, 417, 399-403.

40. Krycer, JR, Pang, CN & Wilkins, MR: High throughput protein-protein

interaction data: clues for the architecture of protein complexes.

Proteome Sci 2008, 6, 32.

41. Tarassov, K, Messier, V, Landry, CR, Radinovic, S, Serna Molina, MM,

Shames, I, Malitskaya, Y, Vogel, J, Bussey, H & Michnick, SW: An in

66 vivo map of the yeast protein interactome. Science 2008, 320, 1465-

70.

42. Ozawa, T, Kaihara, A, Sato, M, Tachihara, K & Umezawa, Y: Split

luciferase as an optical probe for detecting protein-protein

interactions in mammalian cells based on protein splicing. Anal

Chem 2001, 73, 2516-21.

43. Paulmurugan, R, Umezawa, Y & Gambhir, SS: Noninvasive imaging of

protein-protein interactions in living subjects by using reporter

protein complementation and reconstitution strategies. Proc Natl

Acad Sci U S A 2002, 99, 15608-13.

44. Hu, CD, Chinenov, Y & Kerppola, TK: Visualization of interactions

among bZIP and Rel family proteins in living cells using bimolecular

fluorescence complementation. Mol Cell 2002, 9, 789-98.

45. Hu, CD & Kerppola, TK: Simultaneous visualization of multiple

protein interactions in living cells using multicolor fluorescence

complementation analysis. Nat Biotechnol 2003, 21, 539-45.

46. Stagljar, I, Korostensky, C, Johnsson, N & te Heesen, S: A genetic

system based on split-ubiquitin for the analysis of interactions

between membrane proteins in vivo. Proc Natl Acad Sci U S A 1998,

95, 5187-92.

47. Ho, Y, Gruhler, A, Heilbut, A, Bader, GD, Moore, L, Adams, SL, Millar, A,

Taylor, P, Bennett, K, Boutilier, K, Yang, L, Wolting, C, Donaldson, I,

Schandorff, S, Shewnarane, J, Vo, M, Taggart, J, Goudreault, M,

Muskat, B, Alfarano, C, Dewar, D, Lin, Z, Michalickova, K, Willems, AR,

Sassi, H, Nielsen, PA, Rasmussen, KJ, Andersen, JR, Johansen, LE,

67 Hansen, LH, Jespersen, H, Podtelejnikov, A, Nielsen, E, Crawford, J,

Poulsen, V, Sorensen, BD, Matthiesen, J, Hendrickson, RC, Gleeson, F,

Pawson, T, Moran, MF, Durocher, D, Mann, M, Hogue, CW, Figeys, D &

Tyers, M: Systematic identification of protein complexes in

Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415,

180-3.

48. Hopp, TP, Gallis, B & Prickett, KS: A short polypeptide marker

sequence useful for recombinant protein identification and

purification. BioTechnology 1988, 6, 1204–10.

49. Einhauer, A & Jungbauer, A: The FLAG peptide, a versatile fusion tag

for the purification of recombinant proteins. J Biochem Biophys

Methods 2001, 49, 455-65.

50. Rigaut, G, Shevchenko, A, Rutz, B, Wilm, M, Mann, M & Seraphin, B: A

generic protein purification method for protein complex

characterization and proteome exploration. Nat Biotechnol 1999, 17,

1030-2.

51. Puig, O, Caspary, F, Rigaut, G, Rutz, B, Bouveret, E, Bragado-Nilsson,

E, Wilm, M & Seraphin, B: The tandem affinity purification (TAP)

method: a general procedure of protein complex purification.

Methods 2001, 24, 218-29.

52. Gavin, AC, Bosche, M, Krause, R, Grandi, P, Marzioch, M, Bauer, A,

Schultz, J, Rick, JM, Michon, AM, Cruciat, CM, Remor, M, Hofert, C,

Schelder, M, Brajenovic, M, Ruffner, H, Merino, A, Klein, K, Hudak, M,

Dickson, D, Rudi, T, Gnau, V, Bauch, A, Bastuck, S, Huhse, B, Leutwein,

C, Heurtier, MA, Copley, RR, Edelmann, A, Querfurth, E, Rybin, V,

68 Drewes, G, Raida, M, Bouwmeester, T, Bork, P, Seraphin, B, Kuster, B,

Neubauer, G & Superti-Furga, G: Functional organization of the yeast

proteome by systematic analysis of protein complexes. Nature 2002,

415, 141-7.

53. Bader, GD & Hogue, CW: Analyzing yeast protein-protein interaction

data obtained from different sources. Nat Biotechnol 2002, 20, 991-7.

54. Yu, H, Braun, P, Yildirim, MA, Lemmens, I, Venkatesan, K, Sahalie, J,

Hirozane-Kishikawa, T, Gebreab, F, Li, N, Simonis, N, Hao, T, Rual, JF,

Dricot, A, Vazquez, A, Murray, RR, Simon, C, Tardivo, L, Tam, S,

Svrzikapa, N, Fan, C, de Smet, AS, Motyl, A, Hudson, ME, Park, J, Xin,

X, Cusick, ME, Moore, T, Boone, C, Snyder, M, Roth, FP, Barabasi, AL,

Tavernier, J, Hill, DE & Vidal, M: High-quality binary protein

interaction map of the yeast interactome network. Science 2008,

322, 104-10.

55. Levy, ED, Landry, CR & Michnick, SW: Cell signaling. Signaling

through cooperation. Science 2010, 328, 983-4.

56. Breitkreutz, A, Choi, H, Sharom, JR, Boucher, L, Neduva, V, Larsen, B,

Lin, ZY, Breitkreutz, BJ, Stark, C, Liu, G, Ahn, J, Dewar-Darch, D,

Reguly, T, Tang, X, Almeida, R, Qin, ZS, Pawson, T, Gingras, AC,

Nesvizhskii, AI & Tyers, M: A global protein kinase and phosphatase

interaction network in yeast. Science 2010, 328, 1043-6.

57. Rual, JF, Venkatesan, K, Hao, T, Hirozane-Kishikawa, T, Dricot, A, Li, N,

Berriz, GF, Gibbons, FD, Dreze, M, Ayivi-Guedehoussou, N, Klitgord, N,

Simon, C, Boxem, M, Milstein, S, Rosenberg, J, Goldberg, DS, Zhang,

LV, Wong, SL, Franklin, G, Li, S, Albala, JS, Lim, J, Fraughton, C,

69 Llamosas, E, Cevik, S, Bex, C, Lamesch, P, Sikorski, RS, Vandenhaute,

J, Zoghbi, HY, Smolyar, A, Bosak, S, Sequerra, R, Doucette-Stamm, L,

Cusick, ME, Hill, DE, Roth, FP & Vidal, M: Towards a proteome-scale

map of the human protein-protein interaction network. Nature 2005,

437, 1173-8.

58. Stelzl, U, Worm, U, Lalowski, M, Haenig, C, Brembeck, FH, Goehler, H,

Stroedicke, M, Zenkner, M, Schoenherr, A, Koeppen, S, Timm, J,

Mintzlaff, S, Abraham, C, Bock, N, Kietzmann, S, Goedde, A, Toksoz, E,

Droege, A, Krobitsch, S, Korn, B, Birchmeier, W, Lehrach, H & Wanker,

EE: A human protein-protein interaction network: a resource for

annotating the proteome. Cell 2005, 122, 957-68.

59. Jensen, LJ & Bork, P: Biochemistry. Not comparable, but

complementary. Science 2008, 322, 56-7.

60. Ito, T, Chiba, T, Ozawa, R, Yoshida, M, Hattori, M & Sakaki, Y: A

comprehensive two-hybrid analysis to explore the yeast protein

interactome. Proc Natl Acad Sci U S A 2001, 98, 4569-74.

61. Uetz, P, Giot, L, Cagney, G, Mansfield, TA, Judson, RS, Knight, JR,

Lockshon, D, Narayan, V, Srinivasan, M, Pochart, P, Qureshi-Emili, A, Li,

Y, Godwin, B, Conover, D, Kalbfleisch, T, Vijayadamodar, G, Yang, M,

Johnston, M, Fields, S & Rothberg, JM: A comprehensive analysis of

protein-protein interactions in Saccharomyces cerevisiae. Nature

2000, 403, 623-7.

62. Collins, SR, Kemmeren, P, Zhao, XC, Greenblatt, JF, Spencer, F,

Holstege, FC, Weissman, JS & Krogan, NJ: Toward a comprehensive

70 atlas of the physical interactome of Saccharomyces cerevisiae. Mol

Cell Proteomics 2007, 6, 439-50.

63. Han, JD, Bertin, N, Hao, T, Goldberg, DS, Berriz, GF, Zhang, LV, Dupuy,

D, Walhout, AJ, Cusick, ME, Roth, FP & Vidal, M: Evidence for

dynamically organized modularity in the yeast protein-protein

interaction network. Nature 2004, 430, 88-93.

64. Bertin, N, Simonis, N, Dupuy, D, Cusick, ME, Han, JD, Fraser, HB, Roth,

FP & Vidal, M: Confirmation of organized modularity in the yeast

interactome. PLoS Biol 2007, 5, e153.

65. Shannon, P, Markiel, A, Ozier, O, Baliga, NS, Wang, JT, Ramage, D,

Amin, N, Schwikowski, B & Ideker, T: Cytoscape: a software

environment for integrated models of biomolecular interaction

networks. Genome Res 2003, 13, 2498-504.

66. Hu, Z, Mellor, J, Wu, J, Yamada, T, Holloway, D & Delisi, C: VisANT:

data-integrating visual framework for biological networks and

modules. Nucleic Acids Res 2005, 33, W352-7.

67. Ho, E, Webber, R & Wilkins, MR: Interactive three-dimensional

visualization and contextual analysis of protein interaction

networks. J Proteome Res 2008, 7, 104-12.

68. Widjaja, YY, Pang, CN, Li, SS, Wilkins, MR & Lambert, TD: The

Interactorium: visualising proteins, complexes and interaction

networks in a virtual 3D cell. Proteomics 2009, 9, 5309-15.

69. Barabasi, AL & Oltvai, ZN: Network biology: understanding the cell's

functional organization. Nat Rev Genet 2004, 5, 101-13.

71 70. Jeong, H, Mason, SP, Barabasi, AL & Oltvai, ZN: Lethality and

centrality in protein networks. Nature 2001, 411, 41-2.

71. Hart, GT, Ramani, AK & Marcotte, EM: How complete are current

yeast and human protein-interaction networks? Genome Biol 2006,

7, 120.

72. Goll, J & Uetz, P: The elusive yeast interactome. Genome Biol. 2006,

7, 223.

73. Kuhner, S, van Noort, V, Betts, MJ, Leo-Macias, A, Batisse, C, Rode, M,

Yamada, T, Maier, T, Bader, S, Beltran-Alvarez, P, Castano-Diez, D,

Chen, WH, Devos, D, Guell, M, Norambuena, T, Racke, I, Rybin, V,

Schmidt, A, Yus, E, Aebersold, R, Herrmann, R, Bottcher, B, Frangakis,

AS, Russell, RB, Serrano, L, Bork, P & Gavin, AC: Proteome

organization in a genome-reduced bacterium. Science 2009, 326,

1235-40.

74. Cusick, ME, Yu, H, Smolyar, A, Venkatesan, K, Carvunis, AR, Simonis,

N, Rual, JF, Borick, H, Braun, P, Dreze, M, Vandenhaute, J, Galli, M,

Yazaki, J, Hill, DE, Ecker, JR, Roth, FP & Vidal, M: Literature-curated

protein interaction datasets. Nat Methods 2009, 6, 39-46.

75. Reguly, T, Breitkreutz, A, Boucher, L, Breitkreutz, BJ, Hon, GC, Myers,

CL, Parsons, A, Friesen, H, Oughtred, R, Tong, A, Stark, C, Ho, Y,

Botstein, D, Andrews, B, Boone, C, Troyanskya, OG, Ideker, T, Dolinski,

K, Batada, NN & Tyers, M: Comprehensive curation and analysis of

global interaction networks in Saccharomyces cerevisiae. J Biol

2006, 5, 11.

72 76. Batada, NN, Reguly, T, Breitkreutz, A, Boucher, L, Breitkreutz, BJ, Hurst,

LD & Tyers, M: Stratus not altocumulus: a new view of the yeast

protein interaction network. PLoS Biol 2006, 4, e317.

77. Fromont-Racine, M, Rain, JC & Legrain, P: Toward a functional

analysis of the yeast genome through exhaustive two-hybrid

screens. Nat Genet 1997, 16, 277-82.

78. Fromont-Racine, M, Mayes, AE, Brunet-Simon, A, Rain, JC, Colley, A,

Dix, I, Decourty, L, Joly, N, Ricard, F, Beggs, JD & Legrain, P: Genome-

wide protein interaction screens reveal functional networks

involving Sm-like proteins. Yeast 2000, 17, 95-110.

79. Mewes, HW, Frishman, D, Guldener, U, Mannhaupt, G, Mayer, K,

Mokrejs, M, Morgenstern, B, Munsterkotter, M, Rudd, S & Weil, B: MIPS:

a database for genomes and protein sequences. Nucleic Acids Res

2002, 30, 31-4.

80. Hakes, L, Pinney, JW, Robertson, DL & Lovell, SC: Protein-protein

interaction networks and biology-what's the connection? Nat

Biotechnol 2008, 26, 69-72.

81. Orchard, S, Salwinski, L, Kerrien, S, Montecchi-Palazzi, L, Oesterheld,

M, Stumpflen, V, Ceol, A, Chatr-aryamontri, A, Armstrong, J, Woollard,

P, Salama, JJ, Moore, S, Wojcik, J, Bader, GD, Vidal, M, Cusick, ME,

Gerstein, M, Gavin, AC, Superti-Furga, G, Greenblatt, J, Bader, J, Uetz,

P, Tyers, M, Legrain, P, Fields, S, Mulder, N, Gilson, M, Niepmann, M,

Burgoon, L, De Las Rivas, J, Prieto, C, Perreau, VM, Hogue, C, Mewes,

HW, Apweiler, R, Xenarios, I, Eisenberg, D, Cesareni, G & Hermjakob,

73 H: The minimum information required for reporting a molecular

interaction experiment (MIMIx). Nat Biotechnol 2007, 25, 894-8.

82. Batada, NN, Reguly, T, Breitkreutz, A, Boucher, L, Breitkreutz, BJ, Hurst,

LD & Tyers, M: Still stratus not altocumulus: further evidence

against the date/party hub distinction. PLoS Biol 2007, 5, e154.

83. Enright, AJ, Van Dongen, S & Ouzounis, CA: An efficient algorithm for

large-scale detection of protein families. Nucleic Acids Res 2002, 30,

1575-84.

84. Finn, RD, Tate, J, Mistry, J, Coggill, PC, Sammut, SJ, Hotz, HR, Ceric,

G, Forslund, K, Eddy, SR, Sonnhammer, EL & Bateman, A: The Pfam

protein families database. Nucleic Acids Res 2008, 36, D281-8.

85. Finn, RD, Marshall, M & Bateman, A: iPfam: visualization of protein-

protein interactions in PDB at domain and amino acid resolutions.

Bioinformatics 2005, 21, 410-2.

86. Merico, D, Gfeller, D & Bader, GD: How to visually interpret biological

data using networks. Nat Biotechnol 2009, 27, 921-4.

87. Luscombe, NM, Babu, MM, Yu, H, Snyder, M, Teichmann, SA &

Gerstein, M: Genomic analysis of regulatory network dynamics

reveals large topological changes. Nature 2004, 431, 308-12.

88. de Lichtenberg, U, Jensen, LJ, Brunak, S & Bork, P: Dynamic complex

formation during the yeast cell cycle. Science 2005, 307, 724-7.

89. Spellman, PT, Sherlock, G, Zhang, MQ, Iyer, VR, Anders, K, Eisen, MB,

Brown, PO, Botstein, D & Futcher, B: Comprehensive identification of

cell cycle-regulated genes of the yeast Saccharomyces cerevisiae

by microarray hybridization. Mol Biol Cell 1998, 9, 3273-97.

74 90. Huh, WK, Falvo, JV, Gerke, LC, Carroll, AS, Howson, RW, Weissman,

JS & O'Shea, EK: Global analysis of protein localization in budding

yeast. Nature 2003, 425, 686-91.

91. Ghaemmaghami, S, Huh, WK, Bower, K, Howson, RW, Belle, A,

Dephoure, N, O'Shea, EK & Weissman, JS: Global analysis of protein

expression in yeast. Nature 2003, 425, 737-41.

92. Picotti, P, Bodenmiller, B, Mueller, LN, Domon, B & Aebersold, R: Full

dynamic range proteome analysis of S. cerevisiae by targeted

proteomics. Cell 2009, 138, 795-806.

93. Belle, A, Tanay, A, Bitincka, L, Shamir, R & O'Shea, EK: Quantification

of protein half-lives in the budding yeast proteome. Proc. Natl. Acad.

Sci. U. S. A. 2006, 103, 13004-9.

94. Ingolia, NT, Ghaemmaghami, S, Newman, JR & Weissman, JS:

Genome-wide analysis in vivo of translation with nucleotide

resolution using ribosome profiling. Science 2009, 324, 218-23.

95. Holstege, FC, Jennings, EG, Wyrick, JJ, Lee, TI, Hengartner, CJ, Green,

MR, Golub, TR, Lander, ES & Young, RA: Dissecting the regulatory

circuitry of a eukaryotic genome. Cell 1998, 95, 717-28.

96. Newman, JR, Ghaemmaghami, S, Ihmels, J, Breslow, DK, Noble, M,

DeRisi, JL & Weissman, JS: Single-cell proteomic analysis of S.

cerevisiae reveals the architecture of biological noise. Nature 2006,

441, 840-6.

97. Cohen, AA, Geva-Zatorsky, N, Eden, E, Frenkel-Morgenstern, M,

Issaeva, I, Sigal, A, Milo, R, Cohen-Saidon, C, Liron, Y, Kam, Z, Cohen,

75 L, Danon, T, Perzov, N & Alon, U: Dynamic proteomics of individual

cancer cells in response to a drug. Science 2008, 322, 1511-6.

98. Lu, P, Vogel, C, Wang, R, Yao, X & Marcotte, EM: Absolute protein

expression profiling estimates the relative contributions of

transcriptional and translational regulation. Nat Biotechnol 2007, 25,

117-24.

99. Allen, TE, Herrgard, MJ, Liu, M, Qiu, Y, Glasner, JD, Blattner, FR &

Palsson, BO: Genome-scale analysis of the uses of the Escherichia

coli genome: model-driven analysis of heterogeneous data sets. J

Bacteriol 2003, 185, 6392-9.

100. Corbin, RW, Paliy, O, Yang, F, Shabanowitz, J, Platt, M, Lyons, CE, Jr.,

Root, K, McAuliffe, J, Jordan, MI, Kustu, S, Soupene, E & Hunt, DF:

Toward a protein profile of Escherichia coli: comparison to its

transcription profile. Proc Natl Acad Sci U S A 2003, 100, 9232-7.

101. Covert, MW, Knight, EM, Reed, JL, Herrgard, MJ & Palsson, BO:

Integrating high-throughput and computational data elucidates

bacterial networks. Nature 2004, 429, 92-6.

102. Wang, Y, Liu, CL, Storey, JD, Tibshirani, RJ, Herschlag, D & Brown, PO:

Precision and functional specificity in mRNA decay. Proc Natl Acad

Sci U S A 2002, 99, 5860-5.

103. Velculescu, VE, Zhang, L, Zhou, W, Vogelstein, J, Basrai, MA, Bassett,

DE, Jr., Hieter, P, Vogelstein, B & Kinzler, KW: Characterization of the

yeast transcriptome. Cell 1997, 88, 243-51.

104. Chen, EI, Hewel, J, Felding-Habermann, B & Yates, JR, 3rd: Large

scale protein profiling by combination of protein fractionation and

76 multidimensional protein identification technology (MudPIT). Mol

Cell Proteomics 2006, 5, 53-6.

105. de Godoy, LM, Olsen, JV, Cox, J, Nielsen, ML, Hubner, NC, Frohlich, F,

Walther, TC & Mann, M: Comprehensive mass-spectrometry-based

proteome quantification of haploid versus diploid yeast. Nature

2008, 455, 1251-4.

106. de Godoy, LM, Olsen, JV, de Souza, GA, Li, G, Mortensen, P & Mann,

M: Status of complete proteome analysis by mass spectrometry:

SILAC labeled yeast as a model system. Genome Biol 2006, 7, R50.

107. Mallick, P, Schirle, M, Chen, SS, Flory, MR, Lee, H, Martin, D, Ranish, J,

Raught, B, Schmitt, R, Werner, T, Kuster, B & Aebersold, R:

Computational prediction of proteotypic peptides for quantitative

proteomics. Nat Biotechnol 2007, 25, 125-31.

108. Desiere, F, Deutsch, EW, King, NL, Nesvizhskii, AI, Mallick, P, Eng, J,

Chen, S, Eddes, J, Loevenich, SN & Aebersold, R: The PeptideAtlas

project. Nucleic Acids Res 2006, 34, D655-8.

109. Deutsch, EW: The PeptideAtlas Project. Methods Mol Biol 2010, 604,

285-96.

110. Kocher, T, Pichler, P, Schutzbier, M, Stingl, C, Kaul, A, Teucher, N,

Hasenfuss, G, Penninger, JM & Mechtler, K: High precision

quantitative proteomics using iTRAQ on an LTQ Orbitrap: a new

mass spectrometric method combining the benefits of all. J

Proteome Res 2009, 8, 4743-52.

111. Doherty, MK, Hammond, DE, Clague, MJ, Gaskell, SJ & Beynon, RJ:

Turnover of the human proteome: determination of protein

77 intracellular stability by dynamic SILAC. J Proteome Res 2009, 8,

104-12.

112. Arava, Y, Wang, Y, Storey, JD, Liu, CL, Brown, PO & Herschlag, D:

Genome-wide analysis of mRNA translation profiles in

Saccharomyces cerevisiae. Proc Natl Acad Sci U S A 2003, 100, 3889-

94.

113. Beyer, A, Hollunder, J, Nasheuer, HP & Wilhelm, T: Post-

transcriptional expression regulation in the yeast Saccharomyces

cerevisiae on a genomic scale. Mol Cell Proteomics 2004, 3, 1083-92.

114. Wilkins, MR: Hares and tortoises: the high- versus low-throughput

proteomic race. Electrophoresis 2009, 30 Suppl 1, S150-5.

115. Malmstrom, J, Beck, M, Schmidt, A, Lange, V, Deutsch, EW &

Aebersold, R: Proteome-wide cellular protein concentrations of the

human pathogen Leptospira interrogans. Nature 2009, 460, 762-5.

116. Lange, V, Picotti, P, Domon, B & Aebersold, R: Selected reaction

monitoring for quantitative proteomics: a tutorial. Mol Syst Biol 2008,

4, 222.

117. Silva, JC, Denny, R, Dorschel, C, Gorenstein, MV, Li, GZ, Richardson, K,

Wall, D & Geromanos, SJ: Simultaneous qualitative and quantitative

analysis of the Escherichia coli proteome: a sweet tale. Mol Cell

Proteomics 2006, 5, 589-607.

118. Yang, XJ: Multisite protein modification and intramolecular

signaling. Oncogene 2005, 24, 1653-62.

78 119. Hoffman, MD, Sniatynski, MJ & Kast, J: Current approaches for global

post-translational modification discovery and mass spectrometric

analysis. Anal Chim Acta 2008, 627, 50-61.

120. Gamble, RL, Coonfield, ML & Schaller, GE: Histidine kinase activity of

the ETR1 ethylene receptor from Arabidopsis. Proc Natl Acad Sci U S

A 1998, 95, 7825-9.

121. Vosseller, K, Sakabe, K, Wells, L & Hart, GW: Diverse regulation of

protein function by O-GlcNAc: a nuclear and cytoplasmic

carbohydrate post-translational modification. Curr Opin Chem Biol

2002, 6, 851-7.

122. Strahl, BD & Allis, CD: The language of covalent histone

modifications. Nature 2000, 403, 41-5.

123. Lavin, MF & Gueven, N: The complexity of p53 stabilization and

activation. Cell Death Differ 2006, 13, 941-50.

124. Buratowski, S: Progression through the RNA polymerase II CTD

cycle. Mol Cell 2009, 36, 541-6.

125. Egloff, S & Murphy, S: Cracking the RNA polymerase II CTD code.

Trends Genet 2008, 24, 280-8.

126. Sims, RJ, 3rd, Nishioka, K & Reinberg, D: Histone lysine methylation:

a signature for chromatin function. Trends Genet 2003, 19, 629-39.

127. Gnad, F, de Godoy, LM, Cox, J, Neuhauser, N, Ren, S, Olsen, JV &

Mann, M: High-accuracy identification and bioinformatic analysis of

in vivo protein phosphorylation sites in yeast. Proteomics 2009, 9,

4642-4652.

79 128. Krebs, EG & Fischer, EH: The phosphorylase b to a converting

enzyme of rabbit skeletal muscle. Biochim Biophys Acta 1956, 20,

150-7.

129. Krebs, EG, Kent, AB & Fischer, EH: The muscle phosphorylase b

kinase reaction. J Biol Chem 1958, 231, 73-83.

130. Krebs, EG & Stull, JT: Protein phosphorylation and metabolic

control. Ciba Found Symp 1975, 355-67.

131. Lu, JY, Lin, YY, Qian, J, Tao, SC, Zhu, J, Pickart, C & Zhu, H:

Functional dissection of a HECT ubiquitin E3 ligase. Mol Cell

Proteomics 2008, 7, 35-45.

132. Smith, WA, Schurter, BT, Wong-Staal, F & David, M: Arginine

methylation of RNA helicase a determines its subcellular

localization. J Biol Chem 2004, 279, 22795-8.

133. Ciechanover, A: Proteolysis: from the lysosome to ubiquitin and the

proteasome. Nat Rev Mol Cell Biol 2005, 6, 79-87.

134. Ostareck-Lederer, A, Ostareck, DH, Rucknagel, KP, Schierhorn, A,

Moritz, B, Huttelmaier, S, Flach, N, Handoko, L & Wahle, E: Asymmetric

arginine dimethylation of heterogeneous nuclear ribonucleoprotein

K by protein-arginine methyltransferase 1 inhibits its interaction

with c-Src. J Biol Chem 2006, 281, 11115-25.

135. Boisvert, FM, Cote, J, Boulanger, MC & Richard, S: A proteomic

analysis of arginine-methylated protein complexes. Mol Cell

Proteomics 2003, 2, 1319-30.

80 136. Ong, SE, Mittler, G & Mann, M: Identifying and quantifying in vivo

methylation sites by heavy methyl SILAC. Nat Methods 2004, 1, 119-

26.

137. Iwabata, H, Yoshida, M & Komatsu, Y: Proteomic analysis of organ-

specific post-translational lysine-acetylation and -methylation in

mice by use of anti-acetyllysine and -methyllysine mouse

monoclonal antibodies. Proteomics 2005, 5, 4653-64.

138. Lee, SW, Berger, SJ, Martinovic, S, Pasa-Tolic, L, Anderson, GA, Shen,

Y, Zhao, R & Smith, RD: Direct mass spectrometric analysis of intact

proteins of the yeast large ribosomal subunit using capillary

LC/FTICR. Proc Natl Acad Sci U S A 2002, 99, 5942-7.

139. Porras-Yakushi, TR, Whitelegge, JP & Clarke, S: A novel SET domain

methyltransferase in yeast: Rkm2-dependent trimethylation of

ribosomal protein L12ab at lysine 10. J Biol Chem 2006, 281, 35835-

45.

140. Porras-Yakushi, TR, Whitelegge, JP & Clarke, S: Yeast

ribosomal/cytochrome c SET domain methyltransferase subfamily:

identification of Rpl23ab methylation sites and recognition motifs. J

Biol Chem 2007, 282, 12368-76.

141. Lhoest, J, Lobet, Y, Costers, E & Colson, C: Methylated proteins and

amino acids in the ribosomes of Saccharomyces cerevisiae. Eur J

Biochem 1984, 141, 585-90.

142. Itoh, T & Wittmann-Liebold, B: The primary structure of protein 44

from the large subunit of yeast ribosomes. FEBS Lett 1978, 96, 399-

402.

81 143. Sadaie, M, Shinmyozu, K & Nakayama, J: A conserved SET domain

methyltransferase, Set11, modifies ribosomal protein Rpl12 in

fission yeast. J Biol Chem 2008, 283, 7185-95.

144. Carroll, AJ, Heazlewood, JL, Ito, J & Millar, AH: Analysis of the

Arabidopsis cytosolic ribosome proteome provides detailed

insights into its components and their post-translational

modification. Mol Cell Proteomics 2008, 7, 347-69.

145. Goldenberg, CJ & Eliceiri, GL: Methylation of ribosomal proteins in

HeLa cells. Biochim Biophys Acta 1977, 479, 220-34.

146. Shin, HS, Jang, CY, Kim, HD, Kim, TS, Kim, S & Kim, J: Arginine

methylation of ribosomal protein S3 affects ribosome assembly.

Biochem Biophys Res Commun 2009, 385, 273-8.

147. Scolnik, PA & Eliceiri, GL: Methylation sites in HeLa cell ribosomal

proteins. Eur J Biochem 1979, 101, 93-101.

148. Swiercz, R, Person, MD & Bedford, MT: Ribosomal protein S2 is a

substrate for mammalian PRMT3 (protein arginine

methyltransferase 3). Biochem J 2005, 386, 85-91.

149. Bedford, MT & Clarke, SG: Protein arginine methylation in mammals:

who, what, and why. Mol Cell 2009, 33, 1-13.

150. Wooderchak, WL, Zang, T, Zhou, ZS, Acuna, M, Tahara, SM & Hevel,

JM: Substrate profiling of PRMT1 reveals amino acid sequences

that extend beyond the "RGG" paradigm. Biochemistry 2008, 47,

9456-66.

151. Polevoda, B & Sherman, F: Methylation of proteins involved in

translation. Mol Microbiol 2007, 65, 590-606.

82 152. Xu, C, Henry, PA, Setya, A & Henry, MF: In vivo analysis of nucleolar

proteins modified by the yeast arginine methyltransferase

Hmt1/Rmt1p. RNA 2003, 9, 746-59.

153. Gary, JD & Clarke, S: RNA and protein interactions modulated by

protein arginine methylation. Prog Nucleic Acid Res Mol Biol 1998, 61,

65-131.

154. McBride, AE & Silver, PA: State of the arg: protein methylation at

arginine comes of age. Cell 2001, 106, 5-8.

155. Cuthbert, GL, Daujat, S, Snowden, AW, Erdjument-Bromage, H,

Hagiwara, T, Yamada, M, Schneider, R, Gregory, PD, Tempst, P,

Bannister, AJ & Kouzarides, T: Histone deimination antagonizes

arginine methylation. Cell 2004, 118, 545-53.

156. Wang, Y, Wysocka, J, Sayegh, J, Lee, YH, Perlin, JR, Leonelli, L,

Sonbuchner, LS, McDonald, CH, Cook, RG, Dou, Y, Roeder, RG, Clarke,

S, Stallcup, MR, Allis, CD & Coonrod, SA: Human PAD4 regulates

histone arginine methylation levels via demethylimination. Science

2004, 306, 279-83.

157. Xie, H & Clarke, S: Protein phosphatase 2A is reversibly modified by

methyl esterification at its C-terminal leucine residue in bovine

brain. J Biol Chem 1994, 269, 1981-4.

158. Kalhor, HR, Luk, K, Ramos, A, Zobel-Thropp, P & Clarke, S: Protein

phosphatase methyltransferase 1 (Ppm1p) is the sole activity

responsible for modification of the major forms of protein

phosphatase 2A in yeast. Arch Biochem Biophys 2001, 395, 239-45.

83 159. Polevoda, B, Span, L & Sherman, F: The yeast translation release

factors Mrf1p and Sup45p (eRF1) are methylated, respectively, by

the methyltransferases Mtq1p and Mtq2p. J Biol Chem 2006, 281,

2562-71.

160. Gygi, SP, Corthals, GL, Zhang, Y, Rochon, Y & Aebersold, R:

Evaluation of two-dimensional gel electrophoresis-based proteome

analysis technology. Proc Natl Acad Sci U S A 2000, 97, 9390-5.

161. Siebel, CW & Guthrie, C: The essential yeast RNA binding protein

Np13p is methylated. Proc Natl Acad Sci U S A 1996, 93, 13641-6.

162. Kinzel, V & Mueller, GC: Phosphorylation of surface proteins of HeLa

cells using an exogenous protein kinase and (gamma-32P)ATP.

Biochim Biophys Acta 1973, 322, 337-51.

163. Cavallius, J, Zoll, W, Chakraburtty, K & Merrick, WC: Characterization

of yeast EF-1 alpha: non-conservation of post-translational

modifications. Biochim Biophys Acta 1993, 1163, 75-80.

164. Ptacek, J, Devgan, G, Michaud, G, Zhu, H, Zhu, X, Fasolo, J, Guo, H,

Jona, G, Breitkreutz, A, Sopko, R, McCartney, RR, Schmidt, MC,

Rachidi, N, Lee, SJ, Mah, AS, Meng, L, Stark, MJ, Stern, DF, De Virgilio,

C, Tyers, M, Andrews, B, Gerstein, M, Schweitzer, B, Predki, PF &

Snyder, M: Global analysis of protein phosphorylation in yeast.

Nature 2005, 438, 679-84.

165. Mann, M & Jensen, ON: Proteomic analysis of post-translational

modifications. Nat Biotechnol 2003, 21, 255-61.

166. Wilkins, MR, Gasteiger, E, Gooley, AA, Herbert, BR, Molloy, MP, Binz,

PA, Ou, K, Sanchez, JC, Bairoch, A, Williams, KL & Hochstrasser, DF:

84 High-throughput mass spectrometric discovery of protein post-

translational modifications. J Mol Biol 1999, 289, 645-57.

167. Pang, CN, Gasteiger, E & Wilkins, MR: Identification of arginine- and

lysine-methylation in the proteome of Saccharomyces cerevisiae

and its functional implications. BMC Genomics 2010, 11, 92.

168. Nuhse, TS, Stensballe, A, Jensen, ON & Peck, SC: Large-scale

analysis of in vivo phosphorylated membrane proteins by

immobilized metal ion affinity chromatography and mass

spectrometry. Mol Cell Proteomics 2003, 2, 1234-43.

169. Pinkse, MW, Uitto, PM, Hilhorst, MJ, Ooms, B & Heck, AJ: Selective

isolation at the femtomole level of phosphopeptides from

proteolytic digests using 2D-NanoLC-ESI-MS/MS and titanium oxide

precolumns. Anal Chem 2004, 76, 3935-43.

170. Thingholm, TE, Jorgensen, TJ, Jensen, ON & Larsen, MR: Highly

selective enrichment of phosphorylated peptides using titanium

dioxide. Nat Protoc 2006, 1, 1929-35.

171. Larsen, MR, Thingholm, TE, Jensen, ON, Roepstorff, P & Jorgensen, TJ:

Highly selective enrichment of phosphorylated peptides from

peptide mixtures using titanium dioxide microcolumns. Mol Cell

Proteomics 2005, 4, 873-86.

172. Scanff, P, Yvon, M & Pelissier, JP: Immobilized Fe3+ affinity

chromatographic isolation of phosphopeptides. J Chromatogr 1991,

539, 425-32.

173. Gruhler, A, Olsen, JV, Mohammed, S, Mortensen, P, Faergeman, NJ,

Mann, M & Jensen, ON: Quantitative phosphoproteomics applied to

85 the yeast pheromone signaling pathway. Mol Cell Proteomics 2005, 4,

310-27.

174. Gupta, N, Tanner, S, Jaitly, N, Adkins, JN, Lipton, M, Edwards, R,

Romine, M, Osterman, A, Bafna, V, Smith, RD & Pevzner, PA: Whole

proteome analysis of post-translational modifications: applications

of mass-spectrometry for proteogenomic annotation. Genome Res

2007, 17, 1362-77.

175. Baerenfaller, K, Grossmann, J, Grobei, MA, Hull, R, Hirsch-Hoffmann, M,

Yalovsky, S, Zimmermann, P, Grossniklaus, U, Gruissem, W & Baginsky,

S: Genome-scale proteomics reveals Arabidopsis thaliana gene

models and proteome dynamics. Science 2008, 320, 938-41.

176. Ingrell, CR, Miller, ML, Jensen, ON & Blom, N: NetPhosYeast:

prediction of protein phosphorylation sites in yeast. Bioinformatics

2007, 23, 895-7.

177. Chen, H, Xue, Y, Huang, N, Yao, X & Sun, Z: MeMo: a web tool for

prediction of protein methylation modifications. Nucleic Acids Res

2006, 34, W249-53.

178. Shien, DM, Lee, TY, Chang, WC, Hsu, JB, Horng, JT, Hsu, PC, Wang,

TY & Huang, HD: Incorporating structural characteristics for

identification of protein methylation sites. J Comput Chem 2009, 30,

1532-43.

179. Iakoucheva, LM, Radivojac, P, Brown, CJ, O'Connor, TR, Sikes, JG,

Obradovic, Z & Dunker, AK: The importance of intrinsic disorder for

protein phosphorylation. Nucleic Acids Res 2004, 32, 1037-49.

86 180. Shao, J, Xu, D, Tsai, SN, Wang, Y & Ngai, SM: Computational

identification of protein methylation sites through bi-profile Bayes

feature extraction. PLoS One 2009, 4, e4920.

181. Gnad, F, Ren, S, Choudhary, C, Cox, J & Mann, M: Predicting post-

translational lysine acetylation using support vector machines.

Bioinformatics 2010, 26, 1666-8.

182. Li, S, Li, H, Li, M, Shyr, Y, Xie, L & Li, Y: Improved prediction of lysine

acetylation by support vector machines. Protein Pept Lett 2009, 16,

977-83.

183. Basu, A, Rose, KL, Zhang, J, Beavis, RC, Ueberheide, B, Garcia, BA,

Chait, B, Zhao, Y, Hunt, DF, Segal, E, Allis, CD & Hake, SB: Proteome-

wide prediction of acetylation substrates. Proc Natl Acad Sci U S A

2009, 106, 13785-90.

184. Sopko, R, Huang, D, Preston, N, Chua, G, Papp, B, Kafadar, K, Snyder,

M, Oliver, SG, Cyert, M, Hughes, TR, Boone, C & Andrews, B: Mapping

pathways and phenotypes by systematic gene overexpression. Mol

Cell 2006, 21, 319-30.

185. Hernandez, H, Dziembowski, A, Taverner, T, Seraphin, B & Robinson,

CV: Subunit architecture of multimeric complexes isolated directly

from cells. EMBO Rep 2006, 7, 605-10.

186. Sharon, M, Taverner, T, Ambroggio, XI, Deshaies, RJ & Robinson, CV:

Structural organization of the 19S proteasome lid: insights from MS

of intact complexes. PLoS Biol 2006, 4, e267.

187. Zhou, M, Sandercock, AM, Fraser, CS, Ridlova, G, Stephens, E,

Schenauer, MR, Yokoi-Fong, T, Barsky, D, Leary, JA, Hershey, JW,

87 Doudna, JA & Robinson, CV: Mass spectrometry reveals modularity

and a complete subunit interaction map of the eukaryotic

translation factor eIF3. Proc Natl Acad Sci U S A 2008, 105, 18139-44.

188. Grela, P, Krokowski, D, Gordiyenko, Y, Krowarsch, D, Robinson, CV,

Otlewski, J, Grankowski, N & Tchorzewski, M: Biophysical properties

of the eukaryotic ribosomal stalk. Biochemistry 2010, 49, 924-33.

189. Pawson, T & Scott, JD: Signaling through scaffold, anchoring, and

adaptor proteins. Science 1997, 278, 2075-80.

190. Stein, A & Aloy, P: Contextual specificity in peptide-mediated protein

interactions. PLoS One 2008, 3, e2524.

191. Stiffler, MA, Chen, JR, Grantcharova, VP, Lei, Y, Fuchs, D, Allen, JE,

Zaslavskaia, LA & MacBeath, G: PDZ domain binding selectivity is

optimized across the mouse proteome. Science 2007, 317, 364-9.

192. Komurov, K & White, M: Revealing static and dynamic modular

architecture of the eukaryotic protein interaction network. Mol. Syst.

Biol. 2007, 3, 110.

193. Jensen, LJ, Jensen, TS, de Lichtenberg, U, Brunak, S & Bork, P: Co-

evolution of transcriptional and post-translational cell-cycle

regulation. Nature 2006, 443, 594-7.

194. Ingham, RJ, Colwill, K, Howard, C, Dettwiler, S, Lim, CS, Yu, J, Hersi, K,

Raaijmakers, J, Gish, G, Mbamalu, G, Taylor, L, Yeung, B, Vassilovski,

G, Amin, M, Chen, F, Matskova, L, Winberg, G, Ernberg, I, Linding, R,

O'Donnell, P, Starostine, A, Keller, W, Metalnikov, P, Stark, C & Pawson,

T: WW domains provide a platform for the assembly of multiprotein

networks. Mol Cell Biol 2005, 25, 7092-106.

88 195. Via, A, Gould, CM, Gemund, C, Gibson, TJ & Helmer-Citterich, M: A

structure filter for the Eukaryotic Linear Motif Resource. BMC

Bioinformatics 2009, 10, 351.

196. Robinson, CV, Sali, A & Baumeister, W: The molecular sociology of

the cell. Nature 2007, 450, 973-82.

197. Pereira-Leal, JB, Levy, ED, Kamp, C & Teichmann, SA: Evolution of

protein complexes by duplication of homomeric interactions.

Genome Biol 2007, 8, R51.

198. Levy, ED, Boeri Erba, E, Robinson, CV & Teichmann, SA: Assembly

reflects evolution of protein complexes. Nature 2008, 453, 1262-5.

199. Pereira-Leal, JB & Teichmann, SA: Novel specificities emerge by

stepwise duplication of functional modules. Genome Res 2005, 15,

552-9.

200. Glickman, MH, Rubin, DM, Fried, VA & Finley, D: The regulatory

particle of the Saccharomyces cerevisiae proteasome. Mol Cell Biol

1998, 18, 3149-62.

201. Sardiu, ME, Gilmore, JM, Carrozza, MJ, Li, B, Workman, JL, Florens, L &

Washburn, MP: Determining protein complex connectivity using a

probabilistic deletion network derived from quantitative proteomics.

PLoS One 2009, 4, e7310.

202. Sardiu, ME, Florens, L & Washburn, MP: Evaluation of clustering

algorithms for protein complex and protein interaction network

assembly. J Proteome Res 2009, 8, 2944-52.

203. Cote, J & Richard, S: Tudor domains bind symmetrical dimethylated

arginines. J Biol Chem 2005, 280, 28476-83.

89 204. Koch, CA, Anderson, D, Moran, MF, Ellis, C & Pawson, T: SH2 and SH3

domains: elements that control interactions of cytoplasmic

signaling proteins. Science 1991, 252, 668-74.

205. Kim, PM, Sboner, A, Xia, Y & Gerstein, M: The role of disorder in

interaction networks: a structural analysis. Mol Syst Biol 2008, 4,

179.

206. Gsponer, J, Futschik, ME, Teichmann, SA & Babu, MM: Tight

regulation of unstructured proteins: from transcript synthesis to

protein degradation. Science 2008, 322, 1365-8.

207. Choudhary, C, Kumar, C, Gnad, F, Nielsen, ML, Rehman, M, Walther,

TC, Olsen, JV & Mann, M: Lysine acetylation targets protein

complexes and co-regulates major cellular functions. Science 2009,

325, 834-40.

208. Kim, SC, Sprung, R, Chen, Y, Xu, Y, Ball, H, Pei, J, Cheng, T, Kho, Y,

Xiao, H, Xiao, L, Grishin, NV, White, M, Yang, XJ & Zhao, Y: Substrate

and functional diversity of lysine acetylation revealed by a

proteomics survey. Mol Cell 2006, 23, 607-18.

209. Bordoli, L, Kiefer, F & Schwede, T: Assessment of disorder

predictions in CASP7. Proteins 2007, 69 Suppl 8, 129-36.

210. Dunker, AK, Cortese, MS, Romero, P, Iakoucheva, LM & Uversky, VN:

Flexible nets. The roles of intrinsic disorder in protein interaction

networks. Febs J 2005, 272, 5129-48.

211. Daily, KM, Radivojac, P & Dunker, AK. (2005). IEEE Symposium on

Computational Intelligence in Bioinformatics and Computational Biology,

CIBCB 2005, San Diego, California, U.S.A.

90 212. Ekman, D, Light, S, Bjorklund, AK & Elofsson, A: What properties

characterize the hub proteins of the protein-protein interaction

network of Saccharomyces cerevisiae? Genome Biol 2006, 7, R45.

213. Xie, H, Vucetic, S, Iakoucheva, LM, Oldfield, CJ, Dunker, AK, Obradovic,

Z & Uversky, VN: Functional anthology of intrinsic disorder. 3.

Ligands, post-translational modifications, and diseases associated

with intrinsically disordered proteins. J Proteome Res 2007, 6, 1917-

32.

214. Xie, H, Vucetic, S, Iakoucheva, LM, Oldfield, CJ, Dunker, AK, Uversky,

VN & Obradovic, Z: Functional anthology of intrinsic disorder. 1.

Biological processes and functions of proteins with long disordered

regions. J Proteome Res 2007, 6, 1882-98.

215. Vucetic, S, Xie, H, Iakoucheva, LM, Oldfield, CJ, Dunker, AK, Obradovic,

Z & Uversky, VN: Functional anthology of intrinsic disorder. 2.

Cellular components, domains, technical terms, developmental

processes, and coding sequence diversities correlated with long

disordered regions. J Proteome Res 2007, 6, 1899-916.

216. Noivirt-Brik, O, Prilusky, J & Sussman, JL: Assessment of disorder

predictions in CASP8. Proteins 2009, 77 Suppl 9, 210-6.

217. Vavouri, T, Semple, JI, Garcia-Verdugo, R & Lehner, B: Intrinsic protein

disorder and interaction promiscuity are widely associated with

dosage sensitivity. Cell 2009, 138, 198-208.

218. Edwards, YJ, Lobley, A, Pentony, MM & Jones, DT: Insights into the

regulation of intrinsically disordered proteins in the human

91 proteome by analysing sequence and gene expression data.

Genome Biol 2009, 10, R50.

219. Barsnes, H, Mikalsen, SO & Eidhammer, I: Blind search for post-

translational modifications and amino acid substitutions using

peptide mass fingerprints from two proteases. BMC Res Notes 2008,

1, 130.

220. Bhaskaran, N, Iwahana, H, Bergquist, J, Hellman, U & Souchelnytskyi, S:

Novel post-translational modifications of Smad2 identified by mass

spectrometry. Cent Eur J Biol 2008, 3, 359–370.

221. Eidhammer, I, Barsnes, H & Mikalsen, SO: MassSorter: peptide mass

fingerprinting data analysis. Methods Mol Biol 2008, 484, 345-59.

222. Nabel, GJ: Philosophy of science. The coordinates of truth. Science

2009, 326, 53-4.

223. Sugiyama, N, Nakagami, H, Mochida, K, Daudi, A, Tomita, M, Shirasu, K

& Ishihama, Y: Large-scale phosphorylation mapping reveals the

extent of tyrosine phosphorylation in Arabidopsis. Mol Syst Biol

2008, 4, 193.

224. Li, X, Gerber, SA, Rudner, AD, Beausoleil, SA, Haas, W, Villen, J, Elias,

JE & Gygi, SP: Large-scale phosphorylation analysis of alpha-factor-

arrested Saccharomyces cerevisiae. J Proteome Res 2007, 6, 1190-7.

225. Bose, R, Molina, H, Patterson, AS, Bitok, JK, Periaswamy, B, Bader, JS,

Pandey, A & Cole, PA: Phosphoproteomic analysis of Her2/neu

signaling and inhibition. Proc Natl Acad Sci U S A 2006, 103, 9773-8.

226. Chi, A, Huttenhower, C, Geer, LY, Coon, JJ, Syka, JE, Bai, DL,

Shabanowitz, J, Burke, DJ, Troyanskaya, OG & Hunt, DF: Analysis of

92 phosphorylation sites on proteins from Saccharomyces cerevisiae

by electron transfer dissociation (ETD) mass spectrometry. Proc

Natl Acad Sci U S A 2007, 104, 2193-8.

227. Zhao, Y & Jensen, ON: Modification-specific proteomics: strategies

for characterization of post-translational modifications using

enrichment techniques. Proteomics 2009, 9, 4632-41.

228. Golebiowski, F, Matic, I, Tatham, MH, Cole, C, Yin, Y, Nakamura, A,

Cox, J, Barton, GJ, Mann, M & Hay, RT: System-wide changes to

SUMO modifications in response to heat shock. Sci Signal 2009, 2,

ra24.

229. Zhu, H, Klemic, JF, Chang, S, Bertone, P, Casamayor, A, Klemic, KG,

Smith, D, Gerstein, M, Reed, MA & Snyder, M: Analysis of yeast

protein kinases using protein chips. Nat Genet 2000, 26, 283-9.

230. van Leeuwen, F, Gafken, PR & Gottschling, DE: Dot1p modulates

silencing in yeast by methylation of the nucleosome core. Cell 2002,

109, 745-56.

231. McBride, AE, Cook, JT, Stemmler, EA, Rutledge, KL, McGrath, KA &

Rubens, JA: Arginine methylation of yeast mRNA-binding protein

Npl3 directly affects its function, nuclear export, and intranuclear

protein interactions. J Biol Chem 2005, 280, 30888-98.

232. Fiedler, D, Braberg, H, Mehta, M, Chechik, G, Cagney, G, Mukherjee, P,

Silva, AC, Shales, M, Collins, SR, van Wageningen, S, Kemmeren, P,

Holstege, FC, Weissman, JS, Keogh, MC, Koller, D, Shokat, KM &

Krogan, NJ: Functional organization of the S. cerevisiae

phosphorylation network. Cell 2009, 136, 952-63.

93 233. Tan, CS & Linding, R: Experimental and computational tools useful

for (re)construction of dynamic kinase-substrate networks.

Proteomics 2009, 9, 5233-42.

234. Shen, YQ, Song, SY & Lin, ZJ: Structures of D-glyceraldehyde-3-

phosphate dehydrogenase complexed with coenzyme analogues.

Acta Crystallogr D Biol Crystallogr 2002, 58, 1287-97.

235. Fensleau, A: Structure-function studies on glyceraldehyde 3-

phosphate dehydrogenase. IV. Subunit interactions of the rabbit

muscle and yeast enzymes. J Biol Chem 1972, 247, 1074-9.

236. Marton, MJ, Vazquez de Aldana, CR, Qiu, H, Chakraburtty, K &

Hinnebusch, AG: Evidence that GCN1 and GCN20, translational

regulators of GCN4, function on elongating ribosomes in activation

of eIF2alpha kinase GCN2. Mol Cell Biol 1997, 17, 4474-89.

237. Mahajan, R, Delphin, C, Guan, T, Gerace, L & Melchior, F: A small

ubiquitin-related polypeptide involved in targeting RanGAP1 to

nuclear pore complex protein RanBP2. Cell 1997, 88, 97-107.

238. Frederiks, F, Tzouros, M, Oudgenoeg, G, van Welsem, T, Fornerod, M,

Krijgsveld, J & van Leeuwen, F: Nonprocessive methylation by Dot1

leads to functional redundancy of histone H3K79 methylation

states. Nat Struct Mol Biol 2008, 15, 550-7.

239. Fingerman, IM, Wu, CL, Wilson, BD & Briggs, SD: Global loss of Set1-

mediated H3 Lys4 trimethylation is associated with silencing

defects in Saccharomyces cerevisiae. J Biol Chem 2005, 280, 28761-

5.

94 240. Xu, L, Zhao, Z, Dong, A, Soubigou-Taconnat, L, Renou, JP, Steinmetz, A

& Shen, WH: Di- and tri- but not monomethylation on histone H3

lysine 36 marks active transcription of genes involved in flowering

time regulation and other processes in Arabidopsis thaliana. Mol

Cell Biol 2008, 28, 1348-60.

241. Li, M, Luo, J, Brooks, CL & Gu, W: Acetylation of p53 inhibits its

ubiquitination by Mdm2. J Biol Chem 2002, 277, 50607-11.

242. Oberdorf, R & Kortemme, T: Complex topology rather than complex

membership is a determinant of protein dosage sensitivity. Mol Syst

Biol 2009, 5, 253.

243. Semple, JI, Vavouri, T & Lehner, B: A simple principle concerning the

robustness of protein complex activity to changes in gene

expression. BMC Syst Biol 2008, 2, 1.

244. Ma, L, Pang, CN, Li, SS & Wilkins, MR: Proteins deleterious on

overexpression are associated with high intrinsic disorder, specific

interaction domains and low abundance. J Proteome Res 2010, 9,

1218-25.

245. Tompa, P, Fuxreiter, M, Oldfield, CJ, Simon, I, Dunker, AK & Uversky,

VN: Close encounters of the third kind: disordered domains and the

interactions of proteins. Bioessays 2009, 31, 328-35.

246. Niu, W, Li, Z, Zhan, W, Iyer, VR & Marcotte, EM: Mechanisms of cell

cycle control revealed by a systematic and quantitative

overexpression screen in S. cerevisiae. PLoS Genet 2008, 4,

e1000120.

95 247. de Lichtenberg, U, Jensen, TS, Brunak, S, Bork, P & Jensen, LJ:

Evolution of cell cycle control: same molecular machines, different

regulation. Cell Cycle 2007, 6, 1819-25.

248. Rechsteiner, M & Rogers, SW: PEST sequences and regulation by

proteolysis. Trends Biochem Sci 1996, 21, 267-71.

249. Midic, U, Oldfield, CJ, Dunker, AK, Obradovic, Z & Uversky, VN: Protein

disorder in the human diseasome: unfoldomics of human genetic

diseases. BMC Genomics 2009, 10 Suppl 1, S12.

96 10 Appendices

10.1 Appendix I - Publications

Hsu, WT, Pang, CN, Sheetal, J & Wilkins, MR: Protein-protein interactions and disease: use of S. cerevisiae as a model system. Biochim Biophys Acta 2007, 1774,

838-47.

Pang, CN, Lin, K, Wouters, MA, Heringa, J & George, RA: Identifying foldable regions in protein sequence from the hydrophobic signal. Nucleic Acids Res 2008,

36, 578-88.

Widjaja, YY, Pang, CN, Li, SS, Wilkins, MR & Lambert, TD: The Interactorium: visualising proteins, complexes and interaction networks in a virtual 3D cell.

Proteomics 2009, 9, 5309-15. Copyright Wiley-VCH Verlag GmbH & Co. KGaA.

Reproduced with permission.

97 Biochimica et Biophysica Acta 1774 (2007) 838–847 www.elsevier.com/locate/bbapap

Protein–protein interactions and disease: Use of S. cerevisiae as a model system ⁎ Wei-Tse Hsu, Chi Nam Ignatius Pang, Josefa Sheetal, Marc R. Wilkins

School of Biotechnology and Biomolecular Science, University of New South Wales, NSW 2052, Australia Received 6 December 2006; received in revised form 27 April 2007; accepted 27 April 2007 Available online 5 May 2007

Abstract

Disease-causing mutations are increasingly being studied to see if they cause the loss or gain of protein–protein interactions. Because the interaction network of humans is poorly understood and difficult to investigate, here we propose the use of Saccharomyces cerevisiae as a model system for understanding the impact of disease-causing mutations on protein–protein interactions. Alignments of human disease-associated proteins and 379 yeast orthologs showed that 124 of these proteins have N40% sequence identity, with some orthologs having up to 89% identity. A total of 1826 amino acid mutations associated with human disease were found to map to invariant amino acids in yeast. These mutations were proportionately more likely to be non-conservative than non-disease associated polymorphisms for the same proteins (p=0.016). Importantly, 73 of the mutations mapped to protein– protein interaction domains, implying a direct link between mutation and changes in protein interactivity. In the manuscript, all alignment information and tables that map mutations and diseases to yeast orthologs are given. This will help researchers experimentally test the impact of mutations on protein–protein interactions in S. cerevisiae and, by homology, explore the role of such mutations in the genesis of human disease. Crown Copyright © 2007 Published by Elsevier B.V. All rights reserved.

Keywords: Protein–protein interactions; Disease-associated mutations; Disease orthologs; Protein interaction domains; Single nucleotide polymorphisms; Saccharomyces cerevisiae

1. Introduction licobacter pylori, Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster,andPlasmodium falciparum Technology for the systematic mapping of protein–protein [3–7]. Recently, these approaches have also been used to generate interactions is providing a high-order view of the proteome. The partial protein–protein interaction networks for human cells [8,9]. yeast two-hybrid technique, particularly strong for the elucidation Affinity purification techniques have provided an alternative of pairwise and transient protein–protein interactions, was means to study protein complexes and protein–protein initially used to confirm and investigate defined biological interaction networks. Specific tags are engineered onto processes [1,2]. Its potential to be applied on a massive scale was, individual proteins, and affinity chromatography used to purify however, soon recognised by many groups, and was used to the tagged proteins and any others that are physically co- generate large-scale views of protein–protein interactions in He- associated. All proteins in the complex can then be identified by mass spectrometry. Two different affinity purification methods, Abbreviations: 2-DE, two-dimensional polyacrylamide gel electrophoresis; TAP-tag and Flag-tag, have been described to date. Both have LC-MS-MS, liquid chromatography tandem mass spectrometry; TAP, tandem been applied to yeast on a global scale, by the systematic affinity purification; OMIM, Online Mendelian Inheritance in Man; UniProt, tagging and analysis of thousands of different proteins [10–13]. universal protein resource; dbSNP, single nucleotide polymorphism database; These approaches have, however, yet to be applied to the large- SGD, Saccharomyces genome database; NDK, nucleoside diphosphate kinase; scale study of protein–protein interactions in the human cell. dRTA, distal renal tubular acidosis; NF1, neurofibromin; IRA2, inhibitory regulator protein; RASK, GTPase KRas; MSH2, DNA mismatch repair protein Global protein interaction research is providing dramatic ⁎ Corresponding author. Tel.: +61 2 9385 3633; fax: +61 2 9385 1483. insights into the structure and function of the cell, but is yet to E-mail address: [email protected] (M.R. Wilkins). be used to explore the molecular basis of disease. This is

1570-9639/$ – see front matter. Crown Copyright © 2007 Published by Elsevier B.V. All rights reserved. doi:10.1016/j.bbapap.2007.04.014 W.-T. Hsu et al. / Biochimica et Biophysica Acta 1774 (2007) 838–847 839 surprising, as whilst disease often has a basis in germ-line or available and have been analysed only using the two-hybrid somatic mutations, the genesis of many diseases is likely to be technique, known to have a relatively high false positive rate in the corresponding loss or gain of one or more protein–protein [22]. Accordingly, can S. cerevisiae serve as a model for interactions [14]. There are a number of scenarios that are protein–protein interaction studies? possible in this regard. Mutations in a protein may: There are 383 proteins in S. cerevisiae that have been suggested as direct orthologs for proteins associated with, or • Alter an interaction domain, causing a stable protein responsible for, human disease [23–26]. These proteins can be complex to fail to form. This may reduce or eliminate categorically associated with cancer, neurological and malfor- enzymatic or other function. mation syndromes, endocrine, renal, haematological, immune • Alter a localisation or target sequence, which reduces or and metabolic disorders [26]. Studies of yeast orthologs to removes a proteins' capacity to interact with translocator human proteins have already generated significant insight into proteins. This may result in incorrect protein localisation. human disease, although not in association with studies of • Alter an amino acid that is usually post-translationally protein–protein interactions. Studies that consider fundamental modified. The lack of a modification on a particular amino eukaryotic cellular and metabolic processes have been of acid (e.g. phosphorylation) may prevent that protein from particular relevance [27]. For example, human disease muta- interacting with a modification-specific interaction domain tions in cytochrome b, when engineered into orthologous (e.g. SH2 domain). proteins in S. cerevisiae allowed the molecular basis of the • Alter amino acids that form a sequence motif associated associated respiratory defects to be understood [28]. When with protein post-translational modification. The mutation expressed in S. cerevisiae, the human alpha-synuclein protein or loss of a sequence motif will prevent the interaction of (implicated in neurodegenerative disorders including Parkin- the protein with enzymes which usually add the post- son's disease) revealed molecular insights into the pathways translational modification. underlying normal alpha-synuclein biology and the pathogenic consequences of its misfolding [29]. Through studies in S. The above points concern situations where mutations result cerevisiae and other yeasts, the critical proteins in human in the loss of protein–protein interactions. However, it is peroxisomal disorders have also been discovered [30]. In these equally possible that mutations in proteins result in the gain of cases and many others (see N50 examples in [27]), working new, inappropriate protein–protein interactions or interactions with S. cerevisiae proved advantageous as it is biochemically of increased affinity. and genetically well understood, it is easily manipulated, it There are a number of human diseases known to be associated grows quickly to quantities required for detailed studies, and it with the loss or gain of protein–protein interactions (reviewed in is one of the few organisms for which many of the -omics [14]). Familial melanoma, hereditary nonpolyposis colorectal techniques have been developed and applied on a global scale cancer and bare lymphocyte syndrome are known to involve the (e.g. mRNA expression analysis, protein–protein interaction loss of protein–protein interactions [15–17]. By contrast, studies, expression proteomics, systematic deletions). We mutations associated with Apert's syndrome, Werner's syndrome, believe that the depth of knowledge about the S. cerevisiae prostate cancer and CADASIL (Cerebral Autosomal Dominant protein interaction network, in combination with a body of work Arteriopathy with Subcortical Infarcts and Leukoencephalopa- which illustrates the utility of S. cerevisiae as a model system thy) show increased interaction affinities or new, inappropriate for understanding fundamental processes in eukaryotes, raises interactions [18–20]. The textbook example of HbS mutations in the possibility that S. cerevisiae may be useful as a model to haemoglobin, leading to the polymerisation of hemoglobin and study changes in protein–protein interactions associated with sickle cell anaemia, is another extremely well-characterised disease. Here we explore the possibility that S. cerevisiae can be example of this. Correct protein–protein interactions are clearly used for this purpose. required for the normal operation of the human cell. Whilst the association of protein–protein interactions with 2. Materials and methods human disease is becoming clear, our poor understanding of the human protein–protein interaction network [8,9] and our 2.1. Selection of human proteins and their yeast orthologs relatively limited capacity to manipulate human cells means that progress in the investigation of protein–protein interactions Human and S. cerevisiae proteins used in this study were from the associated with disease will be slow. Whilst some lower orthodisease database [23]. To be in this database, the authors stipulated that – human proteins had to be clearly associated with disease (as documented in the eukaryotes have well-researched protein protein interaction OMIM and/or UniProt database), and all yeast proteins had to be clear orthologs networks, they are yet to be exploited as model systems of to their human counterpart. protein interaction-associated disease. Of all the eukaryotes, the interaction network of S. cerevisiae is the best understood and 2.2. Sequences, sequence variants, protein domains and hub status most complete. It has been the subject of two global studies using yeast two-hybrid methods [4,5], four global studies using All sequences for human and yeast proteins were from release 48.0 or higher – of the SWISS-PROT database (http://ch.expasy.org/swissprot). Mutations for protein tagging and complex purification approaches [10 13] human proteins were computer-extracted from the VARIANT field of SWISS- and metadata studies to produce high-quality networks [12,21]. PROT or were manually extracted from the OMIM (Online Mendelian Other model eukaryotes have only a fraction of these data Inheritance in Man) database (http://www.ncbi.nlm.nih.gov/omim/). Single 840 W.-T. Hsu et al. / Biochimica et Biophysica Acta 1774 (2007) 838–847 nucleotide polymorphisms from dbSNP (http://www.ncbi.nlm.nih.gov/SNP/), We then explored the question: do human disease-causing thought to have no association with disease, were also extracted from the mutations map to invariant amino acids on their yeast orthologs? VARIANT field of SWISS-PROT. Yeast protein–protein interaction data was extracted from the Saccharomyces genome database SGD [31], and the This question is of importance because if human mutations map to classification of yeast proteins into hub or non-hub proteins was as described by invariant amino acids in yeast, the engineering of such ‘mutations’ Han et al. [21]. Protein interaction domains were extracted from the DOMAIN into yeast orthodisease proteins may help us understand the effect field of the SWISS-PROT database. they have on protein interactions. Where mutations were found to map directly between human and yeast proteins, we then asked: 2.3. High accuracy pairwise alignment of human disease proteins and do these mutations disrupt the structure of the affected proteins, yeast orthologs and are the mutations known to be part of a protein–protein ProbCons is documented to be the most accurate means of pairwise interaction domain? For the sake of illustration and clarity, we will sequence alignment [32]. Accordingly, ProbCons [33] was used to align human first present one example. We then present the results of the study protein sequences with their S. cerevisiae orthologs. Parameters for ProbCons of all 379 yeast orthodisease proteins. included the BLOSUM62 matrix [34] for the hidden Markov model, default consistency repetitions of 2, iterative refinements repetitions of 100 and no 3.1. The example of vacuolar H+ ATPase pretraining repetitions. The quality of all 379 sequence alignments was assessed manually. The percent sequence identity of all alignments was calculated by dividing the number of conserved amino acids between yeast and human Mutations in the vacuolar H+ ATPase (vH+ ATPase) are sequences by the total number of amino acids in the yeast proteins. responsible for distal renal tubular acidosis in humans [35–37]. The vH+ ATPase contains two interconnected complexes—the 2.4. Mapping of mutations from human proteins to yeast orthologs V1 cytoplasmic complex (made of subunits A to H in stoichiometry A3B3C1D1E1F1G2H1), and the V0 integral mem- Only single amino acid changes have been studied here; other mutations brane complex (composed of 5 subunits a, c, c′,c″,din (insertions, deletions, truncations) have been ignored. Human disease-associated ′ ″ mutations and non-disease associated SNPs were mapped onto the yeast–human stoichiometry a1c6c 6c 1d1) [38]. The complexes are identical in sequence alignment to determine if the mutations occurred on invariant or yeast and man. Sequence comparison of human and yeast variant amino acids between the two proteins. This gave us the capacity to orthodisease proteins showed that the V1 B subunit of the VH+ measure 4 separate aspects (a) the number of disease-causing single amino acid ATPase was extremely highly conserved, having 72% sequence mutations that map to invariant amino acids (b) the number of disease-causing identity. The V0 a subunit is less highly conserved, having 37% single amino acid mutations that map to variant amino acids (c) the number of non-disease associated polymorphisms that mapped to invariant amino acids, sequence identity. All 10 amino acids mutated in the human V1 B and (d) the number of non-disease associated polymorphisms that mapped to subunit in dRTA are invariant in the yeast ortholog, and 5 of the 7 variant amino acids. The number and type of amino acid in each group, above, amino acids subject to mutation in the V0 a subunit are invariant was collated. in the yeast ortholog (see Table 1). The degree of evolutionary conservation of these amino acids, and their contribution to – 2.5. Protein protein interactions of disease-associated orthologs and disease if mutated, suggests that they are critical for the correct mapping of mutations onto protein interaction domains function and interactions of proteins in the vH+ ATPase. It also The protein–protein interactions (number and type) known for each yeast raises the prospect that these mutations could be studied in yeast. protein was investigated. We considered information from high-throughput The effect of engineering four human mutations into the yeast studies (yeast two-hybrid, flag-tag, TAP-tag) as well as from other low- vacuolar ATPase was explored recently by Ochotny et al. [39]. throughput studies, as collated in the SGD. For proteins known to interact with One mutation, W519L, was found to affect the assembly and others, the status of the protein as a hub or non-hub protein was determined stability of the vH+ ATPase. This was proposed to be due to through reference to Han et al. [21]. Disease-associated protein mutations, known to map to invariant amino acids between yeast and human proteins, were mapped to protein interaction domains. Custom perl scripts were used for Table 1 SWISS-PROT associated mutations and manual alignment was used in the case Human mutations from OMIM, dbSNP and the SWISS-PROT database for of OMIM-derived mutation data. the V1 B and V0 a subunits for the vH+ ATPase, mapped to the sequence of the S. cerevisiae ortholog 3. Results and discussion vH+ ATPase V1 B subunit vH+ ATPase V0 a subunit – Human Orthologous Human mutation Orthologous To evaluate S. cerevisiae as a model to study protein protein mutation amino acid amino acid interactions in human disease, we asked a number of questions. Leu 81 to Pro Leu 68 Gly 175 to Asp Gly 188 Firstly, what proteins are sufficiently similar between humans Gly 123 to Val Gly 110 Lys 237 deletion Lys 250 and yeast for them to be clear orthologs? Secondly, are these Arg 124 to Trp Arg 111 Arg 449 to His Arg 461 proteins known to be associated with human disease? In Arg 157 to Cys Arg 144 Pro 524 to Leu Trp 519 essence, these questions had previously been explored by aGlu 161 to Lys Glu 148 Met 580 to Thr Leu 575 O'Brien et al. [23] who proposed that there are 383 proteins in Met 174 to Arg Met 161 Arg 807 to Gln Arg 799 Thr 275 to Pro Thr 262 Gly 820 to Arg Gly 811 S. cerevisiae that are direct orthologs to human proteins, Gly 316 to Glu Gly 303 whereby these proteins were documented in the online Pro 346 to Arg Pro 333 Mendelian inheritance in man (OMIM) database to be mutated Gly 364 to Ser Gly 351 in association with human pathology. O'Brien et al. [23] termed Arg 465 to His Arg 352 these proteins ‘orthodisease’ proteins. a This mutation is a polymorphism that does not cause dRTA. W.-T. Hsu et al. / Biochimica et Biophysica Acta 1774 (2007) 838–847 841 changes in affinity between the V0 a subunit and the Vma12p– appropriately TAP-tagged proteins [10], this would present an Vma22p complex, resulting in aberrant interactions. This is one immediate means to purify mutated protein complexes of of the first examples of how disease-associated mutations in yeast interest. In line with this strategy, we determined which human can provide insight into their effects on protein–protein proteins are known to cause disease from single amino acid interactions. mutations. Of the 379 human proteins, 232 of these were found to have one or more single amino acid mutations known to be 3.2. Many orthodisease proteins have high sequence identity associated with disease. For these 232 proteins, each protein had with human proteins an average of 7.8 disease-causing single amino acid mutations (see supplementary Table S1). The remaining proteins Whilst O'Brien et al. [23] proposed 383 orthodisease contained mutations that were either insertions or deletions, or proteins, they did not investigate the sequence identity of were effectively null mutants due to truncations or chromo- these proteins but used BLAST scores to assess similarity. As somal aberrations. These types of mutations could also be used it was important for us to understand the precise sequence to investigate the effects, if any, on protein–protein interactions similarity of yeast and human orthologs, we undertook high- in orthologs, however, they are slightly more difficult to quality pairwise sequence alignments. The sequence identity engineer into proteins and their effect on proteins (as explored between human and yeast orthologs was up to 89%, with 124 below) are more difficult to predict. To provide a means of proteins having N40% sequence identity. Considering the comparison, we also considered the number of human disease degree of evolutionary distance between yeast and man, this is proteins that are known to have non-deleterious SNPs. A total of quite striking and indicates that many of these proteins are 138 of the 232 proteins are documented to have SNPs (see involved in fundamental cellular processes in the eukaryotic supplementary Table S1); these can serve as control ‘mutations’ cell. The top 10 alignments, by percent sequence identity, are for some of the above experiments. shown in Table 2 and the percent sequence similarity for all 379 proteins is given in supplementary Table S1. This high 3.3. Many human mutations map to invariant amino acids on degree of sequence homology between orthologs and human orthodisease proteins proteins also suggests why mutations in many of these proteins may have such deleterious effect. Note that 4 proteins Having determined that many mutations are known for studied in O'Brien et al. [23], initially defined as paralogs, proteins that are highly conserved between man and yeast, we have since been redefined in the SWISS-PROT database as then mapped the human amino acid mutations onto the yeast the same protein. This accounts for the differences in the orthodisease proteins. This yielded a most striking result. A total number of proteins studied by O'Brien (383 proteins) and us of 1826 single amino acid mutations in human proteins could be (379 proteins). mapped directly to invariant amino acids on 232 yeast If we are to use yeast as a model to study the effects of orthologs. Between 1 and 88 mutations were mapped per mutation on protein–protein interactions, one approach would yeast ortholog (average 7.8) with higher numbers usually be to engineer small mutations into yeast orthologs by indicating numerous different mutations mapping to a single homologous recombination. If this is done into strains with site rather than a large number of different mutation sites in the protein. Some of these mutations mapped to known protein– protein interaction domains; this is explored later in this Table 2 Orthodisease proteins that have highest percent sequence identity to their human manuscript. counterpart 3.4. Are disease-causing mutations on human proteins Yeast ortholog % Human protein Disease b ID a conservative or non-conservative? ACT_YEAST 89 ACTG_HUMAN Deafness, autosomal dominant BMH2_YEAST 73 1433E_HUMAN Miller–Dieker lissencephaly Mutations in proteins can be conservative or non-conserva- VATB_YEAST 72 VATB1_HUMAN Renal tubular acidosis with tive. The latter may affect the three dimensional structure of a deafness protein, possibly changing its capacity to interact with its usual BMH1_YEAST 69 1433E_HUMAN Miller–Dieker lissencephaly partners [40]. To understand if disease-associated mutations are UBC2_YEAST 69 UBE2B_HUMAN Male infertility likely to affect protein structure and interaction, we explored the METK_YEAST 68 METK1_HUMAN Hypermethioninemia, persistent, autosomal dominant types of single amino acid mutation that are found in the human METL_YEAST 67 METK1_HUMAN Hypermethioninemia, persistent, disease-associated proteins. Importantly, we restricted our study autosomal dominant to amino acid mutations that occur on invariant amino acids RAD51_YEAST 66 RAD51_HUMAN Breast cancer, susceptibility to between the human protein and their yeast ortholog, as these are DHSB_YEAST 65 DHSB_HUMAN Paraganglioma, familial obvious candidate mutations that may ultimately be made to the malignant PGK_YEAST 65 PGK1_HUMAN Haemolytic anaemia and orthologous yeast proteins. To assist our interpretation of this, myoglobinuria/hemolysis and to provide a basis for comparison between disease- and Protein names are SWISS-PROT accession numbers. non-disease causing mutations, we also studied the nature of all a Percent sequence identity from ProbCons parwise sequence alignment. non-disease associated single amino acid polymorphisms from b From the OMIM database. dbSNP for the same human proteins. As with the disease- 842 W.-T. Hsu et al. / Biochimica et Biophysica Acta 1774 (2007) 838–847

Table 3 Disease-associated single amino acid mutations in human proteins that are found on invariant amino acids with their yeast ortholog To D E H K R C G N Q S T Y A F I L M P V W X Total From D 0 6 14 0 0 0 19 39 0 0 0 14 7 0 0 0 0 0 13 0 0 112 E 900600080600060000020697 H 3000180021100610030500049 K 01400 70010500000102000443 R 0 0 77 5 0 58 22 0 58 12 4 0 0 0 0 17 0 30 0 60 17 360 C 000013010020905000003033 G 42 18 0 0 100 12 0 0 2 41 0 0 13 0 0 0 0 0 33 6 2 269 N 2001400000244300500000052 Q 032212000000000030200933 S 0000113412000001221801001073 T 0006 5001050011013033800082 Y 3090 02605050003000000657 A 6300 003005280000008350088 F 0000 031001401000210050045 I 0004 200402210060010040053 L 0020190007800016401571111127 M0003 600000700011200130042 P 00301200010164012005300000110 V 5100 008000001035629000067 W0000182100000000200001134 Amino acids are grouped into 4 fundamental types: acidic, basic, polar, nonpolar. X denotes a mutation to a stop codon in the protein. A total of 1826 mutations are documented here. associated changes, only polymorphisms on invariant amino aspartic acid and proline, with a total of 360, 269, 127, 112 and acids between human and yeast were included. This had the 110 changes respectively. For the non-disease polymorphisms, a effect of reducing the total number of SNPs for study from 647 slightly different group of amino acids were most often to 144, however this was necessary to provide an appropriate changed. Arginine, glutamic acid, isoleucine, leucine and control and basis for comparison. glycine showed the highest frequency of change with a total A total of 1826 single amino acid disease-associated of 23, 15, 13, 12 and 11 polymorphisms respectively (Table 4). mutations were considered (Table 3). The amino acids that We further considered what the modified amino acids were were most often changed were arginine, glycine, leucine, changed to in each case. Here we have focused on the nature of

Table 4 Non-disease associated single amino acid polymorphisms in human proteins, which map to invariant amino acids between human and yeast orthologs To D EHK RCGNQSTYAFI LMPVWTotal From D 0 1 1 0 0 0 03000000000010 6 E0 00100010300000000010 15 H0 00 00002100100000000 4 K0 40 02000000000000000 6 R0 03 00740500000030001 23 C0 00 00000000000000000 0 G4 00 04000010000000020 11 N2 00 00000030000000000 5 Q0 01 01000000000000000 2 S 0 00 00102001011010100 8 T0 00 10002000000000000 3 Y0 03 00100000001000000 5 A0 00 00000003000000130 7 F 0 00 00000000000100000 1 I 0 00 00000005000002060 13 L0 00 00000300003001230 12 M0 00 00000000000200030 5 P 0 00 00000020010040000 7 V1 10 00000000010501000 9 W0 00 00010000000010000 2 Amino acids are grouped into 4 fundamental types: acidic, basic, polar, nonpolar. A total of 144 polymorphisms are documented here. W.-T. Hsu et al. / Biochimica et Biophysica Acta 1774 (2007) 838–847 843 the change, to see if any amino acids were changed between the between disease and non-disease associated mutations were broad classes of acidic, basic, polar or non-polar (Table 5A, B). basic to acidic (0.4 fold), acidic to basic (0.5 fold), nonpolar to From this, it was apparent that a greater number of non- basic (∞ due to this class being absent in nondisease muta- conservative amino acid changes were seen in disease- tions) and polar to nonpolar (2.1 fold). Accordingly, different associated proteins (60.6%) over the non-disease associated types of changes occur in disease-associated mutations that in polymorphisms (50%). When compared by a Chi-squared test, non-disease associated polymorphisms. the proportion of non-conservative amino acid changes in disease-associated proteins was found to be significantly higher 3.5. Some disease-causing mutations on human proteins map than those from non-disease associated polymorphisms (p= to protein interaction domains 0.016). Notwithstanding this difference, it is perhaps surprising that the number of non-conservative substitutions associated Finally, we sought to understand if any of the mutations we with SNPs is so high. This may be because these substitutions see in human proteins map to known protein–protein interaction are to non-critical parts of proteins that are tolerant to amino domains. As we were interested to see if yeast may present a acid changes, whereas disease-associated mutations are to means to study these interactions, we focused only on mutations structurally critical parts of proteins. As increasing numbers of that were not just disease-associated, but were also found on SNPs are mapped to coding sequences, it will be important to invariant amino acids between the human proteins and their determine that this observation, seen here with a relatively small yeast orthologs. In total, 73 mutations were found to map to 11 number of SNPs, holds true. interaction domains on 10 different proteins. Seven of the We also evaluated the degree of difference, if any, in the interaction domains are for protein–protein interaction, and 4 types of changes in class of amino acid found in disease- are for domains that interact with amino acid motifs that include associated mutations versus non-disease associated poly- phosphoamino acids. The proteins, mutations, and interaction morphism (Table 5C). For example, 6.7% of single amino domains that appear to be affected by the disease-associated acid changes in disease modified basic amino acids to nonpolar mutations are shown in Table 6. The total number of mutations amino acids. However, only 2.8% of single amino acid, non- that mapped to known interaction domains was low, however it disease associated polymorphisms caused basic amino acids to must be noted that our focus was only on mutations that mapped be substituted with nonpolar amino acids. The degree of this to invariant amino acids from human to yeast. This reduced the particular change between diseased mutations versus non- number of mutations available for this study. It is also likely that disease polymorphism is 2.4-fold. Notably, other changes in many protein–protein interaction domains exist that are yet to amino acid class that were more than two-fold different be described. The future mapping of known and new mutations to new protein–protein interaction domains will reveal the full Table 5 extent to which mutations may alter the interactivity of proteins. Summary of the types of amino acid mutations, and the difference between We also examined the percent sequence identity of the 10 disease and non-disease associated mutations orthologs, as identified above. A number of the proteins in this To: group, such as NF1_HUMAN and its ortholog IRA2_YEAST, did not have high sequence homology and are unlikely to be Acidic Basic Polar Nonpolar good models for examining protein–protein interactions. A However, the yeast and human orthologs of RasK and MSH2 From: Acidic 0.8 4.2 4.9 1.6 Basic 1.0 6.0 10.6 6.7 were of particularly high sequence homology, having 55% and Polar 3.8 9.8 9.0 10.2 40% sequence identity respectively. The latter proteins are of Nonpolar 0.8 4.0 8.2 18.4 particular interest for study in yeast. B From: Acidic 0.7 7.6 4.9 1.4 3.6. Many yeast orthodisease proteins have interaction Basic 2.8 3.5 13.9 2.8 Polar 4.2 6.9 7.6 4.9 partners that are known Nonpolar 1.4 0.0 9.7 27.8 C Finally, the 232 yeast disease orthologs were examined to From: Acidic 1.1 0.5 1.0 1.2 determine the number of interaction partner(s) that are known, Basic 0.4 1.7 0.8 2.4 and thus find proteins of greatest relevance for this study. From Polar 0.9 1.4 1.2 2.1 Nonpolar 0.6 * 0.8 0.7 the interactions field of the SGD database, a total of 3199 redundant interactions were found to exist for a set of 205 of (A) Summary of all disease-associated mutations, expressed as a percent of all 1770 mutations. Amino acids denoted as X (Table 3) have been omitted. (B) these proteins, ranging from 115 to 1 interaction per protein Summary of all non-disease associated polymorphisms, expressed as a percent (average of 15.6). 27 proteins have no protein–protein of all 144 polymorphisms. (C) Differences between the types of disease and non- interactions described to date. The number of known interaction disease associated polymorphisms. Values in part A of this table have been partners per yeast protein is shown in supplementary Table S1. ∼ divided by values in part B. A ratio of 1 implies no difference in the type of It was noted that many human disease-associated proteins did mutation. Ratios that are 2 fold or more changed are shaded and show appreciable differences between the types of disease and non-disease associated not have the same degree of known protein interactions (data polymorphisms. The asterisk * represents a mathematically infinite difference not shown). This probably reflects our lack of knowledge of between disease and non-disease associated mutations (4 divided by 0). protein–protein interactions in humans. 844 W.-T. Hsu et al. / Biochimica et Biophysica Acta 1774 (2007) 838–847

Table 6 Disease-associated mutations in human proteins, from SWISS-PROT and OMIM that were found to map to interaction domains in the human proteins Human protein Interaction domain Reference Position Mutation Mutation Yeast ortholog Amino acid position type position CHK2_HUMAN FHA [41] 113–175 117 R→G RAD53_YEAST 70 167 G→R 108 MLH1_HUMAN Interaction with EXO1 [42] 410–650 622 L→H MLH1_YEAST 627 648 P→L 661 542 Q→L 552 549 L→P 559 559 L→R 569 648 P→S 661 MSH2_HUMAN Interaction with EXO1 [42] 410–650 622 P→L MSH2_YEAST 640 639 H→Y 658 MTMR2_HUMAN SET-interaction (SID) [43] 477–529 482 Q→X YMR1_YEAST 525 MTMR2_HUMAN Tyrosine–protein phosphatase [43] 386–463 426 Q→X YMR1_YEAST 406 NF1_HUMAN Ras-GAP [44] 1235–1451 1444 K→E IRA1_YEAST 1891 1444 K→N 1891 1250 R→P 1724 1276 R→P 1750 1276 R→Q 1750 1444 K→R 1891 1412 R→S 1860 1605 I→V 2059 1444 K→E 1883 1444 K→N 1883 1243 L→P 1709 1250 R→P 1716 1276 R→P 1742 1276 R→Q 1742 1444 K→R 1883 1412 R→S 1852 1605 I→V 2051 NF1_HUMAN CRAL-TRIO [45] 1580–1738 1583 W→X IRA1_YEAST 2037 PCSK9_HUMAN a Peptidase S8 [46] 161–431 253 L→F PRTB_YEAST 384 PCSK9_HUMANa Peptidase S8 [46] 161–431 253 L→F YSP3_YEAST 272 PEX13_HUMAN SH3 [47] 276–334 326 I→T PEX13_YEAST 362 PTEN_HUMAN Phosphatase tensin-type [48] 14–185 173 R→C TEP1_YEAST 243 61 H→D81 20 G→E39 129 G→E 198 165 G→E 235 47 R→G67 130 R→G 199 173 R→H 243 130 R→L 199 119 V →L 188 19 D→N37 170 S→N 240 174 Y→N 244 112 L →P 181 173 R→P 243 167 T→P 237 130 R→Q 199 124 C→R 193 129 G→R 198 165 G→R 235 61 H→R81 123 H→R 192 67 I→R88 112 L →R 181 170 S→R 240 124 C→S 193 27 Y→S46 165 G→V 235 57 L→W77 W.-T. Hsu et al. / Biochimica et Biophysica Acta 1774 (2007) 838–847 845

Table 6 (continued) Human protein Interaction domain Reference Position Mutation Mutation Yeast ortholog Amino acid position type position 93 H→Y 162 123 H→Y 192 RASK_HUMAN Effector region [49] 32–40 34 P→R RAS2_YEAST 40 WASP_HUMAN WH1 [50] 38–147 96 W→C LAS17_YEAST 76 132 E→K112 83 F→L62 57 P→L36 74 V→M53 133 A→T113 55 A→V34 69 G→W48 The precise mutation is shown, as is the corresponding invariant amino acid in the yeast ortholog. Mutations in human proteins that did not map to an invariant amino acid in the yeast ortholog were ignored. a Protein PCSK9_HUMAN (proprotein convertase subtilisin/kexin type 9) has two orthologs in yeast. PRTB_YEAST (cerevisin) and YSP3_YEAST (subtilisin- like protease 3) have 18% and 23% sequence identity with PCSK9_HUMAN respectively.

We also examined the yeast disease orthologs to determine for the experimental investigation of disease-causing mutations which proteins, if any, are known to be ‘hub’ proteins. These are in an easily manipulated model system. known to have a large number of interactions that have been observed by more than one technique or by more than one lab Acknowledgement using the same technique [21]. Approximately 65% of hub proteins are essential in the yeast cell, highlighting their Chi Nam Ignatius Pang is the recipient of an Australian importance in the interaction network. Interestingly, from the set Postgraduate Award. of 379 proteins, 10 hub proteins were found (see supplementary Table S1). These included the TATA-box binding protein, Appendix A. Supplementary data guanine nucleotide-binding protein subunit beta and cell division control protein 10. Hub proteins are likely to be Supplementary data associated with this article can be found, particularly susceptible to the loss of protein interactions due to in the online version, at doi:10.1016/j.bbapap.2007.04.014. the effects of mutations and are therefore of great interest for further study. References

3.7. Summary and conclusions [1] T. Yasugi, J.D. Benson, H. Sakai, M. Vidal, P.M. Howley, Mapping and characterization of the interaction domains of human papillomavirus type In this study, we have comprehensively investigated yeast 16 E1 and E2 proteins, J. Virol. 71 (1997) 891–899. disease orthologs to determine if they can help understand the [2] A.J. Walhout, R. Sordella, X. Lu, J.L. Hartley, G.F. Temple, M.A. Brasch, impact of human mutations on protein–protein interactions. We N. Thierry-Mieg, M. Vidal, Protein interaction mapping in C. elegans have shown that many yeast orthodisease proteins have high using proteins involved in vulval development, Science 287 (2000) 116–122. sequence identity with their human counterparts and that a large [3] J.C. Rain, L. Selig, H. De Reuse, V. Battaglia, C. Reverdy, S. Simon, G. number of mutations occur on invariant amino acids. Some Lenzen, F. Petel, J. Wojcik, V. Schachter, Y. Chemama, A. Labigne, P. mutations map to known protein interaction domains, and Legrain, The protein–protein interaction map of Helicobacter pylori, others are likely to map to interaction domains that are yet to be Nature 409 (2001) 211–215. described. The engineering of human disease-associated [4] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, Y. Sakaki, A com- prehensive two-hybrid analysis to explore the yeast protein interactome, mutations into yeast orthodisease proteins thereby presents a Proc. Natl. Acad. Sci. U. S. A. 98 (2001) 4569–4574. means of understanding the likely impact of these mutations on [5] P. Uetz, L. Giot, G. Cagney, T.A. Mansfield, R.S. Judson, J.R. Knight, D. protein–protein interactions. Lockshon, V. Narayan, M. Srinivasan, P. Pochart, A. Qureshi-Emili, Y. Li, This study has also produced a new resource where B. Godwin, D. Conover, T. Kalbfleisch, G. Vijayadamodar, M. Yang, M. – researchers can find protein orthologs associated with a Johnston, S. Fields, J.M. Rothberg, A comprehensive analysis of protein protein interactions in Saccharomyces cerevisiae, Nature 403 (2000) human disease, see the degree of identity they have with the 623–627. human disease-associated protein, the number of single amino [6] S. Li, C.M. Armstrong, N. Bertin, H. Ge, S. Milstein, M. Boxem, P.O. acid changes that are documented (disease and non-disease Vidalain, J.D. Han, A. Chesneau, T. Hao, D.S. Goldberg, N. Li, M. associated) and how these changes map between human and Martinez, J.F. Rual, P. Lamesch, L. Xu, M. Tewari, S.L. Wong, L.V. Zhang, yeast proteins. The resource also notes the number of yeast G.F. Berriz, L. Jacotot, P. Vaglio, J. Reboul, T. Hirozane-Kishikawa, Q. Li, H.W. Gabel, A. Elewa, B. Baumgartner, D.J. Rose, H. Yu, S. Bosak, R. interaction partners that are known and which orthodisease Sequerra, A. Fraser, S.E. Mango, W.M. Saxton, S. Strome, S. Van Den proteins are available as TAP-tag fusion proteins to facilitate Heuvel, F. Piano, J. Vandenhaute, C. Sardet, M. Gerstein, L. Doucette- their analysis. Together, this resource can open many avenues Stamm, K.C. Gunsalus, J.W. Harper, M.E. Cusick, F.P. Roth, D.E. Hill, M. 846 W.-T. Hsu et al. / Biochimica et Biophysica Acta 1774 (2007) 838–847

Vidal, A map of the interactome network of the metazoan C. elegans, [17] W. Wiszniewski, M.C. Fondaneche, P. Louise-Plence, A. Prochnicka- Science 303 (2004) 540–543. Chalufour, F. Selz, C. Picard, F. Le Deist, J.F. Eliaou, A. Fischer, B. [7] L. Giot, J.S. Bader, C. Brouwer, A. Chaudhuri, B. Kuang, Y. Li, Y.L. Hao, Lisowska-Grospierre, Novel mutations in the RFXANK gene: RFX com- C.E. Ooi, B. Godwin, E. Vitols, G. Vijayadamodar, P. Pochart, H. plex containing in-vitro-generated RFXANK mutant binds the promoter Machineni, M. Welsh, Y. Kong, B. Zerhusen, R. Malcolm, Z. Varrone, A. without transactivating MHC II, Immunogenetics 54 (2003) 747–755. Collis, M. Minto, S. Burgess, L. McDaniel, E. Stimpson, F. Spriggs, J. [18] J. Anderson, H.D. Burns, P.E. Harris, A.O.M. Wilkie, J.K. Heath, Apert Williams, K. Neurath, N. Ioime, M. Agee, E. Voss, K. Furtak, R. Renzulli, syndrome mutations in fibroblast growth factor receptor 2 exhibit N. Aanensen, S. Carrolla, E. Bickelhaupt, Y. Lazovatsky, A. DaSilva, J. increased affinity for FGF ligand, Hum. Mol. Genet. 7 (1998) 1475–1483. Zhong, C.A. Stanyon, R.L. Finley, K.P. White, M. Braverman, T. Jarvie, S. [19] G. Buchanan, M. Yang, J.M. Harris, H.S. Nahm, G. Han, N. Moore, J.M. Gold, M. Leach, J. Knight, R.A. Shimkets, M.P. McKenna, J. Chant, J.M. Bentel, R.J. Matusik, D.J. Horsfall, V.R. Marshall, N.M. Greenberg, W.D. Rothberg, A protein interaction map of Drosophila melanogaster, Science Tilley, Mutations at the boundary of the hinge and ligand binding domain 302 (2003) 1727–1736. of the androgen receptor confer increased transactivation function, Mol. [8] U. Stelzl, U. Worm, M. Lalowski, C. Haenig, F. Brembeck, H. Goehler, M. Endocrinol. 15 (2001) 46–56. Stroedicke, M. Zenkner, A. Schoenherr, S. Koeppen, A human protein– [20] J.C. Shen, Y. Lao, A. Kamath-Loeb, M.S. Wold, L.A. Loeb, The N- protein interaction network: a resource for annotating the proteome, Cell terminal domain of the large subunit of human replication protein A binds 122 (2005) 957–968. to Werner syndrome protein and stimulates helicase activity, Mech. Ageing [9] J.F. Rual, K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, N. Li, Dev. 124 (2003) 921–930. G.F. Berriz, F.D. Gibbons, M. Drezel, N. Ayivi-Guedehoussou, N. Klitgord, [21] J.D. Han, N. Bertin, T. Hao, D.S. Goldberg, G.F. Berriz, L.V. Zhang, D. C. Simon, M. Boxem, S. Milstein, J. Rosenberg, D.S. Goldberg, L.V. Dupuy, A.J. Walhout, M.E. Cusick, F.P. Roth, M. Vidal, Evidence for Zhang, S.L. Wong, G. Franklin, S. Li, J.S. Albala, J. Lim, C. Fraughton, E. dynamically organized modularity in the yeast protein–protein interaction Llamosas, S. Cevik, C. Bex, P. Lamesch, R.S. Sikorski, J. Vandenhaute, network, Nature 430 (2004) 88–93. H.Y.Zoghbi, A. Smolyar, S. Bosak, R. Sequerra, L. Doucette-Stamm, M.E. [22] A. Dziembowski, B. Seraphin, Recent developments in the analysis of Cusick, D.E. Hill, F.P. Roth, M. Vidal, Towards a proteome-scale map of the protein complexes, FEBS Lett. 556 (2004) 1–6. human protein–protein interaction network, Nature 437 (2005) 1173–1178. [23] K.P. O'Brien, I. Westerlund, E.L. Sonnhammer, OrthoDisease: a database [10] A.C. Gavin, M. Bösche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. of human disease orthologs, Hum. Mutat. 24 (2004) 112–119. Schultz, J.M. Rick, A.M. Michon, C.-M. Cruciat, M. Remor, C. Höfert, M. [24] F. Foury, Human genetic diseases: a cross-talk between man and yeast, Schelder, M. Brajenovic, H. Ruffner, A. Merino, K. Klein, M. Hudak, D. Gene 195 (1997) 1–10. Dickson, T. Rudi, V. Gnau, A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, [25] M.A. Andrade, C. Sander, A. Valencia, Updated catalogue of homologues M.-A. Heurtier, R.R. Copley, A. Edelmann, E. Querfurth, V. Rybin, G. to human disease-related proteins in the yeast genome, FEBS Lett. 426 Drewes, M. Raida, T. Bouwmeester, P. Bork, B. Seraphin, B. Kuster, G. (1998) 7–16. Neubauer, G. Super-Furga, Functional organization of the yeast proteome [26] G.M. Rubin, M.D. Yandell, J.R. Wortman, G.L.G. Miklos, C.R. Nelson, by systematic analysis of protein complexes, Nature 415 (2002) 141–147. I.K. Hariharan, M.E. Fortini, P.W. Li, R. Apweiler, W. Fleischmann, J.M. [11] Y. Ho, A. Gruhler, A. Heilbut, G.D. Bader, L. Moore, S.-L. Adams, A. Cherry, S. Henikoff, M.P. Skupski, S. Misra, M. Ashburner, E. Birney, Millar, P. Taylor, K. Bennett, K. Boutilier, L. Yang, C. Wolting, I. M.S. Boguski, T. Brody, P. Brokstein, S.E. Celniker, S.A. Chervitz, D. Donaldson, S. Schandorff, J. Shewnarane, M. Vo, J. Taggart, M. Coates, A. Cravchik, A. Gabrielian, R.F. Galle, W.M. Gelbart, R.A. Goudreault, B. Muskat, C. Alfarano, D. Dewar, Z. Lin, K. Michalickova, George, L.S.B. Goldstein, F. Gong, P. Guan, N.L. Harris, B.A. Hay, R.A. A.R. Willems, H. Sassi, P.A. Nielsen, K.J. Rasmussen, J.R. Andersen, L.E. Hoskins, J. Li, Z. Li, R.O. Hynes, S.J.M. Jones, P.M. Kuehl, B. Lemaitre, Johansen, L.H. Hansen, H. Jespersen, A. Podtelejnikov, E.Nielsen, J. J.T. Littleton, D.K. Morrison, C. Mungall, P.H. O'Farrell, O.K. Pickeral, Crawford, V. Poulsen, B.D. Sørensen, J. Matthiesen, R.C. Hendrickson, F. C. Shue, L.B. Vosshall, J. Zhang, Q. Zhao, X.H. Zheng, F. Zhong, W. Gleeson, T. Pawson, M.F. Moran, D. Durocher, M. Mann, C.W.V.Hogue, D. Zhong, R. Gibbs, J.C. Venter, M.D. Adams, S. Lewis, Comparative Figeys, M. Tyers, Systematic identification of protein complexes in Sac- genomics of the eukaryotes, Science 287 (2000) 2204–2215. charomyces cerevisiae by mass spectrometry, Nature 415 (2002) 180–183. [27] W.H. Mager, J. Winderickx, Yeast as a model for medical and medicinal [12] A.C. Gavin, P. Aloy, P. Grandi, R. Krause, M. Boesche, M. Marzioch, C. research, Trends Pharmacol. Sci. 26 (2005) 265–273. Rau, L.J. Jensen, S. Bastuck, B. Dümpelfeld, A. Edelmann, M.-A. [28] N. Fisher, C.K. Castleden, I. Bourges, G. Brasseur, G. Dujardin, B. Heurtier, V. Hoffman, C. Hoefert, K. Klein, M. Hudak, A.-M. Michon, M. Meunier, Human disease-related mutations in cytochrome b studied in Schelder, M. Schirle, M. Remor, T. Rudi, S. Hooper, A. Bauer, T. yeast, J. Biol. Chem. 279 (2004) 12951–12958. Bouwmeester, G. Casari, G. Drewes, G. Neubauer, J.M. Rick, B. Kuster, P. [29] T.F. Outerio, S. Lindquist, Yeast cells provide insight into alpha-synuclein Bork, R.B. Russell, G. Superti-Furga, Proteome survey reveals modularity biology and pathobiology, Science 302 (2003) 1772–1775. of the yeast cell machinery, Nature 440 (2006) 631–636. [30] R.J. Wanders, H.R. Waterham, Peroxisomal disorders I: biochemistry and [13] N.J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, J. Li, genetics of peroxisome biogenesis disorders, Clin. Genet. 67 (2004) S. Pu, N. Datta, A.P. Tikuisis, T. Punna, J.M. Peregrín-Alvarez, M. Shales, 107–133. X. Zhang, M. Davey, M.D. Robinson, A. Paccanaro, J.E. Bray, A. Sheung, [31] S.S. Dwight, R. Balakrishnan, K.R. Christie, M.C. Costanzo, K. Dolinski, B. Beattie, D.P. Richards, V. Canadien, A. Lalev, F. Mena, P. Wong, A. S.R. Engel, B. Feierbach, D.G. Fisk, J. Hirschman, E.L. Hong, L. Issel- Starostine, M.M. Canete, J. Vlasblom, S. Wu, C. Orsi, S.R. Collins, S. Tarver, R.S. Nash, A. Sethuraman, B. Starr, C.L. Theesfeld, R. Andrada, Chandran, R. Haw, J.J. Rilstone, K. Gandi, N.J. Thompson, G. Musso, P. G. Binkley, Q. Dong, C. Lane, M. Schroeder, S. Weng, D. Botstein, J.M. St Onge, S. Ghanny, M.H.Y. Lam, G. Butland, A.M. Altaf-Ul, S. Kanaya, Cherry, Saccharomyces genome database: underlying principles and A. Shilatifard, E. O'Shea, J.S. Weissman, C.J. Ingles, T.R. Hughes, J. organisation, Brief. Bioinform. 5 (2004) 9–22. Parkinson, M. Gerstein, S.J. Wodak, A. Emili, J.F. Greenblatt, Global [32] R.C. Edgar, S. Batzoglou, Multiple sequence alignment, Curr. Opin. landscape of protein complexes in the yeast Saccharomyces cerevisiae, Struct. Biol. 16 (2006) 368–373. Nature 440 (2006) 637–643. [33] C.B. Do, M.S. Mahabhashyam, M. Brudno, S. Batzoglou, ProbCons: [14] A.C. Gavin, Protein–protein interactions, in: M.R. Wilkins, R.D. Appel, probabilistic consistency-based multiple alignment of amino acid K.L Williams, D.F. Hochstrasser (Eds.) Proteome Research: strategies, sequences, Genome Res. 15 (2005) 330–340. techniques and practice. Springer, Berlin (in press). [34] S. Henikoff, J.G. Henikoff, Amino acid substitution matrices from protein [15] A. Reymond, R. Brent, p16 proteins from melanoma-prone families are blocks, Proc. Natl. Acad. Sci. U. S. A. 89 (1992) 10915–10919. deficient in binding to Cdk4, Oncogene 11 (1995) 1173–1178. [35] S.L. Alper, Genetic diseases of acid–base transporters, Annu. Rev. Physiol. [16] X. Sun, L. Zheng, B. Shen, Functional alterations of human exonuclease 1 64 (2002) 899–923. mutants identified in atypical hereditary nonpolyposis colorectal cancer [36] F.E. Karet, K.E. Finberg, R.D. Nelson, A. Nayir, H. Mocan, S.A. Sanjad, J. syndrome, Cancer Res. 62 (2002) 6026–6030. Rodriguez-Soriano, F. Santos, C.W. Cremers, A. Di Pietro, B.I. Hoffbrand, W.-T. Hsu et al. / Biochimica et Biophysica Acta 1774 (2007) 838–847 847

J. Winiarski, A. Bakkaloglu, S. Ozen, R. Dusunsel, P. Goodyer, S.A. [43] F.L. Robinson, J.E. Dixon, The phosphoinositide-3-phosphatase MTMR2 Hulton, D.K. Wu, A.B. Skvorak, C.C. Morton, M.J. Cunningham, V. Jha, associates with MTMR13, a membrane-associated pseudophosphatase R.P. Lifton, Mutations in the gene encoding B1 subunit of H+-ATPase also mutated in type 4B Charcot–Marie–Tooth disease, J. Biol. Chem. 280 cause renal tubular acidosis with sensorineural deafness, Nat. Genet. 21 (2005) 31699–31707. (1999) 84–90. [44] H.R. Bourne, D.A. Sanders, F. McCormick, The GTPase superfamily: [37] A.N. Smith, J. Skaug, K.A. Choate, A. Nayir, A. Bakkaloglu, S. Ozen, conserved structure and molecular mechanism, Nature 349 (1991) 117–127. S.A. Hulton, S.A. Sanjad, E.A. Al-Sabban, R.P. Lifton, S.W. Scherer, [45] S. Zimmer, A. Stocker, M.N. Sarbolouki, S.E. Spycher, J. Sassoon, A. Azzi, F.E. Karet, Mutations in ATP6N1B, encoding a new kidney vacuolar A novel human tocopherol-associated protein: cloning, in vitro expression, proton pump 116-kD subunit, cause recessive distal renal tubular acidosis and characterization, J. Biol. Chem. 275 (2000) 25672–25680. with preserved hearing, Nat. Genet. 26 (2000) 71–75. [46] S. Naureckiene, L. Ma, K. Sreekumar, U. Purandare, C.F. Lo, Y.Huang, L.W. [38] W.K. Beyenbach, H. Wieczorek, The V type H+ ATPase molecular Chiang, J.M. Grenier, B.A. Ozenberger, J.S. Jacobsen, J.D. Kennedy, P.S. structure and function, physiological roles and regulation, J. Exp. Biol. 209 DiStefano, A. Wood, B. Bingham, Functional characterization of Narc 1, a (2006) 577–589. novel proteinase related to proteinase K, Arch. Biochem. Biophys. 420 [39] N. Ochotny, A. Van Vliet, N. Chan, Y. Yao, M. Morel, N. Kartner, H.P. von (2003) 55–67. Schroeder, J.N. Heersche, M.F. Manolson, Effects of human a3 and a4 [47] M. Fransen, T. Wylin, C. Brees, G.P. Mannaerts, P.P. Van Veldhoven, Human mutations that result in osteopetrosis and distal renal tubular acidosis on yeast pex19p binds peroxisomal integral membrane proteins at regions distinct V-ATPase expression and activity, J. Biol. Chem. 281 (2006) 26102–26111. from their sorting sequences, Mol. Cell. Biol. 21 (2001) 4413–4424. [40] M. Bueno, L.A. Campos, J. Estrada, J. Sancho, Energetics of aliphatic [48] M.M. Georgescu, K.H. Kirsch, T. Akagi, T. Shishido, H. Hanafusa, The deletions in protein cores, Protein Sci. 15 (2006) 1858–1872. tumor-suppressor activity of PTEN is regulated by its carboxyl-terminal [41] J. Li, B.L. Williams, L.F. Haire, M. Goldberg, E. Wilker, D. Durocher, M.B. region, Proc. Natl. Acad. Sci. U. S. A. 96 (1999) 10182–10187. Yaffe, S.P. Jackson, S.J. Smerdon, Structural and functional versatility of [49] M.D. Vos, C.A. Ellis, C. Elam, A.S. Ulku, B.J. Taylor, G.J. Clark, RASSF2 the FHA domain in DNA-damage signaling by the tumor suppressor kinase is a novel K-Ras-specific effector and potential tumor suppressor, J. Biol. Chk2, Mol. Cell 9 (2002) 1045–1054. Chem. 278 (2003) 28045–28051. [42] X. Sun, L. Zheng, B. Shen, Functional alterations of human exonuclease 1 [50] K.E. Prehoda, D.J. Lee, W.A. Lim, Structure of the enabled/VASP mutants identified in atypical hereditary nonpolyposis colorectal cancer homology 1 domain–peptide complex: a key component in the spatial syndrome, Cancer Res. 62 (2002) 6026–6030. control of actin assembly, Cell 97 (1999) 471–480. 578–588 Nucleic Acids Research, 2008, Vol. 36, No. 2 Published online 1 December 2007 doi:10.1093/nar/gkm1070

Identifying foldable regions in protein sequence from the hydrophobic signal Chi N.I. Pang1, Kuang Lin2, Merridee A. Wouters1,3, Jaap Heringa4 and Richard A. George1,*

1Structural & Computational Biology Program, Victor Chang Cardiac Research Institute, Sydney, Australia, 2Biomathematics and Statistics, Scotland, JCMB, The King’s Building, Edinburgh, EH9, 3JZ, Scotland, UK, 3Schools of Biotechnology & Biomolecular Sciences and Medical Sciences, University of New South Wales, Sydney, Australia and 4Centre for Integrative Bioinformatics, Faculty of Sciences and Faculty of Earth & Life Sciences, Vrije Universiteit, De Boelelaan 1081a, 1081HV Amsterdam, The Netherlands

Received August 21, 2007; Revised October 25, 2007; Accepted November 13, 2007

ABSTRACT sequence and structural data has motivated structural genomics initiatives, which aim to elucidate representative Structural genomics initiatives aim to elucidate 3D structures for the majority of protein families. A major representative 3D structures for the majority of bottleneck in structural studies is the correct design of protein families over the next decade, but many constructs: many proteins are either too large or contain obstacles must be overcome. The correct design of unstructured regions, and are thus unsuitable for struc- constructs is extremely important since many tural solution (1). It is therefore essential to identify proteins will be too large or contain unstructured regions in protein sequences likely to be amenable to regions and will not be amenable to crystallization. structural elucidation (2). It is therefore essential to identify regions in protein The component of globularity in proteins is the sequences that are likely to be suitable for structural domain: a compact, semi-independent, structural unit (3). study. Scooby-Domain is a fast and simple method Wetlaufer (4) first proposed the concept: defining domains to identify globular domains in protein sequences. as stable units of protein structure that can fold autono- Domains are compact units of protein structure and mously. Nature often brings several domains together to their correct delineation will aid structural elucida- form multidomain and multifunctional proteins with the possibility of a vast number of combinations. Because tion through a divide-and-conquer approach. domains mostly fold independently, large proteins that Scooby-Domain predictions are based on the may not be amenable to structural solution may yield to a observed lengths and hydrophobicities of domains divide-and-conquer approach. from proteins with known tertiary structure. The Methods for domain prediction can be divided into prediction method employs an A*-search to identify three groups: homology searches, analysis of sequence sequence regions that form a globular structure and features and de novo structure prediction. Domain assign- those that are unstructured. On a test set of 173 ment methods that are based on homology searches proteins with consensus CATH and SCOP domain include Domaination (5) and PASS (6). Other such definitions, Scooby-Domain has a sensitivity of 50% methods include those used to generate domain data- and an accuracy of 29%, which is better than bases such as Pfam (7–10). Both Domaination and current state-of-the-art methods. The method does PASS identify domains using the positions of the not rely on homology searches and, therefore, can N- and C-termini of aligned homologous sequences. identify previously unknown domains. While effective at identifying distant family members, homology-based methods will not identify the exact structural limits of a domain, which is essential for structural elucidation (11) and will fail to identify domains INTRODUCTION that have not been rearranged during evolution (5). Completion of 200 genome-sequencing projects has led to Many methods have been developed to delineate an astronomical growth in sequence data, leaving the domains using sequence features. The amino acids that massive task of structural and functional annotation to be make up inter-domain linking peptides are distinct from addressed. The vast and growing gap between protein those in domain or loop regions (12). This signal has been

*To whom correspondence should be addressed. Tel: +61 (0)2 9295 8508; Fax: +61 (0)2 9295 8501; Email: [email protected]

ß 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2008, Vol. 36, No. 2 579 utilized by several groups. Armadillo (13) and Domcut MATERIALS AND METHODS (14) predict linkers using a table of likelihood scores for Domain size and percentage hydrophobicity each amino acid to be within a linker region. Bae et al. (15) uses a hidden Markov model and linker index with The distribution of domain size and hydrophobicity was combined Gibbs sampling and Markov Chain Monte calculated using the S35 domain representatives from the Carlo to estimate the parameters and posterior probabili- CATH domain database version 3.0.0 (28). No domain ties. Miyazaki et al. (16), Dong et al. (17) and Sikder and in this set has >35% sequence identity with any other Zomaya (18) utilize position-specific scoring matrices domain. Only the first three classes of the CATH (PSSM) generated from PSI-BLAST to predict domain classification were used since class-four proteins have boundaries. Miyazaki et al. (16) employs a neural network few secondary structures and are unlikely to be comprised to identify signals between linkers and domains while of globular domains. Full-sequence data was taken from Dong et al. (17) applies a linker propensity index and the corresponding CATH COMBS file. Unlike the ATOM Sikder and Zomaya (18) utilizes a support vector machine sequences, which may have missing residues, the COMBS and linker predictions made by Domcut. sequences attempt to provide the full sequence, by filling Other methods apply predicted secondary structure and in any missing residues in the PDB atom fields with multiple sequence alignment to predict domain location those in the PDB SEQRES fields. The length of the (19–22). DomSSEA (20) applies a simple threading domain sequences was restricted to between 34 and 251 protocol while CHOPnet (19) utilizes a neural network residues. Domains outside this range are unlikely to have a using amino acid flexibility, secondary structure, solvent single hydrophobic core (30). For each domain, percen- accessibility and amino acid composition. tage hydrophobicity was calculated using a simple binary Many of these sequence-feature-based methods are no hydrophobicity scale, where 11 amino acid types are better than a simple guess based on the predicted number considered as hydrophobic: Ala, Cys, Phe, Gly, Ile, of domains, as applied in Domain Guess by Size (DGS) Leu, Met, Pro, Val, Trp and Tyr (31). Other scales were (23). Furthermore, few methods tackle the prediction of trialled but were found to produce poorer results in discontinuous domains. Discontinuous domain prediction benchmarking. is an important problem because 45% of multidomain A 3D histogram of the distribution of domain sequence proteins have one or more domains wound from non- lengths versus their corresponding hydrophobicities was contiguous sequence in the polypeptide chain (24). created using a square averaging window. The window Finally, SnapDragon (25) and Ginzu-RosettaDOM (26) sums the number of domain sequences that it encapsulates are programs that utilize de novo protein structure within the distribution. The resulting value is then placed prediction to delineate domains. Although these methods at the central position of the window. The window moves showed some success, exact domain boundary placement along the distribution one unit at a time, covering the is often limited to proteins with two or three domains and entire dataset. The square window has a size of 19 residues the time required for prediction is too long for genome- by 1 unit percentage hydrophobicity, which means that scale assignment. it captures all domain sequences within a length of 19 The Scooby-Domain (SequenCe hydrOphOBicitY pre- residues and with an average hydrophobicity resolution dicts DOMAINs) web application was recently introduced of 1%. Each position in the final 3D histogram was to visually identify foldable regions in a protein sequence then scaled to a value between 0 and 1, where 1 is the (27). Here we present benchmark performance of a new highest point in the distribution corresponding to the algorithm to automatically predict domain boundaries. most frequent observation (Figure 1a). These values were Scooby-Domain uses the distribution of observed lengths used as a reference to judge whether a sequence fragment and hydrophobicities in domains with known 3D structure can form a domain based on its length and average to predict novel domains and their boundaries in a protein hydrophobicity. sequence. It utilizes a multilevel smoothing window to determine the percentage of hydrophobic amino acids Generating the domain probability matrix within a putative domain-sized region in a sequence. Using the observed distribution of domain lengths and Scooby-Domain uses a multilevel smoothing window to percentage hydrophobicities, the probability that the predict the location of domains in a novel sequence region can fold into a domain or be unfolded is then (Figure 1b). The window size, representing the length of a calculated. A novel algorithm is then applied to calculate putative domain, is incremented starting from the smallest the most likely domain architecture of the protein. domain size observed in the database to the largest Scooby-Domain was benchmarked on proteins with domain size. The window size must be an odd number and known 3D structure and defined domain architecture. the size is incremented by two each time. Each smoothing Precise domain definition, even with a known structure, is window calculates the fraction of hydrophobic residues a difficult problem and several databases with alternative it encapsulates along a sequence, and places the value at definitions exist. To fairly test our methodology we have its central position. This leads to a 2D matrix, where the used two databases, CATH (28) and SCOP (29), as well as value at cell (i, j) is the average hydrophobicity encapsu- a consensus definition. The benchmark sets contain lated by a window of size j that is centred at residue proteins with a range of domain number and domain– position i. The matrix has a triangular shape with the apex domain connectivity and is a challenging test for domain corresponding to a window size equal to the length of the prediction algorithms. sequence. 580 Nucleic Acids Research, 2008, Vol. 36, No. 2

(a) (b) (c) 1.0 Window size j Hydrophobicity (%) Window size j 25 32 39 46 53 60 67 74 81 0.0 0.2 0.4 0.6 0.8 35 65 95 125 155 185 215 Sequence position i Sequence position i Sequence length

Figure 1. (a) Domain probability matrix. CATH domains as a function of their sequence length and percentage hydrophobicity. The red areas represent regions that have a high frequency of domain occurrence, while the blue areas represent regions that have a low frequency of domain occurrence. (b) Multilevel smoothing window. Smoothing windows of increasing length are used to calculate the average hydrophobicity along the sequence. The horizontal axis corresponds to the sequence position, i, and the vertical axis represents the window length, j. Hydrophobicity values are plotted at the position representing the sequence position of the centre point of the smoothing window and the window length (i, j). (c) Domain prediction. For each position in the multilevel smoothing (b) the length of the smoothing window and calculated average hydrophobicity is converted to a probability that it will fold into a domain, based on the lengths and hydrophobicities observed in the distribution of CATH domains (a).

Matrix values are converted to probability scores by referring to the observed distribution of domain sizes and (a) (b) (c) hydrophobicities described earlier, i.e. given an average hydrophobicity and window length, the probability that it can fold into a domain is found directly from the observed data. Visualization of Scooby-Domain plots can be used to effectively identify regions that are likely to fold into domains, as well as unstructured regions (27). Figure 2. Protocol for domain assignment. (a) The highest scoring window (first predicted domain) is identified in the probability matrix Automatic domain boundary assignment and the sequence region it encapsulates (dark grey triangle) is removed from the sequence. (b) The resulting sequence fragments are rejoined Scooby-Domain employs an A*-search algorithm to and the probability matrix recalculated. (c) The smoothing windows search through a large number of alternative domain that encapsulate the last 15 residues of the N-terminal fragment and the annotations. The top ten highest probabilities in the first 15 residues of the C-terminal fragment have their probabilities set Scooby-Domain plot are identified, each one becoming to zero (white bands). If the next highest scoring region is found in the light grey region, then the excised domain will be discontinuous, the first predicted domain in a set of alternative otherwise it will be continuous. predictions (Figure 2). To encourage alternative predic- tions that are distinct, a new start site must not be within a diamond-shaped region, of width 17 residues, surrounding The A*-search algorithm considers combinations of an old start site. different domain sizes, using a heuristic function to guide The corresponding sequence stretch for the first the search (Figure 3). Instead of just considering the predicted domain is removed from the sequence domain prediction with the highest score for each step of (Figure 2a). Therefore, the first predicted domain will the algorithm, A*-search memorizes a list of up to ten always have a continuous sequence and further domain possible domain predictions, and each of these are predictions can encompass discontinuous domains. If the represented as a node in the tree-like search space. Each excised domain occupies an interior position in the possible domain solution will be a path or branch in the sequence, the resulting N- and C-terminal fragments are search tree. The heuristic score of new predictions is rejoined and a new probability matrix is recalculated compared with the heuristic scores from domains pre- (Figure 2b). dicted in the previous step. Consideration of these Upon rejoining the sequence fragments, once a domain alternative paths to other possible solutions would avoid has been removed, it is important that the probabilities on the search being trapped in a local maximum. Since A*- either side of a join are down-weighted to avoid small search is a generic algorithm, its description can be found fragments being involved in subsequent domain delinea- in other texts that cover artificial intelligence, for example, tions. To enforce this, a minimum discontinuous segment the original paper by Hart et al. (32). The implementation size of 15 residues is applied (Figure 2c). details specific to domain prediction are discussed below. The search process is repeated until there are <34 The heuristic score, h, is defined by the following residues remaining—the size of the smallest domain; or equation: until there are no probabilities greater than 0.33—an P lðL lÞ P arbitrary cutoff to prevent non-domain-like regions from h ¼ þ 1 being predicted as a domain. L2 b þ 1 Nucleic Acids Research, 2008, Vol. 36, No. 2 581

(a) is no guarantee that an optimal solution will always be reached (33).

Integration of multiple sequence alignment The performance of Scooby-Domain was assessed with the inclusion of homology information. Homologues of the (b) (c) query sequence were detected using PSI-BLAST (34) searches of the SWISS-PROT database (35) and multiple sequence alignments (MSA) were generated using 736 PRALINE (36). Only those sequences with <90% sequence identity and >70% coverage of the query (d) sequence were kept for alignment to the query sequence. 5 2 2 2 2 4 All sequences in the alignment were trimmed such that they matched the start and end points of the query sequence. A domain-probability matrix was constructed for each Figure 3. Different stages in the A*-search algorithm. (a) The top-most sequence in the MSA and the scoring matrices from each triangle represents the Scooby-Domain domain-probability matrix for a of the multiply-aligned sequences were summed and protein sequence. The search for protein domains in a query sequence is like travelling through a maze. The centre of the maze being the best cumulated into a master array for tallying scores. Each domain prediction. In this figure, each triangle is like a different path value in the master matrix is divided by the number of through the maze, and each level below the first triangle represents one sequences in the MSA. The positions in the master matrix more domain region being predicted. Each ‘hotspot’ in the triangular that correspond to gaps in the query sequence are matrix, is used to locate the exact region of the sequence with highest probability of a globular domain being formed. Three highest scoring removed, resulting in a matrix with the same width as hotspots in the first matrix are identified and highlighted with a dot in the length of the query sequence. This final matrix is used the figure, with scores of 7, 3 and 6, respectively. This leads to the for automatic domain-boundary assignment as discussed addition of three new paths, with each one being the recalculated earlier. matrix for the remaining sequence, after the first domain region was predicted and removed from the original sequence. (b) Each triangle also represents a node in the search tree, where each node could branch Integration of linker propensities to a different path that may lead to the solution. The highest scoring Two linker prediction scoring systems, Domcut (14) and triangle (7) is searched for new hotspots, which have scores of 5, 2 and 2. (c) Regardless of level, the node with the next highest score would be PDLI (17) were used independently to complement searched upon, until no further domain regions can be predicted. In Scooby-Domain’s prediction. A negative number repre- this example, it is the node with a score of 6. This allows the algorithm sents a higher propensity for a linker in both of these to consider different parallel paths through the ‘maze’, covering a larger scoring schemes. Therefore, the scores were multiplied area, and avoiding the search being confined to a ‘dead end’ path. (d) The next node to search following the highest scoring predictions by 1 to reverse the polarity of the scoring. Scaling was has a score of 5. performed on these scoring schemes such that the range of the scores is within 0.0 and 0.5. The Scooby-Domain multi-dimensional smoothing window adds the linker P prediction scores at its N- and C- termini to the domain where P is the sum of probabilities for each domain probability matrix (the combined linker prediction will predicted so far; b represents the number of boundaries have a maximum score of 1.0). To avoid increasing the assigned; L is the length of the original sequence and l is chance of assigning a domain in a large unstructured the remaining length of the sequence in which no domain region, the scores were added to the triangular matrix only has yet been predicted. if the domain probability value exceeds a threshold To prevent large numbers of connections between probability score. A threshold of 0.25 was found to be domains, a penalty is applied when a discontinuous best after assessing a number of test cases. domain segment is assigned: b + 1 equals the number of To test the added value of the combined approach, protein domains if all domains are continuous, otherwise the Domcut and PDLI methods were re-implemented and this value will be larger and effectively lowers the score. their domain-prediction performances were compared Other heuristic measures were trialled, but the one with Scooby-Domain: as a stand-alone predictor; or described here had the best benchmark results. The with complementary predictions made by Domcut or heuristic increases the likelihood of a boundary being PDLI. When Scooby-Domain was combined with close to the middle of the sequence, but this had no Domcut, the raw Domcut score, rather than the normal- detrimental effect on discontinuous domain predictions, ized score was used for Scooby-Domain predictions. where boundaries are often not in the middle of the For the Scooby-Domain, Domcut and MSA combination, sequence. Domcut predictions were obtained for the query sequence In an optimal A*-search, an admissible heuristic only, and not for each sequence in the MSA. This is function would be used, which means the estimated cost because when Domcut is applied to all sequences in to reach the optimal solution would always need to be the MSA prediction, results for Domcut are close to larger than the actual cost of finding the optimal solution, random. When PDLI and MSAs are combined with otherwise the optimal solution is not guaranteed. Since Scooby-Domain, PDLI uses all sequences in the MSA to the equation used is not proven to be admissible, there make a prediction. 582 Nucleic Acids Research, 2008, Vol. 36, No. 2

Benchmarking Table 1. Number of proteins in each dataset

Predictions were assessed using a set of proteins with CATH SCOP CATH \ SCOP known structural domain assignments. A non-redundant list of protein sequences, with known 3D structures, was All 611 496 173 obtained from the VAST non-redundant PDB chain set Continuous 336 418 150 Discontinuous 275 78 23 (www.ncbi.nlm.nih.gov/Structure/VAST/nrpdb.html). Total number of unique sequences 789 This was matched with the corresponding entries in CATH (28) and SCOP (29), to create two test sets: a non-redundant CATH test set and a non-redundant SCOP test set. Where a domain boundary consists of several residues, the central position between the start and value (PPV), which is the percentage of all boundary end of the boundary is used. Full-length sequences were predictions that are correct: taken from the PDB SEQRES fields of the ASTRAL TP PPV ¼ 3 database (37). Our test sets are much more rigorous than ðTP þ FPÞ those used by other methods, as they contain sequences with three or more domains and sequences with discontin- where FP is the number of false positive boundary uous domains. These sequences were often under- predictions. PPV is called accuracy for the purpose of represented or omitted by other groups, for example Liu this study. et al. (19) and Dong et al. (17). Both the CATH and SCOP databases define a domain as a particular core structure of secondary structure RESULTS elements. Both of these databases allow some degree of Comparison of different test sets elaboration upon domain definition, however, CATH’s definition is more flexible and delineates smaller domains Three test sets were used in this study: CATH, SCOP and in comparison to SCOP (38). Direct comparisons of the CATH \ SCOP. The number of sequences in each test set two databases showed that they agree on the majority is shown in Table 1. Accuracies for all methods tested were of domain annotations (39, 40). As an additional test, the around 10% higher in the CATH test set compared to the intersection of CATH and SCOP was also used as an SCOP set (Figure 4 and Table 2). This could be attributed optimal test set, using a consensus approach (CATH \ to the higher proportion of linkers in the CATH set, which SCOP). For this set, only proteins that had boundary makes it easier to predict boundaries by chance. The equal assignments corresponding to within 10 residues between cut method achieved its highest accuracy on this set. the two definitions were used. CATH assigns more domains and linkers in a protein Scooby-Domain performance was compared to PDLI, than SCOP, and has an average 2.52 linkers per protein Domcut and an equal-cut method. Equal-cut is a naive (1376 linkers in 611 proteins) while SCOP has an average method, similar to DGS (23), that is used as an 1.50 linkers per protein (746 linkers in 496 proteins). experimental control. First, a rough estimate of the However, Scooby-Domain achieved similar prediction number of domains in a sequence is calculated by dividing accuracies in both CATH and CATH \ SCOP sets. the sequence length by the average domain size of 100 CATH \ SCOP has the smallest average number of linkers residues, and rounding to the closest integer. The sequence per protein, 1.47 (255 linkers in 173 proteins), suggesting was then chopped as evenly as possible based on the that Scooby-Domain prediction accuracies are not artifi- number of domains. It was impossible to fairly assess cially improved by a larger number of linkers in CATH. other methods on our test sets. For example, DomSSEA CATH assigns domains purely on the basis of structure, (20) uses a threading procedure that would identify the whereas SCOP domains are assigned on the basis of original query and others apply domain-profile searches in inherited functional units. Therefore, SCOP domains can an initial attempt to find known domains. often be made up of two or more CATH domains. The Predictions were assessed using various error-window Scooby-Domain algorithm tries to predict domains based sizes around the known domain boundary. A correct on sequence characteristics related to structural principles boundary prediction is one that falls within the and should therefore perform better on the CATH set error window. Two measures of performance for the compared to the SCOP set, which it does. Interestingly, predictions were utilized. The first measure is sensitivity, Scooby-Domain had the best overall performance on the which is the percentage of boundaries, out of all consensus CATH \ SCOP set, suggesting that Scooby- the boundaries collected from all proteins, that were Domain can successfully identify domains that qualify as correctly predicted: being both structural and functional.

TP Enhanced sensitivity with linker predictions S ¼ 2 ðTP þ FNÞ Predictions made by utilizing domain-boundary predic- tions produced from other sources are compared in where TP is the number of true positive boundary Figure 4 and Table 2. A window size of 20 residues predictions and FN the number of false negatives. (41 residues) is used for the results presented below. The second measure of performance is positive predictive Alone, Scooby-Domain scored a sensitivity of 37.3% Nucleic Acids Research, 2008, Vol. 36, No. 2 583

(a) (b) (c) 60 60 60

50 50 50

40 40 40

30 30 30

Sensitivity (%) 20 Sensitivity (%) 20 Sensitivity (%) 20

10 10 10

0 0 0 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 40 40 40

30 30 30

20 20 20 Accuracy (%) Accuracy (%) Accuracy (%) 10 10 10

0 0 0

10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Performance Window Size Performance Window Size Performance Window Size

Figure 4. Sensitivity and accuracy versus different performance window size around the domain boundary. Window size is the total number of residues making up the window. Sensitivity (top) and accuracy (bottom) are shown for the CATH dataset (a), SCOP dataset (b), and the CATH \ SCOP dataset (c). Continuous line, Scooby+Domcut; Diamond with dashed line, PDLI only; triangle with dashed line, Scooby only; inverted triangle with dashed line, Scoopy+PDLI+MSA; plus sign with dotted line, Domcut only; crossed square with continuous line, Scooby+Domcut+MSA; Cross mark with dashed dotted line, Equal cut.

Table 2. Sensitivity and accuracy at performance window size 20 residues

CATH SCOP CATH \ SCOP

Methods Sensitivity Accuracy Sensitivity Accuracy Sensitivity Accuracy

Scooby+Domcut+MSA 38.7 31.3 45.2 23.2 50.2 28.5 Scooby+Domcut 36.8 31.6 41.7 22.7 45.5 27.9 Scooby only 32.3 30.4 37.9 23.3 37.3 25.5 Domcut only 19.3 29.3 20.8 20.1 25.5 24.6 Equal cut 29.5 31.7 33.2 22.2 33.3 27.0 PDLI only 51.5 27.3 56.7 18.4 52.2 19.7 Scooby+PDLI+MSA 34.5 29.7 36.7 20.5 39.2 23.5

on the CATH \ SCOP test set. Domcut is the least sensitive close to random when Domcut was applied to all sequences method amongst those tested, but addition of the Domcut in the MSAs. Therefore, when combined with Scooby- predictions into Scooby-Domain surprisingly increases Domain, Domcut was applied to the query sequence only, Scooby-Domain’s overall score. The sensitivity for while Scooby-Domain was applied to all sequences in the Domcut alone is 25.5%, less than the equal-cut MSA. Sensitivity was improved from 45.5% to 50.2% and method (33.3%). The combination of Scooby-Domain accuracy was improved from 27.9% to 28.5%. and Domcut achieved a sensitivity of 45.5% and an PDLI, which makes linker predictions based on MSA, accuracy of 27.9%. The combined Scooby-Domain and has the best sensitivity (52.2%) but the worst accuracy Domcut method was determined to be the best of the three (19.7%). Because PDLI overpredicts linkers, it finds more in terms of sensitivity and accuracy. linkers in the CATH set (Figure 4), but consistently has the lowest accuracy compared to other methods assessed. The combination of the PDLI method with Scooby- Homology information enhances prediction Domain significantly reduces the sensitivity and accuracy. We further determined whether improvements in perfor- Scooby-Domain (Scooby+Domcut+MSA) has the mance could be obtained if combined Scooby-Domain and highest sensitivity, 50.2%, in the CATH \ SCOP set in Domcut methodology was used in combination with comparison to the other two datasets, with an accuracy of homology information. Domcut benchmark results were 28.5%. For the CATH dataset, it achieved a sensitivity of 584 Nucleic Acids Research, 2008, Vol. 36, No. 2

(a) 4 (b) 5 1 1 Continuous Continuous Discontinuous Discontinuous

5 1 44 21 117 12 6 22 7 27 117 34 4 15 1 17 4 7 13 12 51 3 88 Accuracy (%) Sensitivity (%) 3 17 22 273 6 117 6 1 1 0 20 40 60 80 100 0 20 40 60 80 100 246810 246810 Number of Domains Number of Domains

Figure 5. Sensitivity (a) and accuracy (b) of Scooby-Domcut (Scooby+Domcut+MSA), at window size of 20, for proteins with more than one domain. The white bars represent proteins with continuous domains only. The grey bars represent proteins containing discontinuous domains. There are two numbers above each bar. The top number has a different meaning for each graph: in the sensitivity graph (a) it represents the number of real linkers for each domain; and in the accuracy graph (b) it represents the number of linkers predicted for the binned proteins. For both graphs, the bottom number is the number of proteins with the corresponding number of domains.

38.7% and the highest accuracy, 31.3%, of the three Examples of predictions for proteins with both datasets. For the SCOP dataset, Scooby-Domain achieved continuous and discontinuous domains are shown in a sensitivity of 45.1%, but a lower accuracy of 23.1%. Figure 6 and 7. The corresponding protein structures are shown and coloured by the predicted domain region. Each Results for different domain numbers and types is bounded by the predicted middle position of a probable linker region. In Figure 6a, two distinctive hotspots The sensitivity and accuracy of prediction for multi- representing the two larger domains of 1LK5 chain A domain proteins as a function of domain number is shown (Figure 6b) are discernible. Scooby-Domain had difficulty in Figure 5. Proteins are divided into two groups: predicting the exact domain boundary (junction of red continuous and discontinuous. In the latter group, at and green region) at the end of a b-strand. Figure 7a least one domain is wound from non-contiguous portions shows an example of a successful discontinuous domain of the polypeptide chain. prediction by Scooby-Domain. The two segments of the The method is equally sensitive at delineating proteins discontinuous domain are coloured in red and pink, with either continuous or discontinuous domains. respectively. Similar to the previous example, Scooby- Sensitivity is 50.4% and 47.1% for two-domain proteins Domain did not precisely match the inter-domain region, with continuous and discontinuous domains respectively however, the boundary is within the 20 residue window. and 54.6% and 50.0% for three domains. Figure 7b demonstrates how Scooby-Domain accurately The method more accurately delineates proteins with delineated the monoclonal antibody heavy chain for Mus discontinuous domains. The accuracy for two-domain musculus (PDB 1IGT, chain B). proteins is 22.0% and 33.3%, for proteins with continuous In summary, Scooby-Domain can successfully identify and discontinuous domains respectively; and for three discontinuous regions and can easily delineate distinct domain proteins, 30.7% and 46.7% (Figure 5b). The domains separated by long linker regions. Precise domain majority of domain prediction methods have not been boundary placement is a very difficult problem, even when developed to identify discontinuous domains and ignore a structure is known. For example, the CATH domain such proteins in their benchmarking tests. database uses a consensus of computational methods, Sensitivity and accuracy for proteins with four or more combined with a manual assessment when automatic domains is not statistically significant due to the lack of methods do not agree (24). Scooby-Domain, therefore, proteins in the test data. However, it is interesting to note performs very well at identifying domains and their that 100% sensitivity and accuracy is scored for a protein boundaries using only sequence information. which includes one discontinuous domain (PDB 1dq3A). The results show that Scooby-Domain delineates proteins with discontinuous domains with a sensitivity and DISCUSSION accuracy as good as for proteins with continuous domains. This is important to the structural genomics Domain prediction based on hydrophobicity initiative because the presence of discontinuous domains The globular structure of a protein cannot be achieved in the protein sequence would not confound prediction by any combination of amino acids, as certain principles of results, thus the predictions are more reliable, and will aid structure must be obeyed. Previous studies have shown that the discovery of previously unknown protein domains. there is a required ratio of hydrophobic to hydrophilic Nucleic Acids Research, 2008, Vol. 36, No. 2 585

(a)

(b)

Figure 6. The Scooby–Domain (Scooby+Domcut+MSA) prediction for the hyperthermostable D–ribose–5–phosphate isomerase from Pyrococcus horikoshii (PDB 1LK5, chain A). (a) The structure of 1LK5, coloured according to the linker prediction by Scooby–Domain. A discontinuous domain is predicted at residues 1–136 (green) and 207–229 (blue). A second domain is predicted at residues 137–206 (red). The CATH domain annotation consists of two domains, a discontinuous domain made of two segments 1–128 and 208–229; and the continuous domain 129–207. (b) The Scooby–Domain probability plot for 1LK5.

(a) (b)

Figure 7. The Scooby–Domain (Scooby+Domcut+MSA) predictions mapped to structures. (a) Transcription factor NusG from Aquifex aeolicus (PDB 1M1G, chain C), coloured according to the linker prediction by Scooby–Domain. A discontinuous domain is predicted at residues 1–55 (red) and 131–174 (pink), two continuous domains are predicted at residues 56–130 (green) and 175–249 (blue). The CATH domain annotation consists of three domains; a discontinuous domain (1–49,131–190) and two continuous domains (50–131 and 191–249). (b) IgG2 monoclonal antibody heavy chain from Mus musculus (PDB 1IGT, chain B), coloured according to the linker prediction by Scooby–Domain. Four continuous domains are predicted at residues 1–117 (red), 118–228 (blue), 229–360 (green) and 361–474 (orange). The corresponding CATH domain annotation consists of four continuous domains at residues 1–114, 115–236, 250–360 and 361–474.

residues. Molecules with too many hydrophobic residues criteria to identify local deviations in hydrophobicity to aggregate in solution, and largely hydrophilic proteins fail predict the protein domain architecture. to form a stable hydrophobic core (41,42). Given the simplicity of our method, Scooby-Domain is Scooby-Domain is a domain prediction method based surprisingly powerful. It is likely that its performance can on the observed properties of proteins with known 3D be further improved by incorporating other information, structure. Smaller domains are found to have a higher for example, secondary structure prediction (20). proportion of hydrophilic residues, while larger domains Furthermore, using information from methods that that maintain a single hydrophobic core are constrained predict transmembrane regions is likely to improve by their length, with an average size of 100 residues Scooby-Domain’s ability to delineate solvent-exposed (Figure 1). Scooby-Domain takes advantage of the above domains in membrane proteins. 586 Nucleic Acids Research, 2008, Vol. 36, No. 2

Comparison with other methods Both of these methods are more sensitive and accurate than Scooby-Domain for this window size, but much more At a window size of 20 residues, Scooby-Domain has computationally expensive. a sensitivity of 50.2% and an accuracy of 28.5% (CATH It is important to note that different protein test sets and \ SCOP set). Performance is similar to CHOPnet (19), assessment criteria were used in the above comparisons. which has a sensitivity of 46–51%. The accuracy of Therefore, these comparisons only provide a ballpark CHOPnet was not computed. figure of how each of these methods perform in relation to Armadillo has a sensitivity of 27 3% and an accuracy each other. For example, the test set used for this study of 35 4% (13) and the PDLI method (17) has a contains multi-domain sequences and sequences with one sensitivity of 71% and specificity of 34%. However, or more discontinuous domains, whereas only sequences Armadillo and PDLI assess a linker as a region consisting with two continuous domains were used by Dong et al. (17). of multiple residues, rather than as a single residue To further assess our methods against others we have position as applied here, which implicitly makes the applied the Scooby-Domain algorithm to benchmark window size larger and the predictions easier. On our 2 from Holland et al. (43), which was used in their dataset, PDLI has a sensitivity of 52.2% and accuracy of assessment of methods that assign domains using 3D 19.7%. Domcut (14) is reported by Dong et al. (17) to structure. The dataset is built using a similar methodology have low sensitivity (23%) and specificity (9%) in as applied in our consensus set, i.e. looking for a comparison to other methods, which is consistent with consensus between CATH and SCOP definitions. our observation. However, while we ensure that boundaries between DomainDiscovery (18), which also applies linker pre- domains are at equivalent positions in CATH and dictions from Domcut, has a sensitivity (termed recall in SCOP, the consensus in benchmark 2 is based on the their paper) and accuracy (termed precision in their paper) number of domains assigned and unlike our set, bench- of 31% and 9%, respectively at a window size of 15 mark 2 contains single domain proteins. residues. At this window size, Scooby-Domain with Scooby-Domain, using Domcut and MSA, scored a Domcut and MSA has a sensitivity of 42.0% and accuracy sensitivity of 41.6% and accuracy of 29.0% on the of 22.9%. benchmark 2 set, and correctly predicted single domain DomSSEA (20) used the CATH database for their test proteins in 59.3% of cases. Predictions for all proteins can set, therefore, its performance will be compared with be found in Supplementary Table 1. Scooby-Domain Scooby-Domain tested with our CATH test set. For multi- predicted the exact domain number for nearly half of the domain proteins, DomSSEA has a sensitivity of 24.7% proteins (71/156), but often overpredicted domain number and Scooby-Domain has a sensitivity of 38.7%. For (65/156). Interestingly, many structure based methods also proteins with two continuous domains, DomSSEA has a tend to overpredict domain number on this set (43). sensitivity of 49% compared with Scooby-Domain’s The multidomain proteins in benchmark 2 are particu- 50.4%. For proteins with two domains and at least one larly hard to predict, as nearly half are made up of discontinuous domain, Scooby-Domain has a higher discontinuous domains, but Scooby-Domain performs sensitivity (35.4%) than DomSSEA (33.1%), but a lower well on the discontinuous subset with a sensitivity of accuracy (36.0%) than DomSSEA (49.7%). It can be 39.4% and accuracy of 34.8%. observed from these rough comparisons that the perfor- We also tested Scooby-Domain on a set of proteins used mance of Scooby-Domain is comparable, and often better, by Sikder and Zomaya (18) in their analysis of seven other than other sequence-feature-based methods. state-of-the-art methods. This set is based on 50 randomly Domaination (5) is an example of a method that uses selected proteins from the benchmark 2 dataset. It is homology searches to predict domains. We applied unclear whether these other methods were fairly tested as Domaination to our test set and added the predictions most use AI algorithms that learn from proteins with to Scooby-Domain. The combined method has a similar known 3D structure, and performances could be artifi- sensitivity (50.6%) and accuracy (29.3%) to cially enhanced by making predictions on the proteins Scooby+Domcut+MSA, and has a higher sensitivity used to train them. Furthermore, the webservers for some but lower accuracy than Domaination alone. However, of these methods will perform an initial homology search Domaination is significantly more computationally expen- to first identify any known structures with domain sive, therefore, its use is restricted to small datasets. In definitions, again leading to an unfair assessment. addition, homology methods cannot identify seldom Nevertheless, predictions by Scooby-Domain with encountered domains. Combining Domaination and Domcut and MSA (Supplementary Table 2) are compar- Scooby-Domain would likely improve Domaination’s able to the other seven methods and scores the highest homology detection. domain placement accuracy, 4.04. The ab initio methods SnapDRAGON (25) and RosettaDOM (26) currently have the best sensitivities and accuracies for boundary prediction. Both these Application of A*-search in structure prediction methods employ protein-fold prediction to identify Reinert et al. (44) previously used A*-search to efficiently domain boundaries. For a window size of 10 residues, perform near optimal MSA, but to our knowledge, RosettaDOM has lower sensitivity (28.6%) than Scooby-Domain is the first method that uses an A*- SnapDRAGON (42.3%). However, RosettaDOM is search in protein structure prediction. A*-search is a very more accurate (54.6%) than SnapDRAGON (39.8%). flexible method, and it may be easily adapted and Nucleic Acids Research, 2008, Vol. 36, No. 2 587 improved to include more sophistication in its predictions. REFERENCES For example, a more biologically accurate heuristic 1. Edwards,A.M., Arrowsmith,C.H., Christendat,D., Dharamsi,A., function could be developed by incorporating more Friesen,J.D. and Greenblatt,J.F. and Vedadi,M. (2000) Protein feature-based parameters, such as the flexibility of the production: feeding the crystallographers and NMR spectroscopists. peptide backbone and the presence of possible disulphide Nat. Struct. Biol., 7(Suppl), 970–972. bonds. 2. Bateman,A. and Valencia,A. (2006) Structural genomics meets computational biology. Bioinformatics, 22, 2319. Another future area of research and development is to 3. Richardson,J.S. (1981) The anatomy and taxonomy of protein adapt the A*-search algorithm to predict protein-folding structure. Adv. Protein Chem., 34, 167–339. pathways. An obvious but interesting property of A*- 4. Wetlaufer,D.B. (1973) Nucleation, rapid folding, and globular search is that it explores the hypothetical folding space in intrachain regions in proteins. Proc. Natl Acad. Sci. USA, 70, a tree-like search pattern. Because Scooby-Domain 697–701. 5. George,R.A. and Heringa J. (2002) Protein domain identification predictions rely on the hydrophobicity of the protein and improved sequence similarity searching using PSI-BLAST. sequence, it is possible, therefore, to simulate hydrophobic Proteins, 48 672–681. collapse and protein-folding pathways by backtracking 6. Kuroda,Y., Tani,K. and Matsuo,Y. and Yokoyama,S. (2000) through the search tree. Finally, the A*-search algorithm, Automated search of natively folded protein fragments for high- or similar heuristics, could in theory be incorporated into throughput structure determination in structural genomics. Protein Sci., 9, 2313–2321. a protein tertiary structure-prediction algorithm to simu- 7. Gracy,J and Argos,P. (1998) Automated protein sequence database late and predict folding pathways. classification. ii. Delineation of domain boundaries from sequence similarities. Bioinformatics, 14, 174–187. 8. Bateman,A., Birney,E., Durbin,R., Eddy,S.R. and Howe,K.L. and CONCLUSION Sonnhammer,E.L. (2000) The Pfam protein families database. Nucleic Acids Res., 28, 263–266. Percentage hydrophobicity and domain size are good 9. Corpet,F., Servant,F. and Gouzy,J. and Kahn,D. (2000) Prodom variables for domain prediction and have been success- and Prodom-CG: tools for protein domain analysis and whole fully applied to predict domain boundaries in the Scooby- genome comparisons. Nucleic Acids Res., 28, 267–269. 10. Schultz,J., Copley,R.R., Doerks,T. and Ponting,C.P. and Bork,P. Domain algorithm. Precise boundary positioning is still a (2000) Smart: a web-based tool for the study of genetically mobile difficult problem. Domains that are connected by small domains. Nucleic Acids Res., 28, 231–234. linkers may not be identifiable by Scooby-Domain, 11. Castiglone Morelli,M.A., Stier,G., Gibson,T., Joseph,C., Musco,G., because window averaging may lose any signal at the Pastore,A. and Trave,G. (1995) The KH module has an alpha beta linker. Scooby-Domain is therefore more useful when fold. FEBS Lett., 358, 193–198. 12. George,R.A. and Heringa,J. (2002) An analysis of protein domain identifying modules separated by clear linker regions in linkers: their classification and role in protein folding. Protein Eng., large proteins. However, Scooby-Domain does produce 15, 871–879 encouraging results. Predictions made from the Scooby- 13. Dumontier,M., Yao,R. and Feldman,H.J. and Hogue,C.W. (2005) Domcut combination are better than other previously Armadillo: domain boundary prediction by amino acid composi- described sequence-feature-based methods. Unlike other tion. J. Mol Biol., 350, 1061–1073. 14. Suyama,M. and Ohara,O. (2003) Domcut: prediction of inter- methods, it achieves similar prediction sensitivity and domain linker regions in amino acid sequences. Bioinformatics, 19, accuracy regardless of whether the domain is discontin- 673–674. uous or continuous. 15. Bae,K. and Mallick,B.K. and Elsik,C.G. (2005) Prediction of Scooby-Domain stands out from other prediction protein interdomain linker regions by a hidden Markov model. methods because it is able to predict discontinuous Bioinformatics, 21, 2264–2270. 16. Miyazaki,S., Kuroda,Y. and Yokoyama,S. (2006) Identification of domains and successful predictions are not limited by putative domain linkers by a neural network - application to a large the length of the query sequence, which can be too sequence database. BMC Bioinformatics, 7, 323. complex or time-consuming for other methods to 17. Dong,Q., Wang,X., Lin,L. and Xu,Z. (2006) Domain boundary calculate. prediction based on profile domain linker propensity index. Comput. Biol. Chem., 30, 127–133. The inclusion of difficult targets for benchmarking 18. Sikder,A.R. and Zomaya,A.Y. (2006) Improving the performance domain prediction, such as discontinuous domains, is of DomainDiscovery of protein domain boundary assignment using essential to drive future developments in this area. Our test inter-domain linker index. BMC Bioinformatics, 7,(Suppl. 5),S6. sets are available as Supplementary Data. 19. Liu,J. and Rost,B. (2004) Sequence-based prediction of protein domains.. Nucleic Acids Res., 32, 3522–3530. 20. Marsden,R.L., McGuffin,L.J. and Jones,D.T (2002) Rapid protein SUPPLEMENTARY DATA domain assignment from amino acid sequence using predicted secondary structure. Protein Sci., 11, 2814–2824. Supplementary Data are available at NAR online. 21. Gewehr,J.E. and Zimmer,R. (2006) SSEP-domain: protein domain prediction by alignment of secondary structure elements and profiles. Bioinformatics, 22, 181–187. 22. Joshi,R.R. and Samant,V.V. (2006) Fast prediction of protein ACKNOWLEDGEMENTS domain boundaries using conserved local patterns. J Mol. Model., We would like to thank Dr Bernd Brandt for maintaining 12, 943–952. the Scooby-Domain server. R.A.G. acknowledges funding 23. Wheelan,S.J., Marchler-Bauer,A. and Brayand,S.H. (2000) Domain size distributions can predict domain boundaries. Bioinformatics, 16, from the Ronald Geoffrey Arnott Foundation. Funding 613–618. to pay the Open Access publication charges for this article 24. Jones,S., Stewart,M., Michie,A., Swindells,M.B., Orengo,C. and was provided by Victor Chang Cardiac Research Institute. Thornton,J.M. (1998) Domain assignment for protein structures using a consensus approach: characterization and analysis. Protein Conflict of interest statement. None declared. Sci., 7, 233–242. 588 Nucleic Acids Research, 2008, Vol. 36, No. 2

25. George,R.A. and Heringa,J. (2002) Snapdragon: a method to PSI-BLAST: a new generation of protein database search programs. delineate protein structural domains from sequence data. J. Mol Nucleic Acids Res., 25, 3389–3402. Biol., 316, 839–851. 35. Bairoch,A., Apweiler,R., Wu,C.H., Barker,W.C., Boeckmann,B., 26. Kim,D.E., Chivian,D., Malmstrom,L. and Baker,D. (2005) Ferro,S., Gasteiger,E., Huang,H., Lopez,R. et al. (2005) The Automated prediction of domain boundaries in CASP6 Universal Protein Resource (UniProt). Nucleic Acids Res., 33, targets using Ginzu and RosettaDOM. Proteins, 61(Suppl. 7), D154–D159. 193–200. 36. Simossis,V.A. and Heringa,J. (2005) Praline: a multiple sequence 27. George,R.A., Lin,K. and Heringa,J. (2005) Scooby-domain: pre- alignment toolbox that integrates homology-extended and second- diction of globular domains in protein sequence. Nucleic Acids Res., ary structure information. Nucleic Acids Res., 33, W289–W294. 33, W160–W163. 37. Chandonia,J.M., Hon,G., Walker,N.S., Lo Conte,L., Koehl,P., 28. Pearl,F., Todd,A., Sillitoe,I., Dibley,M., Redfern,O., Lewis,T., Levitt,S.E. and Brenner,S.E. (2004) The ASTRAL compendium in Bennett,C., Marsden,R., Grant,A. et al. (2005) The CATH domain 2004. Nucleic Acids Res., 32, D189–D192. structure database and related resources Gene3D and DHS provide 38. Chu,C.K., Feng,L.L. and Wouters,M.A. (2005) Comparison of comprehensive domain family information for genome analysis. sequence and structure-based datasets for nonredundant structural Nucleic Acids Res., 33, D247–D251. data mining. Proteins, 60, 577–583. 29. Murzin,A.G., Brenner,S.E. and Hubbard,T. and Chothia,C. (1995) 39. Day,R., Beck,D.A., Armen,R.S and Daggett,V. (2003) A consensus SCOP: a structural classification of proteins database for the view of fold space: combining SCOP, CATH, and the Dali domain investigation of sequences and structures. J. Mol Biol., 247, dictionary. Protein Sci., 12, 2150–2160. 536–540. 40. Hadley,C. and Jones,D.T. (1999) A systematic comparison of 30. Garel,J. (1992) Folding of large proteins: multidomain and multi- protein structure classifications: SCOP, CATH and FSSP. Structure, subunit proteins. In Creighton,T. (ed.), Protein folding, W.H. 7, 1099–1112. Freeman and Company, New York., pp. 405–454. 41. Fisher,H.F. (1964) A limiting law relating the size and shape of 31. White Jacobs,R.E. (1990) Statistical distribution of hydrophobic protein molecules to their composition. Proc. Natl Acad. Sci. USA, residues along the length of protein chains. Implications for protein 51, 1285–1291. folding and evolution. Biophys. J., 57, 911–921. 42. Dill,K.A. (1985) Theory for the folding and stability of globular 32. Hart,P., Nilsson,N. and Raphael,B. (1968) A formal basis for the proteins. Biochemistry, 24, 1501–1509. heuristic determination of minimum cost paths. IEEE Transactions 43. Holland,T.A., Veretnik,S. and Shindyalov,I.N. and Bourne,P.E. on Systems Science and Cybernetics, SSC4, pp. 100–107. (2006) Partitioning protein structures into domains: why is it so 33. Russell,S. and Norvig,P. (2003) Artificial Intelligence: A Modern difficult. J. Mol Biol., 361, 562–590. Approach. 2nd edn. Prentice Hall, Englewood Cliffs, New Jersey. 44. Reinert,K. and Stoye,J. and Will,T. (2000) An iterative method for 34. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., faster sum-of-pairs multiple sequence alignment. Bioinformatics, 16, Miller,W. and Lipman,D.J. (1997) Gapped BLAST and 808–814. Proteomics 2009, 9, 5309–5315 DOI 10.1002/pmic.200900260 5309

RESEARCH ARTICLE The Interactorium: Visualising proteins, complexes and interaction networks in a virtual 3-D cell

Yose Y. Widjaja1, Chi Nam Ignatius Pang2, Simone S. Li2, Marc R. Wilkins2 and Tim D. Lambert1

1 School of Computer Science and Engineering, University of New South Wales, NSW, Australia 2 Systems Biology Initiative and School of Biotechnology and Biomolecular Sciences, University of New South Wales, NSW, Australia

Here, we describe the Interactorium, a tool in which a Virtual Cell is used as the context for Received: April 22, 2009 the seamless visualisation of the yeast protein interaction network, protein complexes and Revised: July 8, 2009 protein 3-D structures. The tool has been designed to display very complex networks of up to Accepted: August 26, 2009 40 000 proteins or 6000 multiprotein complexes and has a series of toolboxes and menus to allow real-time data manipulation and control the manner in which data are displayed. It incorporates new algorithms that reduce the complexity of the visualisation by the generation of putative new complexes from existing data and by the reduction of edges through the use of protein ‘‘twins’’ when they occur in multiple locations. Since the Interactorium permits multi-level viewing of the molecular biology of the cell, it is a considerable advance over existing approaches. We illustrate its use for Saccharomyces cerevisiae but note that it will also be useful for the analysis of data from simpler prokaryotes and higher eukaryotes, including humans. The Interactorium is available for download at http://www.interactorium.net.

Keywords: Complexome / Networks / Protein–protein interactions / Systems biology / Virtual Cell

1 Introduction be displayed at once. These tools also provide limited inter- activity [3]. A 3-D visualisation can increase the size of a graph The visualisation of biological data is an ongoing challenge. that can be readily understood, particularly when coupled with Analytical data, including protein structures, interactions and user interaction [4]. Some recently introduced tools such as other parameters of proteins can be generated with increas- GEOMI [5] have adopted this approach. However, they have ingly automated technology. However, the meaning of such not been designed to visualise networks with very high data can often not be appreciated without sophisticated visua- numbers of protein nodes. In contrast to the above, other lisation, and scientific insights can be difficult to obtain with- projects have attempted to build a cellular context by artificially out strong biological context. To date, there have been many visualising and representing the cell in a holistic manner. efforts to visualise biological data in a non-contextual way. These include visualisation projects such as The Cell Visuali- Protein structure viewing tools, perhaps the first sophisticated sation Project (http://www.kenneth-eward.com/cvp/cvpin- use of computer visualisation of biochemical data, can provide dex.html) and The Visible Cell Project [6] and modelling and a means to interactively see the shapes and nuances of proteins simulation projects including the E-Cell Project [7] and The in 3-D. Tools for visualising protein interaction networks, such Virtual Cell Modelling and Simulation Framework [8]. Inter- as Cytoscape [1] and VisANT [2], are in increasingly wide use estingly, there have been no projects to date designed to inte- but are 2-D, which limits the amount of information that can grate and work at multiple levels inside the cell, for example that of protein structure, protein complex, protein interaction network and organellar localisation. Correspondence: Professor Marc R. Wilkins, School of Biotech- Here, we describe the adaptation of the Skyrails nology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia Visualisation System (http://www.skyrails.net) to create the E-mail: [email protected] Interactorium. This is an interactive tool in which a virtual Fax: 161-2-9385-1483 yeast cell is used as the context for the visualisation of the

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com 5310 Y. Y. Widjaja et al. Proteomics 2009, 9, 5309–5315 yeast protein interaction network, protein complexes and given a weighting of 1. The total weighting assigned to an protein 3-D structures. The tool has been designed to display interaction was calculated as the sum of all documented very complex networks of up to 6000 complexes containing observations multiplied by the respective weight of the 40 000 proteins and has a series of toolboxes and menus to detection method. This weighting system was used to reflect allow real-time data manipulation and control the manner the likelihood that these observations represent a true in which data are displayed. Since the Interactorium permits interaction. The weightings were then used to define the multi-level viewing of the molecular biology of the cell, it is a edge strength, which determined the thickness of the edge considerable advance over existing approaches. It will be displayed, as well as the strength of the spring used to pull applicable to the future analysis of similar data from simpler interacting proteins together in the layout algorithm. Non- prokaryotes as well as higher eukaryotes, including that for interacting proteins, by contrast, were repelled from each human cells. other by the algorithm. Together, this created a layout that arranges related proteins together in 3-D space. Weighting was also used in the determination of putative protein 2 Materials and methods complexes (Section 3).

2.1 Protein localisation, abundance and half-life data 2.3 Protein complex data

Cellular localisation data were obtained from Huh et al. [9], Protein complex data used in this visualisation were based protein abundance data from Ghaemmaghami et al. [10] and on the high-accuracy consensus map by Hart et al. [13]. We protein half-life data from Belle et al. [11]. also devised a grouping algorithm to automatically generate protein complexes, based on the weightings of the protein–protein interactions. It involves first assigning every 2.2 Interaction data ungrouped protein as a one-protein group. Groups are then merged together based on their ‘‘synergy’’ values: Protein–protein interaction data were from Bertin et al. [12]. synergy ¼ D2 =ðD D Z Z Þ This is a high-quality metadata set that considers interac- ab ab a b a b tions from yeast two-hybrid experiments, affinity purifica- where Da and Db are the degrees of any two groups A and B, tion studies as well as those from the literature. Za and Zb are the node counts of groups A and B and Dab is The different types of interaction detection methods were the sum of weightings for all edges that connect groups used to weight the edges in the interaction networks. A and B. The degree of a group indicates the total weight of Interactions detected using affinity capture and co-crystal interactions of group members, in this case, proteins. The structure techniques were given a weighting of 3, and those synergy value roughly models the likelihood that two groups from high-throughput two-hybrid experiments were given a of proteins will form a complex based on the amount of weighting of 2. Other observations, listed in Table 1, were interactions between the two groups, while also taking into

Table 1. Interaction types and their weighting in the Interactorium, sourced from Bertin et al. [12]

Interaction detection method Total interactions Observation Weighting frequency

Affinity capture (MS) 3305 7438 3 Affinity capture Western 3029 4721 3 Biochemical activity 414 539 1 Co-crystal structure 69 92 3 Co-fractionation 228 252 1 Co-localisation 95 109 1 Co-purification 854 1114 3 Far Western 38 40 1 Fret 26 26 1 In silico 170 170 1 Protein-peptide 126 186 1 Protein-RNA 1 2 1 Reconstituted complex 1914 2542 1 Two-hybrid 2131 3378 2 Unknown 1427 1738 1

The second column indicates the amount of unique interactions between proteins, whereas the third column indicates the total frequency of the observation.

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com Proteomics 2009, 9, 5309–5315 5311 account how exclusive those interactions are. Two groups frames per second, when run on a quad-core Pentium are much more likely to form a complex when the groups computer with a NVIDIA GeForce 8600GT video card or only have interactions with each other, as opposed to equivalent. The Interactorium layout engine runs continu- situations when groups have interactions with proteins of ously, which allows changes to layout and selection of many other groups. These relationships are measured by the features to be made in real time. The Interactorium is synergy values. In the use of synergy values, pairs of groups available for download at http://www.interactorium.net. with the highest synergy were merged first until there were Demonstration videos and detailed release notes are also no more high synergy pairs available. This scheme enables provided. complexes that have many or many heavily weighted inter- actions to be merged together, and complexes with lower synergy with other complexes to remain by themselves. 3 Results

When calculating the degree between two groups Dab,we also considered the second-degree relationships (neighbours The Interactorium is a Virtual Cell, containing organelles of neighbours) of each protein. To ensure that the values of and other cellular localizations. It provides three levels of first-degree relationships are higher than the second-degree visualisations of protein interaction networks and proteins: values, thus preserving the topology of the network, first- an interaction network view, a protein complex view and a degree values were multiplied by the maximum edge weight protein structure view (Fig. 1). These are navigable in real of the graph and an arbitrary constant, in this case, 10. time in three dimensions. Second-degree values, on the other hand, were calculated as the product of the weight of the two connecting edges (see Supporting Information Fig. 1). 3.1 The Virtual Cell

The Virtual Cell presents a cell-like representation of the 2.4 Protein structure data proteome. Proteins are visualised in the context of the organelles and compartments that hold the proteins. Orga- Protein structure data were obtained from the RCSB Protein nelles are represented as large circles inside the cell and Data Bank (PDB) [14]. A small sample of PDB files is have their own specific colours and decorations. Figure 2 included in the current distribution of the Interactorium. shows the virtual Golgi apparatus and virtual mitochon- drion. Most localisations such as the nucleus, the nucleolus and the bud are given their own compartments, a 3-D 2.5 The Interactorium volume where the localised proteins are forced to stay by the layout algorithm. Localisations such as actin and the cyto- The Interactorium is a 3-D network rendering program built plasm are not given a compartment, as these localisations from Skyrails, an experimental graph visualisation platform. can be thought of as existing throughout the whole cell. In all graphs, data entities and their relationships are They are visualised outside of the organelles but inside the represented as nodes and edges, respectively. The spring- boundaries of the display space, which acts as a cell wall. embedded algorithm [15] is used to position the nodes in Therefore, proteins belonging to these localisations are not 3-D space. The Interactorium is a C11/OpenGL program, bound to any particular 3-D compartment. Full details on which runs on Linux, Mac and Windows. It can support up the organelles in the Virtual Cell, and their visual properties to 6000 individual proteins or up to 40 000 proteins if they are given in the Release Notes (Supporting Information). An are grouped into complexes. Provided that there are a alternative that was considered during the design was to use maximum of 5000 nodes, the Interactorium can run at 20 3-D shapes for the compartments, such as an elongated

Figure 1. The three levels of visualisation provided by the Virtual Cell. The highest level (left panel) shows the whole interactome in the context of cellular organelles. Zooming in reveals the network interactions between proteins in the context of their complexes (middle panel), while the lowest level of the visualisation shows the 3-D structures of proteins or protein complexes (right panel). The plus sign in the right panel is a bookmarking system: if clicked by the user, the node is stored in a list for later use.

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com 5312 Y. Y. Widjaja et al. Proteomics 2009, 9, 5309–5315

Figure 2. Organelles inside the Virtual Cell have specific colours and decorations to assist in their recognition. (A) The virtual Golgi apparatus. (B) The virtual mitochondrion. In these views, the names of the proteins in the organelles are shown in a colour specific for the organelle. The user can search by navigation or Figure 3. The display of protein nodes and their attributes. The through the menu for a protein of interest inside any organelle or description box appears when the user hovers the mouse over a in the entire cell. node. The two circles of a node show the protein’s half-life and abundance values. The inner circle indicates half-life: a full circle indicates a 12 h half-life. Incomplete inner circles indicate half- spheroid for the mitochondrion. However, the use of a life values less than 12 h, similar to the time frame of a clock’s sphere for the volume and circles to visualise these hour hand. The outer circle represents the abundance value of compartments allowed a simpler technical implementation, the protein: the length and thickness of the line are proportional and we found that the virtual decorations such as that to the abundance value of the protein. For example, the abun- shown in Fig. 2, combined with a distinguishing colour dance of protein ATP11 (right) is much lower than that of ATP synthase subunit a and b. scheme, provided good differentiation between different compartments. The Interactorium has an interactive menu system that colour of the edges by protein localisation, evidence type for allows users to highlight all the proteins that belong to a an interaction or membership of a complex. For example, if particular localisation; this generates a complete view of all the user wishes to highlight all the inter-compartment relevant information for that compartment (that is, protein interactions that have been detected by co-purification interactions, complexes and structures). Alternatively the user experiments, they can first hide all edges, show all the inter- can navigate the Virtual Cell by clicking on protein nodes; compartment edges and then change all the edges drawn on this immediately centres the visualisation on a protein of the basis of co-purification evidence. In this case, the interest and displays relevant associated data. Detailed visualisation would show all co-purification edges in the instructions on how to navigate through the Interactorium network in red (see Supporting Information Fig. 2). are given in the Release Notes (Supporting Information).

3.2.2 Representation of protein interactions present 3.2 Interaction network in two or more parts of the cell

The Interactorium can show the interactions between the Most proteins in the Virtual Cell are associated with one proteins themselves. Nodes are displayed as double-bordered localisation. However, it was important to be able to high- circles if half-life and/or abundance data are available for the light the interactions of proteins known to be present in protein (Fig. 3). The interactions are represented as lines or more than one compartment. One option would be to show edges, which connect the protein nodes. As mentioned in nodes that represent such proteins at each of their locali- Section 2, an edge may represent an interaction detected by a sations. However, this would have greatly increased the single or multiple experiments. Weightings are used to reflect number of edges in the graph. As an alternative, we devel- the relative confidence in the interaction, which is represented oped the notion of ‘‘twin’’ proteins, whereby interactions by edge thickness. Users can hover the mouse over edges to between two proteins are only displayed if the proteins share see what observations support the interactions. subcellular localisation (Fig. 4). Edges between proteins found in more than one cellular compartments are labelled with the word ‘‘twin’’ (Fig. 4C). 3.2.1 Interaction filter toolbox An example is the interaction between GYL1 and GYP5. GYL1 has three localisations, whereas GYP5 has four. If the An interaction filter toolbox was built to allow interactions of interactions between these localised nodes were to be drawn interest to be shown, hidden or highlighted in different naively, there would be 12 edges drawn in total. Using this ways. The toolbox allows users to change the presence and edge reduction scheme, only three edges are drawn between

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com Proteomics 2009, 9, 5309–5315 5313 the three pairs of GYL1 and GYP5 proteins in the bud, the otherwise cluttered graph. Members of protein complexes bud neck and the cytoplasm. When applied to the entire are automatically collapsed into one metanode when they Virtual Cell, this technique reduced the number of edges are in the visual distance. Supporting Information Fig. 3 from 11 192 to 7819 edges, a 29% decrease. shows the display of known complexes in the Interactorium.

3.3 Display of protein complexes 3.3.1 Autogeneration of complexes

In the visualisations presented above, proteins have been The complexes from Hart et al. [13] contained only 1697 represented as single nodes. Although this is a common proteins out of the 2860 proteins we are visualising in the representation of the interactome, it is also possible to yeast proteome. As each of the proteins not described in the incorporate a higher level of organisation into the display. Hart et al. [13] data set has one or more interactions, it was To achieve this, we developed the Complex Viewer of the possible to group these into new, potential complexes. This Interactorium. was achieved by calculating ‘‘synergy’’ values as mentioned Proteins have been grouped together into known in Section 2. This generated 285 new complexes and 48 complexes defined by affinity purification and mass spec- doubletons (two-protein complexes). Table 2 summarizes trometric techniques. Complexes defined in a metadata the number of proteins in these complexes, along with their analysis of two proteome-wide screens (Hart et al. [13]) are average degree of isolation (weighted proportion of intra–- shown as double-bordered circles. Each complex contains complex interactions of proteins, divided by the total smaller circles, which represent the member proteins. number of intra–complex and inter–complex interactions of Interactions between constituent proteins of a complex and proteins in a complex). We observe that the automatic proteins outside the complex are shown as edges; the data complexes generated have a similar average degree of for these were from Bertin et al. [12]. Visualisations that isolation, although the complexes from Hart et al. [13] show include complexes are important, as they can help generate a substantially greater number of interactions within new insights into the higher-level organisation of the cell. complexes and are at least three times more likely to not Additionally, the complexes serve as a way to simplify the have any interactions with external complexes (Supporting

Figure 4. Representation of interactions where proteins are present in more than one cellular location. (A) Explosion in the number of edges if protein nodes are replicated in more than one localisation. If proteins A and B were visualised in the nucleus and cytoplasm or bud neck and nucleolus, respectively, four interactions would be required in the visualisation. (B) The use of specific edges in the graphs to identify ‘‘Twin’’ proteins (identical proteins present in more than one places) allows the number of interaction edges to be limited to instances where the interacting proteins exist in the same cellular compartment. (C) Labelled edges (containing the word ‘‘twin’’) are used to warn the user that the protein is known to exist in more than one location in the cell.

Table 2. Types and numbers of complexes in the Complex View

Name Groups Proteins Average degree of Interactions within Complexes with no isolation complex external interactions

Hart et al. [13] 398 1697 0.3 2123 73 Automatic 285 1067 0.16 697 15 Doubletons 48 96 N/A 48 48 Singletonsa) N/A 3332 N/A N/A N/A a) Note that singletons are hidden in the Complex Viewer of the Interactorium.

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com 5314 Y. Y. Widjaja et al. Proteomics 2009, 9, 5309–5315

Information Table 1). Our autogenerated complexes defined 3.4 Structural display numerous protein complexes that are biologically sensible, or already characterized (Fig. 5). These included a group of In addition to displays of the Virtual Cell and interaction proteins associated with the peroxisome; a complex that networks, the Interactorium can also build interactive 3-D contained translocase of the outer membrane proteins; and structures of proteins. Within the visualisations described another complex contained three out of the four known here, users can focus and zoom in on a protein to view its members of complex II of the mitochondrial electron structure, which is automatically built and displayed. In transport chain. Note that proteins that were not described cases where more than one structure has been determined in Bertin et al. [12] or Hart et al. [13] have been omitted from for a protein, users can simultaneously view multiple our visualisations. Furthermore, we observe cases where the structures by selecting the ‘‘display protein’’ icon on the autogeneration algorithm has grouped together the large protein nodes (Fig. 6). When structures have been deter- numbers of proteins that interact with only one common mined for protein complexes, it will be possible for these to protein in a complex from Hart et al. [13], for example, also be displayed in an identical manner. Currently, there proteins CDC28 and calmodulin (Supporting Information are 618 structures in the PDB database, which pertain to 560 Fig. 4). Although it is unlikely that these proteins actually proteins in the yeast interactome. In the current distribution form a complex, the autogenerated complexes serve to co- of the Interactorium, only a small number of files are visualize a series of related proteins as one group, instead included to keep the download size to a minimum. Users of dispersed throughout the entire display space. A full list who wish to display structures for other proteins can put the of complexes visualised, both autogenerated and from the associated PDB files into the appropriate directory of the Hart et al. [13] data set, is available as Supporting Infor- software (see Release Notes, Supporting Information) and mation. they will then be displayed automatically.

Figure 5. Automatically generated protein complexes in the Interactorium. These are clearly identifiable by their dashed border. (A) This complex contains 18 proteins, all of which are known to be part of the peroxisome. (B) Complex containing translocase of the outer membrane of the mitochondrion porter outer membrane (translocase of the outer membrane) proteins. (C) Three proteins of complex II of the mitochondrial electron transport chain are grouped together here, along with the protein TCM62.

4 Discussion

The Interactorium is an interactive visualisation environment for the exploration and analysis of protein-associated data, in the context of the eukaryotic cell. It provides a unique, seamless multi-level view of these data, from protein to complex to interaction network and organelle, removing the need for a series of unconnected software packages. It should be noted that the Interactorium is not a modelling environ- ment, such as E-CELL [7] or the Virtual Cell Modeling and Simulation Framework [8], nor is it an artistic or faithful representation of cellular processes. Instead, we have focused Figure 6. Viewing 3-D structures in the Interactorium. The VPS27 on finding a way in which complicated, independently protein has three structures in the PDB database (entries 1Q0V, derived analytical data can be represented on a proteome- or 1VFY and 2PJW), which can be viewed simultaneously. Here, interactome-wide scale while preserving a cellular context. they are seen from left to right, respectively, highlighting Additionally, we have designed the Interactorium such that secondary structural elements with ribbons and showing other parts of the structures as wire frames. Structures can be information is presented with the levels of visual complexity manipulated on-screen (rotation and zoom) and visually that are within the capacity of human visual cognition. The compared. use of automatically expanding and collapsing complexes, for

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com Proteomics 2009, 9, 5309–5315 5315 example, shows increased detail for complexes that are close 5 References up and of interest but less information for complexes that are far away. We hope that this novel perspective will help [1] Shannon, P., Markiel, A., Ozier, O., Baliga, N. S. et al., provide new insights into the molecular organisation of the Cytoscape: a software environment for integrated models of cell, including aspects of the spatial organisation of the biomolecular interaction networks. Genome Res. 2003, 13, proteome which are typically overlooked. 2498–2504. In the process of developing this tool, new protein [2] Hu, Z., Ng, D. M., Yamada, T., Chen, C. et al., VisANT 3.0: complex groupings were generated, independent of those new modules for pathway visualization, editing, described in Hart et al. [13]. Some of these reflected those prediction and construction. Nucleic Acids Res. 2007, 35, W625–W632. known in databases and/or the literature (e.g. those shown in Fig. 5); all are given in the Supporting Information. The new [3] Ho, E., Webber, R., Wilkins, M. R., Interactive three- complex groupings, in contrast to those in Hart et al. [13], dimensional visualization and contextual analysis of protein interaction networks. J. Proteome Res. 2008, 7, were built by synthesising ‘‘leftover’’ interactions into 104–112. complexes, as the higher-confidence interactions were already identified by Hart et al. [13]. To do this we used a merging [4] Ware, C., Franck, G., Evaluating stereo and motion cues for visualizing information nets in three dimensions. ACM heuristic which considered: the weight or evidence for any Trans. Graph. 1996, 15, 121–140. interaction between proteins, the number and weight of interaction of proteins within a nascent complex versus the [5] Ahmed, A., Dwyer, T., Forster, M., Fu, X. et al., GEOMI: GEOmetry for maximum insight. Proceedings of 13th number and weight of interactions with proteins external to International Symposium on Graph Drawing, 2005, the nascent complex, and second-degree interactions (those 468–479. that were two steps away from proteins in a complex). This [6] Marsh, B. J., Toward a ‘‘Visible Cell’’y and beyond. Aust. heuristic is an alternative means of generating complex Biochem. 2006, 37, 5–10. grouping and considered many types of analytical evidence (e.g. yeast two-hybrid analyses, affinity-purified protein [7] Tomita, M., Hashimoto, K., Takahashi, K., Shimizu, T. S. et al., E-CELL: software environment for whole cell simula- complexes, crystallographic as well as 13 other types). It tion. Bioinformatics 1999, 15, 72–84. contrasts to the previous approaches that have used prob- abilistic and/or machine learning algorithms to categorize [8] Loew, L. M., Schaff, J. C., The Virtual Cell: a software environment for computational cell biology. Trends interacting proteins into putative multiprotein complexes. Biotechnol. 2001, 19, 401–406. At the current time, the Interactorium is in its first public release and we anticipate further features will be incorporated [9] Huh, W., Falvo, J. V., Gerke, L. C., Carroll, A. S. et al., Global analysis of protein localization in budding yeast. Nature in the future. Currently, a eukaryotic cell (namely S. cerevisiae) 2003, 425, 686–691. has formed the basis of our Virtual Cell. Transforming the Interactorium to support data from prokaryotes, archaea and [10] Ghaemmaghami, S., Huh, W., Bower, K., Howson, R. W. et al., Global analysis of protein expression in yeast. Nature plants is a near-term aim so that researchers with interaction 2003, 425, 737–741. data from other organisms can produce similar visualisa- tions. Another aim is to incorporate tools for statistics and [11] Belle, A., Tanay, A., Bitincka, L., Shamir, R. et al., Quantifi- cation of protein half-lives in the budding yeast proteome. network analytics, to allow summary data as well as mathe- Proc. Natl. Acad. Sci. USA 2006, 103, 13004–13009. matical properties of the networks to be easily discerned. The incorporation of other aspects of the cell such as the meta- [12] Bertin, N., Simonis, N., Dupuy, D., Cusick, M. E. et al., Confirmation of organized modularity in the yeast inter- bolic pathways or signal transduction pathways is a further actome. PLoS Biol. 2007, 5, e153. possibility. User feedback, which we welcome, will be useful to guide the direction of these developments. [13] Hart, G. T., Lee, I., Marcotte, E. R., A high-accuracy consensus map of yeast protein complexes reveals modular nature of gene essentiality. Biomed Chromatogr. Bioinform. Y. Y. W. and C. N. I. P. are the recipients of Australian 2007, 8, 236. Postgraduate Awards. M. R. W. and the NSW Systems Biology [14] Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G. et al., Initiative acknowledge support from the NSW State Government The Protein Data Bank. Nucleic Acids Res. 2000, 28, Science Leveraging Fund and the University of New South Wales. 235–242. [15] Eades, P., A heuristic for graph drawing. Congressus The authors have no conflicts of interest to declare. Numerantium 1984, 42, 149–160.

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com 10.2 Appendix II - Book Chapters

Pang, CN & Wilkins, MR: Online resources for the molecular contextualization of disease. Methods Mol Med 2008, 141, 287-308.

98 16

Online Resources for the Molecular Contextualization of Disease

Chi N. I. Pang and Marc R. Wilkins

Summary Searching online resources can provide medical researchers with an efficient means of gathering existing knowledge on the molecular causes of disease. The researcher may choose to explore the following areas, e.g., genetic mutations associated with the disease, function and cellular sub-localization of the associated protein(s) and their protein interaction partners. Using a small case study, examining the disease retinoblastoma, this chapter guides the reader through the relevant information contained within relevant databases. It is shown that the integration of online biological knowledge with genomic and proteomic experimental data provides insights into the understanding of diseases in their molecular context.

Key Words: protein–protein interactions, disease, gene ontology, subcellular localization.

Abbreviations: GO – gene ontology; HPRD – Human Protein Reference Database; PTM – post-translational modification; RB1 – retinoblastoma

1. Introduction Functional genomics and proteomics approaches are becoming increasingly widely used for the investigation, and understanding of disease. Microarray or proteomic techniques are frequently used for the comparison of gene or protein expression between diseased and control tissues, resulting in the identification of genes or proteins that are aberrantly expressed. At this point, investigators typically ask a series of fundamental questions. These include

From: Methods in Molecular Medicine, Vol. 141: Clinical Bioinformatics Edited by: R. J. A. Trent © Humana Press, Totowa, NJ 287 288 Pang and Wilkins

1. Is the function of the protein known? 2. Is the protein known to be associated with disease? 3. Are mutations already described for this gene or protein? 4. Is the subcellular location and/or tissue distribution of the protein known? With increasing studies of protein-protein interactions and their association with disease, the researcher may also wish to ask: 5. Does the protein interact with any others? This chapter will introduce a number of online resources which the medical researcher can use to contextualize their results from microarray and proteomics experiments. For the sake of clarity, we will write this chapter as a small case study, examining the RB1 gene and protein, associated with the disease retinoblastoma. Whilst this protein was not discovered using microarray or proteomics experiments, instead having been discovered through classical genetics and molecular biology (1), we note that the application of proteomics/microarray experiments have shown the RB1 protein to be under- expressed in association with the disease (2).

2. Methods There are two starting points that are useful for the contextualization of a gene or protein of interest. These are either with the name of the gene or protein, or with the name of the disease itself. Here we assume that the starting point will be with a gene or protein of interest. Each section is organized under a subheading, being a question or questions which researchers may choose to ask. 2.1. Is the Function of the RB1 Protein Known? Is the Protein Associated with Disease? What Mutations are Known? 2.1.1. Searching the Swiss-Prot Database Swiss-Prot is a curated protein sequence database. It provides useful annotation, such as the protein’s function, cellular sub-localization, post- translational modifications and amino acid sequence variants. The entry for each protein provides hypertext links to a comprehensive diversity of useful biological databases (3,4). The Expasy website (http://ca.expasy.org/) allows the user to search for a protein by entering the protein or gene name. For the retinoblastoma-associated protein, the user would need to type “retinoblastoma-associated” into the search field on the top of the Expasy home page. It then provides the user a list of proteins to select from. Selecting the appropriate protein name from the results list leads the user to the full Swiss-Prot entry for the retinoblastoma-associated protein, accession number P06400, entry name RB_HUMAN. Contextualization of Disease 289

There are several parts of the database that are of interest to the reader, notably the comments field and the feature table. The comments field (Fig. 1) provides the reader with information on the protein’s function, subunit information if part of a complex, protein-protein interactions, cellular sub- localization, tissue specificity, post-translational modification (PTM) and any diseases associated with the protein. The feature table (Fig. 2) provides infor- mation on amino acid variants of the protein. The “VARIANTS” field gives the position of the amino acid variant in a protein, what amino acid it is mutated to, and the associated disease. The feature table also contains information on the location and type of PTM in the “MOD_RES” field. Information on the enzyme that catalyses the addition of a PTM may also be found here. Note, however that fatty acid modifications are not recorded as “MOD_RES” but “LIPID,” glycosylation is recorded as “CARBOHYD,” molecular cross-linking as “CROSSLNK” and disulfide bonds as “DISULFID.” There are four classes of PTM reliability in Swiss-Prot. The first and most reliable class concerns modifications that have strong experimental

Fig. 1. Comments field for the retinoblastoma-associated protein from the Swiss-Prot database. The accession number for the RB1 protein is P06400. 290 Pang and Wilkins

Fig. 2. Excerpt of the feature table for the retinoblastoma-associated from the Swiss- Prot database. The accession number for the RB1 protein is P06400. evidence. A further three classes concern PTMs which have not been exper- imentally verified. The most reliable of these are modifications inferred by taxonomic similarity, and are labeled “by similarity.” PTM information labeled as “probable” have some experimental evidence and should be found in the native protein. Modifications that have been predicted only by protein sequence analysis tools are denoted as “potential.” In the case of RB1, the protein is annotated in Swiss-Prot as a tumor suppressor, which interacts preferentially with transcription factor E2F1. It is also known to interact with CDK2 and TAF1 proteins. The protein is localized in the nucleus, and may be phosphorylated in five different amino acid positions. The RB1 protein is associated with diseases such as the childhood cancer retinoblastoma, bladder cancer and osteogenic cancer. RB1 is also involved in pinealoma; this is described in the OMIM database discussed later.

2.1.2. Searching the Online Mendelian Inheritance in Man Database The Online Mendelian Inheritance in Man (OMIM) is a database of genetic disorders. It provides links to literature in the PubMed literature database, to Contextualization of Disease 291 gene sequences in the NCBI database and other data sources. It is a resource used primarily by physicians and medical practitioners concerned with genetic disorders and genetics researchers (5). The OMIM website allows the user to search for a disease by typing multiple keywords in the search box at top of the home page (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM). In the case of RB1, the keyword “retinoblastoma” is sufficient. The identification number for the retinoblastoma database entry in OMIM is 180200. There are a number of sections of each OMIM entry that are of particular interest. They include the brief description of the disease, its gene map locus, molecular genetics, patho- genesis, gene function and allelic variants. For RB1, it is recorded in the gene function section that RB1 is modulated by phosphorylation and dephosphorylation during different stages of the cell cycle. The RB1 protein is unphosphorylated in the G0/G1 phases, but it is mostly phosphorylated during S and G2 phases (6–8). The allelic variant entries include information on how the nucleic acid mutations affect the gene product. 2.2. Where is the Protein Found in the Body and Inside the Cell? 2.2.1. Searching the Human Protein Atlas The human protein atlas (www.proteinatlas.org) is an ongoing project that aims to map the localization and expression of every human protein in all tissues of the body. The protein atlas database includes a large number of images of normal tissues and a variety of disease states (see Note 1). These are taken from immunohistochemically stained tissue sections, generated by use of specific antibodies generated against different antigens in the body. A brown- black color in an image highlights the location where the antibody has bound to its corresponding antigen. A blue stain is used for visualizing microscopic features in the same tissue samples. It may stain both cellular and extracellular materials. Each tissue type is represented by three images from unique patients. For most cancer tissue types, duplicate samples from 12 or more patients have been recorded, although some only have duplicate samples from 4 patients. All images have been analyzed by specialized image analysis software and validated by expert histopathologists. The reliability of each image is recorded in the database (see Note 2) (9). To search for the retinoblastoma-associated protein in the human protein atlas, “RB1” or “retinoblastoma” is typed into the search field on the atlas home page (www.proteinatlas.org). In the search results page, click on the link under the “Antibody ID” column corresponding to the RB1 gene name. This brings up an overview of RB1 localization throughout the body (Fig. 3). The overview 292 Pang and Wilkins

Fig. 3. Tissue distribution of the RB1 protein from the human protein atlas. The top panel shows where the protein is found in normal tissues, and the bottom panel shows where the tissue is found in cancer tissues. The page can be accessed at www.proteinatlas.org/tissue_profile.php?antibody_id=95 includes an “annotation summary”, describing the cellular sub-localization information for different tissues. For RB1, it showed strong and distinct staining in the nucleus of all tissue types, whether they were normal or malignant tissues. Fibroblasts, inflammatory cells and neuronal cells were also stained in the nuclei. In the protein atlas, users have to select images for tissue types most relevant to the disease under investigation. For example (Fig. 4), mutation in the retinoblastoma protein would cause urothelial carcinoma, commonly called bladder cancer. Normal bladder tissue and urothelial cancer tissue would be appropriate to view for this investigation. Contextualization of Disease 293

Fig. 4. Histological view of RB1 protein expression in uorthelial cancer. The expanded view can be accessed by clicking on the lower resolution image. This page can be accessed at www.hpr.se/cancer_unit.php?antibody_id=95&mainannotation _id=29007.

2.2.2. Searching a Repository of Gene Expression Data SymAtlas (symatlas.gnf.org/SymAtlas/) is a database of results from microarray experiments. It contains expression levels for genes of interest from a wide selection of tissue types. Results are collated from human and mouse tissues. Expression levels of genes with no previously known function are included in this database (10). To search SymAtlas, enter one or more accession numbers, gene names or gene ontology (GO) identifiers, separated by a space (see Note 3). The resulting histogram of gene expression levels is of most interest to the user, as they show which tissues or cell lines are expressing a gene and in which quantity (Fig. 5). The database actually contains expression information from a number 294 Pang and Wilkins

Fig. 5. Expression of the RB1 gene in different tissues, as documented in the SymAtlas database (symatlas.gnf.org/SymAtlas/). Here the expression levels are shown from the Human GeneAtlas GNF1H MAS5 dataset using the 203132_at reporter.

of experiments and sources, and the user may select what is appropriate by using the dataset selection panel on the top of the page. The gene expression is measured by using Affymetrix microarrays. The different gene expression levels may be viewed as separate histograms by selecting from the panel on the top of the page. A brief browse through the RB1 gene expression level in the Human GeneAtlas GNF1H, MAS5 dataset shows that it is under-expressed in normal bone marrow, as compared to other tissue types. RB1 was under-expressed, albeit detectably, for both the mRNA reporters 203132_at (HG-U133A) and 211540_s_at (HG-U133A). Expression of the RB1 gene was not detectable in retinoblastomas, osteosarcomas or soft tissue sarcomas (2). Therefore, the expression level of the RB1 gene could be used to help determine the exact cause of these cancer types. Contextualization of Disease 295

2.2.3. Understanding Subcellular Protein Localization The human protein atlas project, above, is generating some information on the subcellular localization of proteins. For example, it is described in the annotation summary section for the RB1 protein, that RB1 is localized in the nucleus for almost all tissues, whether they are normal and malignant. However, there is no large-scale experiment which has been undertaken to date to accurately determine the precise sub-cellular localization of all human proteins. As an alternative, subcellular protein localization data can be obtained from studies on individual proteins. This is available in the Swiss-Prot database, and is systematized by the Gene Ontology. The Swiss-Prot database can be queried for RB1_HUMAN. The GO terms in the Swiss-Prot database, under the cross-reference field, indicate that the protein is localized in the nucleus and the chromatin. The comments field of Swiss-Prot will sometimes also provide a description of the sub-cellular localization.

2.3. Where is the Protein Found in Biochemical Pathways, Cellular Reactions or the Reactome? 2.3.1. The Reactome Project The reactome project (www.reactome.org) is a curated resource of pathways and reactions. It represents these pathways in graphical as well as tabular format. It draws on information from, and is hypertext linked to, other resources including KEGG (Kyoto Encyclopedia of Genes and Genomes), the Gene Ontology (GO) and the metabolite database (Chemical Entities of Biological Interest; ChEBI) (11,12). The user may search for the context of a protein in the reactome by typing the gene name in the text box at the top of the front page. From the summary results page, click through to reaction section (see Note 4). The name RB1 can be used to search for the reactions described on that results page (see Note 5). Searches of the reactome database pinpoint the exact reactions in which a protein participates. The reaction is described with a flow chart figure and is also described in text. The inputs of the reaction are described as well as the products: these include small molecules that may be substrates or products in enzyme-mediated reactions. The hypertext links to databases like KEGG and GO help the researcher to learn more about the reaction pathway(s) of interest. For RB1, the reaction the protein is described as “Replication initiation regulation by Rb1/E2F1” (see Fig. 6) and the detail of the reaction is described: “Rb1 is normally hyperphosphorylated by CycD/CDK4/CDK6 and Cyclin 296 Pang and Wilkins

Fig. 6. Summary results page from the reactome project for protein RB1. Note that this page immediately contextualizes the protein into a pathway, and gives details on the reaction inputs and outputs.

E/CDK2 for transition into S-phase. PP2A can then reverse this reaction, in this case, in response to DNA damage induced checkpoint.”

2.3.2. The Gene Ontology Gene ontology (GO) (www.geneontology.org) is a project aimed at providing controlled and consistent vocabulary for the description of gene and protein function, and a means to classify these into biologically meaningful categories. It is a global system and is applicable to all species. There are three categories in GO. They are cellular component, biological process and molecular function. For any gene or gene product, cellular component describes the part of a cell where the entity is found, for example, rough endoplasmic reticulum or proteasome. A biological process is a series of molecular functions or processes performed by assemblies of biomolecular entities, for example, signal trans- duction or pyrimidine metabolism. A molecular function describes an activity at a molecular level, for example, a catalytic activity, a type of binding, and more specifically, Toll receptor binding. The GO definitions are like a hierarchy, with the exception that a child GO term may have more than one parent. For instance, the term hexose biosynthesis has two parents: hexose metabolism and Contextualization of Disease 297 monosaccharide biosynthesis. If the child GO definition is annotated to a gene, the annotation automatically cascades to the parental GO term in a recursive manner (13). Gene ontology is not a database of gene product names, or a database recording the attributes of sequences such as gene introns and exons. It does not describe protein tertiary structures or protein-protein interactions. Terms unrelated to the normal function of any gene, for example oncogenesis, are also not included. In addition, any descriptors that are above the level of cellular component, such as anatomical or histological features and cell types are not described. Other broad categories such as gene evolution and gene expression are similarly not addressed. A useful web-based tool for browsing GO is quickGO (www.ebi.ac.uk/ ego/index.html). A general method of searching is to type in the UniProt accession number of the gene (see Note 6). For the retinoblastoma protein, the UniProt accession number is P06400. GO can provide insights into the molecular functions of a protein in one or more cellular processes. This helps the researcher find genes that are involved in similar molecular processes and functions, or are of the same cellular component. Genes with common GO terms also have a high chance of associ- ation through protein-protein interactions, or may be found in the same diseases or pathways. It is noteworthy that GO provides synonyms to terms of interest. This can provide researchers with appropriate keywords to effectively search other databases which do not use the GO vocabulary. A search of the Swiss-Prot database for the RB1-associated GO terms reveals a set of classifications for RB1 cellular component, molecular function and biological process (see Table 1). Each of these classifications has an associated acyclic graph with the name of the GO term, all the GO terms’ parents and the “ancestral” GO terms associated with them. These graphs can be accessed from quickGO (see Fig. 7), give a clear view of the hierarchy of the ontology for RB1 and illustrate the assigned function(s). In the case of RB1, this is “regulation of progression through the cell cycle.” Note that it is common for one protein to map to more than one part of the gene ontology. In part, this is because proteins can be multifunctional, but also because a single protein can be involved in more than one process.

2.3.3. Searching BioCarta BioCarta is a curated database of biological pathways. Its particular strength is that it visualizes the pathway of interest and has legends that are easy to interpret. The icons for the protein provide links to other databases, such as 298 Pang and Wilkins

Table 1 Gene ontology (GO) terms assigned to the RB1 retinoblastoma-associated protein

GO Classification Description

Cellular component • chromatin • nucleus Molecular function • androgen receptor binding • protein binding • transcription coactivator activity • transcription factor activity Biological process • androgen receptor signaling pathway • cell cycle checkpoint • G1 phase • M phase • negative regulation of cell growth • negative regulation of protein kinase activity • negative regulation of transcription from RNA polymerase II promoter • positive regulation of transcription, DNA-dependent

OMIM and Swiss-Prot. It is different to the Gene Ontology as it does not seek to provide a global context for a particular gene or protein, instead focusing on the local environment and pathways in which a protein participates. For each pathway, a detailed textual description is given, as well as contact details for a “pathway expert”. To use BioCarta, the user first needs to access the main page at www.biocarta.com. Click on the “Pathways” tab at the top of the index page, and in the subsequent page, search by using the “gene name” text box under the section “search pathways by title”. For example, the gene name “RB1” would be used for the retinoblastoma protein. This produces a list of pathways in which the protein of interest is involved. Choose the pathway of interest for investigation, by clicking on the pathway name. For the retinoblastoma tumour pathway, the pathway of most relevance is “RB Tumor Suppressor/Checkpoint Signaling in response to DNA damage.” This is shown in Fig. 8. Contextualization of Disease 299

Fig. 7. Acyclic graph showing how the RB1 protein maps onto the gene ontology (GO). Thispagecanbeaccessedat www.ebi.ac.uk/ego/DisplayGoTerm?id=GO:0000074. Note that there are six levels to this part of the ontology (biological process). The lowest level of the ontology here is “regulation of progression through the cell cycle”.

2.4. Are Any Protein-Protein Interactions Known? Broadly speaking, there are two types of protein–protein interaction data. There are those from high throughput studies using techniques such as yeast two-hybrid or affinity purification of protein complexes and those that result from the study of individual proteins. For the latter, curation of literature can be used to generate a large dataset of interactions. For humans, there are two large, 300 Pang and Wilkins

Fig. 8. The BioCarta illustration for the “RB Tumor Suppressor/Checkpoint Signalling in response to DNA damage” pathway, accessible from page www.biocarta. com/pathfiles/h_rbPathway.asp. Note that this pathway figure is accompanied by a detailed description. high throughput studies to date (see Note 7) (14,15). Whilst extensive, these studies together represent less than a few percent of the human interactome. Accordingly, when searching for interaction partners of human proteins, it is necessary to use resources that consider interactions documented in small and large-scale studies. Databases such the HPRD (www.hprd.org) and IntAct contain such information. The IntAct database will be discussed here, because of the extensive cross-linking with other EBI resources (16). 2.4.1. The IntAct Database of Protein–Protein Interactions IntAct (www.ebi.ac.uk/intact/site/) is a database for protein-protein inter- action data. The interaction data result from user submission or by the curation Contextualization of Disease 301 of published literature. IntAct allows the user to search for interaction partners for human proteins. The user can search the IntAct database by entering a Swiss-Prot identifi- cation number in the search box on the front page. For retinoblastoma-associated protein, the RB_HUMAN identifier is used. The results of this type of search are the proteins which are known to directly interact with the protein of interest. In the case of RB_HUMAN, these are proteins Cdk2, Taf1, and Pa2g4 (see Fig. 9). It is usually of interest to visualize the interactions that a protein participates in and how this fits into a local or global interaction network. IntAct allows interactions to be visualized as scale-free graphs with the Hierarch viewer. Proteins of interest are selected from the list of interacting proteins (Fig. 9) and the graph button then selected (see Note 8). The resulting scale-free graphs display the proteins as nodes (protein names) and the interactions between proteins as edges (lines). The protein of interest appears in the center of the graph. Figure 10 shows the local interaction network for the retinoblastoma- associated protein. The Hierarch viewer, as described above, also provides additional contextual information for a protein and its interactors. A list of GO terms and protein

Fig. 9. Search results from the IntAct database for the retinoblastoma-associated protein. This shows that the protein directly interacts with three other proteins. 302 Pang and Wilkins

Fig. 10. Contextualization of the retinoblastoma-associated protein in the local protein-protein interaction network. The RB_HUMAN protein is at the center of the view, and other proteins up to two interactions away from it are shown. The right- hand side of the viewer provides access to links to the gene ontology and domain-based information. functional domains from InterPro (see Note 9) (17) is given on the right hand side of the viewer web page (see Note 10). Neighboring proteins are likely to be involved in similar functions to the protein of interest and may contribute to the molecular basis for the development of the disease. The importance of a protein may be related to the number of interactions in the protein-protein interaction network. Highly connected proteins are thought to be more highly associated with disease.

2.5. What Types of Post-Translational Modifications does the Protein Carry? 2.5.1. The Swiss-Prot Database As a final consideration, it can be of interest to understand the post- translational modifications that are known to be carried by a protein. This can provide clues to protein localization and function. The Swiss-Prot database Contextualization of Disease 303 is an excellent source of post-translational modification information, where all modifications are collated from the literature. To find post-translational modifications in Swiss-Prot, the user needs to refer to relevant MOD_RES annotations in the feature table of each database entry. Various modifications are described therein, including fatty acids, glycosylation and reversible modifi- cations, including phosphorylation, methylation, and acetylation. Disulfide bridges of proteins are also documented in this part of the database. For the retinoblastoma-associated protein, Swiss-Prot documents phosphorylation at amino acid positions 249, 252, 373, 807, and 811. It also notes that this phosphorylation is mediated by the kinase CDC2 (see Note 11). Figure 11A shows the relevant portion of the feature table for the retinoblastoma-associated protein.

A

B

Fig. 11. Post-translational modifications documented for the retinoblastoma- associated protein: (A) Portion of the Swiss-Prot feature table for entry RB_HUMAN, detailing which amino acids are phosphorylated. (B) Similar entry from the HPRD database. See text for details. 304 Pang and Wilkins

2.5.2. The Human Protein Reference Database The Human Protein Reference Database (HPRD) (www.hprd.org) is a database which centralizes protein annotation. It contains information such as functional domains, co- and post-translational modifications, protein-protein interactions and associations with mutations and disease. The information is curated by experts that mine literature references and published data. These information are accessible via a web-based interface (18). Unlike the Swiss-Prot database, the HPRD database only provides annotation and sequence information for human proteins. As a result of this focus, HPRD may provide more post-translational modification data than Swiss-Prot for human proteins. It also provides detailed information on the enzyme which catalyses the addition of the post-translational modification (PTM) on a protein, and useful links to the literature reference(s) that discovered the modification. This latter feature is not available in the Swiss-Prot database. Hypertext link to literature references for information such as tissue distribution, subcellular localization and disease association of proteins are also provided (18). To access modification information in HPRD, the user first needs to access the query interface via the databases front page (www.hprd.org). This is done by selecting the query button on the top left-hand corner of the page. The user may then search HPRD for the protein of interest by using a Swiss-Prot accession number, in this case P06400 for the retinoblastoma-associated protein (see Note 12). At the results page, click on the tab for “PTMs and Substrates.” A small cartoon in the PTM annotation page illustrates approximately where each post-translational modification is found on the protein. The upstream enzymes thought to be responsible for the addition of the PTM are noted for each modified amino acid. Each modification is linked to a literature resource, which can be seen by clicking on the amino acid position number in the table. The RB1 protein is phosphorylated at 16 positions in the HPRD database (Fig. 11B), as compared to five positions as annotated in Swiss-Prot. It is mainly phosphorylated by proteins that regulate cell division cycles: cell division cycle 2 (CDC2 or CDK1), Cyclin D2, and two cyclin dependent kinases CDK2 and CDK4.

2.5.3. Predicted Post-Translational Modifications Databases of predicted post-translational modifications may be relevant and of interest to the researcher. Predictions in these databases must, however, be used with caution as the predictions may be of low quality. They may provide false predictions and also neglect real modification Contextualization of Disease 305 sites. Databases containing predicted modifications include the dbPTM (http://dbptm.mbc.nctu.edu.tw/index.html), which contains predicted modifica- tions for many proteins, including the retinoblastoma-associated protein (19).

3. Notes 1. Human Protein Atlas—Normal tissue: It is often difficult to obtain normal human tissues, since they are derived from surgical material. Therefore, normal is defined in this context as close to normal and samples would include alterations due to inflammation, degeneration and tissue remodeling. 2. Human Protein Atlas—How to interpret reliability of results? A number of colored circles are found beside the name of the tissue type, whether they are normal or malignant tissues. Each colored circle represents the specific tissue type from one individual. A different color code is assigned to annotate the intensity and abundance of immunoreactivity (red = strong, orange = moderate, yellow = weak, white = no staining, and black = missing tissue). The circle is divided evenly such that each section represents replicate samples from the same tissue type. 3. Usage of SymAtlas—Wildcard characters ? and *, which represent one and any number of characters correspondingly, may be used. The search results list appears on the side panel of the webpage. Click on the little picture icon next to the appropriate gene name, under the Homo sapiens section. This will link to the gene expression chart. Clicking on the gene name will lead to a list of annotations and hyperlinks to other databases. The gene expression histogram may be accessed by selecting a dataset under the functional data section. 4. Usage of Reactome—Upon searching using the gene name, the results page will be shown. The matches would be classified into categories, for example: physical entity, reference entity, summation, reaction coordinates and reaction. 5. Reactome’s page for RB1—The search for RB1 should get to this page: http:// www.reactome.org/cgi-bin/eventbrowser?DB=gk_current&ID=113643& (20) 6. QuickGO search result list—The search result will show every match to the query within the categories of biological process, molecular function and cellular component. Each GO term is attached to an identification number in the format GO: 7 digit number. An example GO ID number for the RB1 protein is: “regulation of progression through cell cycle, GO:0000074.” 7. Protein-protein interactions—RB1 protein is part of the Rual et al. yeast-two- hybrid study on human proteins (14). However, it is not present in Stelzl et al. human yeast two-hybrid study (15). 8. IntAct search result list—You may need to select a number of interactor proteins to see the context of a protein in its interaction network. The “Select All” or “Clear All” button may be used to select and deselect the whole list of proteins. The “Path” button will show the minimal number of protein-protein interactions, or the minimally connecting network, for the selected set of proteins. 306 Pang and Wilkins

9. Interpreting IntAct results—A protein domain is a polypeptide chain that can fold autonomously into a structural unit. Some domains have a common evolutionary origin and molecular function (21). 10. Interpreting IntAct results—The list of GO terms can assist in understanding the biological function of the protein-protein interaction sub-network. The “show” button beside each GO and InterPro entry on the right-hand side of the web-page allows the user to highlight proteins with that annotation in the graph. The count indicates the number of proteins in the graph to which the GO or InterPro entry applies. The user can click the hypertext links to browse the details of the relevant GO terms or access biological information about protein domains of interest. 11. Protein phosphorylation—CDC2 is shown to interact indirectly with RB1 through CDK2 in Figure 10. This may be due to a missing interaction not documented in IntAct. Another explanation would be that phosphorylation of RB1 by CDC2 is dependent on CDK2, such that phosphorylation would only occur if the three proteins are in a complex. 12. Using the HPRD database—There are many other methods of searching the database, for example, searching by PTMs, molecular class and cellular component. The user may browse the database by clicking the “Browse” button on the top left-hand corner of the web page, and accessing the browsing interface. This would give the list of all possible searches available for each category, for instance, functional domains, PTMs and cellular sub-localization.

References 1. Fung, Y. K., Murphree, A. L., T’Ang, A., Qian, J., Hinrichs, S. H., and Benedict, W. F. (1987) Structural evidence for the authenticity of the human retinoblastoma gene. Science 236, 1657–1661. 2. Weichselbaum, R. R., Beckett, M., and Diamond, A. (1988) Some retinoblastomas, osteosarcomas, and soft tissue sarcomas may share a common etiology. Proc. Natl. Acad. Sci.USA85, 2106–2109. 3. Bairoch, A., Apweiler, R., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., et al. (2005) The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154–D159. 4. Wu, C. H., Apweiler, R., Bairoch, A., Natale, D. A., Barker, W. C., Boeckmann, B., et al. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 34, D187–D191. 5. Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., and McKusick, V. A. (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517. 6. Buchkovich, K., Duffy, L. A., and Harlow, E. (1989) The retinoblastoma protein is phosphorylated during specific phases of the cell cycle. Cell 58, 1097–1105. Contextualization of Disease 307

7. Chen, P. L., Scully, P., Shew, J. Y., Wang, J. Y., and Lee, W. H. (1989) Phospho- rylation of the retinoblastoma gene product is modulated during the cell cycle and cellular differentiation. Cell 58, 1193–1198. 8. DeCaprio, J. A., Ludlow, J. W., Lynch, D., Furukawa, Y., Griffin, J., Piwnica- Worms, H., et al. (1989) The product of the retinoblastoma susceptibility gene has properties of a cell cycle regulatory element. Cell 58, 1085–1095. 9. Uhlen, M., Bjorling, E., Agaton, C., Szigyarto, C. A., Amini, B., Andersen, E., et al. (2005) A human protein atlas for normal and cancer tissues based on antibody proteomics. Mol. Cell. Proteomics 4, 1920–1932. 10. Su, A. I., Cooke, M. P., Ching, K. A., Hakak, Y., Walker, J. R., Wiltshire, T., et al. (2002) Large-scale analysis of the human and mouse transcriptomes. Proc. Natl. Acad. Sci. U.S.A. 99, 4465–4470. 11. Joshi-Tope, G., Gillespie, M., Vastrik, I., D’Eustachio, P., Schmidt, E., de Bono, B., et al. (2005) Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 33, D428–D432. 12. Joshi-Tope, G., Vastrik, I., Gopinath, G. R., Matthews, L., Schmidt, E., Gillespie, M., et al. (2003) The Genome Knowledgebase: a resource for biologists and bioinformaticists. Cold Spring Harb. Symp. Quant. Biol. 68, 237–243. 13. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29. 14. Rual, J. F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T., Dricot, A., Li, N., et al. (2005) Towards a proteome-scale map of the human protein–protein inter- action network. Nature 437, 1173–1178. 15. Stelzl, U., Worm, U., Lalowski, M., Haenig, C., Brembeck, F. H., Goehler, H., et al. (2005) A human protein–protein interaction network: a resource for annotating the proteome. Cell 122, 957–968. 16. Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., et al. (2004) IntAct: an open source molecular interaction database. Nucleic Acids Res. 32, D452–D455. 17. Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D., et al. (2005) InterPro, progress and status in 2005. Nucleic Acids Res. 33, D201– D205. 18. Peri, S., Navarro, J. D., Amanchy, R., Kristiansen, T. Z., Jonnalagadda, C. K., Surendranath, V., et al. (2003) Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 13, 2363–2371. 19. Lee, T. Y., Huang, H. D., Hung, J. H., Huang, H. Y., Yang, Y. S., and Wang, T. H. (2006) dbPTM: an information repository of protein post-translational modifi- cation. Nucleic Acids Res. 34, D622–D627. 308 Pang and Wilkins

20. Gopinathrao, G. (2004) Replication initiation regulation by Rb1/E2F1 [Homo sapiens]. Reactome project. Viewed 20 September 2006 http://www.reactome.org/ cgi-bin/eventbrowser?DB=gk_current&ID=113643&. 21. Mount, D. M. (2001) Bioinformatics: Sequence and Genome Analysis, Cold Spring Harbor Laboratory Press, New York, NY. 10.3 Appendix III - Publication highlights

Chapter 2 of this thesis was featured in the same issue of Proteomics in which it was published: In this issue: Proteomics 3/2008. Proteomics 2008, 8.

This article can be accessed via the following website: http://onlinelibrary.wiley.com/doi/10.1002/pmic.200890005/abstract.

99