Research Collection

Doctoral Thesis

Modular organization of proteomes: new insights into tissue homeostasis and epigenetic control

Author(s): Hauri, Simon Karl

Publication Date: 2013

Permanent Link: https://doi.org/10.3929/ethz-a-010105188

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library

DISS. ETH NO. 21312

Modular Organization of Proteomes: New Insights into Tissue Homeostasis and Epigenetic Control

A dissertation submitted to ETH ZURICH

for the degree of Doctor of Sciences

presented by SIMON KARL HAURI MSc, University of Basel born January 4, 1983 citizen of Hirschthal AG

accepted on the recommendation of Prof. Dr. Ruedi Aebersold, examiner Dr. Matthias Gstaiger, co-examiner Dr. Christian Beisel, co-examiner Dr. Nic Tapon, co-examiner

2013

I. Zusammenfassung

In der Vergangenheit wurde die Funktion eines Proteins aufgrund von genom-weiten phänotypischen Analysen in Modellorganismen bestimmt. Heute wissen wir, dass die meisten Proteine ihre Funktion im Zusammenhang mit anderen Proteinen in makromolekularen Protein Komplexen ausführen. In vielen Fällen ist die Bildung dieser Komplexe ein dynamischer Prozess und die molekulare Funktion kann durch die Assoziation oder Dissoziation von Proteinen beeinflusst werden. In Abhängigkeit des zellulären Zustandes können Proteinkomplexe Zusammensetzung ändern und ihre Funktion den biologischen Bedingungen anpassen. In vielen genetisch verursachten Krankheiten, wie Krebs, kann die korrekte Bildung von Proteinkomplexen durch genetische Mutationen verhindert werden. Daher ist es von grosser Wichtigkeit, die molekularen Mechanismen der Proteinkomplexbildung besser zu verstehen. Klassische biochemische Proteinanalyse ist aufwendig und die Charakterisierung eines Proteinkomplexes allein kann oftmals eine ganze Doktorarbeit umfassen. Um aber einen globales Verständnis zu erlangen, wie Proteine in Komplexen organisiert sind, brauchen wir Methoden die gleichzeitig viele Proteininteraktionsdaten messen können. „Proteomics“ ist die Lehre aller Proteine und deren Eigenschaften innerhalb eines zellulären Systems. In den letzten Jahren hat dieser Forschungsbereich vor allem durch die Proteinbestimmung mittels Massenspektrometrie enorme Fortschritte gemacht. Gekoppelt mit Protein-Affinitätsaufreinigung (engl. affinity purification mass spectrometry, oder „AP-MS“) ermöglicht es diese potente Technologie Proteininteraktionen systematisch und quantitativ mit einer Genauigkeit zu studieren wie noch nie zuvor. Während meiner Doktorarbeit habe ich eine etablierte AP-MS Methode benutzt um zwei umfangreiche Proteininteraktionsnetzwerke zu studieren. Den menschlichen Hippo Signalweg und das Kompendium menschlicher Polycomb Group (PcG) Proteinkomplexe. Der Hippo Signalweg kontrolliert Gewebehomeostase und Organgrösse in Vielzellern. PcG Proteine sind Chromatin-regulatoren und beteiligt an epigenetischen Kontrollmechanismen der Genexpression. Für beide Systeme haben wir neue und bekannte Proteininteraktionen gefunden und konnten mithilfe der Netzwerktopologie neue funktionell angereicherte Module und Proteinkomplexe zu bestimmen.

I

Zusammengefasst, befasst sich diese Doktorarbeit mit den Herausforderungen der massenspektrometrischen Bestimmung von Proteinkomplexen und präsentiert zwei hochauflösende Interaktionsnetzwerke von momentan grösstem biologischem Interesse. Zusätzlich war ich an zwei weiteren, AP-MS bezogene, Studien beteiligt: Die Bestimmung von Reproduzierbarkeit von AP-MS Daten zwischen verschieden Labors und die Entwicklung einer Datenbank für die Identifikation von unspezifischen

Kontaminantenproteinen in AP-MS Experimenten.

II

II. Summary

In the past, the function of a protein has been determined based on genome wide phenotypic screens in model organisms. Today we know that most proteins carry out their function in the context of protein complexes. In many cases, protein complexes are dynamic systems and their molecular function can be affected by association and dissociation of proteins. Dependent on the cellular state, protein complexes can change in their composition and adapt their function to overcome biological challenges. In many genetic diseases, including cancer, the proper formation of protein complexes can be disturbed by genetic mutations of associating proteins. Therefore, it is of great importance to study the molecular mechanisms involved in protein complex formation. Classical biochemical analysis of proteins is a tedious task and more often than not, the characterization of one protein complex was topic in an entire PhD thesis. To reach a global comprehension of how proteins are organized in the cell, we need methods capable of measuring many protein interactions at the same time. Proteomics is the large-scale study of proteins and their properties within a living organism. In the last few years, proteomics was subjected to tremendous advances thanks to protein identification by mass spectrometry. Combined with affinity purification (AP-MS) this potent technology allows to perform systematic quantitative studies of protein interactions at near physiological conditions and at unprecedented resolution. During my PhD thesis I incorporated an integrated AP-MS workflow to study two large protein interaction networks: The human Hippo growth signaling pathway and the human Polycomb Group (PcG) protein complexes. The Hippo pathway controls tissue homeostasis and organ size in developing organisms. PcG protein complexes are chromatin regulators involved in epigenetic control. For both systems we identified many novel and known protein interactions and were able to determine a network topology that allowed us to define functionally enriched modules and novel protein complex assemblies. In summary this PhD thesis discusses the challenges of mass spectrometry based interaction proteomics and presents two high resolution protein interaction networks of great biological importance. I was also involved in two additional studies: The assessment of inter-laboratory reproducibility of our AP-MS protocol and the assembly of a repository for common contaminant proteins in AP-MS experiments.

III

III. Abbreviations

ABCP apico-basal cell polarity AP-MS affinity purification mass spectrometry CID collision induced dissociation CMV cytomegalovirus co-IP co-immunoprecipitation ComPASS Comparative Proteomics Analysis Software Suite DIA data independent acquisition DLR dual luciferase assay DUB deubiquitinating ESI electrospray ionization FDR false discovery rate FERM four-one, Ezrin, Radixin Moesin domain GFP green fluorescent protein H3K27me3 histone H3 lysine-27 trimethylation HCIP high confidence interacting protein Hpo Hippo L27 Lin2 Lin7 domain LC-MS/MS liquid chromatography tandem mass spectrometry LOD limit of detection LOQ limit of quantification LTQ linear ion trap m/z mass to charge ratio NSAF normalized spectral count abundance factor PcG Polycomb group PCP planar cell polarity PDZ PDT-85, Dlg, ZO-1 domain PINA protein interaction network analysis platform PPI protein-protein interaction PRC Polycomb repressive complex QUBIC Quantitative BAC Interactomics RFP ref fluorescent protein SAINT Significance Analysis of Interactome SARAH Salvador Rassf Hippo domain SEC size exclusion chromatography SRM selected reaction monitoring STRIPAK striatin-interacting phosphatase and kinase complex TPP trans-proteomic pipeline Y2H yeast two hybrid

IV

V. Table of Contents

I. Zusammenfassung I

II. Summary III

III. Abbreviations IV

V. Table of Contents V

Chapter 1 Introduction to Interaction Proteomics 1

1.1. Abstract 2 1.2. Introduction 3 1.3. Mass spectrometry based Interaction proteomics 4 1.3.1. Protein Identification 4 1.3.2. Protein quantification 6 1.3.3. Targeted proteomics by SRM and SWATH-MS 7 1.3.4. Analysis of Protein interactions by Affinity purification mass spectrometry 9 1.3.5. Data filtering strategies for unspecific interacting proteins 11 1.3.6. Generating large scale interaction maps to guide upcoming and targeted studies 13 1.3.7. Inference of Protein complex stoichiometry by absolute quantification 13 1.3.8. Profiling of dynamic changes in interaction proteomes. 13 1.4. Cross-linking and AP-MS 14 1.5. Sub-fractionation of full cell lysates and purified complexes 14 1.6. Conclusions and Outlook 15 1.7. References 177

Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1 21

2.1. Abstract 22 2.2. Introduction 23 2.3. Results and Discussion 25 2.3.1. Characterization of the human Hippo interaction proteome by a systematic AP-MS approach 25 2.3.2. Hierarchical clustering assigns Hpo pathway components to interaction modules 27

V

2.3.3. Topology of the Hippo Core Kinase Complex 29 2.3.4. The PP1-ASPP module provides links to apico-basal and planar cell polarity 32 2.3.5. PP1/ASPP2 complexes promote YAP1 activity 33 2.3.6. A Cell Polarity Network linked to L2GL1, Kibra, Merlin and YAP1 36 2.4. Conclusions 39 2.5. Experimental Procedures 41 2.6. References 45 2.7. Supplementary Materials 48 2.7.1. Supplementary tables 48 2.7.2. Supplementary figures 49

Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome 55

3.1. Abstract 56 3.2. Introduction 57 3.3. Results 58 3.3.1. Systematic profiling of the human PcG interaction proteome 58 3.3.2. Hierarchical clustering assigns HCIPs to PcG complexes 61 3.3.3. The PRC1 module 63 3.3.4. The PRC2 module 66 3.3.5. PR-DUB 70 3.4. Discussion 76 3.5. Experimental Procedures 80 3.6. References 85 3.7. Supplementary materials 89 3.7.1. Supplementary tables 89 3.7.2. Supplementary figures 90

Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS 99

4.1. Abstract 100 4.2. Introduction 101 4.3. Results 102 4.3.1. Cell lines generation, sample preparation and MS analysis 102 4.3.3. Reproducibility analysis 105 4.3.4. Final network 107

VI

4.3.5. A STE20 kinase family network 110 4.3.6. ILK network 110 4.4. Discussion 111 4.5. Online Methods 115 4.6. References 121 4.7. Supplementary Materials 124 4.7.1. Supplementary Figures and Tables 124 4.7.2. Supplementary Notes 127

Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data 133

5.1. Abstract 134 5.2. Introduction 135 5.3. Results 137 5.3.1. Creation of the CRAPome repository 137 5.3.2. Graphical user interface 139 5.3.3. Characterization of the CRAPome 140 5.3.4. Using CRAPome to score interactions 144 5.3.5. A more stringent FC score helps capture “rare” contaminants 149 5.4. Concluding Remarks 152 5.5. Online Methods 153 5.6. References 162 5.7. Supplementary materials 164

Chapter 6 Conclusions 165

Acknowledgments 166

Curriculum vitae Error! Bookmark not defined.

VII

VIII

Chapter 1 Introduction to Interaction Proteomics

Simon Hauri and Matthias Gstaiger

Contribution by SH Manuscript writing

1 Chapter 1 Introduction to Interaction Proteomics

1.1. Abstract The development of disease causing genome alterations is a complex pathological process. Recent advances in genomic and transcriptomic sequencing of mutated organisms provide an enormous amount of data about the genetic patterns that correlate with phenotypes. To understand the molecular mechanisms that determine a disease phenotype, post-genomic analysis of the products - the proteins - became an important but tremendous task. Advances in proteomic technologies allow promising new concepts to characterize complex non-linear genotype-phenotype relationships. In this review we provide an overview of current and upcoming mass spectrometry techniques, their inherent advantages and limitations, and discuss how the reliable and systematic analysis of protein complexes contribute to new insights in the biochemistry underlying complex diseases.

2 Chapter 1 Introduction to Interaction Proteomics

1.2. Introduction Many disease causing mutations have been identified and linked to human pathology using genome-wide phenotypic screens or sequencing of patient genomes. The phenotype manifests in a variety of observable traits and may or may not be a linear consequence of the genotype. The original hypothesis of “one gene - one enzyme” [1] does not sufficiently describe most genotype-phenotype relationships. Gene products - the proteins - exert their function in the context protein complexes [2]. It is therefore crucial to understand how these cellular machines are organized and more importantly, how disease causing mutations may affect composition and regulation of protein complexes. Of course, the systematic identification of disease causing mutations by high throughput sequencing projects remains an important step towards a comprehensive genetic compendium of human pathologies. The OMIM® database (http://omim.org/) provides access to such a compendium of human genotype-phenotype relationships. It contains almost 7500 phenotypes of which so far half can be linked to an underlying set of gene mutations. Proteomic approaches have begun to successfully link the impact of genetic perturbations to protein behavior, such as post-translational modifications [3], protein degradation [4], and protein-protein interactions (PPI). Here we will focus on interaction proteomics and how changes in protein complex formation can affect a macromolecular phenotype. An example of this concept represents the Edgotyping Disease Initiative (Center for Cancer Systems Biology, Dana-Farber Cancer Institute, Boston MA, http://cegs2.dfci.harvard.edu/). The goal of the initiative is to systematically study how diseases causing mutations affect the protein interaction landscape of mutated proteins. A distinction is made between a “node removal” (complete loss of a gene product) and “edgetic” alteration (edge-specific genetic alteration; loss of specific protein interactions). The hypothesis states that genetic mutations can cause distinct molecular defects in proteins and lead to phenotypic outcomes based on alterations of cellular interaction networks. For example, single amino acid substitutions in domains or binding sites of proteins will induce alterations in the interaction network that can cause changes in biochemical and biophysical processes [5, 6]. To undertake the endeavor of characterizing protein complexes, several large-scale PPI detection methods have been developed. Two widely used techniques are yeast two hybrid (Y2H) screens (as used in the edgoyping initiative) and affinity purification and

3 Chapter 1 Introduction to Interaction Proteomics

mass spectrometry (AP-MS) based proteomics. Y2H is a fast, easy to perform and large- scale compatible method that has been used extensively to map the binary protein interactions. Over a third of annotated interactions in public PPI databases were detected by Y2H (source: IntAct; http://www.ebi.ac.uk/intact/). Y2H relies on the physical interaction of two recombinant proteins (bait and prey) expressed in yeast. If an interaction occurs, the expression of a reporter gene is induced and generates a positive readout. The assay has some limitations that need to be addressed however: Interactions have to occur in the nucleus; high concentrations of recombinant proteins may produce false positive interactions; non-yeast proteins may not be correctly folded or modified according to their natural cellular environment. And more conceptually, Y2H reports binary proteins interactions and provides no information about protein complexes. Thanks to advances in instrumentation, computational methods and full genome sequencing, mass spectrometry based proteomics became the most widely used and reliable technology to measure proteins in complex biological samples. In combination with affinity purification it became feasible to study the composition of native protein complexes for the first time at proteome wide scale [7]. There are of course challenges and limitations that still have to be addressed (see below), but principally the tools to characterize high quality interaction data by mass spectrometry are available today and thus offer a great tool box to compare complex formation between wild type and disease associated protein variants at large scale.

1.3. Mass spectrometry based Interaction proteomics 1.3.1. Protein Identification Proposed by Mann and Wilm in 1994 [8], protein identification by liquid chromatography tandem mass spectrometry (LC-MS/MS) reaches its 20 year anniversary next year. The typical LC-MS/MS workflow (Figure 1.1), often dubbed as “shotgun proteomics”, requires prior enzymatic digestion, usually by trypsin, of a protein sample. The tryptic peptides are buffered at low pH, to ensure positive charges on the N and C-termini, and separated by reverse phase chromatography according to their hydrophobicity. At the tip of the chromatographic column, the peptides are subjected to electrospray ionization (ESI) and injected in the mass spectrometer, where their mass to charge ratio (m/z) is measured (MS1 scan). The most intense peptide ions (precursors) are fragmented at the peptides

4 Chapter 1 Introduction to Interaction Proteomics

bonds by colliding with inert gas molecules (e.g. Helium; collision induced dissociation, CID), which generates ions that reflect the amino acid sequence by characteristic mass shifts (MS2 scan). The MS1 scan followed by multiple MS2 scan is referred to as the scan cycle. After recording several thousand scan cycles, a search algorithm matches and scores the spectra from MS2 scans to a theoretical spectrum derived from an in silico digested protein. Protein identifications are inferred from peptide spectrum matches, however this is not a trivial task and has been addressed by several different approaches [9]. Despite two decades of progress in LC-MS/MS technology and computational analysis of MS data, there are still remaining challenges to be addressed. The large complexity and high dynamic range of cellular proteomes are the major factors that determine comprehensive protein identification. Due to the principle of shotgun proteomics, the likelihood of identifying very low abundant proteins decreases with increasing sample complexity. The mass spectrometer selects only the most intense peptide ions from MS1 scan for subsequent fragmentation. Therefore in very complex samples, more high intensity ions will occupy the limited scan cycles, and prevent ions of lower intensity from being fragmented. The same applies for samples with extreme dynamic range, where the protein concentration spans across several orders of magnitude. Biological samples usually possess these characteristics: In yeast the dynamic range can reach up to four [10], in human cell lines even up to seven orders of magnitude [11]. The development of more efficient protein fractionation techniques in combination with the advances of high resolution MS instruments led to the comprehensive proteomic analysis of several genetic model organisms, including S. cerevisiae [12], D. melanogaster [13, 14], C. elegans [13], A. thalinana [15], M. tuberculosis [16], and H. sapiens [11, 17]. Upcoming new technologies in mass spectrometry, such as SWATH-MS mentioned below, will certainly extend this list and increase the coverage, especially for highly complex proteomes.

5 Chapter 1 Introduction to Interaction Proteomics

Figure 1.1: Typical protein identification workflow by tandem mass spectrometry. Proteins are enzymatically digested and separated by reverse phase high performance liquid chromatography (RP-HPLC). The peptide ions get measured (precursor ion spectrum, MS1), fragmented by collision induced dissociation (CID) and measured again (fragment ion spectrum, MS2)

1.3.2. Protein quantification The ability to characterize changes in proteomes requires a technology that is able to provide reliable and sensitive quantification of protein abundances. An important advantage of mass spectrometry based proteomics is the ability to infer protein abundance, based on mass spectral properties. The most commonly used abundance estimation method is based on spectral counting. As the name suggests, spectral counting represents the sum of all acquired peptide spectra that can be assigned to a specific protein. It is based on the assumption that protein abundances correlate with the number of peptide fragmentation events [18, 19]. Spectral counting emerged as an easily accessible quantification method in a majority of proteomic studies, but several limitations have to be taken into account. The protein abundance is not the only contributing variable to spectral counting, but so is number and detectability of corresponding peptides. Shorter proteins generate fewer spectral counts than longer ones and peptides which do not ionize well result in lower signals and are less likely to be fragmented. Furthermore, unique and non-proteotypic peptides have to be weighted differently, since non-proteotypic peptides could originate from multiple proteins of variable abundance. Finally, between samples of different complexity, spectral count values from less complex samples may lead to an overestimation of protein abundance, as the instrument had more opportunities to fragment the same peptides multiple times. In order to take these aspects into consideration a computational method was introduced to normalize spectral counts by converting them to normalized spectral abundance factors

6 Chapter 1 Introduction to Interaction Proteomics

(NSAF) values [20, 21]. NSAF values are normalized to protein length and total sample abundance, and consider differences between unique and shared peptides. An alternative quantification method in shotgun proteomics is the extraction of peptide precursor ion intensities [22, 23]. In theory, the sum of all precursor ion intensities should accurately describe the protein abundance. In contrast to spectral counting, the precursor intensity is largely unaffected by the sample complexity, provided equal concentrations. A major advantage of extracting MS1 features is the ability to map of peptide identifications to quantified precursor intensities across independent LC-MS experiments. The criteria for feature extraction are usually based on charge state, isotope pattern and peak width and only after successful peak extraction, peptide identifications are matched according to retention time and m/z coordinates. This allows the assignment of peptides identification across different samples by creating a master map of peptide to feature assignments [24, 25]. The challenge of this approach lies in the high demand of technical reproducibility for the retention time dimension. A poor master map alignment will drastically affect the quantification result. When choosing the ideal quantification method, the difference of limit of detection (LOD) and limit of quantification (LOQ) should be taken into consideration. The LOD in LC-MS/MS represent the lowest amount of a peptide ion necessary to trigger an MS2 scan. The LOD is usually lower than the LOQ, as it is possible for the instrument to fragment a low abundant peptide from a very noisy part in the mass spectrum, where a reliable extraction of the precursor peak intensity is not possible. As a consequence in most cases only a fraction of identified proteins can also be quantified by MS1 intensity measurements.

1.3.3. Targeted proteomics by SRM and SWATH-MS Targeted proteomics was developed to address the above mentioned shortcomings of quantification in shotgun based methods [26]. Instead of allowing the mass spectrometer to select which precursor ions should be fragmented, the instrument is programmed to selectively fragment peptide ions from a predefined list of target peptides. The most common quantification method in targeted proteomics is selected reaction monitoring (SRM, Figure 1.2). For each protein from the target list, a set of proteotypic peptides can be selected. The subsequent validation and identification of the correct peptides are based on the generated highly specific pairs of precursor and fragment ion signals (transitions). By knowing m/z, signal intensity and retention time of the transitions, it becomes possible

7 Chapter 1 Introduction to Interaction Proteomics

to acquire reliable quantitative data for the target proteins. This requires a high mass accuracy acquisition in MS2 scan and is usually measured on a triple quadrupole or similarly performing instrument. Targeted proteomics by SRM is currently the most sensitive mass spectrometry based proteomics technique which allows detection of proteins in the lower attomolar range. Its limitations are required prior knowledge of the investigated peptides, time-consuming and labor-intensive assay developments, and relatively low throughput.

Figure 1.2: Typical targeted proteomics approach (SRM). Single reaction monitoring is usually performed on triple quadrupole instruments. Prior to the measurement, proteotypic ions of the protein(s) of interest (here in red) are selected by the user according to the peptides fragmentation pattern (transitions). The mass spectrometer will scan only for the transitions provided provide by the operator. Digested peptides are ionized by electrospray (ESI) and injected to the mass spectrometer. The proteotypic ion is selected in the first quadrupole Q1; it becomes fragmented by collision induce dissociation (CID) in the second quadrupole Q2 and the preselected transitions are measured in the third quadrupole Q3 over the full duration of the liquid chromatography gradient. The measured transitions are highly specific for the selected protein and provide reliable quantification values.

8 Chapter 1 Introduction to Interaction Proteomics

To combine the sequencing power of data-dependent acquisition (shotgun) and the sensitivity and reliable quantification of targeted proteomics an alternative data independent acquisition (DIA) technology referred to as SWATH-MS is currently being developed. Conceptually, all precursor ions that ever enter the mass spectrometer are fragmented and measured in consecutive acquisition windows of defined mass width (“SWATH”) that progress over the full m/z range and retention time [27]. To achieve this, an instrument featuring very high acquisition speed and high mass accuracy is needed. The composite of all acquired SWATHs creates a digital record of all detectable fragment ions. Such digital records is that they can be queried by post-acquisition data extraction using either transitions from existing SRM assays or from spectral libraries obtained by previous shot gun MS experiments. That also means that there is no need to define transitions prior to the measurement as in the case of SRM and the dataset can be revisited with a new set of transitions any time depending on the questions to be addressed. In summary SWATH MS allows the comprehensive acquisition of SRM type data for accurate identification and quantification of complex proteomes.

1.3.4. Analysis of Protein interactions by Affinity purification mass spectrometry Affinity Purification Mass Spectrometry (AP-MS) is the method of choice to study protein- protein interactions (PPI) and complexes under near physiological conditions (Figure 1.3). A protein of interest - the bait - is purified by either a bait specific antibody or an affinity reagent specific for an epitope tag fused to the bait protein. Like in co- immunoprecipitation (co-IP) western blot experiments, AP-MS relies on the principle that under native lysis conditions, protein complexes remain largely intact and associated proteins (prey) purify together with the bait protein. The digested peptides of purified complexes are analyzed by shotgun LC-MS/MS and the compiled data of all purification experiments can be assembled in a bait-prey matrix. Several purification protocols have been described and are roughly summarized in three groups: Antibody based methods, tandem affinity tags and single affinity tags. All have their advantages and disadvantages. Specific antibodies allow protein purification at endogenous expression levels from tissue samples and primary cells without the need for any genetic engineering. Antibodies based AP-MS methods may be a valid choice for focused studies that do not justify an extensive tagging protocol. They are however not

9 Chapter 1 Introduction to Interaction Proteomics

ideal to analyze systematic large scale data, as antibodies can differ in specificity and binding efficiency or, for some proteins, simply are unavailable. In a recent study, a large interaction network of human co-regulators proteins was analyzed by affinity purifications with 1796 antibodies [28]. The authors acknowledge the inherent problem of cross-reactivity and non-specific binding in antibodies by a computational approach. According to their supplementary materials, only 70% antibodies were able to recover their intended antigens, out of which 60% showed cross-reactivity with other proteins. Besides these disadvantages for high throughput AP-MS, antibodies can also disrupt protein complexes. Affinity tags on the other hand offer a more versatile biochemical tool to purify a broad range of different proteins using coherent, large-scale compatible purification procedures. Apart from these biochemical differences, tags can be used as single tags or in tandem with other tags for sequential purification of protein complexes. This results in very clean purifications with low amount of unspecific background proteins. However, transient interaction partners may dissociate during the extended purification steps [29]. With advances in data analysis (see below) the need for highly purified protein complexes diminished and single tagged approaches became more popular as they allow quantitative, specific and fast purification of complexes at high throughput. An extensive comparison between affinity reagents was provided by Oeffinger et al.[30].

10 Chapter 1 Introduction to Interaction Proteomics

Figure 1.3: Quantitative mass spectrometry in interaction proteomics. (A) A protein of interest (bait; orange) is purified in condition A by antibody or affinity tag. Prey protein (red, green) co-elute with the bait and are identified by LC-MS/MS. (B) The same bait is purified in different conditions (e.g. drug treatment or mutant protein) quantitatively compared to condition A. The green prey did not associate with the prey anymore and is lost during the purification, observable by loss of signal in the mass spectrum. (C) A control purification with no or untagged bait protein can show that the blue protein is an unspecific binding contaminant protein.

1.3.5. Data filtering strategies for unspecific interacting proteins Already from early AP-MS studies [7], it became clear that the majority of prey proteins identified by mass spectrometry are non-specific contaminants that co-purify with bait. The enormous complexity and abundance range in AP-MS experiments initially led to hundreds of reported interactions per bait protein. Compared to earlier works [31], recent AP-MS studies claim on average fewer interactions, even though instrument sensitivity increased over the years. This indicates that insufficient focus was put on

11 Chapter 1 Introduction to Interaction Proteomics

contaminant filtering which led to false positive assignments. This is critical, as false positives enter the public interaction database and are typically never removed. A detailed perspective on the wide array of filtering methods has been provided by Pardo and Choudhary, [32]. In general, most filtering approaches rely on quantitative comparison of prey abundances to control purification experiments and/or incorporation of distinct dataset properties, such as reproducibility, reciprocity, and frequency of observation [33-44]. Well established examples for filtering computational methods include: the SAINT (Significance Analysis of INTeractome) algorithm, which assigns confidence scores based on statistical models [35, 36]; the software platform ComPASS (Comparative Proteomics Analysis Software Suite), that considers quantitative distribution and reproducibility between replicates [33, 42], and the QUBIC (Quantitative BAC InteraCtoimcs) method, where MS1 precursor intensities are compared between bait and untagged control purifications [38, 39]. An effort to support the AP-MS community with a repository of common contaminant proteins was recently published, called the CRAPome (Contaminat Repository for Affnity Pruification Data; http://www.crapome.org/) [45]. The CRAPome provides annotated negative control data from different cell lines, tag systems and protocols collected from multiple laboratories around the world. A web interface allows query of protein lists, download of control datasets and network analysis of AP-MS data. The reproducibly of modern standardized AP-MS workflows can be very high, as demonstrated by a recent inter-laboratory comparison [44]. High inter-laboratory robustness in AP-MS data generation is important since full proteome coverage for human interaction data will depend on the integration of AP-MS data obtained from many labs. AP-MS has become the method of choice to describe protein complexes under physiological conditions at large scale. A proper filtering strategy for AP-MS datasets is a critical step towards more coherent, reproducible and therefore biologically meaningful interaction landscapes and their dynamics.

12 Chapter 1 Introduction to Interaction Proteomics

1.3.6. Generating large scale interaction maps to guide upcoming and targeted studies There are several databases already available that contain a collection of published protein interactions. Unfortunately, due to the heterogeneity of the data, the false discovery rates (FDR) are largely unknown. As mentioned above, access to reliable and comparable large scale interaction data should be a key priority for researchers interested in focused studies on protein complexes. Large scale approaches that map entire sections of the proteome are important contributors to such databases. Not only do they provide information on binary protein interactions, but depending on the coverage they can also allow conclusions about protein assemblies, higher-order modularity beyond complex formation, and mutual exclusive binding partners [41, 46-48]. Results from such large scale studies have given new insights into many biological processes including autophagy, de-ubiquitination, and protein kinase signaling [33, 40-43].

1.3.7. Inference of Protein complex stoichiometry by absolute quantification High-density interaction networks that contain a high degree of reciprocity have been used to define sub-classes of protein complexes by analyzing mutually exclusive binding partners of core proteins of complexes using various clustering techniques [41, 46-48]. This provides qualitative information about differential protein complexes and their composition, but little about protein stoichiometry. In AP-MS experiments stoichiometry can be inferred from abundance measurements of co-purifying proteins. The quantification methods mentioned above provide relative abundance information of prey proteins that co-purify with the bait. Recent studies have used various techniques to determine the absolute abundance of proteins in complexes. Such information is also useful to quantitatively infer the partitioning of proteins across several concurrent complexes in a cell.[49-51].

1.3.8. Profiling of dynamic changes in interaction proteomes. Cells have to sense external signals and adapt their cell growth and proliferation in response to changes in the cellular environment. Such adaptations are likely to involve changes in the protein interaction landscape underlying cellular signaling. An early AP- MS study characterized the dynamic changes of the human TNFα/NF-κB pathway upon stimulation with TNFα or RNA interference of network components in a qualitative

13 Chapter 1 Introduction to Interaction Proteomics

manner [52]. Recent studies aimed at resolving such changes in a more quantitative manner at an entire network level. Glatter et al 2011 showed, that the about 10% of the interactions constituting the interaction proteome of insulin receptor and target of rapamycin (InR/TOR) pathway significantly changed upon treatment with the insulin growth factor [37]. In this example, the authors used aligned MS1 precursor ion intensities to quantify relative protein changes between the different conditions. Another study analyzed the kinetics of GRB2 complex formation following the activation of receptor tyrosine kinases in human cells. Using SRM the authors could simultaneously profile the kinetics 90 different proteins binding to GRB2 following growth factor treatment [53]. In summary, analysis of protein complexes by quantitative AP-MS provides a reliable high resolution method to measure protein complex dynamics at large scale. AP-MS data however do not provide information to distinguish direct form indirect interactions and models on protein complex formation cannot be built from a single bait AP-MS experiment but rather requires an array of reciprocal AP-MS experiments

1.4. Cross-linking and AP-MS Cross-linking coupled with shotgun mass spectrometry has been developed to infer direct protein-protein interactions and to obtain distant constraints for modeling the topology of protein complexes [41, 46-48, 54-57]. In principle, proteins are treated with simultaneously with an isotopically labeled and unlabeled cross-linking reagent before tryptic digest. The peptides that harbor cross-linked residues remain covalently linked during the LC-MS/MS analysis and produce a complex fragmentation spectrum in the MS2 scan. Precursor ions for which both the unlabeled (light) and labeled (heavy) species are observed (indicated by a mass shift of the labeled cross-linker), get analyzed by a dedicated search algorithms. The obtained data can provide the exact location of the cross-linked lysine residues from inter as well as intra protein cross linking events. Given the known length of the applied cross linking reagents and the availability of high resolution structure homology models such data, have been successfully used to develop topological models for large protein assemblies that typically escape structural analysis by classical methods.

1.5. Sub-fractionation of full cell lysates and purified complexes

14 Chapter 1 Introduction to Interaction Proteomics

A major limitation of AP-MS based protein complex inference is to measure protein distribution across multiple concurrent protein complexes purified in an AP-MS experiment. It is very likely that the purified bait protein interacts with several, mutually exclusive proteins in distinct complexes. This problem can be partially addressed by computational analysis of an array of reciprocal AP-MS experiments and mapping of mutually exclusive binding partners within protein complexes. [41, 46-48]. Recently, a promising approach was postulated based on the separation of protein complexes from full cell lysates by size exclusion chromatography (SEC) and protein correlation profiling. The obtained size fractions were measured by shotgun proteomics and reconstituted elution profiles of all identified proteins could be used to infer protein complex composition [58, 59]. The idea of such an approach is not entirely new, but advances in size exclusion chromatography columns and mass spectrometry facilitated the separation of the entire proteome and its subsequent quantitative analysis by mass spectrometry. So far, large and abundant assemblies like the proteasome, the TCP chaperone complex and the prefoldin complex could be successfully extract form the obtained size exclusion profiles. However, the enormous complexity of the human proteome and high dynamic range in protein expression certainly challenges the recently proposed methods. An approach that combines SEC fractionation with SWATH-MS analysis using spectral libraries of known interaction partners obtained by high resolution AP-MS studies might provide a significant increase of sensitivity for assigning co-eluting proteins.

1.6. Conclusions and Outlook Functional genomics provides the molecular basis of disease phenotypes. Quantitative high resolution interaction proteomics of protein complexes allow integrated analysis of the molecular mechanisms associated with the disease. In this review, we provided an overview of current and emerging mass spectrometry based methods that allow hypothesis driven analysis of interaction networks. New analytical techniques such as SRM or SWATH-MS are currently integrated in established AP-MS workflows to improve to further improve existing limitations in assay sensitivity and quantitative accuracy. It can be expected that the high reproducibility and accuracy of mass spectrometry driven interaction proteomics will provide data on protein complex dynamics at unprecedented resolution at nearly full proteome coverage. This will significantly improve the quality and

15 Chapter 1 Introduction to Interaction Proteomics

coverage of annotated interactions in public databases and thus can be used as a reliable resource to infer genotype-phenotype relationships from a modular proteome.

The goal of this thesis is the characterization of two human protein interactomes by incorporation of state-of-the-art AP-MS protocols and data analysis methods. Using the Strep-HA tandem affinity tag system (Glatter et. al), we set out to investigate the interaction proteomes of the human Hippo signaling pathway (Chapter 2) and the human PolyComb group proteins of epigenetic regulators (chapter 3). Furthermore, in a collaborative effort, the collective control purification experiments generated during this thesis, should contribute to the online repository CrapOme (Chapter 4). Finally, the inter- laboratory reproducibility of the Strep-HA AP-MS protocol should be investigated in collaboration with a research group in Vienna (Chapter 4). By incorporation of a highly reproducible AP-MS protocol (Chatper 5) and high quality contaminant filtering (Chapter 4) we want to characterize the biochemistry of the the Hpo pathway (Chapter 2) and the PolyComb Group proteins (Chapter3) protein interactomes with the goal of gaining new insights of how these biological systems are organized and providing a solid framework for future research in the respective fields.

16 Chapter 1 Introduction to Interaction Proteomics

1.7. References 1. Beadle, G.W. and E.L. Tatum, Genetic Control of Biochemical Reactions in Neurospora. Proc Natl Acad Sci U S A, 27, 499-506 (1941). 2. Alberts, B., The cell as a collection of protein machines: preparing the next generation of molecular biologists. Cell, 92, 291-4 (1998). 3. Gnad, F., et al., Systems-wide analysis of K-Ras, Cdc42 and PAK4 signaling by quantitative phosphoproteomics. Mol Cell Proteomics, (2013). 4. Lanucara, F., P. Brownridge, I.S. Young, P.D. Whitfield, and M.K. Doherty, Degradative proteomics and disease mechanisms. Proteomics Clin Appl, 4, 133-42 (2010). 5. Dreze, M., et al., 'Edgetic' perturbation of a C. elegans BCL2 ortholog. Nat Methods, 6, 843- 9 (2009). 6. Zhong, Q., et al., Edgetic perturbation models of human inherited disorders. Mol Syst Biol, 5, 321 (2009). 7. Gavin, A.C., et al., Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, 141-7 (2002). 8. Mann, M. and M. Wilm, Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem, 66, 4390-9 (1994). 9. Claassen, M., Inference and validation of protein identifications. Mol Cell Proteomics, 11, 1097-104 (2012). 10. Ghaemmaghami, S., et al., Global analysis of protein expression in yeast. Nature, 425, 737- 41 (2003). 11. Geiger, T., A. Wehner, C. Schaab, J. Cox, and M. Mann, Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins. Mol Cell Proteomics, 11, M111 014050 (2012). 12. de Godoy, L.M., et al., Comprehensive mass-spectrometry-based proteome quantification of haploid versus diploid yeast. Nature, 455, 1251-4 (2008). 13. Schrimpf, S.P., et al., Comparative functional analysis of the Caenorhabditis elegans and Drosophila melanogaster proteomes. PLoS Biol, 7, e48 (2009). 14. Brunner, E., et al., A high-quality catalog of the Drosophila melanogaster proteome. Nat Biotechnol, 25, 576-83 (2007). 15. Baerenfaller, K., et al., Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science, 320, 938-41 (2008). 16. Schubert, O.T., et al., The Mtb Proteome Library: A Resource of Assays to Quantify the Complete Proteome of Mycobacterium tuberculosis. Cell Host Microbe, 13, 602-12 (2013). 17. Beck, M., et al., The quantitative proteome of a human cell line. Mol Syst Biol, 7, 549 (2011). 18. MacCoss, M.J., C.C. Wu, H. Liu, R. Sadygov, and J.R. Yates, 3rd, A correlation algorithm for the automated quantitative analysis of shotgun proteomics data. Anal Chem, 75, 6912-21 (2003). 19. Liu, H., R.G. Sadygov, and J.R. Yates, 3rd, A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal Chem, 76, 4193-201 (2004). 20. Paoletti, A.C., et al., Quantitative proteomic analysis of distinct mammalian Mediator complexes using normalized spectral abundance factors. Proc Natl Acad Sci U S A, 103, 18928-33 (2006). 21. Zybailov, B., et al., Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae. J Proteome Res, 5, 2339-47 (2006).

17 Chapter 1 Introduction to Interaction Proteomics

22. Strittmatter, E.F., P.L. Ferguson, K. Tang, and R.D. Smith, Proteome analyses using accurate mass and elution time peptide tags with capillary LC time-of-flight mass spectrometry. Journal of the American Society for Mass Spectrometry, 14, 980-991 (2003). 23. Callister, S.J., et al., Normalization approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. J Proteome Res, 5, 277-86 (2006). 24. Rinner, O., et al., An integrated mass spectrometric and computational framework for the analysis of protein interaction networks. Nat Biotechnol, 25, 345-52 (2007). 25. Mueller, L.N., et al., SuperHirn - a novel tool for high resolution LC-MS-based peptide/protein profiling. Proteomics, 7, 3470-80 (2007). 26. Picotti, P. and R. Aebersold, Selected reaction monitoring-based proteomics: workflows, potential, pitfalls and future directions. Nat Methods, 9, 555-66 (2012). 27. Gillet, L.C., et al., Targeted data extraction of the MS/MS spectra generated by data- independent acquisition: a new concept for consistent and accurate proteome analysis. Mol Cell Proteomics, 11, O111 016717 (2012). 28. Malovannaya, A., et al., Analysis of the human endogenous coregulator complexome. Cell, 145, 787-99 (2011). 29. Dunham, W.H., et al., A cost-benefit analysis of multidimensional fractionation of affinity purification-mass spectrometry samples. Proteomics, 11, 2603-12 (2011). 30. Oeffinger, M., Two steps forward--one step back: advances in affinity purification mass spectrometry of macromolecular complexes. Proteomics, 12, 1591-608 (2012). 31. Ewing, R.M., et al., Large-scale mapping of human protein-protein interactions by mass spectrometry. Mol Syst Biol, 3, 89 (2007). 32. Pardo, M. and J.S. Choudhary, Assignment of protein interactions from affinity purification/mass spectrometry data. J Proteome Res, 11, 1462-74 (2012). 33. Behrends, C., M.E. Sowa, S.P. Gygi, and J.W. Harper, Network organization of the human autophagy system. Nature, 466, 68-76 (2010). 34. Breitkreutz, A., et al., A global protein kinase and phosphatase interaction network in yeast. Science, 328, 1043-6 (2010). 35. Choi, H., T. Glatter, M. Gstaiger, and A.I. Nesvizhskii, SAINT-MS1: protein-protein interaction scoring using label-free intensity data in affinity purification-mass spectrometry experiments. J Proteome Res, 11, 2619-24 (2012). 36. Choi, H., et al., SAINT: probabilistic scoring of affinity purification-mass spectrometry data. Nat Methods, 8, 70-3 (2011). 37. Glatter, T., et al., Modularity and hormone sensitivity of the Drosophila melanogaster insulin receptor/target of rapamycin interaction proteome. Mol Syst Biol, 7, 547 (2011). 38. Hubner, N.C., et al., Quantitative proteomics combined with BAC TransgeneOmics reveals in vivo protein interactions. J Cell Biol, 189, 739-54 (2010). 39. Hubner, N.C. and M. Mann, Extracting gene function from protein-protein interactions using Quantitative BAC InteraCtomics (QUBIC). Methods, 53, 453-9 (2011). 40. Krogan, N.J., et al., Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature, 440, 637-43 (2006). 41. Sardiu, M.E., et al., Probabilistic assembly of human protein interaction networks from label-free quantitative proteomics. Proc Natl Acad Sci U S A, 105, 1454-9 (2008). 42. Sowa, M.E., E.J. Bennett, S.P. Gygi, and J.W. Harper, Defining the human deubiquitinating enzyme interaction landscape. Cell, 138, 389-403 (2009).

18 Chapter 1 Introduction to Interaction Proteomics

43. Varjosalo, M., et al., The Protein Interaction Landscape of the Human CMGC Kinase Group. Cell Rep, 3, 1306-20 (2013). 44. Varjosalo, M., et al., Interlaboratory reproducibility of large-scale human protein-complex analysis by standardized AP-MS. Nat Methods, 10, 307-14 (2013). 45. Dattatreya Mellacheruvu, Z.W., Amber L. Couzens, Jean-Philippe Lambert, Nicole St-Denis, Tuo Li, Yana V. Miteva, Simon Hauri, Mihaela E. Sardiu,Teck Yew Low, Vincentius A. Halim, Richard D. Bagshaw, Nina C. Hubner, Abdallah al-Hakim, Annie Bouchard, Denis Faubert, Damian Fermin, Wade H. Dunham, Marilyn Goudreault, Zhen-Yuan Lin, Beatriz Gonzalez Badillo, Tony Pawson, Daniel Durocher, Benoit Coulombe, Ruedi Aebersold, Giulio Superti- Furga, Jacques Colinge, Albert J. R. Heck, Hyungwon Choi, Matthias Gstaiger, Shabaz Mohammed, Ileana M. Cristea, Keiryn L. Bennett, Mike P. Washburn, Brian Raught, Rob M. Ewing, Anne-Claude Gingras, Alexey I. Nesvizhskii., The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data. Nat Methods, in press, (2013). 46. Powell, D.W., et al., Cluster analysis of mass spectrometry data reveals a novel component of SAGA. Mol Cell Biol, 24, 7249-59 (2004). 47. Gavin, A.C., et al., Proteome survey reveals modularity of the yeast cell machinery. Nature, 440, 631-6 (2006). 48. Choi, H., S. Kim, A.C. Gingras, and A.I. Nesvizhskii, Analysis of protein complexes through model-based biclustering of label-free quantitative AP-MS data. Mol Syst Biol, 6, 385 (2010). 49. Wepf, A., T. Glatter, A. Schmidt, R. Aebersold, and M. Gstaiger, Quantitative interaction proteomics using mass spectrometry. Nat Methods, 6, 203-5 (2009). 50. Smits, A.H., P.W. Jansen, I. Poser, A.A. Hyman, and M. Vermeulen, Stoichiometry of chromatin-associated protein complexes revealed by label-free quantitative mass spectrometry-based proteomics. Nucleic Acids Res, 41, e28 (2013). 51. van Nuland, R., et al., Quantitative dissection and stoichiometry determination of the human SET1/MLL histone methyltransferase complexes. Mol Cell Biol, 33, 2067-77 (2013). 52. Bouwmeester, T., et al., A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. Nat Cell Biol, 6, 97-105 (2004). 53. Bisson, N., et al., Selected reaction monitoring mass spectrometry reveals the dynamics of signaling through the GRB2 adaptor. Nat Biotechnol, 29, 653-8 (2011). 54. Rinner, O., et al., Identification of cross-linked peptides from large sequence databases. Nat Methods, 5, 315-8 (2008). 55. Leitner, A., et al., Probing native protein structures by chemical cross-linking, mass spectrometry, and bioinformatics. Mol Cell Proteomics, 9, 1634-49 (2010). 56. Herzog, F., et al., Structural probing of a protein phosphatase 2A network by chemical cross-linking and mass spectrometry. Science, 337, 1348-52 (2012). 57. Walzthoeni, T., A. Leitner, F. Stengel, and R. Aebersold, Mass spectrometry supported determination of protein complex structure. Curr Opin Struct Biol, (2013). 58. Havugimana, P.C., et al., A census of human soluble protein complexes. Cell, 150, 1068-81 (2012). 59. Kristensen, A.R., J. Gsponer, and L.J. Foster, A high-throughput approach for measuring temporal changes in the interactome. Nat Methods, 9, 907-9 (2012).

19 Chapter 1 Introduction to Interaction Proteomics

20

Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

Simon Hauri, Alexander Wepf, Audrey van Drogen, Markku Varjosalo, Anton Vichalkovski, Nic Tapon, Ruedi Aebersold, and Matthias Gstaiger.

Published in Hauri, S., et al. (2013). "Interaction proteome of human Hippo signaling: modular control of the co-activator YAP1." Mol Syst Biol 9: 713.

Contribution by SH Protein purification experiments, functional experiments, generation of expression constructs, tissue culture, design and performance of the LC-MS/MS measurements, data analysis, and manuscript writing.

21 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

2.1. Abstract Tissue homeostasis is controlled by signaling systems that coordinate cell proliferation, cell growth and cell shape in response to changes in the cellular environment. Deregulation of these processes often is associated with human cancer and can occur at multiple levels of the underlying signaling systems. To gain an integrated view on concurrent signaling modules controlling tissue cell growth we analyzed the interaction proteome of the human Hippo pathway, an established signaling system for controlling metazoan tissue growth. The resulting high-resolution network model of 480 interactions among 270 network components suggests participation of Hpo pathway components in three separate modules that all converge on the transcriptional co-activator YAP1. One of the modules corresponds to the canonical Hippo kinase cassette whereas the other two both contain Hpo components in complexes with cell polarity proteins. Quantitative proteomics data suggests that complex formation with cell polarity proteins is dynamic and depends on the integrity of cell-cell contacts. Collectively our systematic analysis greatly enhances our insights into the biochemical landscape underlying human Hpo signaling and emphasizes multifaceted roles of cell polarity complexes in Hpo mediated tissue cell growth.

22 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

2.2. Introduction Development of metazoan tissues and organs depend on tight control of proliferation, cell growth and programmed cell death in response to extracellular and intracellular signals. Genetic and biochemical experiments originally performed in Drosophila melanogaster lead to the discovery of the Hippo (Hpo) pathway, a conserved signaling cascade for tissue and organ homeostasis in metazoans. Its Core components are conserved in humans and have been implicated in a variety of human cancer malignancies [1]. They include the STE20 kinases MST1 and MST2 (orthologs of the Drosophila Hippo kinase) which bind to SAV1 (WW45), the AGC kinase LATS1 (Large tumor suppressor homolog 1) and its associated scaffolding proteins MOB1A/B [2-4]. Downstream of this kinase cascade are the WW-domain containing transcriptional co-activators YAP1 and TAZ [5, 6], and hence represent the two major effectors known for the Hpo pathway. It has been established that active MST1/2 in complex with SAV1 phosphorylates LATS, which in turn stimulates LATS-MOB complex formation and activation of LATS kinase activity. Active LATS kinase phosphorylates and inactivates YAP and TAZ by 14-3-3 protein mediated cytoplasmic sequestration [7]. When the Hpo pathway is inactive hypo-phosphorylated YAP and TAZ are bound to the TEA domain transcription factors (TEAD1/2/3/4), translocate to the nucleus and activate proliferative and anti-apoptotic [8]. While the principal signaling mechanisms for the described Hpo core modules are well established, our understanding on the physiological cues, signaling components and mechanisms that control the activation and repression of the human Hpo pathway are still quite limited. Recent genetic data from Drosophila and biochemical analysis in human cells suggest that Hpo signaling is linked to cell polarity, the cytoskeleton and cell junctions. However, the molecular systems that transmit signals to the Hpo core modules are just beginning to emerge and there is debate as to whether these promising links are dependent on Hpo kinase or mechanisms downstream of Hpo kinase. Since most proteins exert their function in the context of specific protein complexes, the characterization of complexes involving genetically defined Hpo components turned out to be a particularly successful approach to uncover novel regulators and mechanisms underlying the control of tissue growth by the Hpo signaling system. Affinity purification coupled to mass spectrometry (AP-MS) has proven to be a sensitive tool for the identification of novel protein interactions under physiological conditions [9-11]. Using a combined AP-MS and genomics approach we

23 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

recently identified dSTRIPAK, a phosphatase complex that associates with Hippo kinase and negatively regulates Hpo signaling in Drosophila [12]. AP-MS also revealed the regulatory interaction between the FERM domain protein Expanded (Ex) and the co-activator Yorkie (Yki) in Drosophila [13] or the presence of cell-cell junction proteins (AMOT, AMOTL1, AMOTL2) in human YAP and TAZ complexes [14, 15]. The few isolated AP-MS studies performed so far used different technologies applied for different cellular systems, which makes it hard to integrate the available data in coherent models on the protein interaction landscape underlying Hpo signaling. Such integrative models are however important, as the control of tissue and organ size can hardly be attributed to just single signaling components, but more likely emerges from concerted molecular events of a multiple concurrent signaling systems. Significant advances in protein complex purification and mass spectrometry instrumentation permit robust characterization of larger groups of complexes and entire pathways, even from human cells [16-20]. Here we describe a systematic approach to characterize the protein landscape for the human Hpo pathway. Stringent scoring and cluster analysis of obtained AP-MS data revealed 480 high confidence interactions that confirm many previous protein interactions found in humans and other species and provide novel biochemical context for Hpo pathway components. Hierarchical clustering of the obtained interaction data revealed a system of three major signaling modules linked to the transcriptional co- activator YAP1. Aside from the Hpo core kinase complex, the remaining two modules provide multiple links to apico-basal cell polarity (ABCP) and planar cell polarity (PCP). We identified the PP1-ASPP2 module as a regulatory element in controlling transcriptional outputs of the Hpo pathway and show that polarity proteins differentially bind YAP1 depending on biological conditions. The presented data represents a rich biochemical framework providing 343 novel high confidence protein interactions for established pathway components, which will be important for directing future functional experiments to better model tissue growth by the Hpo pathway. The results furthermore support the idea that transcriptional outputs of the major Hpo effector YAP1 may involve a number of biochemical processes linked to cell polarity that may act in parallel to or independent of the canonical Hpo core kinase cassette.

24 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

2.3. Results and Discussion 2.3.1. Characterization of the human Hippo interaction proteome by a systematic AP-MS approach To resolve the interaction proteome underlying human Hippo signaling, we selected nine conserved core pathway components as initial baits for AP-MS analysis in HEK293 cells. This initial bait set included the human Hippo kinases MST1 and MST2, the transcriptional co-activator YAP1, the transcription factor TEAD3, as well as described Drosophila Hippo pathway regulators FRMD6 (Willin; homolog Expanded, Ex), the tumor suppressor protein MERL (homolog of Merlin) [21], and the WW-domain protein KIBRA [22]. From the interacting proteins identified in the first round of AP-MS experiments we subsequently selected 25 additional secondary bait proteins. Altogether we have analyzed 96 AP-MS experiments covering 34 bait proteins by at least two biological replicates (Figure 2.1A, Figure 2.1B, Supplementary Table S2.1) which resulted in the identification of a total of 835 proteins at a protein false discovery rate (FDR) of <1%. To differentiate between high confidence interacting proteins (HCIPs) and nonspecific contaminants, we filtered our dataset based on WDN-score calculations [18] and relative protein abundance compared to control purification experiments, estimated by normalized spectral counting [23, 24] (Figure 2.1C). The data filtering yielded a final network of 270 HCIPs and 480 corresponding interactions (Supplementary Table S2.2). On average, we identified 14.7 HCIPs for each bait, which corresponds to the number of interactors typically found in similar AP-MS studies [17, 18, 20]. We also compared the obtained high confidence AP- MS data set with protein interaction (PPI) data annotated in public databases. 71.5% of our interactions have not been reported at the time of submission and as expected, the fraction of newly identified PPI varied a lot across different bait proteins (Figure 2.1E). Well studied proteins (e.g. MST1, MST2, STRN, STRN3, AMOT, PP1G, RASF1) have many more annotated interaction partners than less intensively studied (e.g. E41L3, RASF10, RASF9, FRMD5). Inspection of public PPI data (including yeast two hybrid and in vitro binding assays) for the 34 baits analyzed in our study resulted in a network of 516 proteins and 719 protein interactions (Supplementary Table S2.3). 16% of these interactions were found in our AP- MS dataset, which corresponds to 137 known protein interactions. Since the FDR of public PPI data is largely unknown we used the number of independent literature reports that

25 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

support a given interaction as a proxy for data confidence (Supplementary Figure 2.1A). When we compared our data with a high confidence subset of public PPI data (>1 publication per interaction) our recall rate increased to >35% (Figure 2.1D). The fraction of high confidence public interactions matching with our AP-MS data set was three times higher than the one for the public PPI data we missed in our study, demonstrating the overall robustness of the presented PPI data. (Supplementary Figure 2.1B). Further inspection of the experimental sources of matching public PPI data revealed that two thirds were obtained by other AP-MS studies (Supplementary Figure 2.1C). At least 28 independent publications were needed to cover the annotated 137 interactions in our study. Remarkably given the collective efforts in the biochemical analysis of this pathway in the past we identified 343 interactions for Hippo pathway components that so far were not annotated in public databases, which provide important new clues for understanding the molecular mechanisms underlying Hpo signaling in human cells.

Figure 2.1: A Systematic Affinity Purification Mass Spectrometry (AP-MS) Approach to Define the Human Hippo Pathway Interaction Proteome.

26 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

34 human proteins were affinity purified (Strep-tag) under native conditions from HEK293 cells to identify protein-protein interactions by tandem mass spectrometry. (A) Baits were selected sequentially, starting with the core components of the Hippo kinase signaling pathway and extended by obtained results or homology. (B) Biochemical workflow for native protein complex purification from HEK293-Flp T-rex cells. Bait proteins were cloned into an expression vector containing a tetracycline inducible CMV promoter, a Strep-HA fusion tag, and single FRT sites. Isogenic cell lines were created by flippase mediated recombination and induced with doxycycline for 24h. Pellets from 108 cells were lysed under native conditions and affinity purified by the Strep-tag. The purified proteins were digested with trypsin and C18 desalted peptides were analyzed by tandem mass spectrometry on an LTQ Orbitrap XL. (C) Data analysis pipeline. Acquired mass spectra from 96 experiments (at least 2 biological replicates per bait) were searched with X!Tandem and statistically validated by the Trans-Proteomic Pipeline (TPP) to match a protein identification false discovery rate of <1%. Unspecific binding proteins were filtered by a WDN-score threshold and quantitative comparison to control purification experiments. The resulting high confidence interactions were hierarchically clustered and visualized. (D) Recall of known interactions from public protein interaction databases. The recall rate was higher, when compared to a more robust subset of literature interactions that required more than one independent mention per interaction. (E) Overview of all bait proteins and identified high confidence interacting proteins (HCIPs). Overall 71.5% of identified interactions were novel and each bait associated on average with 14.7 proteins.

2.3.2. Hierarchical clustering assigns Hpo pathway components to interaction modules Clustering of bait and prey proteins has been used successfully in the past to infer modular proteome organization from systematic AP-MS data [16]. Hierarchical cluster analysis based on protein abundance of prey proteins relative to the corresponding bait protein revealed three major clusters (hereafter referred to as “modules”; Figure 2.2). The first module (“Core Kinase Complex”) consists of three interlinked sub-clusters: STRIPAK, a group of complexes around protein phosphatase 2A linked to Hpo kinase; SARAH, a cluster containing all human SARAH domain proteins, including the human Hpo kinases MST1 and MST2, and finally the MOB1-LATS cluster. The second module (“PP1-ASPP”) corresponds to a single large cluster that is enriched for regulatory and catalytic subunits of protein phosphatase 1 (PP1) and contains proteins involved in apico-basal and planar cell polarity (ABCP and PCP). The third module represents a highly interlinked network that consist of a set of Hpo pathway members (LGL1, DLG1, MERLIN, KIBRA, YAP1, TEAD) and proteins linked to ABCP and hence referred to as “Polarity network” (Figure 2.2B). We next analyzed the occurrence of structural domains (InterPro) across the 270 proteins found in the three modules and performed a hierarchical clustering of these domains

27 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

across all purifications. The obtained cluster correspond to the modular organization described above and thus revealed a characteristic enrichment profile of structural domains for each of the modules. Overall we noted a characteristic overrepresentation of specific domains, such as WW, PDZ, FERM, L27 and SARAH (Supplementary Figure S2.2A and S2.2B). Besides the apparent biochemical differences, all three modules converge on the transcriptional co-activator and Yki homolog YAP1, the only protein connecting all three pathway modules.

28 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

Figure 2.2: Hierarchical Clustering of Bait and HCIPs reveals three distinct modules. (A) Interactions between high confidence interacting proteins (HCIPs) were clustered by an uncentered Pearson correlation algorithm using an average distance metric. Three major clusters revealed a distinct modularity of the Hpo interaction proteome. The “core kinase complex” contains the STRIPAK kinase phosphatase complex, the SARAH domain proteins, including the Hpo kinase MST1 and MST2, and the downstream kinase LATS1 and its adaptor proteins MOB1A/B. The “PP1-ASPP module” contains all three PP1 catalytic and several regulatory subunits. The “Angiomotin cell polarity network” is defined by multiple polarity complexes and cell-cell contact proteins. It also contains the transcription factor TEAD3 and provides the most direct links to the transcription co-activator YAP1 (B) Graphical network representation of the obtained modules that demonstrates the degree of inter-module connectivity. Yap1 is the only protein that is common in all three modules.

2.3.3. Topology of the Hippo Core Kinase Complex Orthologs of the Drosophila proteins Hpo, Sav and Wts encode the highly conserved core kinase cassette of the Hippo pathway. Our analysis of the human orthologs of Hpo, Sav and Wts uncovered three major sub-clusters linked to the core kinase module: MOB1- LATS, SARAH and STRIPAK (Figure 2.3A). MOB1-LATS represents the smallest cluster within this module. It contains the Mats homologs MOB1A and MOB1B which we found in complexes with the protein kinases ST38L and LATS1. LATS1 has been shown to act as a substrate for MST1 but serves also as upstream kinase for the and inactivation of YAP1. Under the experimental conditions applied we identified stable kinase substrate complexes between LATS1 and YAP1, but not between MST1 and LATS1. In MST1/2 complexes we found all human SARAH domain containing proteins which besides MST1/2 include the Ras-association domain proteins RASF1-6 and the WW domain protein SAV1 (Figure 2.3A). It has been proposed that SARAH domain proteins undergo complex formation via homotypic dimerization with other SARAH domain proteins and some of these interactions have been linked to the regulation of MST1/2 kinase [25]. It is not clear however whether the full range of combinatorial possibilities indeed occurs or whether only a subset of dimeric SARAH domain pairs can be formed in human cells. To address this question and to resolve potential differences in concurrent SARAH domain complexes we included all human SARAH proteins as baits in our AP-MS analysis. In contrast to public protein interaction data [26], we observed a high degree of mutual exclusive binding between the two Hpo kinases and the remaining SARAH domain proteins. Only MST1/2 could undergo combinatorial complex formation with the other SARAH domain proteins, whereas RASF1-6 and SAV1 complexes exclusively contained

29 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

either MST1 or MST2, but none of the other SARAH proteins (Figure 2.3B). Based on the average of the three most abundant precursor ion intensities per protein (TOP3) [19, 27, 28] we quantified the relative abundance of MST1/2 interactors. The MST1-MST2 hetero- dimer represents the most abundant SARAH domain assembly, followed by complexes of between MST1/MST2 and RASF2 or SAV1 (Figure 2.3B and 2.3C). These results suggest a system of 15 concurrent SARAH protein sub-complexes raises the question whether these sub-complexes may have specific biochemical functions. The identification of cellular proteins that bind both RASF family members and MST1/2 may provide a first hint at potential functional diversification of found RASF-MST assemblies. RASF1/3/5 for example were identified in complexes with MAP1S and MAP1B, two microtubule associated proteins also found in MST1/2 complexes suggesting a potential role for this subset of RASF-MST assemblies in controlling microtubules. In this regard it has been reported that RASF1 stabilizes microtubules through interaction with MAP1 proteins [29, 30]. The exportin protein XPO6 on the other hand co-purified exclusively with RASF1 and MST1 and the vesicle-associated membrane protein-associated protein VAPA and VAPB were found only in RASF3 and MST1 complexes providing further evidence for potential functional diversification of RASF-Hippo kinase complexes. In this context the analysis of RASF3 complexes revealed in addition eight members of the human STRIPAK complex. Two of these components (STRN3, SLMAP) were also detected in MST1 complexes suggesting that human STRIPAK can bind to both, MST1 and RASF3 or complexes thereof. No STRIPAK components were detected with other RASF family members (Figure 2.3A). It appears that the STRIPAK Hpo interaction is conserved since we previously could show that subunits of the Drosophila STRIPAK complex also bind to Hpo and RASF where they act as negative regulators of Hpo signaling by recruitment of the protein phosphatase PP2A [12]. The phosphatase inhibitor okadaic acid (OA) has been shown to activate the human hippo pathway [31, 32] and thus may change the composition of MST1/2 complexes. Based on the average TOP3 intensities precursor intensities, we measured the relative abundance of MST1/2 associated proteins in the presence or absence of OA (Figure 2.3C, Supplementary Figure S2.3). Whereas interactions with SARAH domain proteins were largely unaffected or mildly decreased upon OA treatment we found a strong increase of all STRIPAK subunits associated with MST1/2 in OA treated cells. This indicates that under exponential growth conditions, the

30 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

amount of STRIPAK proteins bound to human Hpo is relatively low compared to the amount of associated SARAH proteins, but may significantly increase upon changes in protein phosphorylation as illustrated by OA treatment (Figure 2.3C). By including STRN, STRN3 and SLMAP as baits we could detect all known STRIPAK subunits including the GCK-III subfamily of Ste20 protein kinases, STK24 and MST4, and subunits of the protein phosphatase PP2A. Interestingly, we found that only a specific sub- type of the STRIPAK complex containing SLMAP, FGOP1 and SIKE1, but not CTTNBP2 and CTTNBP2NL binds to Hpo kinases and RASF3. In comparison, the related kinases STK23, STK24 and MST4 are able to bind both subtypes of STRIPAK [33]. Whether STRIPAK mediated recruitment of PP2A leads to dephosphorylation and inhibition of the human Hpo kinases like in Drosophila remains to be tested. , but our dataset provides an interesting perspective on the topology of the Hpo kinase core complexes. Contrary to the general view [1, 34], our data suggest that the Hpo kinase rather than being a single kinase unit represents a highly dynamic system of multiple concurrent kinase complexes some of which may be regulated by a specific human STRIPAK-PP2A complex.

31 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

Figure 2.3: The Human STRIPAK Complex Associates with RASF3 and MST1/2. (A) High resolution interaction map of the canonical human Hippo pathway and STRIPAK complex. Bait proteins are indicated as hexagons, prey proteins as circles. Solid lines are interactions found by AP-MS in this study, dotted, red lines are obtained from public protein interaction databases. The Hippo kinase homologs MST1 and MST2 interact with SAV1 and all RASF1-6 proteins and together form the SARAH module. The assembly of MOB1A/B and LATS1 forms the downstream kinase cascade of Hpo and associates with the LATS1 substrate YAP1. MST1 interacts with the STRIPAK complex via STRN and SLMAP, but most connections between STRIPAK and the SARAH module occur by RASF3. (B) Interactions between SARAH domain proteins MST1/2 and RASF1-6. All RASF proteins interact with MSTs but not with other RASF proteins or SAV1. MST1/2 seem able to form hetero-dimers with each other as well as the remaining SARAH domain proteins. Most identified interactions (green lines) could be quantified (red lines) by the average intensity of the three most abundant precursors per protein. The line width represents relative bait abundance. The strongest interactions occur between the MST1/2 heterodimer, whereas the predominant RASF- MST interaction was RASF2 and MST1, or RASF2 and MST2, respectively. (C) Abundance changes of interacting proteins of MST1 upon okadaic acid stimulation. HEK293 cells expressing Strep-HA tagged MST1 were treated with 100nM okadaic acid (OA) for 2h. The purified MST1 complexes contained an increased amount of STRIPAK associated proteins, whereas SARAH module components only show marginal changes.

2.3.4. The PP1-ASPP module provides links to apico-basal and planar cell polarity Apart from the canonical Hpo kinase cassette, we identified a module linked to YAP1 which is centered on the serine protein phosphatase 1 (PP1), and contains several proteins associated with the control of apico-basal cell polarity (ABCP) as well as planar cell polarity (PCP). We initially found PP1 together with ASPP2 (Apoptosis-stimulating of p53 protein 2) in YAP1 complexes. Analysis of ASPP2, its paralog ASPP1, and the PP1G catalytic subunit as AP-MS baits resulted in an extended network which includes all three PP1 catalytic subunits (PP1A, PP1B, PP1C) and multiple regulatory subunits (Figure 2.4A; Supplementary table S2.2). Both ASPP1 and ASPP2 can interact with all PP1 catalytic subunits, the coiled coil proteins CC85B and CC85C and the Ras-association domain family proteins RASF7, RASF8 and RASF9. We subsequently included RASF7/8/9/10 as baits in our AP-MS analysis. Interaction data from these poorly studied proteins further refined the ASPP/RASF/PP1 sub-network. As ASPP1 and ASPP2 do not interact with each other, we suggest the formation of mutually exclusive ASPP/RASF/PP1 complex isoforms. Whereas all four RASF proteins share the association with PP1 and ASPP, the individual RASF members also form paralog specific complexes. In both RASF9 as well as RASF10 complexes we found the Drosophila homologs of the segment polarity protein Disheveled

32 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

DVL1/2/3, together VANG1 and PRIC1/3, which have been implicated in the PCP pathway [35, 36]. RASF9/10 also interact with alpha and beta subunits of casein kinase II (CSK21 and CSK2B), the upstream kinases which phosphorylate and activate DVL proteins [37, 38]. PARD3, an additional important cell polarity regulator, was found in complexes containing ASPP2, RASF9 and RASF10. PARD3 forms the Par polarity complex together with PARD6 and the atypical protein kinase C (aPCK), two proteins we identified in L2GL1 complexes described below [39]. Such an atypical PP1-PARD3 complex has been reported before, where PP1A directly dephosphorylates PARD3 to stabilize a functional Par/aPKC complex [40]. We could not detect any of the RASF9/10 associated cell polarity proteins by using the related RASF7/8 as baits. Besides the mentioned cell polarity regulators in RASF9/10 complexes, we also found the PDZ protein SCRIB in PP1G phosphatase complexes. SCRIB is the human ortholog of Drosophila Scribble, a protein that controls apico-basal polarity together with Lgl and Dlg and has been linked to Hpo signaling [41, 42]. In this regard, Scribble has been genetically linked in Drosophila toPP1 phosphatase, via PP1R7 (Sds22) [43], a regulatory PP1 subunit we also have identified with PP1G. Whether PP1R7 and SCRIB interact together with PP1G in the same sub-complex remains to be tested. Our AP-MS data also revealed previously known PP1 sub-complexes including the PTW/PP1 phosphatase complex formed by the PP1G interacting proteins PP1RA, WDR82 and TOX4, which has been shown to regulate chromatin structure during the cell cycle [44] and the PP1G associated URI prefoldin complex linked to S6 kinase signaling [45].Collectively the presence of the different groups of cell polarity proteins in the PP1- ASPP module suggest an interesting role for this module in controlling cell polarity processes by a variety of molecular mechanisms, some of which are likely to involve established Hpo signaling components.

2.3.5. PP1/ASPP2 complexes promote YAP1 activity Since we found PP1A and ASPP2 in complexes with YAP1 we wanted to test whether these proteins or other components from the PP1 module might regulate YAP1 dependent transcription. We applied a Dual Luciferase reporter (DLR) system containing a promoter with TEAD transcription factor binding sites to measure YAP1/TEAD transcriptional

33 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

activity in response to increased levels of PP1 network components. As a negative control we used a luciferase reporter lacking TEAD binding sites and included MST1 and YAP1 as a reference to ensure assay specificity. Similar to earlier reports, MST1 decreased [46] whereas YAP1 overexpression transcription from the TEAD reporter [47]. We generated a set of expression constructs for the expression of PP1 network components and overall validated transgenic expression of 24 components from the PP1-ASPP module by western blotting (Supplementary Figure S2.4). Upon transient expression of the corresponding transgenes, we found that out of the 24 components tested only overexpression of PP1G, PP1A and ASPP2 strongly enhanced the expression from the TEAD luciferase reporter construct (Figure 2.4B). Remarkably, the related proteins ASPP1 and PP1B had not effect, which agrees with our interaction data, where we only found ASPP2 and PP1A to bind YAP1 but not ASPP1 and PP1B. In conclusion, we were able to show that PP1A, PP1G, and ASPP2 overexpression increases YAP1 mediated transcriptional activity and suggest that ASPP2 is the YAP1 interacting determinant of the PP1-ASPP module and could provide PP1A and PP1G substrate specificity.

34 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

Figure 2.4: The ASPP-PP1 Module Provides Links to Cell Polarity and Modulates YAP1 Mediated Transcriptional Activity. Detailed view of the PP1-ASPP module: ASPP and PP1A were found with YAP1 and determined the selection a subsequent bait protein set. (A) The PP1-ASPP “core network” was defined by interaction data from ASPP1/2, PP1G, and RASF7/8/9/10 purification experiments. Besides the bait proteins the core includes PP1A/B and CC85B/C. Three different cell polarity complexes could be linked to the PP1-ASSP core network (highlighted in blue). The polarity determinant SCRIB (Scribble) was found with PP1G, the Par polarity complex component PARD3 was identified with RASF9/10 and ASPP2, and proteins linked to planar cell polarity (VANG1, PRIC1/3 and DVL1/2/3) were co-purified with RASF9/10. (B) Dual Luciferase Reporter assay (DLR) of overexpressed PP1-ASPP components. HEK293-Flp YAP1 cells were transfected with a firefly luciferase construct containing four TEAD transcription factor binding sites to test whether overexpression of PP1-ASPP network components could influence YAP1/TEAD transcriptional activity. The indicated proteins were co-transfected for transient overexpression together with low levels of doxycycline-induced YAP1 expression. The measured firefly activities were normalized to the activity of a constitutively co-expressing Renilla luciferase. YAP1 (activator) and MST1 (inhibitor) overexpression were used as positive controls. Proteins that did not modulate TEAD transcriptional activity by more than threefold were not considered as bona fide regulators. ASPP2, PP1A and PP1G clearly increased TEAD activity.

35 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

2.3.6. A Cell Polarity Network linked to L2GL1, Kibra, Merlin and YAP1 The Hpo core kinase cassette has long been recognized as the major upstream regulatory module of YAP1 [48]. Recent studies, however, indicated multiple modes of Hpo independent YAP1 regulation, which involve signals from cell-cell contact [14, 15, 49], mechanical stress [50, 51] and cell polarity [42, 52], which suggests a role for YAP1 as a major hub for signal integration. Indeed, we found 30 HCIPs in YAP1 complexes, 21 of which have not been observed previously. Besides the interactions with canonical Hpo components, including the established upstream inhibitory kinase LATS1 or the transcription factor TEAD3, we identified a large group of proteins linked to cell polarity and cell junction complexes. To gain a more detailed view on the organization of this cell polarity network linked to YAP1 we included several proteins of the cell junction complex (AMOT, MPP5, LIN7A), the FERM domain proteins (FRMD3, FRMD5, FRMD6, E41L3), in addition to Hpo components previously linked to cell polarity (L2GL1) or the cell cortex (MERL, KIBRA) as baits in our AP-MS experiments (Figure 2.5A). Analysis of complexes containing L2GL1, the human ortholog of Drosophila tumor suppressor lethal giant larvae (Lgl), revealed the presence of all major components of the Par polarity complex. They include the aPKC subunits KPCI and KPCZ, SQSTM, a protein previously found associated with aPKC [17], and the cell polarity proteins PARD6B and PAR6G. We did not find any interactions between L2GL1 and other canonical Hpo pathway members, however. FRMD6 is the closest human homolog of the Drosophila protein Expanded (Ex), which has been shown to form complexes with Mer and Kibra to from a complex involved in the regulation of Hpo kinase activity [21, 22]. We could not confirm any homologous interactions with FRMD6 (or FRMD3 and FRMD5) but instead found interactions between YAP1, MERL, KIBRA and the junctional protein AMOT (Angiomotin). YAP1 binds directly by its WW domain to the PPxY motif of Angiomotin and its paralogues (AMOL1 and AMOL2) [14, 15, 53]. Drosophila Ex also has been found to bind the Yki WW domain by its PPxY motif and inhibit Yki activity. [13]. The level of evolutionary Hpo pathway conservation between Drosophila and higher eukaryotes is currently of great interest. Genevet and Tapon proposed that AMOT might be the functional homolog of Ex in mammals [54] and a recent evolutionary study concluded that FRMD6 is unlikely to be

36 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

the functional homolog of Ex, as it lacks a PPxY motif containing C-terminal domain [55]. Both of these claims provide convincing support for our experimental data. Analogous to Ex in Drosophila, AMOT forms a highly interlinked network with proteins found in apico-basal polarity complexes (LIN7C, MPP5, MPDZ, INADL), which share a characteristic L27 protein binding domain (Figure 2.5A). L27 domains form heterotetrameric complexes with each other [56] and show significant enrichment in our network (Supplementary Figure S2B). When we used LIN7A as bait, we placed six additional L27 domain proteins neatly to the already mentioned ones: LIN7A itself and the MAGUK proteins CSKP, MPP2/6/7 and DLG1, the homolog of Drosophila Disks large 1 (Dlg). Two other protein binding domains - FERM and PDZ - were also highly overrepresented in our network (Supplementary S2.2B). The FERM domains are often found in proteins that interface between membrane associated proteins and the cytoskeleton [57]. Eight of the nine FERM containing proteins from our AP-MS data set are exclusively found in the cell polarity module. Among the novel PDZ proteins associated with YAP1 we found RAPGEF2/6 that function as GEFs for the Ras GTPase . RAP1 plays a role in the formation of adherence junctions and RAPGEF1 was shown to be required for AJ maturation [58] which may suggest a link between these processes and to the control of YAP1. There have also been studies that show interactions between RAPGEF6 and another WW protein, BAG3 which is involved in mechanotransduction [59]. YAP1 nuclear localization is enhanced upon low density growth or disruption of cell-cell contacts [46, 60] which raises the questions how YAP1 complex formation may be affected under these conditions. We therefore monitored the abundance of YAP1 interacting proteins in cells subjected to non adhesive growth over a time course of one hour. Based on the average TOP3 intensity (see above) we were able to reliably quantify 18 YAP1 associated proteins across the entire time course. Consistent with the observation that YAP1 nuclear localization is enhanced upon disruption of cell-cell contacts we found its association with TEAD3 transcription factor increased when cell are grown in suspension. We noticed a drastic and concerted drop in cell polarity proteins (AMOT, AMOL1, AMOL2, LIN7C, and MPDZ) associated with YAP1 upon disruption of cell-cell contacts, whereas other YAP1 associated proteins such as 14-3-3 proteins were largely unaffected. These results clearly showed that the interaction of YAP1 with the cell polarity network is highly dynamic and thus suggests a potential role for the polarity network as an integrated

37 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

signaling system that controls the transcriptional co-activator YAP1 in response to changes at intercellular junctions.

Figure 2.5: The Angiomotin Polarity Network is Sensitive to Disruption of cell-cell contacts. (A) Besides TEAD3, YAP1 binds to an extended network of FERM, PDZ and L27 domain proteins, which contains AMOT a central node. AMOT and AMOL1 connect FERM domain proteins with the tight junction associated L27 and PDZ proteins and also directly binds to YAP1. * refers to supplementary Figure S2.5. (B) YAP1 interactome after disruption of cell-cell contact by cell trypsinization and growth in suspension over a time course of 1h. The tight junction associated proteins LIN7C, MPDZ, AMOT, AMOL1 and AMOL2 are concertedly decreased, where as the transcription factor TEAD3 increased. 14-3-3 proteins only show a slight change.

38 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

2.4. Conclusions Metazoan tissue homeostasis at the cellular level emerges from the interplay of concurrent signaling systems that integrate and translate information from neighboring cells, growth factors and mechanical forces to coordinate cell growth, proliferation, apoptosis as well as cell shape changes in a tissue context. Initial models on the regulation of metazoan tissue cell growth by the Hpo pathway were centered on the Hpo core kinase cassette involving a linear array of regulatory relationships among canonical Hpo core components that control the transcriptional effector Yki/YAP. But what are the molecules and mechanisms that link processes relevant for tissue integrity such as mechanotransduction or cell polarity to growth control by the Hpo pathway? Recent genetic and biochemical data on individual pathway components revealed first links between the Hpo pathway and proteins at the cellular membrane and intercellular junction which suggest promising new ideas how processes linked to cell polarity and epithelial plasticity might translate into growth and proliferation programs by the Hpo pathway [61]. Most of these recent insights were however obtained on isolated Hpo pathway components studied in different cellular contexts using different biochemical approaches which complicate the integration of such information into coherent models on tissue cell growth. In this study we present a first integrated model on the biochemical landscape underlying human Hpo signaling using an unbiased proteomics approach in a defined cellular context. The integrated view presented here suggests several new insights into the global biochemical organization of the Hpo pathway and its relationship to coexisting cell polarity modules. First, human hippo kinase can be viewed as a system of concurrent kinase sub-complexes which in large are formed by homotypic SARAH domain interactions involving all nine human SARAH domain proteins. These sub- complexes showed overlapping but also highly specific interactions with other cellular proteins indicating potential functional specification of hippo kinase sub-complexes. Furthermore human Hpo kinase can bind to STRIPAK, a protein phosphatase 2 complex that binds and negatively regulates Hpo in Drosophila. Given this evolutionary conservation it remains to be seen whether STRIPAK may also act as a negative regulator of MST1/2 kinase in human cells. Second, clustering of AP-MS data revealed that cell polarity proteins interacting with Hpo core components can be assigned to two separate modules (“ASPP/PP1” and “Cell Polarity”) that are biochemically separated from the Hpo

39 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

core kinase module. The modular structure of the Hpo interaction proteome together with our findings that all three modules converge on the major Hpo effector YAP1 would be consistent with an emerging view that cell polarity signaling may control YAP1 in parallel or even independent of the canonical Hpo core kinase module. This contrasts models based on genetic studies in Drosophila which link cell polarity regulators upstream of Hpo [21, 62]. Third, our results also showed that the interaction proteome of major Hpo components is highly dynamic and changes in response to an altered cellular environment. In YAP1 complexes we found a strong decrease in abundance of cell polarity proteins upon disruption of cell-cell contacts. These results suggest that cell polarity and cell junction complexes may constitute dynamic signaling modules that relay information on tissue integrity towards the major Hpo effector YAP1.

Besides these general conclusions on the overall organization of the Hpo pathway interaction proteome, this work identified 347 new protein interactions for human Hpo components. These interactions represent an important resource for directing future functional studies to better understand the detailed mechanisms underlying the molecular coupling of cell polarity signaling to cell growth control by the human Hpo pathway.

40 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

2.5. Experimental Procedures

Expression constructs To generate expression vectors for tetracycline-induced expression of N-terminally SH- tagged bait proteins, human ORFs provided as pDONR™223 vectors were picked from a Gateway® compatible human orfeome collection (horfeome v5.1, Open Biosystems) for LR recombination with the customized destination vector pcDNA5/FRT/TO/SH/GW [63]. Genes not in the human orfeome collection were amplified from the MegaMan Human Transcriptome Library (Agilent) by PCR and cloned into entry vectors by TOPO (pENTR™- TOPO®) or BP clonase reaction (pDONR™223; Invitrogen).

Stable cell line generation Flp-In HEK293 cells (Invitrogen) containing a single genomic FRT site and stably expressing the tet repressor were cultured in DMEM (4.5 g/l glucose, 10% FCS, 2 mM L- glutamine, 50 mg/ml penicillin, 50 mg/ml streptomycin) containing 100 µg/ml zeocin and 15 µg/ml blasticidin. The medium was exchanged with DMEM medium containing 15 µg/ml blasticidin before transfection. For cell line generation, Flp-In HEK293 cells were co-transfected with the corresponding expression plasmids and the pOG44 vector (Invitrogen) for co-expression of the Flp-recombinase using the Fugene transfection reagent (Roche). Two days after transfection, cells were selected in hygromycin- containing medium (100 µg/ml) for 2–3 weeks.

Protein purification Stable expressing cell lines for protein production were grown in four 14 cm Nunclon dishes to 80% confluency, induced with 1.3 µg/ml doxycline for 24h, and harvested with PBS containing 10 mM EDTA. The suspended cells were pelleted and drained from the supernatant for subsequent shock-freezing in liquid nitrogen and long term storage at - 80°C. The frozen cell pellets were resuspended in 4ml HNN lysis buffer (50 mM HEPES pH 7.5, 150 mM NaCl, 50 mM NaF, 0.5% Igepal CA-630 (Nonidet P-40 Substitute), 200 µM Na3VO4, 1 mM PMSF, 20 µg/ml Avidin and 1x Protease Inhibitor mix (Sigma) and incubated on ice for 10 min. Insoluble material was removed by centrifugation. Cleared

41 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

lysates were loaded on a pre-equilibrated spin column (Bio-Rad) containing 50 ul Strep- Tactin sepharose beads (IBA Biotagnology). The beads were washed two times with 1 ml HNN lysis buffer and three times with HNN buffer (50 mM HEPES pH 7.5, 150 mM NaCl, 50 mM NaF). Bound proteins were eluted with 600 ul 0.5 mM Biotin in HNN buffer. To remove the biotin, the samples were TCA precipitated, washed with acetone, air-dried and re-solubilized in 50 ul 8 mM Urea in 50 mM NH4HCO3 pH 8.8. Cysteine bonds were reduced with 5 mM TCEP for 30 min at 37°C and alkylated in 10 mM iodoacetamide for 30 min at room temperature in the dark. Samples were diluted with NH4HCO3 to 1.5M Urea and digested with 1 mg trypsin (Promega) overnight at 37°C. The peptides were purified using C18 microspin columns (The Nest Group Inc.) according to the protocol of the manufacturer, resolved in 0.1% formic acid, 1% acetonitrile for mass spectrometry analysis.

Mass spectrometry LC-MS/MS analysis was performed on a LTQ Orbitrap XL mass spectrometer (Thermo Fisher Scientific). Peptide separation was carried out by a Proxeon EASY-nLC II liquid chromatography system (Thermo Fisher Scientific) connected to an RP-HPLC column (75 µm x 10 cm) packed with Magic C18 AQ (3 µm) resin (WICOM International), running a linear gradient from 95% solvent A (0.1% formic acid, 2% acetonitrile) and 5% solvent B (98% acetonitrile, 0.1% formic acid) to 35% solvent B over 60 min at a flow rate of 300 nl/min. The data acquisition mode was set to obtain one high resolution MS scan in the Orbitrap (60,000 @ 400 m/z). The 6 most abundant ions from the first MS scan were fragmented by collision induced fragmentation (CID) and MS/MS fragment ion spectra were acquired in the linear trap quadrupole (LTQ). Charge state screening was enabled and unassigned or singly charged ions were rejected. The dynamic exclusion window was set to 15s and limited to 300 entries. Only MS precursors that exceeded a threshold of 150 ion counts were allowed to trigger MS/MS scans. The ion accumulation time was set to 500 ms (MS) and 250 ms (MS/MS) using a target setting of 106 (MS) and 104 (MS/MS) ions. After every replicate set, a peptide reference sample containing 200 fmol of human [Glu1]-Fibrinopeptide B (Sigma-Aldrich) was analyzed to monitor the LC-MS/MS systems performance.

42 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

Protein identification Acquired spectra were searched with X!Tandem [64] against the canonical human proteome reference dataset (http://www.uniprot.org/), extended with reverse decoy sequences for all entries. The search parameters were set to include only fully tryptic peptides (KR/P) containing up to two missed cleavages. Carbamidomethyl (+57.021465 amu) on Cys was set as static peptide modification. Oxidation (+ 15.99492 amu) on Met and phosphorylation (+79.966331 amu) on Ser, Thr, Tyr were set as dynamic peptide modifications. The precursor mass tolerance was set to 25 ppm, the fragment mass error tolerance to 0.5 Da. Obtained peptide spectrum matches were statistically evaluated using PeptideProphet and protein inference by ProteinProphet, both part of the Trans Proteomic Pipeline (TPP, v.4.5.1) [65]. A minimum protein probability of 0.9 was set to match a false discovery rate (FDR) of <1%. The resulting pep.xml and prot.xml files were used as input for the spectral counting software tool Abacus to calculate spectral counts and NSAF values [66].

Evaluation of high confidence interacting proteins (HCIP) Adjusted NSAF values of identified co-purified proteins were compared to a mock AP control dataset consisting of 62 StrepHA-GFP or 12 StrepHA-RFP-NLS purification experiments. The protein abundance in the control dataset was estimated by averaging the 10 highest NSAF values per protein among all 74 measurements. In order for candidate interactions to pass, the enrichment threshold over the control dataset was >10. Adjusted NSAF values were also used to calculate WDN-scores of all the potential interactions [18]. A simulated data matrix was used to calculate the WD-score threshold below which 98% of the simulated data falls. All raw WD-scores were normalized to this value. To gain sensitivity and incorporate network topology as filtering criterion, the distance between all interactions and the closest high confidence interactions (which passed both filtering steps) was determined with the software tool MultiExperiment Viewer [67] (http://www.tm4.org/mev/). Interactions that were below the thresholds for control ratio and WDN-score could be rescued if the distance to a high confidence interaction was greater than zero. The final filtered dataset contained high confidence interacting proteins (HCIPs) and corresponding protein-protein interactions.

43 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

Label free quantification In cases where label free quantification was performed, the average of the three most intense MS1 features was used to estimate protein abundances, using the commercial software tool Progenesis LC-MS (Nonlinear USA Inc.). Abundance values were normalized to total ion current (TIC) and bait protein intensity.

Network visualization and accessed public protein interaction databases Protein Interaction data was visualized with Cytoscape 2.8.6 (http://www.cytoscape.org) [68]. Known interactions were obtained from the protein interaction network analysis platform PINA (http://cbg.garvan.unsw.edu.au/pina/) [69], using UniProt bait protein accessions as starting nodes. PINA is based on the source databases Intact (http://www.ebi.ac.uk/intact/; release Oct. 2012), BioGRID (http://thebiogrid.org/), MINT (http://mint.bio.uniroma2.it/mint/), DIP http://dip.doe-mbi.ucla.edu/dip/), HRPD (http://www.hprd.org/), and MIPS/MPAC (http://mips.helmholtz- muenchen.de/genre/proj/mpact).

Dual luciferase reporter assay Stable Flp-In HEK-293 YAP1 cells were grown to 50% confluency in a six well plate format. For optimal firefly signal modulation, YAP1 expression was induced with 500 ng/ml doxycline for six hours [70] prior transfection. The cells were transfected with 80 ng pGL3-4xGTIIC-49 (TEAD binding sites fused to firefly luciferase), 0.3 ng pRL-CMV (Renilla luciferase; Promega) and 100 ng of pDEST40 (Invitrogen) containing Hpo network components, using the transfection reagent Fugene6 (Promega). The transfected cells were kept under doxycycline induction for the next 24 h. DMEM and FBS was removed and the cells were washed with PBS. Cell lysis and the Dual-Luciferase Reporter Assay (DLR; Promega) were performed according to the manufacturer’s instructions. The pDEST40 destination vectors were generated by LR reaction with Gateway compatible entry constructs. For negative controls, a firefly luciferase vector lacking TEAD binding site was used (pGL-49) or alternatively, cell were not induced with doxycycline. The luciferase signals were measured with a Synergy HT Multi-Mode microplate reader (BioTek). All obtained values were normalized to overexpressed GFP.

44 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

2.6. References 1. Harvey, K.F., X. Zhang, and D.M. Thomas, The Hippo pathway and human cancer. Nat Rev Cancer, 2013. 13(4): p. 246-57. 2. Zhou, D., et al., Mst1 and Mst2 maintain hepatocyte quiescence and suppress hepatocellular carcinoma development through inactivation of the Yap1 oncogene. Cancer Cell, 2009. 16(5): p. 425-38. 3. Lee, K.P., et al., The Hippo-Salvador pathway restrains hepatic oval cell proliferation, liver size, and liver tumorigenesis. Proc Natl Acad Sci U S A, 2010. 107(18): p. 8248-53. 4. Lu, L., et al., Hippo signaling is a potent in vivo growth and tumor suppressor pathway in the mammalian liver. Proc Natl Acad Sci U S A, 2010. 107(4): p. 1437-42. 5. Wu, S., et al., The TEAD/TEF family protein Scalloped mediates transcriptional output of the Hippo growth-regulatory pathway. Dev Cell, 2008. 14(3): p. 388-98. 6. Lei, Q.Y., et al., TAZ promotes cell proliferation and epithelial-mesenchymal transition and is inhibited by the hippo pathway. Mol Cell Biol, 2008. 28(7): p. 2426-36. 7. Basu, S., et al., Akt phosphorylates the Yes-associated protein, YAP, to induce interaction with 14-3-3 and attenuation of p73-mediated apoptosis. Mol Cell, 2003. 11(1): p. 11-23. 8. Zhao, B., et al., TEAD mediates YAP-dependent gene induction and growth control. Genes Dev, 2008. 22(14): p. 1962-71. 9. Gingras, A.C., et al., Analysis of protein complexes using mass spectrometry. Nat Rev Mol Cell Biol, 2007. 8(8): p. 645-54. 10. Gstaiger, M. and R. Aebersold, Applying mass spectrometry-based proteomics to genetics, genomics and network biology. Nat Rev Genet, 2009. 10(9): p. 617-27. 11. Pardo, M. and J.S. Choudhary, Assignment of protein interactions from affinity purification/mass spectrometry data. J Proteome Res, 2012. 11(3): p. 1462-74. 12. Ribeiro, P.S., et al., Combined functional genomic and proteomic approaches identify a PP2A complex as a negative regulator of Hippo signaling. Mol Cell, 2010. 39(4): p. 521-34. 13. Badouel, C., et al., The FERM-domain protein Expanded regulates Hippo pathway activity via direct interactions with the transcriptional activator Yorkie. Dev Cell, 2009. 16(3): p. 411- 20. 14. Wang, W., J. Huang, and J. Chen, Angiomotin-like proteins associate with and negatively regulate YAP1. J Biol Chem, 2011. 286(6): p. 4364-70. 15. Zhao, B., et al., Angiomotin is a novel Hippo pathway component that inhibits YAP oncoprotein. Genes Dev, 2011. 25(1): p. 51-63. 16. Sardiu, M.E., et al., Probabilistic assembly of human protein interaction networks from label- free quantitative proteomics. Proc Natl Acad Sci U S A, 2008. 105(5): p. 1454-9. 17. Sowa, M.E., et al., Defining the human deubiquitinating enzyme interaction landscape. Cell, 2009. 138(2): p. 389-403. 18. Behrends, C., et al., Network organization of the human autophagy system. Nature, 2010. 466(7302): p. 68-76. 19. Glatter, T., et al., Modularity and hormone sensitivity of the Drosophila melanogaster insulin receptor/target of rapamycin interaction proteome. Mol Syst Biol, 2011. 7: p. 547. 20. Varjosalo, M., et al., The Protein Interaction Landscape of the Human CMGC Kinase Group. Cell Rep, 2013. 3(4): p. 1306-20. 21. Hamaratoglu, F., et al., The tumour-suppressor genes NF2/Merlin and Expanded act through Hippo signalling to regulate cell proliferation and apoptosis. Nat Cell Biol, 2006. 8(1): p. 27- 36. 22. Yu, J., et al., Kibra functions as a tumor suppressor protein that regulates Hippo signaling in conjunction with Merlin and Expanded. Dev Cell, 2010. 18(2): p. 288-99. 23. Paoletti, A.C., et al., Quantitative proteomic analysis of distinct mammalian Mediator complexes using normalized spectral abundance factors. Proc Natl Acad Sci U S A, 2006. 103(50): p. 18928-33.

45 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

24. Zybailov, B., et al., Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae. J Proteome Res, 2006. 5(9): p. 2339-47. 25. Polesello, C., et al., The Drosophila RASSF homolog antagonizes the hippo pathway. Curr Biol, 2006. 16(24): p. 2459-65. 26. Schagdarsurengin, U., et al., Frequent epigenetic inactivation of RASSF2 in thyroid cancer and functional consequences. Mol Cancer, 2010. 9: p. 264. 27. Silva, J.C., et al., Absolute quantification of proteins by LCMSE: a virtue of parallel MS acquisition. Mol Cell Proteomics, 2006. 5(1): p. 144-56. 28. Rinner, O., et al., An integrated mass spectrometric and computational framework for the analysis of protein interaction networks. Nat Biotechnol, 2007. 25(3): p. 345-52. 29. Dallol, A., et al., RASSF1A interacts with microtubule-associated proteins and modulates microtubule dynamics. Cancer Res, 2004. 64(12): p. 4112-6. 30. Song, M.S., et al., The centrosomal protein RAS association domain family protein 1A (RASSF1A)-binding protein 1 regulates mitotic progression by recruiting RASSF1A to spindle poles. J Biol Chem, 2005. 280(5): p. 3920-7. 31. Taylor, L.K., H.C. Wang, and R.L. Erikson, Newly identified stress-responsive protein kinases, Krs-1 and Krs-2. Proc Natl Acad Sci U S A, 1996. 93(19): p. 10099-104. 32. Guo, C., X. Zhang, and G.P. Pfeifer, The tumor suppressor RASSF1A prevents dephosphorylation of the mammalian STE20-like kinases MST1 and MST2. J Biol Chem, 2011. 286(8): p. 6253-61. 33. Goudreault, M., et al., A PP2A Phosphatase High Density Interaction Network Identifies a Novel Striatin-interacting Phosphatase and Kinase Complex Linked to the Cerebral Cavernous Malformation 3 (CCM3) Protein. 2009. 34. Yu, F.X. and K.L. Guan, The Hippo pathway: regulators and regulations. Genes Dev, 2013. 27(4): p. 355-71. 35. Gubb, D., et al., The balance between isoforms of the prickle LIM domain protein is critical for planar polarity in Drosophila imaginal discs. Genes Dev, 1999. 13(17): p. 2315-27. 36. Song, H., et al., Planar cell polarity breaks bilateral symmetry by controlling ciliary positioning. Nature, 2010. 466(7304): p. 378-82. 37. Song, D.H., D.J. Sussman, and D.C. Seldin, Endogenous protein kinase CK2 participates in Wnt signaling in mammary epithelial cells. J Biol Chem, 2000. 275(31): p. 23790-7. 38. Bernatik, O., et al., Sequential activation and inactivation of Dishevelled in the Wnt/beta- catenin pathway by casein kinases. J Biol Chem, 2011. 286(12): p. 10396-410. 39. Petronczki, M. and J.A. Knoblich, DmPAR-6 directs epithelial polarity and asymmetric cell division of neuroblasts in Drosophila. Nat Cell Biol, 2001. 3(1): p. 43-9. 40. Traweger, A., et al., Protein phosphatase 1 regulates the phosphorylation state of the polarity scaffold Par-3. Proc Natl Acad Sci U S A, 2008. 105(30): p. 10402-7. 41. Bilder, D., M. Li, and N. Perrimon, Cooperative regulation of cell polarity and growth by Drosophila tumor suppressors. Science, 2000. 289(5476): p. 113-6. 42. Doggett, K., et al., Loss of the Drosophila cell polarity regulator Scribbled promotes epithelial tissue overgrowth and cooperation with oncogenic Ras-Raf through impaired Hippo pathway signaling. BMC Dev Biol, 2011. 11: p. 57. 43. Jiang, Y., et al., Sds22/PP1 links epithelial integrity and tumor suppression via regulation of myosin II and JNK signaling. Oncogene, 2011. 30(29): p. 3248-60. 44. Lee, J.H., et al., Identification and characterization of a novel human PP1 phosphatase complex. J Biol Chem, 2010. 285(32): p. 24466-76. 45. Djouder, N., et al., S6K1-mediated disassembly of mitochondrial URI/PP1gamma complexes activates a negative feedback program that counters S6K1 survival signaling. Mol Cell, 2007. 28(1): p. 28-40. 46. Ota, M. and H. Sasaki, Mammalian Tead proteins regulate cell proliferation and contact inhibition as transcriptional mediators of Hippo signaling. Development, 2008. 135(24): p. 4059-69.

46 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

47. Lamar, J.M., et al., The Hippo pathway target, YAP, promotes metastasis through its TEAD- interaction domain. Proc Natl Acad Sci U S A, 2012. 109(37): p. E2441-50. 48. Huang, J., et al., The Hippo signaling pathway coordinately regulates cell proliferation and apoptosis by inactivating Yorkie, the Drosophila Homolog of YAP. Cell, 2005. 122(3): p. 421- 34. 49. Varelas, X., et al., The Crumbs complex couples cell density sensing to Hippo-dependent control of the TGF-beta-SMAD pathway. Dev Cell, 2010. 19(6): p. 831-44. 50. Dupont, S., et al., Role of YAP/TAZ in mechanotransduction. Nature, 2011. 474(7350): p. 179-83. 51. Wada, K., et al., Hippo pathway regulation by cell morphology and stress fibers. Development, 2011. 138(18): p. 3907-14. 52. Chen, C.L., et al., The apical-basal cell polarity determinant Crumbs regulates Hippo signaling in Drosophila. Proc Natl Acad Sci U S A, 2010. 107(36): p. 15810-5. 53. Webb, C., et al., Structural features and ligand binding properties of tandem WW domains from YAP and TAZ, nuclear effectors of the Hippo pathway. Biochemistry, 2011. 50(16): p. 3300-9. 54. Genevet, A. and N. Tapon, The Hippo pathway and apico-basal cell polarity. Biochem J, 2011. 436(2): p. 213-24. 55. Bossuyt, W., et al., An evolutionary shift in the regulation of the Hippo pathway between mice and flies. Oncogene, 2013. 56. Feng, W., et al., The tetrameric L27 domain complex as an organization platform for supramolecular assemblies. Nat Struct Mol Biol, 2004. 11(5): p. 475-80. 57. Chishti, A.H., et al., The FERM domain: a unique module involved in the linkage of cytoplasmic proteins to the membrane. Trends Biochem Sci, 1998. 23(8): p. 281-2. 58. Dube, N., et al., The RapGEF PDZ-GEF2 is required for maturation of cell-cell junctions. Cell Signal, 2008. 20(9): p. 1608-15. 59. Ulbricht, A., et al., Cellular mechanotransduction relies on tension-induced and chaperone- assisted autophagy. Curr Biol, 2013. 23(5): p. 430-5. 60. Schlegelmilch, K., et al., Yap1 acts downstream of alpha-catenin to control epidermal proliferation. Cell, 2011. 144(5): p. 782-95. 61. Piccolo, S. and M. Cordenonsi, Regulation of YAP and TAZ by Epithelial Plasticity. 2013: p. 89-113. 62. Ling, C., et al., The apical transmembrane protein Crumbs functions as a tumor suppressor that regulates Hippo signaling by binding to Expanded. Proc Natl Acad Sci U S A, 2010. 107(23): p. 10532-7. 63. Glatter, T., et al., An integrated workflow for charting the human interaction proteome: insights into the PP2A system. Mol Syst Biol, 2009. 5: p. 237. 64. Craig, R. and R.C. Beavis, TANDEM: matching proteins with tandem mass spectra. Bioinformatics, 2004. 20(9): p. 1466-7. 65. Deutsch, E.W., et al., A guided tour of the Trans-Proteomic Pipeline. Proteomics, 2010. 10(6): p. 1150-9. 66. Fermin, D., et al., Abacus: a computational tool for extracting and pre-processing spectral count data for label-free quantitative proteomic analysis. Proteomics, 2011. 11(7): p. 1340- 5. 67. Saeed, A.I., et al., TM4: a free, open-source system for microarray data management and analysis. Biotechniques, 2003. 34(2): p. 374-8. 68. Shannon, P., et al., Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res, 2003. 13(11): p. 2498-504. 69. Wu, J., et al., Integrated network analysis platform for protein-protein interactions. Nat Methods, 2009. 6(1): p. 75-7. 70. Liu, X., et al., PTPN14 interacts with and negatively regulates the oncogenic function of YAP. Oncogene, 2013. 32(10): p. 1266-73.

47 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

2.7. Supplementary Materials

2.7.1. Supplementary tables All supplementary tables are available on the attached CD.

Supplementary Table S2.1: List of all bait proteins used in this study. Supplementary Table S2.2: Bait-prey list of all protein-protein interactions identified in this study. HCIP: Qualifier for high confidence protein interaction which differentiates between yes, no and bait. Avg Unique Peptides: the average number of proteotypic peptides from replicate experiments. Avg Spectral counts: the average number of spectral counts from replicate experiments. Avg relative bait NSAF: Normalized spectral count values (NSAF) expressed relative to the corresponding bait protein. Control NSAF ratio: obtained NSAF values were compared to NSAF values from control experiments and expressed as enrichment over control. The threshold was empirically set to >10 for true interactions. WDN-Score: Score to discriminate true from unspecific interactions, based on prey frequency, abundance and reproducibility. Scores above 1.00 are considered bona fide interactors. Distance (*): Represent the distance to the closest neighboring bait with a high confidence interaction to the same prey protein. If an interaction does not fulfill the required filtering conditions, it can be rescued by an assigned distance of >0. This loosens the stringent filtering conditions and increases sensitivity. Supplementary Table S2.3: List of extracted protein-protein interactions from public databases. All extracted interactions are represented as a bait-prey list and contain relevant annotations to access the source database entry or corresponding publications (by PubMed identifier).

48 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

2.7.2. Supplementary figures

Supplementary Figure S2.1: Distributions of Publicly Available Protein Interaction Data. Publicly annotated protein interactions were extracted from the protein interaction network analysis platform PINA, using the Hpo bait proteins as seeds. (A) Distribution of number of independent mentions (by PubMed identifier) across all extracted interactions. A large majority (84.6%) was reported with one single publication. (B) Enrichments of publication number in observed and not observed public interactions. All interactions in this study that match to the PINA dataset were compared to PINA exclusive interactions in respect to the corresponding number of independent publications. There was threefold enrichment towards the unobserved interactions when probed in the subset with more than one (2+) independent mentions per interaction. (C) Distribution of detection methods for public interactions. A total of 20 different publications are needed to cover the overlap of 137 interactions with this study.

49 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

Supplementary Figure S2.2: Domain Distributions and Enrichment of Identified Hpo Pathway Proteins Annotated domains for all network components were extracted from InterPro. (A) Clustered domains were represented in an experiment-domain matrix and bait dendrograms revealed the same modularity as established by AP-MS interactions. (B) Enrichment of a subset of prominent domains found in the network.

50 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

Supplementary Figure S2.3: MST2 Binds to STRIPAK Components, When Cells Are Treated with Okadaic Acid Abundance changes of interacting proteins of MST2 upon okadaic acid stimulation. HEK293 cells expressing Strep-HA tagged MST2 were treated with 100 nM okadaic acid (OA) for 2h. Similar to MST1, the purified MST2 complexes contained an increased amount of STRIPAK associated proteins, whereas SARAH module components only showed marginal changes.

51 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

Supplementary Figure S2.4: Overexpression of Network Components for Dual Luciferase Assay Western blots of overexpressed Hpo interaction network components. Anti-V5 antibody was used at 1:5000 dilution and detected with anti-mouse-HRP. (*) longer exposure time.

52 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

Supplementary Figure S2.5: The E41L3 Interactome E41L3 was the most exhaustive interactome of this study (62 HCIP). Aside from the interactions to KIBRA and proteins shared with RASF9 and RASF10, public protein interaction data revealed associations to several other protein complexes, such as the exon junction complex, the methylosome and the AP-3 complex.

53 Chapter 2 Interaction Proteome of Human Hippo Signaling: Modular Control of the Transcriptional Co-activator YAP1

54

Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

Simon Hauri, Federico Comoglio, Makiko Seimiya, Timo Glatter, Ruedi Aebersold, Renato Paro, Matthias Gstaiger, and Christian Beisel

Manuscript under revision

Contribution by SH Design and performance of the LC-MS/MS measurements, protein purification experiments, data analysis and manuscript writing.

55 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

3.1. Abstract The establishment and maintenance of epigenetic gene silencing is fundamental to cell determination and function. An essential epigenetic regulatory system involved in heritable repression of gene activity is represented by the Polycomb group (PcG) proteins. PcG proteins are required for maintaining both the pluripotency of stem cells and the identity of differentiated cells. Mutations in PcG genes lead to major developmental defects and are implicated in various forms of cancer. The PcG gene family is a highly conserved and functionally heterogeneous group of proteins that act in multimeric protein complexes. Understanding of epigenetic control by PcG proteins depend on comprehensive and robust information on the topology of the underlying protein interaction network. Here we present a comprehensive interaction network containing all conserved human PcG proteins. By a systematic affinity purification mass spectrometry approach, we identified a network at unprecedented resolution covering 1400 interactions and 490 network components. Modular decomposition of the obtained interaction data together with functional analysis suggests a system of established and novel concurrent PcG complexes shaping the human epigenetic landscape.

56 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

3.2. Introduction Polycomb group (PcG) proteins constitute a heterogeneous class of evolutionary conserved repressive chromatin modifiers that regulate a variety of biological processes including cell differentiation, tissue regeneration and cancer cell growth [1-3]. They are engaged in the control of hundreds of genes involved in most major signaling pathways [4]. PcG proteins act as multimeric protein complexes. The most intensely investigated PcG complexes are the Polycomb Repressive Complexes PRC1 and PRC2, which contain different enzymatic activities [5, 6]. In Drosophila melanogaster PRC1 is built up by the four core subunits Pc, Psc, Ph and Sce/dRing [7, 8]. Mammalian cells express several homologs of each of these proteins (Figure 3.1A), which seem to assemble combinatorially [9, 10]. The six homologs of Psc, PCGF1-6, purify together with RING2 in different complexes with optional sets of additional components[10, 11]. RING2 mediates the mono-ubiquitination of histone H2A on lysine 119 (H2AK119Ub1), a repressive mark that blocks RNA polymerase II activity [12-14]. The incorporation of RING2 in optional PCGF complexes can lead to differential regulation of its enzymatic activity and recruitment to distinct target genes [10, 15-18]. Mammalian CBX proteins, homologs of Drosophila Pc, are present in the canonical PRC1 complex. These proteins contain a chromo-domain, which can bind lysine-methylated histone H3 tails, thus providing a mean for recruitment to specific chromatin sites [19-22]. PRC2 trimethylates lysine 27 of histone H3 (H3K27me3) [19-21]. Establishment of PcG mediated silencing can be therefore explained by a hierarchical recruitment model whereby PRC2 is targeted to gene promoters by transcription factors and sets the H3K27me3 mark [23]. Subsequently the CBX protein in PRC1 binds H3K27me3 and mediates the recruitment of RING2, which in turn sets the H2AK119Ub1 mark. The PRC2 core complex contains EED, SUZ12 and the H3K27-specific histone methyltransferase EZH2 [6]. The enzymatic activity of PRC2 seems to be modulated by accessory components such as the three Polycomb-like homologs PHF1, PHF19 and MTF2 [24, 25]. The recruitment of PRC2 might be mediated by the additional interaction partners JARD2 and AEBP2 via their DNA binding motifs [26-30]. Similarly, the transcription factor REST can interact with PRC2 and target the complex to distinct genes [31, 32]. Whether the PRC2 core interacts with all of these components simultaneously or optionally remained elusive to date.

57 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

In non-canonical PRC1 complexes the CBX component seems to be replaced by RYBP, which does not contain a chromo-domain [10]. Hence, recruitment of RYBP containing PRC1 complexes seems independent of H3K27 methylation. Indeed, recent work showed that the histone demethylase Kdm2b, a subunit of the PCGF1/RYBP containing PRC1 mediates the targeting of the complex via direct binding to unmethylated CpG island promoters [15, 17, 33]. Overall, the assembly of PcG complexes is highly dynamic which reflects the functional diversity required to regulate a large number of diverse target genes [34]. To date, our knowledge on the assembly of PcG complexes in mammalian cells is based on many studies using different cell systems as protein source and various biochemical chromatographic procedures for complex analysis. Here we report the systematic and comprehensive profiling of the PcG protein interactome in a single human cell line. We applied a robust double-affinity purification strategy coupled with mass-spectrometry (AP-MS) and subsequent quantitative data analysis. This led to a considerable refinement of the human PRC1 and PRC2 network topology, including their relation with the heterochromatin protein system. Furthermore, we investigated the composition and assembly of PR-DUB (Polycomb repressive deubiquitinase complex), the third major Drosophila PcG complex, which has not been investigated in the human system before [35]. We identified human PR-DUB as highly diverse complex containing MBD proteins, FOXK transcription factors and most interestingly OGT1, an O-linked N-acetylglucosamine (O- GlcNAC) transferase which has been linked to PcG silencing in Drosophila [36]. Chromatin profiling of PR-DUB components confirmed their co-localization at distinct target regions, indicating that PRC1 and PR-DUB regulate different sets of genes in the human system.

3.3. Results 3.3.1. Systematic profiling of the human PcG interaction proteome To investigate the propererties of the human Polycomb group (PcG) protein interaction network, we devised a systematic proteomics approach, based on our previously reported AP-MS protocol in HEK293 cells [37]. We selected 28 PcG proteins according to their homology to Drosophila core complexes (“primary baits”). Based on obtained interaction data from the primary bait dataset, we chose 36 additional “secondary bait” proteins

58 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

(Figure 3.1A, Supplementary Table S3.1). We used stable cell lines expressing the baits as Strep-HA tag fusion proteins to identify bait-associated proteins by liquid chromatography tandem mass spectrometry (LC-MS/MS) (Figure 3.1B). For each bait we measured at least two biological replicates, corresponding to a total of 174 AP-MS experiments. Our protein identification workflow relied on X!Tandem to match mass spectra to peptides, and the Trans-Proteomic Pipeline(TPP) to infer peptides to proteins at a false discovery rate of <1% [38, 39]. The obtained raw dataset contained 930 proteins (“preys”) with 9856 candidate interactions (Supplementary Table S2). To ensure efficient discrimination of interaction partners from contaminant proteins, we calculated the WDN- score [40] and estimated the relative abundance of prey proteins compared to control purification experiments. The resulting filtered dataset of 490 high confidence interacting proteins (HCIPs) corresponding to 1400 interactions was subjected to a hierarchical cluster analysis and the obtained topology was visualized as protein-protein interaction network (Figure 3.1C). On average we found 21.9 HCIPs per bait and overall 64.5% of our interactions have not been annotated in public databases at the time of submission (Figure 3.1D). The two baits with the most interactions (SKP1 and WDR5) demonstrate convincingly the high quality and sensitivity of our approach (Supplementary Figures S3.1 and S3.2). The overall recall rate of all known interactions was 17%. Since public protein interaction data are compiled from heterogeneous experimental sources the false discovery rate is largely unknown. We used the number of independent reports (PubMed) per interaction as an estimate for data robustness (Supplementary table S3.3). The recall rate of our dataset increased to 43%, when we set the minimal number of reports to two and more (Supplementary Figure S3.3). The fraction of robust public interactions (at least two references) that match to the PcG AP-MS dataset is four times higher than the robust public interactions that do not match (Figure 3.1E).

59 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

Figure 3.1: A systematic approach to characterize human Polycomb group (PcG) protein complex assemblies. (A) Based on homology to Drosophila PcG complexes, 28 bait proteins were selected for affinity purification and subsequent mass spectrometry analysis. Secondary bait proteins were selected based on interactions with the first set. (B) Altogether, 64 bait proteins were cloned into an expression vector containing a tetracycline inducible CMV promoter, Strep- HA fusion tag, and FRT sites. Isogenic HEK293-Flp T-rex cell lines were generated by homologous Flp recombination. After positive selection, the cells were expanded and induced with doxycycline for 24h. Protein complexes from 108 cells were Strep-Tag purified in native conditions and digested with trypsin. Peptides were identified by tandem mass spectrometry. (C) Analysis of at least two biological replicates per bait (174 AP-MS experiments) were searched with X!Tandem and validated by the Trans-Proteomic Pipeline (TPP) to match a protein identification false discovery rate of <1%. Unspecific binding proteins were filtered by calculated WDN-score and quantitative comparison to control purification experiments. Resulting high confidence interactions were hierarchically clustered to infer protein complex compositions. The obtained protein interaction data of 490 high confidence interacting proteins (HCIPs) and 1400 interactions were visualized as network graphs. (D) Overview of all bait proteins and identified HCIPs. Overall, 64.5% of the defined interactions have not been described in public literature databases. (E) Robustness of AP-MS HCIPs compared to public interaction data. The overlap to the public literature interactions was split in two groups of interactions that have been either observed or not observed in the AP-MS dataset. 58.3% of the AP-MS interactions in this study that are annotated in public databases have been reported in more than one (>1) independent publication. Compared to interactions exclusively found in public data (15.4%), this corresponds to 3.8 fold enrichment in data robustness.

60 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

3.3.2. Hierarchical clustering assigns HCIPs to PcG complexes To determine the topology of our protein interaction network and the assignment of potential protein complexes, we performed hierarchical clustering of the HCIPs using a rank-based correlation dissimilarity measure. This analysis revealed a modular organization of our interaction network built upon the three major PcG assemblies PRC1, PRC2 and PR-DUB, but also HP1 associated complexes (Figure 3.2A and 3.2B). PRC1 represents the most elaborated and heterogeneous assembly that contains four complexes which are defined by six PCGF proteins: PRC1.1 (PCGF1), PRC1.2/4 (PCGF2/4), PRC1.3/5 (PCGF3/5) and PRC1.6 (PCGF6). All four PRC1 assemblies share a common core of the E3 ubiquitin ligases RING1 and RING2, and - with the exception of PRC1.2/4 - also RYBP and YAF2. PRC1.6 furthermore provides links to the heterochromatin system via the HP1 chromobox proteins CBX1 and CBX3. Separate of PRC1 and HP1 we found PRC2. The two characteristic histone binding proteins RBBP4 and RBBP7 form together with SUZ12, EED and EZH1/2 the PRC2 core, but also partake in other protein complexes such as LINC, NURF, NURD and SIN3 (Supplementary Figure S3.4). Finally, we found the PcG complex PR-DUB defined by ASXL1/2 and BAP1. Apart from PcG and HP1 assemblies, the clustering also revealed complexes that were mostly compiled by prey proteins, such as the TCP chaperonin and the proteasomal lid. Of note are several proteins of MLL complexes, which share interactions between PRC1.3/5 (CSK21/22), PRC1.6 (WDR5) and PR-DUB (OGT1). Taken together the high resolution and reciprocity of our data allows a representation of clearly defined modules that correlate with established and novel human PcG complexes.

61 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

62 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

Figure 3.2: Hierarchical cluster analysis and network graph assigns HCIPs to PcG complexes. (A) Rank-based correlation dissimilarity measure based hierarchical clustering was applied on the HCIP dataset and revealed a modular organization of the PcG interaction proteome. The clusters correlate with the general perspective of PcG protein complex assemblies PRC1, PRC2 and PR-DUB. PRC1 was found as four individual clusters that represent PRC1 sub-complexes which are defined by different paralogs of PCGF proteins and share a common core of RING ubiquitin ligases and RYBP or YAF2 proteins. Other clusters were annotated involved in chromatin regulation: the heterochromatin system around HP1 chromobox proteins CBX1/3/5; the MLL/SET methyltransferase complexes, TCP complex and proteasome; LMBL1, INO80, LCOR and LINC/NURF/NURD/SIN3. Two apparently biggest clusters, SKP1 and WDR5, represent single AP- MS experiments with the most number of HCIPs that do not connect to the rest of the interactome. (B) Network representation of clustered interaction data. The blue lines represent interaction between proteins found in the same clusters and emphasize the annotated protein complex assemblies. Enlarged hexagon-shaped nodes are the bait proteins used in this study.

3.3.3. The PRC1 module The PRC1 module is centered on RING1 and RING2. Using these two proteins as baits we identified the same set of 34 preys. Purifications of the two closely related proteins YAF2 and RYBP yielded 26 out of 34 RING1/2 preys indicating that they are exclusively located in the RING/PRC1 module of the PcG interaction network (Figure 3.2A, Supplementary Figure S3.6, and Supplementary Table S3.2). The only interaction partners not shared between RING1/2 and RYBP/YAF2 are the five Polycomb CBX and the three PHC proteins, which along with PCGF2/4 and RING1/2 define the canonical PRC1 complex (Supplementary Figure S3.6A). Of note, we did not detect interactions among members of the same PcG protein family. Each complex contains only one RING (out of 2), one PCGF (out of 6), one Polycomb CBX (out of 5) and in non-canonical PRC1 complexes either RYBP or YAF2. Our data revealed the combinatory assembly of these different protein groups leading to the formation of 80 different protein complexes within the PRC1 module, which fall in 4 different classes defined by their PCGF subunit together with additional stably associated proteins (Supplementary Figure S3.6B). The classification in four classes is also mirrored in the PCGF1-6 phylogenetic tree generated by ClustalW alignment of their amino acid sequences. Indeed, the most distantly related PCGF1 and PCGF6 as well as the more closely related pairs PCGF2/4 and PcGF3/5 assemble in separate complexes (Supplementary Figure S3.6C). Using PCGF1, RING1/2 and RYBP/YAF2 as baits we confirmed the previously characterized BCOR complex[41], which also contains KDM2B, the SKP1 ubiquitin E3

63 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

ligase and BCORL as alternative to BCOR (Figure S3.6B). In addition to reciprocal interactions with the other PCGF1 complex subunits, our SKP1 purification experiment identified KDM2A and further 39 F-Box proteins as preys (Supplementary Figure S3.1). Notably, none of these proteins were co-purified with RING1/2, RYBP/YAF2 and PCGF1, indicating that SKP1 specifically interacts with the F-Box domain of KDM2B within the BCOR complex, while simultaneously forming alternative complexes. Analysis of the canonical PRC1 complex components identified the anticipated topology assembled around CBX2/4/6/7/8, PHC1-3, PCGF2/4 and RING1/2 proteins. Despite a highly sensitive experimental workflow, we failed to identify additional proteins that stably associate with PRC1 core members. Confirming previous studies, we detected SCML1 as transient interaction partner as it could be found to interact only with a subset of the core bait proteins (Supplementary Figure S3.6A). The WD domain containing protein DCAF7 co-purified with CBX4/6/8, RING1/2, RYBP/YAF2 and PCGF3/5/6 indicating that it is intricately embedded in the PRC1 module. Similarly, recent studies reported interactions of DCAF7 with members of the canonical PRC1 complex as well as PCGF3/5 and 6 [9, 11, 31, 42]. In order to clarify if DCAF7 is indeed a universal subunit of several different RING1/2 containing complexes we performed DCAF7 purification experiment. We detected reciprocal interactions with all bait proteins that the hierarchical cluster analysis aggregates to a cluster around PCGF3 and 5 (Figure 3.3C and 3.3D) and which clearly separates from the other PCGF complexes. Like RING1/2, RYBP/YAF2 and PCGF3/5 we found DCAF7 co-purifying with the tetrameric casein kinase 2, as well as the three paralogous proteins AUTS2, FBRS and FBSL. To further characterize the PCGF3/5 sub-network we used the catalytic casein kinase subunits CSK21 and CSK22 as baits. Our results confirmed the topology of the PCGF3/5-DCAF7 assemblies including CSK and the affiliation of three uncharacterized proteins belonging to the AUTS2 family (Figures 3.3C and 3.3D). The protein PCGF6 was initially purified together with the transcription factors E2F6, MAX, TFDP1, MGAP as well as RING1/2, YAF2, LMBL2, CBX3 and the histone methyltransferases EHMT1 and EHMT2, an assembly denoted as E2F6.com [43]. However, subsequent studies have not been able to confirm the existence of the entire (holo) E2F6.com but only partial subcomplexes could be recovered [10, 11, 44, 45]. Furthermore, recent data suggested that PCGF6 and RING2 may interact with the WDR40

64 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

domain protein WDR5 [10]. Thus, we decided to revisit the topology of the PCGF6-E2F6 network and investigated WDR5 connectivity by adding MAX, TFDP1, E2F6, LMBL2, CBX3, EHMT2 and WDR5 to our bait collection. The results of the corresponding AP-MS experiments unraveled a high-density network including reciprocal interactions between all but one (EHMT2, see below) bait proteins within this set (Figure 3.3C). In addition to MGAP we detected with MAX, TFDP1 and E2F6 a rich compilation of transcription factors which can form heterodimeric complexes with these proteins (Figure 3.3E). The absence of further interactions with other bait proteins, however, demonstrated that they are not part of the PCGF6-E2F6 complex. Recent studies have shown that WDR5 is part of the NSL complex and that it forms a trimeric complex with RBBP5 and ASH2L which stimulates the H3K4 specific methyltransferase activity of the SET1 HMT family: SET1A, SET1B and MLL1-4 [46-49]. While our AP-MS data could confirm the interactions with the NSL as well as with the SET1 core complexes (Supplementary Figure S3.2) we could additionally detect reciprocal interactions of WDR5 with all PCGF6-E2F6 complex subunits. This convincingly demonstrates that WDR5 is a universal component of activating and repressing chromatin modifying complexes. In contrast to previous studies which only reported the heterochromatin protein CBX3 as part of E2F6.com we detected also CBX1 in all PCGF6-E2F6 related pull down experiments. To confirm this unexpected finding we used CBX1 and as control the constitutive heterochromatin protein CBX5 as baits. Indeed we could validate the interactions of CBX1 whereas CBX5 is absent from the PCGF6-E2F6 network. Furthermore, our AP-MS data evidently disconnects EHMT2 and EHMT1 from E2F6.com. However, we validated interactions of EHMT2 with CBX1 and CBX3 and highlighted a potential complex containing CBX1/3, EHMT1/2, Zinc finger transcription factors (ZNFs) as well as the KRAB-ZNF interacting and co-repressor protein TIF1B (Supplementary Figure S3.7). Because of the molecular relation of PcG and HP1 chromobox proteins CBX2/4/6/7/8 and CBX1/3/5, respectively as well as the functional relation of the PcG and the heterochromatin proteins as silencing systems (reviewed in Beisel and Paro, 2011), we further explored the CBX1/3/5 domain of our network seeking for potential connections between these two systems. This survey, where we included further bait proteins (Figure 3.1A, Supplementary Table S3.2) led to improved fine mapping of CBX1/3/5 containing

65 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

complexes and identification of new interacting partners (Supplementary Figure S3.7). Additional links to the PcG systems were not detected, however.

Figure 3.3: High resolution interaction map of the PRC1 complex assemblies. (A) PRC1 complexes can be split into four sub-complexes that share a core of RING1/2, YAF2 and RYBP. The sub-complexes are defined by PCGF proteins. (B) Mutual exclusive binding of PRC1 core and PCGF proteins. RING1 and RING2 do not interact with each other but bind to YAF2 and RYBP. Likewise, YAF2 and RYBP only interact with RING1/2 but not each other. PCGF proteins never bind to other PCGF proteins. (C) Excerpt of cluster analysis showing the two PRC1 assemblies PRC1.1 and PRC1.3/5. (D) and (E) Detailed interaction map of the human PRC1.3/5 and PRC1.6 complexes. Hexagon shaped nodes represent bait proteins used in this study. Yellow nodes are DNA binding proteins.

3.3.4. The PRC2 module The functional, enzymatically active, core complex of PRC2 is composed of SUZ12, EED, RBBP4/7 and either EZH1 or EZH2 [6]. Additional accessory proteins, which regulate the H3K27 methyltransferase activity of the complex and its recruitment to chromatin, have also been identified [26-30]. However, the topology of the different PRC2 complexes and particularly the interrelation of the accessory proteins with the PRC2 core proteins remained largely unresolved so far. Thus, to investigate the topological organization of

66 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

PRC2 complexes we performed AP-MS experiments using all reported PRC2 associated proteins. Hierarchical clustering analysis of the data assigned all these bait proteins to a single cluster of highly correlated profiles. This result is also reflected in the high-density interaction network formed by reciprocal interactions between PRC2 components (Figure 3.4A and 3.4B). Based on these reciprocal interactions our data revealed two fundamental alternative assemblies linked to the PRC2 core: the first defined by AEBP2 and JARD2 and the second by the mutually exclusive binding of one of the three Polycomb- like homologs PHF1, PHF19 and MTF2, respectively (Figure 3.4A). In addition, purification experiment of the PRC2 core members as well as of the Polycomb-like proteins identified two uncharacterized proteins, C10ORF12/LCOR and C17ORF96, as novel PRC2 interactors (Figure 3.4A). Purification experiment of C17ORF96 confirmed all interactions with the Polycomb-like ‘wing’ of PRC2 (Figure 3.4A). Computational sequence analysis revealed that C17ORF96 is present in all vertebrate genomes. Interestingly, BLAST search identified a single protein related to C17ORF96 (NCBI blastp; alignment score 77.8; E-value 2e-15; 53% identity) namely the SKI/DAC domain containing protein 1 (SKDA1). SKDA1 belongs to the DACH (Dachshund) family proteins which are defined by the SKI/SNO/DAC domain of about 100 amino acids and which are involved in various aspects of cell proliferation and differentiation [50]. The SKI/SNO/DAC domain is missing in C17ORF96 as well as any other annotated protein domain. The homology of C17ORF96 to SKDA1 is restricted to the C-terminal part of both proteins where they share >50% sequence identity within the last 60 amino acids (Supplementary Figure S3.8A). According to a BLAST search with C17ORF96 amino acids 313-375 as query sequence, SKDA1 and C17ORF96 are the only proteins encoded in the carrying this domain. Most interestingly we also detected SKDA1 as prey with EZH1 and SUZ12 (Figure 3.4B), suggesting that the C- terminal part mediates the interaction of both proteins with the PRC2 core complex. Initially, the second new protein we could link to PRC2 gave us ambiguous results as we measured peptides of two proteins separately annotated in UniProt: LCOR and C10ORF12 (Supplementary Figure S3.8B). Indeed both proteins are encoded in the same genomic locus and Genbank contains an entry (ligand-dependent co-repressor, isoform CRA_b, EAW49962.1) where the N-terminal 111 amino acids of LCOR are fused to C10ORF12 separated by a 200 amino acid spacer (Figure 3.4C). LCOR has previously been identified

67 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

as ligand-dependent co-repressor interacting via its N-terminal domain with nuclear hormone receptors in a complex including CTBP and various histone deacetylases [51, 52]. Our mass spectrometric analysis yielded peptides of the LCOR N-terminal part, C10ORF12 as well as the LCOR-CRA_b specific spacer (Supplementary Figure S3.8A). Peptides of the LCOR C-terminal part were missing indicating that PRC2 interacts with LCOR-CRA_b whereas our AP-MS data could not rule out an additional interaction with C10ORF12 alone. To elaborate this, we used LCOR, C10ORF12 and the large isoform LCOR-CRA_b as bait proteins. LCOR purified with the known interaction partners CTBP1 and CTBP2, while peptides of PRC2 components have not been detected. In contrast, both LCOR-CRA_b and C10ORF12 confirmed a strong link by reciprocally interacting with all subunits of the Polycomb-like wing of PRC2 (Figure 3.4B and Supplementary Figure S3.8B). To further support these findings, we made use of a heterologous reporter system based on a stably integrated constitutively active luciferase reporter gene, which is responsive to upstream located GAL4 DNA binding sites (Figure 3.4E) [53]. We engineered cell lines containing tetracycline inducible GAL4-LCOR and GAL4-C10ORF12 expression constructs, respectively. Upon induction both proteins located to the GAL4 motifs close to the promoter of the reporter gene which resulted in strong repression of luciferase activity (Figures 3.4E and 3.4F). If the repressive activity of C10ORF12 is mediated by recruitment of PRC2 the luciferase promoter should become methylated at H3K27. We tested this hypothesis by chromatin immunoprecipitation with an H3K27me3-specific antibody and checked the presence of luciferase promoter fragments by subsequent quantitative PCR analysis. Indeed, we could detect a significant signal for H3K27me3 upon GAL4-C10ORF12 induction whereas there was no H3K27me3 modification detected in the case of LCOR (Figure 3.4G). This result is substantiated by the fact that we expressed GAL4-LCOR at a higher level than GAL4-C10ORF12, which led to a 10-20 fold increased binding of LCOR to the reporter (Figures 3.4E and 3.4F). Together these data demonstrate that we identified LCOR-CRA_b and C10ORF12 as new accessory proteins specific to the Polycomb-like wing of PRC2, while the interaction domain is located in C10ORF12 and not in the ligand responsive N-terminal domain of LCOR.

68 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

Figure 3.4: High resolution interaction analysis of PRC2 components determines specific PRC2 classes and reveals LCOR-CRA_b and C10ORF12 as new PRC2 interactors. (A) Detailed interaction map of PRC2 revealed formation of a core of RBBP4/7, SUZ12 and EED together with either EZH1 or EZH2. Reciprocal interactions of all identified PRC2 proteins revealed two alternative complexes binding to the core: AEBP2 together with JARD2 or PHF1, PHF19 and MTF2. Two uncharacterized proteins C17ORF96 and C10ORF12 have been reciprocally linked to all PRC2 proteins except of AEBP2 and JARD2 (red lines). (B) Excerpt of cluster analysis for PRC2 components. (C) Schematic representation of alternative protein isoforms of LCOR and C10ORF12, which result by alternative promoter usage and splicing. (D) LCOR as bait does not physically interact with PRC2 components. (E) Transcriptional repression of tetracycline

69 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

inducible GAL4-LCOR and GAL4-C10ORF12 on GAL-binding sites fused to luciferase reporter. Upon tetracycline induction, relative luciferase activity dropped significantly for both LCOR and C10ORF12, (error bars are standard deviation; values above bars are p-values). (F) Chromatin immunoprecipitation with subsequent quantitative PCR (ChIP-qPCR) analysis revealed that tetracycline induced GAL4 constructs localize close to the promoter of the reporter gene (transcription start site; TSS). (G) ChIP-qPCR analysis with an H3K27me3 antibody showed increased trimethylated lysine K27 on Histone H3 for tetracycline induced C10orf12 , but not LCOR.

3.3.5. PR-DUB The PcG complex PR-DUB was identified as heterodimer consisting of the de-ubiquitinase Calypso and Asx in Drosophila [35]. To determine the composition of the human PR-DUB complex we used the corresponding homologs BAP1, ASXL1 and ASXL2 as baits. We found BAP1 reciprocally interacting with both, ASXL1 and ASXL2, while there was no interaction between the two ASXL proteins (Figure 3.5A and 3.5B). This argues for the presence of two optional PR-DUB complexes, which we call PR-DUB1 and PR-DUB2 depending on the BAP1 partner ASXL1 respectively ASXL2. Both cores associate with a similar set of accessory proteins, the transcription factors FOXK1 and FOXK2, the chromatin associated proteins MBD5 and MBD6, the transcriptional co-regulator HCFC1 and most remarkably OGT1. The Drosophila OGT1 homolog, Supersexcombs (Sxc) was annotated as PcG protein, yet its incorporation in a PcG complex remained unresolved. OGT1 is the only O-linked N- acetylglucosamine (O-GlcNAC) transferase encoded in the mammalian genome and adds a single GlcNAc to serine and threonine of many cellular proteins [54]. Previously it was shown that OGT1 interacts with BAP1 and that it binds to chromatin with the 5- methylcytosine oxidase TET1 [55, 56]. To validate its interaction with the PR-DUB network we added OGT1 as bait protein to our AP-MS workflow. We monitored the reciprocal interaction with BAP1 and also purified HCFC1. In addition OGT1 seems to be part of various other protein complexes involved in transcriptional regulation, which did not co-purify with PR-DUB1/2 core subunits. We confirmed known interactions with TET1 and NCOAT, the OGT1 opposing O-GlcNAcase [56, 57]. Additionally our data revealed new interactions with the transcription factors ZEP1/ZEP2 and the arginine- specific HMT CARM1. OGT1 was implicated as subunit of the WDR5 containing complex NSL and the SET1 HMT family activating complex WDR5/RBBP5/ASH2, which most likely mediates the detected interaction of OGT1 to MLL1 and SET1A (Figure 3.5B).

70 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

Although we could not detect interactions of OGT1 with FOXK1/2 and MBD5/6, these proteins co-cluster with PR-DUB core components and OGT1 is highly connected in the corresponding network (Figures 3.5A and 3.5B).

Figure 3.5: Human PR-DUB complexes contain transcription factors and interact with OGT1. (A) Excerpt of hierarchical clustering analysis of PR-DUB associated proteins. (B) Topology of PR-DUB complexes. The three proteins ASXL1, ASXL2 and BAP1 define the classical PR-DUB complex and were found to reciprocally interact with the O-GlcNAc transferase OGT1. Other components that cluster with ASXL1/2 and BAP1 are the protein phosphatase PGAM5, the histone demethylase KDM1B, the methyl-CpG binding domain proteins MBD5 and MBD6, and the forkhead transcription factors FOXK1 and FOXK2. OGT1 interactions with no connections to PR- DUB included predominantly MLL/SET complex associated proteins shared with WDR5 A functional interaction of OGT1 with FOXK transcription factors within the same PR-DUB complex would require their co-localization at genomic target sites. To test this we used chromatin immunoprecipitations (ChIPs) followed by next-generation sequencing (ChIP-

71 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

seq) to acquire high resolution genome-wide maps of the corresponding components ASXL1, FOXK1 and OGT1 in HEK293 cells. With an antibody recognizing OGT1 the ChIP enrichments we received were not sufficient for subsequent comparative analysis steps. Therefore, we used an O-GlcNAc specific antibody as surrogate, assuming that an O- GlcNAc signal indicates the presence of an active OGT1 protein. After mapping the sequence reads to the reference genome we computed the significantly enriched regions for each of the three ChIPed features by means of MACS analysis software [58]. Afterwards we calculated the pairwise overlaps and found 41% and 55% of FOXK1 peaks co-localizing with GlcNAc and ASXL1, respectively, while 69% of GlcNAc peaks were co-occupied by ASXL1 (Figure 3.6A). In total these experiments led to the identification of 2703 sites where all three features co-localize (Figure 3.6A). The annotation of these sites indicated a predominant binding to gene promoters whereas the distribution of the read density at RefSeq annotated genes shows a specific co-localization +/-250 base pairs around the transcription start sites (TSSs) (Figures 3.6A and 3.6B). Motivated by this result we computed the enrichments within +/-1kb windows of 30,038 RefSeq TSSs and performed a pairwise correlation analysis at these promoter regions. Each combination shows a high pairwise correlation coefficient >0.8, demonstrating that ASXL1, FOXK1 and OGT1 co- occupy many gene promoters potentially as functional PR-DUB1 complex (Figure 3.6C). This is also reflected in the similarity of enriched pathway ontologies revealed upon gene enrichment analysis of the individual features (Supplementary Figure S3.9A). FOXK1-PR- DUB1 targets are predominantly enriched for genes involved in major cellular processes like , cell cycle, mitosis and protein metabolism (Figure 3.6D). In the Drosophila genome PR-DUB has been reported to overlap with a large proportion of PRC1 bound regions [35]. We sought to investigate this relation in the human genome by comparing our PR-DUB data with previously published ChIP-seq data of RING2 and RYBP [10]. We also included genome-wide mapping data of TIF1B [59]. Thus we compared six proteins of three major domains represented in our PcG interaction network on the level of chromatin: RING2 and RYBP form the central core of the ‘PRC1’ module (Figure 3.2B and 3.3A), TIF1B is the common component of zinc finger transcription factor containing CBX1/3/5 complexes (Supplementary Figure S3.7A) and PR-DUB. First we performed a pairwise correlation analysis with the individual ChIPed features using their enrichment at RefSeq promoter regions as input. As expected we

72 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

observed a high correlation coefficient for RING2 and RYBP of 0.78 (Supplementary Figure S3.9B). In contrast there is an evident segregation between RING2, RYBP and TIF1B on the one hand and ASXL1, O-GlcNAc and FOXK1 on the other (Figures 3.6E Supplementary Figure S3.9B). In the same way, when we compared the genome-wide distribution of PR-DUB (2703 ASXL1+GlcNAc+FOXK1 co-occupied regions) with ‘PRC1’ (6816 RING2+RYBP peaks) and TIF1B (10297 peaks), we observed only a partial overlap of target sites. 24% and 31% of PR-DUB co-localize with PRC1 and TIF1B, respectively, with 336 regions occupied by all three (Figure 3.6F). Overall, these data uncover the basic topology of the human PR-DUB network and show that in the human genome ‘PRC1’ and PR-DUB complexes probably regulate different target gene sets.

73 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

Figure 3.6: The PR-DUB complex preferentially localizes at TSSs different from PRC1 target genes (A) Venn diagrams showing the genome-wide colocalization of high-confidence MACS

74 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

peaks for the PR-DUB components FOXK1 (blue), ASXL1 (orange) and the O-GlcNAc modification (green). Empirical p-values from peaks shuffling are indicated. The pie chart illustrates the distribution of PR-DUB peaks (2703, triple intersection) with respect to TSSs. (B) Distribution of ChIP-Seq signal of FOXK1, ASXL1, O-GlcNAc and input with respect to TSSs (metaTSS profile) in a region of 2kb centered on the TSS position. Dashed lines correspond to all annotated TSSs whereas solid lines correspond to PR-DUB target TSSs. The top-left inset shows the distribution of signals around TTS (metaTSS). (C) Pairwise correlation of PR-DUB feature enrichments at TSSs illustrated as smoothed color density. Spearman’s rank correlation coefficients are indicated. (D) Functional annotation of high-confidence MACS peaks localizing within 5kb of annotated promoters. Only the top 10 significantly enriched MSigDB pathways obtained with GREAT are indicated. UCSC tracks of PR-DUB ChIP-Seq signals at selected promoters belonging to the top four hits are shown in order of significance (encoded by blue tones). (E) Hierarchical clustering of PR-DUB, PRC1 components and TIF1B enrichments at TSSs genome-wide. (F) Venn diagrams showing the genome-wide colocalization of high-confidence MACS peaks for the PR-DUB complex (red), PRC1 (brown) and TIF1B (light blue). UCSC tracks of ChIP-Seq signals at promoters bound by all three features are shown.

75 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

3.4. Discussion In this study we used a systematic proteomic approach to comprehensively profile the PcG protein interactome in a single human cell line. The result is a high-density interaction network, which allowed the dissection of individual PcG complexes in unprecedented detail. In addition to the fine mapping of the cardinal PcG complexes PRC1 and PRC2 our data illuminates human PR-DUB as multifaceted assembly comprising the O-linked N-acetylglucosamine (O-GlcNAC) transferase OGT1 plus several transcription and chromatin binding factors. Hence we could allocate newly identified interaction partners to all PcG complex families leading to novel insights with regard to molecular function and recruitment. In the following we discuss our significant findings and their functional implications. Our data confirm the main conclusions of a recent AP-MS study that concentrated on key subunits of PRC1 complexes [10]: (1) There exist six different groups of PRC1 complexes which are defined by the interaction of RING1/2 with one out of six PCGF proteins. (2) PRC1 complexes can be further classified in the canonical PRC1.2/4 complex containing PCGF2 or PCGF4 together with one out of five CBX and one out of three PHC proteins. Alternatively, RYBP or YAF2 may replace the CBX subunits in those complexes. Our data together with the data published by Gao et al. [10] make a convincing case to keep their introduced PRC1 nomenclature by naming these complexes PRC1.1-PRC1.6 according to their PCGF subunit. In contrast to previously published studies we included RING1 as bait protein. Notably, we demonstrated that RING1 and RING2 are mutually exclusive subunits, whereby RING1 can replace RING2 in all defined PRC1 complexes. Therefore our results provide the basis for the finding that the mouse homologs of RING1 and RING2 can act redundantly during early developmental processes [60, 61]. Recently it was shown that the diversity of the PRC1 complexes might be dependent on the binding preferences of PCGF proteins mediated by their C-terminally located RAWUL domain [62, 63]. The phylogenetic tree based on the sequence alignment of the PCGF RAWUL domains reflects the classification of PRC1 complexes in PRC1.1, PRC1.2/4, PRC1.3/5 and PRC1.6 [63]. The PCGF1 and PCGF2/4 RAWUL domains selectively interact with BCOR/BCORL and PHC proteins, respectively [63]. The interaction partners of PCGF3/5 and PCGF6 have not been experimentally specified, however. Interestingly, our

76 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

data reveal the WD40 domain proteins DCAF7 and WDR5 as subunits of the PRC1.3/5 and PRC1.6 complexes, respectively. They may serve as potential binding partners of the corresponding PCGF RAWUL domains. WD40 domain proteins provide several surfaces for the interaction with multiple binding partners and are often found to scaffold the assembly of multisubunit complexes [63]. WDR5 has been identified as part of several complexes implicated in transcriptional activation [64]. Especially its role in regulating global histone H3K4 methylation via the SET1 family of histone methyltransferases has been well established [65] Therefore it was surprising for us to see that according to our data WDR5 is unambiguously also part of PRC1.6, a repressive complex that is supposed to counteract the activity of the SET1 family. This might propose a functional mechanism for the scaffolding activity of WDR5. Upon differential modification, repressive PRC1.6 or activating complexes could be assembled around WDR5. DCAF7 has been implicated in skin development and cell proliferation through interaction with DIAP1 and the dual- specificity tyrosine phosphorylation-regulated kinase DYR1A [66, 67]. Our study validates these interactions as DCAF7 pull-downs detect in addition to all subunits of the PRC1.3/5 complexes, the zinc finger transcription factors ZN503 and ZN703, the ankyrin-repeat proteins SWAHA and SWAHC, as well as DYR1A/B and DIAP1. We see no links of these latter DCAF7 interactors to PRC1.3/5 demonstrating that DCAF7 like WDR5 is scaffolding several different protein complexes. The WDR40 proteins EED and RBBP4/7, core subunits of PRC2 bear affinity specifically to H3K27me3 and histones, respectively [68, 69]. In this respect, in addition to the scaffolding function, DCAF7 and WDR5 could also contribute to the targeting or stabilization of their PRC1 complexes to chromatin. Several recent studies have indicated that PRC1 complexes have different sets of target genes and follow different routes of recruitment, which seem obviously reflected in their diverse subunit composition [10, 15-18, 33]. The canonical PRC1.2/4 CBX containing complexes are the ones that respond to and are dependent on the H3K27me3 mark while RYBP containing complexes bind independently of H3K27me3 [10, 16, 18]. In PRC1.1 KDM2B was identified as DNA binding protein that mediates PRC1.1 recruitment to unmethylated CpG islands. PRC1.6 contains the transcription factors E2F6, TFDP1, MAX and MGAP, which target the complex to E2F- and Myc responsive genes [43, 44]. Interestingly PRC1.6 also contains the histone binding protein LMBL2 and the H3K9me3 binding proteins CBX3 and CBX1, which may also contribute to the recruitment of the

77 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

complex or alternatively to build up repressive chromatin structures [44, 70]. For PRC1.3/5 no recruitment function has been assigned until now. As mentioned above, DCAF7 could play a role in recognizing specific histone modifications. In contrast to PRC1 the composition of PRC2 complexes is less variable. Our data agrees with recent studies showing that the core of PRC2 is assembled by SUZ12, EED, RBBP4/7 and the histone methyltransferase active subunit EZH1 or EZH2 (reviewed in Margueron and Reinberg, 2011[6]). Furthermore our purification experiments with the three Polycomb-like orthologs PHF1, MTF2 and PHF19 validate recent results that these proteins reside in different PRC2 complexes [25]. However the allocation of the other two known subunits AEBP2 and JARD2 has not been clearly defined yet. The question still remained if they are present in a holo complex with the Polycomb like proteins [6]. Our approach demonstrates an unexpected deconvolution in separate PRC2 assemblies. One class of PRC2 complexes is characterized by the simultaneous interaction of AEBP2 and JARD2 with the core, the other by the interaction of PHF1, MTF2 or PHF19. Furthermore we could identify two novel interactors C17ORF96 and LCOR-CRA_b plus its isoform C10ORF12. Notably both exclusively interact with the Polycomb like class of PRC2. This further substantiates the hypothesis that we identified two different PRC2 classes, which we name in the following PRC2-AJ (for AEBP2-JARD2) and PRC2-PCL (PHF1, MTF2, and PHF19). AEBP2 and JARD2 can directly bind to DNA and have been implicated in the recruitment of PRC2 [26, 27, 30]. Interestingly, the effect of JARD2 depletion on global H3K27 methylation is mild, arguing for the separate PRC2-PCL complexes being responsible for maintaining the histone mark. PCL proteins have an effect on PRC2 activity and recruitment via their ability to bind to H3K36me [24, 71, 72]. How C17OR96 and LCOR-CRA_b/C10OR12 influence PRC2-PCL needs to be experimentally tackled in the future. LCOR-CRA_b contains specific sequence motifs in its N-terminus that can mediate the interaction to nuclear hormone receptors and CTBP/HDAC complexes [73]. The largest part of LCOR-CRA_b, which corresponds to C10ORF12 is not characterized and does not contain any known domain. An interesting possibility could be that PRC2-PCL complexes get recruited via LCOR-CRA_b to nuclear hormone receptor binding sites upon ligand binding. The third Drosophila PcG complex we could resemble in human cells was PR-DUB, consisting of the core members BAP1 and ASXL1 or ASXL2. Previously an attempt to

78 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

detect BAP1 interaction partners led to the identification of Asxl1, Asxl2, Ogt, Foxk1, Kdm1b and Hcf1 in mouse spleen tissue [55]. These proteins we also identified in our study, indicating a general, cell type independent assembly of mammalian PR-DUB complexes. Evidently, OGT1 is part of mammalian PR-DUB complexes, an interaction which was not identified when the Drosophila PR-DUB complex was purified [35]. However, the OGT1 homolog in Drosophila, Supersexcombs (Sxc) was annotated as bona fide PcG protein [36]. Mutations in Drosophila Sxc, Calypso (BAP1 homolog) and ASX (ASXL1/2 homolog) genes lead to de-repression of HOX genes. It was shown that Calypso, ASX as well as O-GlcNAc, the modification set by Sxc/OGT1, co-localize with major PRC1 bound sites at inactive genes in the Drosophila genome [35, 36]. Using global ChIP-seq analysis we found ASXL1, OGlcNAc as well as FOXK1 co-localizing at transcription start sites, indicating that they are subunits of the same protein complex, PR-DUB1. In contrast to the situation in Drosophila, however, PR-DUB1 and PRC1 complexes are bound to distinct sets of target genes, arguing that these complexes are involved in different cellular processes. Of note, we could identify the transcription factors FOXK1 and FOXK2 as PR- DUB components, which provides insight in the potential recruitment mechanism of PR- DUB complexes. Future experiments based on our AP-MS and genome wide mapping data will shed light on the functionality of PR-DUB complexes in gene regulation and their relation to PRC1 and PRC2.

In summary, we presented an in-depth and coherent analysis of the composition of human PcG complexes. Understanding the role of PcG proteins in epigenetic gene regulation requires understanding the dynamic assembly of PcG complexes, which in turn defines their function. Our study testifies the significant diversity that exists among individual PRC1, PRC2 and PR-DUB complexes. Hence our data should serve as solid basis for future systematic experiments to disentangle their role in gene silencing.

79 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

3.5. Experimental Procedures Expression constructs To generate expression vectors for tetracycline-induced expression of N-terminally SH- tagged bait proteins, human ORFs provided as pDONR™223 vectors were picked from a Gateway® compatible human orfeome collection (horfeome v5.1, Open Biosystems) for LR recombination with the customized destination vector pcDNA5/FRT/TO/SH/GW [37]. Genes not in the human orfeome collection were amplified from the MegaMan Human Transcriptome Library (Agilent) by PCR and cloned into entry vectors by TOPO (pENTR™- TOPO®) or BP clonase reaction (pDONR™223; Invitrogen).

Cell line generation Flp-In HEK293 cells (Invitrogen) containing a single genomic FRT site and stably expressing the tet repressor were cultured in DMEM (4.5 g/l glucose, 10% FCS, 2 mM L- glutamine, 50 mg/ml penicillin, 50 mg/ml streptomycin) containing 100 µg/ml zeocin and 15 µg/ml blasticidin. The medium was exchanged with DMEM medium containing 15 µg/ml blasticidin before transfection. For cell line generation, Flp-In HEK293 cells were co-transfected with the corresponding expression plasmids and the pOG44 vector (Invitrogen) for co-expression of the Flp-recombinase using the Fugene transfection reagent (Roche). Two days after transfection, cells were selected in hygromycin- containing medium (100 µg/ml) for 2–3 weeks.

Protein purification Stable expressing cell lines for protein production each were grown in four 14cm Nunclon dishes to 80% confluency and harvested with PBS containing 10 mM EDTA. The suspended cells were pelleted and drained from the supernatant for subsequent shock- freezing in liquid nitrogen and long term storage at -80°C. The frozen cell pellets were resupended in 4ml HNN lysis buffer (50 mM HEPES pH 7.5, 150 mM NaCL, 50 mM NaF, 1% Igepal CA-630 (Nonidet P-40 Substitute), 200 mM Na3VO4, 1 mM PMSF, 20 µg/ml Avidin and 1x Protease Inhibitor mix (Sigma) and rested on ice for 10 min. Insolubilizable material was removed by centrifugation. Cleared lysates were loaded on a pre- equilibrated spin column (Biorad) containing 50 ul Strep-Tacting sepharose (IBA Biotagnology). The sepharose was washed two times with 1 ml HNN lysis buffer. Bound

80 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

proteins were eluted with 600 ul 0.5 mM Biotin in HNN lysis buffer, incubated for 2h with 50 ul αHA-Agarose (Sigma) and washed three times with HNN buffer (50 mM HEPES pH 7.5, 150 mM NaCL, 50 mM NaF). The bound proteins were released by acidic elution with 450 ul 0.2 M Glycine pH 2.5 and the eluate was pH neutralized with NH4HCO3. Cysteine bonds were reduced with 5 mM TCEP for 30 min at 37°C and alkylated in 10 mM iodacetamide for 30 min at room temperature in the dark. Samples were diluted with NH4HCO3 to 1.5M Urea and digested with1 mg trypsin (Promega) overnight at 37°C. Bait proteins that had a low protein yield were processed by single step purification, omitting the HA step. To remove detergent the samples were washed three times with HNN buffer (50 mM HEPES pH 7.5, 150 mM NaCL, 50 mM NaF) before biotin elution. The eluates were TCA precipitated to remove biotin and resolubilized in 50ul 8 M Urea in 50 mM 50 mM NH4HCO3 pH 8.8.7. After dilution with NH4HCO3 to 1.5 M Urea the samples were reduced, alkylated and digested as in the double step protocol. The digested peptides were purified with C18 microspin columns (The Nest Group Inc.) according to the protocol of the manufacturer, resolved in 0.1% formic acid, 1% acetonitrile for mass spectrometry analysis.

Mass spectrometry LC-MS/MS analysis was performed on a LTQ Orbitrap XL mass spectrometer (Thermo Fisher Scientific). Peptide separation was carried out by a Proxeon EASY-nLC II liquid chromatography system (Thermo Fisher Scientific) connected to an RP-HPLC column (75 µm x 10 cm) packed with Magic C18 AQ (3 µm) resin (WICOM International), running a linear gradient from 95% solvent A (0.1% formic acid, 2% acetonitrile) and 5% solvent B (98% acetonitrile, 0.1% formic acid) to 35% solvent B over 60 min at a flow rate of 300 nl/min. The data acquisition mode was set to obtain one high resolution MS scan in the Orbitrap (60,000 @ 400 m/z). The 6 most abundant ions from the first MS scan were fragmented by collision induced fragmentation (CID) and MS/MS fragment ion spectra were acquired in the linear trap quadrupole (LTQ). Charge state screening was enabled and unassigned or singly charged ions were rejected. The dynamic exclusion window was set to 15s and limited to 300 entries. Only MS precursors that exceeded a threshold of 150 ion counts were allowed to trigger MS/MS scans. The ion accumulation time was set to 500 ms (MS) and 250 ms (MS/MS) using a target setting of 106 (MS) and 104 (MS/MS)

81 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

ions. After every replicate set , a peptide reference sample containing 200 fmol of human [Glu1]-Fibrinopeptide B (Sigma-Aldrich) was analyzed to monitor the LC-MS/MS systems performance.

Protein identification Acquired spectra were searched with X!Tandem [38] against the canonical human proteome reference dataset [74] (http://www.uniprot.org/) , extended with reverse decoy sequences for all entries. The search parameters were set to include only fully tryptic peptides (KR/P) containing up to two missed cleavages. Carbamidomethyl (+57.021465 amu) on Cys was set as static peptide modification. Oxidation (+ 15.99492 amu) on Met and phosphorylation (+79.966331 amu) on Ser, Thr, Tyr were set as dynamic peptide modifications. The precursor mass tolerance was set to 25 ppm, the fragment mass error tolerance to 0.5 Da. Obtained peptide spectrum matches were statistically evaluated using PeptideProphet and protein inference by ProteinProphet, both part of the Trans Proteomic Pipeline [39] (TPP, v.4.5.1). A minimum protein probability of 0.9 was set to match a false discovery rate (FDR) of <1%. The resulting pep.xml and prot.xml files were used as input for the spectral counting software tool Abacus to calculate spectral counts and NSAF values [75].

Evaluation of high confidence interacting proteins (HCIP) Adjusted NSAF values of identified co-purified proteins were compared to a mock AP control dataset consisting of 62 StrepHA-GFP or 12 StrepHA-RFP-NLS purification experiments. The protein abundance in the control dataset was estimated by averaging the 10 highest NSAF values per protein among all 74 measurements. In order for candidate interactions to pass, the enrichment threshold over the control dataset was >10. Adjusted NSAF values were also used to calculate WDN-scores of all the potential interactions [40]. A simulated data matrix was used to calculate the WD-score threshold below which 98% of the simulated data falls. All raw WD-scores were normalized to this value. To gain sensitivity and incorporate network topology as filtering criterion, the distance between all interactions and the closest high confidence interactions (which passed both filtering steps) was determined with the software tool Multiple Experiment Viewer [76] (http://www.tm4.org/mev/). Interactions that were below the thresholds

82 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

for control ratio and WDN-score could be rescued if the distance to a high confidence interaction was greater than zero. The final filtered dataset contained high confidence interacting proteins (HCIPs) and corresponding protein-protein interactions.

Clustering analysis Agglomerative hierarchical clustering of HCIP was performed using adjusted NSAF values. Different correlation-based dissimilarity measures were considered in combination with commonly adopted intergroup dissimilarity measures (single, average and complete linkage functions). For each pair of measures, clustering performances were evaluated using the cophenetic correlation coefficient [77]. Briefly, this coefficient judges the ability of a dendrogram to represent the input data structure. As a result of this procedure, we considered a Spearman’s rank correlation coefficient (SCC) based dissimilarity along with average linkage. Hence, the dissimilarity between prey i and j was computed as dij=(1- (xi,xj))/2, where is the SCC.

Network visualization and literature data mining Protein Interaction data were visualized with Cytoscape 2.8.6 [78] (http://www.cytoscape.org). Known interactions were obtained from the protein interaction network analysis platform PINA [79] (http://cbg.garvan.unsw.edu.au/pina/home.do) using bait protein identifiers as starting nodes.

Genomic data analysis ChIP-Seq profiles of RING1B, RYBP, TIF1B (GEO accession number GSM855007, GSM855008 and GSE27929, respectively) were downloaded as raw data in sra format, along with corresponding input datasets and converted to fastq using the NCBI Short Read Archive Toolkit. All ChIP-Seq datasets were aligned to the human genome (hg19 assembly) using Bowtie 2.0.0 [80] allowing for 1 mismatch in a seed of 30 nt, looking for up to 100 alignments and reporting best. Overall alignment rates ranged between 83 and 94% for in-house generated datasets and between 75 and 98% for external datasets. Alignments were converted from SAM format to BAM using SAMtools 0.1.18 [81]. Peak calling was performed using MACS 1.4.0 [58] with default parameters. When replicates

83 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

were available, only peaks shared between replicates were considered further. Peaks were then filtered according to their p-values yielding a subset denoted as high- confidence peaks (p < 1e-10), which were used for subsequent analysis in R using Bioconductor [82] and custom scripts. Overlap between peaks was determined using GenomicRanges, peaks sharing at least 1 bo were considered as overlapping. RefSeq transcript annotations were fetched from UCSC using GenomicFeatures. Unique TSSs were defined as TSSs having no other annotated TSS within their 1kb flanking region, irrespective of the strand. A total of 21612 unique TSSs was then considered further. MetaTSS and metaTTS analysis was performed by computing feature signals at TSS (+/- 1kb) and transcription termination sites (TTS, +/- 1kb), respectively, using non- overlapping windows of width 20nt. Signals were normalized across all corresponding genomic regions to produce density plots. Feature enrichments at unique promoters (TSSs +/- 1kb) were computed as described in [83]. Pairwise Spearman’s rank correlations were computed between all analyzed features. Agglomerative hierarchical clustering of feature enrichments at promoters was performed using Spearman’s rank correlation coefficient based dissimilarity and average linkage. MSigDB pathway analyses were performed with GREAT [84] (v 2.0.2) by associating genomic regions to single nearest genes within 5kb.

84 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

3.6. References

1. Maurange, C., N. Lee, and R. Paro, Signaling meets chromatin during tissue regeneration in Drosophila. Curr Opin Genet Dev, 16, 485-9 (2006). 2. Sparmann, A. and M. van Lohuizen, Polycomb silencers control cell fate, development and cancer. Nat Rev Cancer, 6, 846-56 (2006). 3. Jaenisch, R. and R. Young, Stem cells, the molecular circuitry of pluripotency and nuclear reprogramming. Cell, 132, 567-82 (2008). 4. Ringrose, L., Polycomb comes of age: genome-wide profiling of target sites. Curr Opin Cell Biol, 19, 290-7 (2007). 5. Muller, J. and P. Verrijzer, Biochemical mechanisms of gene regulation by polycomb group protein complexes. Curr Opin Genet Dev, 19, 150-8 (2009). 6. Margueron, R. and D. Reinberg, The Polycomb complex PRC2 and its mark in life. Nature, 469, 343-9 (2011). 7. Shao, Z., et al., Stabilization of chromatin structure by PRC1, a Polycomb complex. Cell, 98, 37-46 (1999). 8. Saurin, A.J., Z. Shao, H. Erdjument-Bromage, P. Tempst, and R.E. Kingston, A Drosophila Polycomb group complex includes Zeste and dTAFII proteins. Nature, 412, 655-60 (2001). 9. Vandamme, J., P. Volkel, C. Rosnoblet, P. Le Faou, and P.O. Angrand, Interaction proteomics analysis of polycomb proteins defines distinct PRC1 complexes in mammalian cells. Mol Cell Proteomics, 10, M110 002642 (2011). 10. Gao, Z., et al., PCGF homologs, CBX proteins, and RYBP define functionally distinct PRC1 family complexes. Mol Cell, 45, 344-56 (2012). 11. Sanchez, C., et al., Proteomics analysis of Ring1B/Rnf2 interactors identifies a novel complex with the Fbxl10/Jhdm1B histone demethylase and the Bcl6 interacting corepressor. Mol Cell Proteomics, 6, 820-34 (2007). 12. de Napoles, M., et al., Polycomb group proteins Ring1A/B link ubiquitylation of histone H2A to heritable gene silencing and X inactivation. Dev Cell, 7, 663-76 (2004). 13. Wang, H., et al., Role of histone H2A ubiquitination in Polycomb silencing. Nature, 431, 873-8 (2004). 14. Stock, J.K., et al., Ring1-mediated ubiquitination of H2A restrains poised RNA polymerase II at bivalent genes in mouse ES cells. Nat Cell Biol, 9, 1428-35 (2007). 15. Farcas, A.M., et al., KDM2B links the Polycomb Repressive Complex 1 (PRC1) to recognition of CpG islands. Elife, 1, e00205 (2012). 16. Tavares, L., et al., RYBP-PRC1 complexes mediate H2A ubiquitylation at polycomb target sites independently of PRC2 and H3K27me3. Cell, 148, 664-78 (2012). 17. He, J., et al., Kdm2b maintains murine embryonic stem cell status by recruiting PRC1 complex to CpG islands of developmental genes. Nat Cell Biol, 15, 373-84 (2013). 18. Morey, L., L. Aloia, L. Cozzuto, S.A. Benitah, and L. Di Croce, RYBP and Cbx7 define specific biological functions of polycomb complexes in mouse embryonic stem cells. Cell Rep, 3, 60-9 (2013). 19. Cao, R., et al., Role of histone H3 lysine 27 methylation in Polycomb-group silencing. Science, 298, 1039-43 (2002). 20. Czermin, B., et al., Drosophila enhancer of Zeste/ESC complexes have a histone H3 methyltransferase activity that marks chromosomal Polycomb sites. Cell, 111, 185-96 (2002). 21. Muller, J., et al., Histone methyltransferase activity of a Drosophila Polycomb group repressor complex. Cell, 111, 197-208 (2002). 22. Bernstein, E., et al., Mouse polycomb proteins bind differentially to methylated histone H3 and RNA and are enriched in facultative heterochromatin. Mol Cell Biol, 26, 2560-9 (2006).

85 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

23. Wang, L., et al., Hierarchical recruitment of polycomb group silencing complexes. Mol Cell, 14, 637-46 (2004). 24. Sarma, K., R. Margueron, A. Ivanov, V. Pirrotta, and D. Reinberg, Ezh2 requires PHF1 to efficiently catalyze H3 lysine 27 trimethylation in vivo. Mol Cell Biol, 28, 2718-31 (2008). 25. Hunkapiller, J., et al., Polycomb-like 3 promotes polycomb repressive complex 2 binding to CpG islands and embryonic stem cell self-renewal. PLoS Genet, 8, e1002576 (2012). 26. Kim, T.G., J.C. Kraus, J. Chen, and Y. Lee, JUMONJI, a critical factor for cardiac development, functions as a transcriptional repressor. J Biol Chem, 278, 42247-55 (2003). 27. Kim, H., K. Kang, and J. Kim, AEBP2 as a potential targeting protein for Polycomb Repression Complex PRC2. Nucleic Acids Res, 37, 2940-50 (2009). 28. Peng, J.C., et al., Jarid2/Jumonji coordinates control of PRC2 enzymatic activity and target gene occupancy in pluripotent cells. Cell, 139, 1290-302 (2009). 29. Shen, X., et al., Jumonji modulates polycomb activity and self-renewal versus differentiation of stem cells. Cell, 139, 1303-14 (2009). 30. Pasini, D., et al., JARID2 regulates binding of the Polycomb repressive complex 2 to target genes in ES cells. Nature, 464, 306-10 (2010). 31. Dietrich, N., et al., REST-mediated recruitment of polycomb repressor complexes in mammalian cells. PLoS Genet, 8, e1002494 (2012). 32. Arnold, P., et al., Modeling of epigenome dynamics identifies transcription factors that mediate Polycomb targeting. Genome Res, 23, 60-73 (2013). 33. Wu, X., J.V. Johansen, and K. Helin, Fbxl10/Kdm2b recruits polycomb repressive complex 1 to CpG islands and regulates H2A ubiquitylation. Mol Cell, 49, 1134-46 (2013). 34. Beisel, C. and R. Paro, Silencing chromatin: comparing modes and mechanisms. Nat Rev Genet, 12, 123-35 (2011). 35. Scheuermann, J.C., et al., Histone H2A deubiquitinase activity of the Polycomb repressive complex PR-DUB. Nature, 465, 243-7 (2010). 36. Gambetta, M.C., K. Oktaba, and J. Muller, Essential role of the glycosyltransferase sxc/Ogt in polycomb repression. Science, 325, 93-6 (2009). 37. Glatter, T., A. Wepf, R. Aebersold, and M. Gstaiger, An integrated workflow for charting the human interaction proteome: insights into the PP2A system. Mol Syst Biol, 5, 237 (2009). 38. Craig, R. and R.C. Beavis, TANDEM: matching proteins with tandem mass spectra. Bioinformatics, 20, 1466-7 (2004). 39. Deutsch, E.W., et al., A guided tour of the Trans-Proteomic Pipeline. Proteomics, 10, 1150- 9 (2010). 40. Behrends, C., M.E. Sowa, S.P. Gygi, and J.W. Harper, Network organization of the human autophagy system. Nature, 466, 68-76 (2010). 41. Gearhart, M.D., C.M. Corcoran, J.A. Wamstad, and V.J. Bardwell, Polycomb group and SCF ubiquitin ligases are found in a novel BCOR complex that is recruited to BCL6 targets. Mol Cell Biol, 26, 6880-9 (2006). 42. El Messaoudi-Aubert, S., et al., Role for the MOV10 RNA helicase in polycomb-mediated repression of the INK4a tumor suppressor. Nat Struct Mol Biol, 17, 862-8 (2010). 43. Ogawa, H., K. Ishiguro, S. Gaubatz, D.M. Livingston, and Y. Nakatani, A complex with chromatin modifiers that occupies E2F- and Myc-responsive genes in G0 cells. Science, 296, 1132-6 (2002). 44. Trojer, P., et al., L3MBTL2 protein acts in concert with PcG protein-mediated monoubiquitination of H2A to establish a repressive chromatin structure. Mol Cell, 42, 438-50 (2011). 45. Qin, J., et al., The polycomb group protein L3mbtl2 assembles an atypical PRC1-family complex that is essential in pluripotent stem cells and early development. Cell Stem Cell, 11, 319-32 (2012).

86 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

46. Dou, Y., et al., Regulation of MLL1 H3K4 methyltransferase activity by its core components. Nat Struct Mol Biol, 13, 713-9 (2006). 47. Wysocka, J., et al., A PHD finger of NURF couples histone H3 lysine 4 trimethylation with chromatin remodelling. Nature, 442, 86-90 (2006). 48. Cai, Y., et al., Subunit composition and substrate specificity of a MOF-containing histone acetyltransferase distinct from the male-specific lethal (MSL) complex. J Biol Chem, 285, 4268-72 (2010). 49. Zhang, P., H. Lee, J.S. Brunzelle, and J.F. Couture, The plasticity of WDR5 peptide-binding cleft enables the binding of the SET1 family of histone methyltransferases. Nucleic Acids Res, 40, 4237-46 (2012). 50. Hammond, K.L., I.M. Hanson, A.G. Brown, L.A. Lettice, and R.E. Hill, Mammalian and Drosophila dachshund genes are related to the Ski proto-oncogene and are expressed in eye and limb. Mech Dev, 74, 121-31 (1998). 51. Fernandes, I. and J.H. White, Agonist-bound nuclear receptors: not just targets of coactivators. J Mol Endocrinol, 31, 1-7 (2003). 52. Shi, Y., et al., Coordinated histone modifications mediated by a CtBP co-repressor complex. Nature, 422, 735-8 (2003). 53. Hansen, K.H., et al., A model for transmission of the H3K27me3 epigenetic mark. Nat Cell Biol, 10, 1291-300 (2008). 54. Hart, G.W., C. Slawson, G. Ramirez-Correa, and O. Lagerlof, Cross talk between O- GlcNAcylation and phosphorylation: roles in signaling, transcription, and chronic disease. Annu Rev Biochem, 80, 825-58 (2011). 55. Dey, A., et al., Loss of the tumor suppressor BAP1 causes myeloid transformation. Science, 337, 1541-6 (2012). 56. Vella, P., et al., Tet proteins connect the O-linked N-acetylglucosamine transferase Ogt to chromatin in embryonic stem cells. Mol Cell, 49, 645-56 (2013). 57. Whisenhunt, T.R., et al., Disrupting the enzyme complex regulating O-GlcNAcylation blocks signaling and development. Glycobiology, 16, 551-63 (2006). 58. Zhang, Y., et al., Model-based analysis of ChIP-Seq (MACS). Genome Biol, 9, R137 (2008). 59. Iyengar, S., A.V. Ivanov, V.X. Jin, F.J. Rauscher, 3rd, and P.J. Farnham, Functional analysis of KAP1 genomic recruitment. Mol Cell Biol, 31, 1833-47 (2011). 60. Endoh, M., et al., Histone H2A mono-ubiquitination is a crucial step to mediate PRC1- dependent repression of developmental genes to maintain ES cell identity. PLoS Genet, 8, e1002774 (2012). 61. Posfai, E., et al., Polycomb function during oogenesis is required for mouse embryonic development. Genes Dev, 26, 920-32 (2012). 62. Sanchez-Pulido, L., D. Devos, Z.R. Sung, and M. Calonje, RAWUL: a new ubiquitin-like domain in PRC1 ring finger proteins that unveils putative plant and worm PRC1 orthologs. BMC Genomics, 9, 308 (2008). 63. Junco, S.E., et al., Structure of the polycomb group protein PCGF1 in complex with BCOR reveals basis for binding selectivity of PCGF homologs. Structure, 21, 665-71 (2013). 64. Wysocka, J., et al., WDR5 associates with histone H3 methylated at K4 and is essential for H3 K4 methylation and vertebrate development. Cell, 121, 859-72 (2005). 65. Ruthenburg, A.J., et al., Histone H3 recognition and presentation by the WDR5 module of the MLL1 complex. Nat Struct Mol Biol, 13, 704-12 (2006). 66. Morita, K., C. Lo Celso, B. Spencer-Dene, C.C. Zouboulis, and F.M. Watt, HAN11 binds mDia1 and controls GLI1 transcriptional activity. J Dermatol Sci, 44, 11-20 (2006). 67. Miyata, Y. and E. Nishida, DYRK1A binds to an evolutionarily conserved WD40-repeat protein WDR68 and induces its nuclear translocation. Biochim Biophys Acta, 1813, 1728- 39 (2011). 68. Song, J.J., J.D. Garlick, and R.E. Kingston, Structural basis of histone H4 recognition by p55. Genes Dev, 22, 1313-8 (2008).

87 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

69. Margueron, R., et al., Role of the polycomb protein EED in the propagation of repressive histone marks. Nature, 461, 762-7 (2009). 70. Lachner, M., D. O'Carroll, S. Rea, K. Mechtler, and T. Jenuwein, Methylation of histone H3 lysine 9 creates a binding site for HP1 proteins. Nature, 410, 116-20 (2001). 71. Musselman, C.A., et al., Molecular basis for H3K36me3 recognition by the Tudor domain of PHF1. Nat Struct Mol Biol, 19, 1266-72 (2012). 72. Cai, L., et al., An H3K36 methylation-engaging Tudor motif of polycomb-like proteins mediates PRC2 complex targeting. Mol Cell, 49, 571-82 (2013). 73. Fernandes, I., et al., Ligand-dependent nuclear receptor corepressor LCoR functions by histone deacetylase-dependent and -independent mechanisms. Mol Cell, 11, 139-50 (2003). 74. UniProt, C., Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res, 40, D71-5 (2012). 75. Fermin, D., V. Basrur, A.K. Yocum, and A.I. Nesvizhskii, Abacus: a computational tool for extracting and pre-processing spectral count data for label-free quantitative proteomic analysis. Proteomics, 11, 1340-5 (2011). 76. Saeed, A.I., et al., TM4: a free, open-source system for microarray data management and analysis. Biotechniques, 34, 374-8 (2003). 77. Hastie, T., R. Tibshirani, and J. Friedman, <> elements of statistical learning data mining, inference, and prediction. 2nd ed. Springer series in statistics. 2009, New York, N.Y.: Springer. 745 S. 78. Shannon, P., et al., Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res, 13, 2498-504 (2003). 79. Wu, J., et al., Integrated network analysis platform for protein-protein interactions. Nat Methods, 6, 75-7 (2009). 80. Langmead, B. and S.L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nat Methods, 9, 357-9 (2012). 81. Li, H., et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078-9 (2009). 82. Gentleman, R.C., et al., Bioconductor: open software development for computational biology and bioinformatics. Genome Biol, 5, R80 (2004). 83. Enderle, D., et al., Polycomb preferentially targets stalled promoters of coding and noncoding transcripts. Genome Res, 21, 216-26 (2011). 84. McLean, C.Y., et al., GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol, 28, 495-501 (2010).

88 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

3.7. Supplementary materials 3.7.1. Supplementary tables All supplementary tables are available on the attached CD. Supplementary Table S3.1: List of all bait proteins used in this study. Supplementary Table S3.2: HCIP: Qualifier for high confidence protein interaction which differentiates between yes, no and bait. Avg Unique Peptides: the average number of proteotypic peptides from replicate experiments. Avg Spectral counts: the average number of spectral counts from replicate experiments. Avg relative bait NSAF: Normalized spectral count values (NSAF) expressed relative to the corresponding bait protein. Control NSAF ratio: obtained NSAF values were compared to NSAF values from control experiments and expressed as enrichment over control. The threshold was empirically set to >10 for true interactions. WDN-Score: Score to discriminate true from unspecific interactions, based on prey frequency, abundance and reproducibility. Scores above 1.00 are considered bona fide interactors. Distance (*): Represent the distance to the closest neighboring bait with a high confidence interaction to the same prey protein. If an interaction does not fulfill the required filtering conditions, it can be rescued by an assigned distance of >0. This loosens the stringent filtering conditions and increases sensitivity. Supplementary Table S3.3: List of extracted protein-protein interactions from public databases. All extracted interactions are represented as a bait-prey list and contain relevant annotations to access the source database entry or corresponding publications (by PubMed identifier).

89 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

3.7.2. Supplementary figures

Supplementary Figure S3.1: High resolution interaction map of SKP1. Overview of the full SKP1 interactome defined in this study (blue and orange lines) and annotated in public literature databases (red dotted lines). The thick blue lines represent the main interactions to PcG components that cluster together with SKP1. About half of the identified interactions (n = 42) were F-box proteins, including KDM2B – the only F-box protein associated with the PRC1.1 complex.

90 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

Supplementary Figure S3.2: The scaffold protein WDR5 is involved in many chromatin regulating complexes. (A) Detailed interaction map of WDR5 experiments. Blue lines represent interactions detected in this study to PcG associated proteins that cluster with WDR5 (PRC1.6 and CSK2). Interactions to other HCIPs, which together with public literature interactions (dotted red lines) could be assigned to other protein complexes (MLL, TCP, NSL, TORC2 and ADA2/GCN5/ADA3 transcription activator complex). (B) Overlap with the study of Cai et al. 2010.

91 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

Supplementary Figure S3.3: public protein interaction data. Publicly annotated protein interactions were extracted from the protein interaction network analysis platform PINA, using the PcG bait proteins as seeds. (A) Recall of known interactions from public protein interaction databases. The recall rate was higher, when compared to a more robust subset of literature interactions that required more than one independent mention per interaction. N = subset size. (B) Distribution of number of independent mentions (by PubMed identifier) across all extracted interactions. A large majority (79%) was reported with a single publication.

92 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

Supplementary Figure S3.4: RBBP4/7 Interaction proteome. Aside from the interactions to PRC2 and HP1, the RBBP4/7 proteins were found to bind to proteins of LINC, NURF, NIRD and SIN3 complexes. Blue and orange lines are interactions detected in this study, red dotted lines represent public literature interactions.

93 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

Supplementary Figure S3.5: PHORC and LMBL1 do not provide any links to PcG protein complexes. (A) Full HCIP interactome of the Drosophila homologous proteins associated with the PHORC complex. TYY1 co-purified with the INO80 complex. (B) Interaction proteome of the four human LMBL paralogues. Only LMBL2 provided connections with PcG protein assembly (PRC1.6). Blue and orange interactions represent interactions found in this study; red dotted lines are annotated public literature interactions.

94 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

Supplementary Figure S3.6: PRC1 assemblies. (A) The PRC1.2/4 complexes, defined by PCGF2/4 interacts with the PRC1 core proteins RING1 and RING2, PCGF2/4 can undergo mutually exclusive interactions with either CBX and PHC proteins or with YAF2 and RYBP. PHC1/2/3 never interact with each other and neither do CBX2/4/6/7/8. (B) PRC1.1 complex contains PCGF1 and SKP1 and links the co-repressor BCOR and BCORL as well as the demethylase KDM2B to the PRC1 core.

95 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

Supplementary Figure S3.7: HP1 proteins CBX1, CBX3 and CBX5 assemble a diverse set of protein complexes implicated in different molecular functions. (A) CBX1 and CBX3 but not CBX5 interact with the histone methyltransferases EHMT1 and EHMT2. The CBX1/3-EHMT1/2 complex also interacts with zinc finger transcription factors WIZ, ZN644, ZN462 and the co- repressor protein TIF1B. In contrast to a previous report we did not detect any potential interactions of EHMT1/2 with PRC1.6 [45] (B) Interaction of CBX3 and CBX5 with SENP7. This complex may also include TIF1B, the zinc finger transcription factor AHDC1 and the histone chaperone CHAP1 (C) CBX1/3/5 complex with histone chaperone CAF1, RBBP4, which is potentially involved in DNA replication (D) Centromeric DSN1/MIS12 complex with HP1 proteins, involved in mitosis. Hexagons indicate bait, squares prey proteins.

96 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

Supplementary Figure S3.8: Sequence coverage of LCOR isoform CRA_b (C10ORF12_long). (A) All identified peptides that match to C10ORF12_long. Blue: the shared sequence between LCOR and C10ORF12_long; Green: C10ORF12_long specific linker; Red: canonical C10ORF12 sequence (as annotated in UniProt) (B) Overview of identified LCOR and C10ORF12 isoform peptides in PRC2 purification experiments.

97 Chapter 3 Characterization of the Human Polycomb Group Interaction Proteome

Supplementary Figure S3.9: Functional annotation of high-confidence MACS peaks for individual PR-DUB components and genome-wide analysis of PR-DUB, PRC1 and TIF1B enrichments at TSSs. (A) Functional annotation of high-confidence MACS peaks localizing within 5kb of annotated promoters. Enriched MSigDB pathways are reported as returned by GREAT analysis (B) Pairwise correlation of PR-DUB, PRC1 components and TIF1B enrichments at TSSs illustrated as smoothed color density. Spearman’s rank correlation coefficients are indicated.

98 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

Markku Varjosalo, Roberto Sacco, Alexey Stukalov, Audrey van Drogen, Melanie Planyavsky, Simon Hauri, Ruedi Aebersold, Keiryn L. Bennett, Jacques Colinge, Matthias Gstaiger and Giulio Superti-Furga

Published in Varjosalo, M., et al. (2013). "Interlaboratory reproducibility of large-scale human protein-complex analysis by standardized AP-MS." Nat Methods 10(4): 307-314.

Conribution by SH Design and performance of the LC-MS/MS measurements, data analysis and discussion.

99 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

4.1. Abstract The characterization of all protein complexes of human cells under defined physiological conditions using affinity purification-mass spectrometry (AP-MS) is a highly desirable step in our quest to understand the phenotypic effects of genomic information. However, such a challenging goal has not yet been achieved, as it requires robustness of the experimental workflow and high data consistency across different studies and laboratories. We embarked on a systematic investigation of the variation sources in a common AP-MS workflow by performing a rigorous inter-laboratory comparative analysis on the interactomes of 32 human kinases. We show that it is possible to achieve high inter-laboratory reproducibility despite differences in MS configurations and biochemistry-related variations, and that combination of independent datasets improved the approach sensitivity resulting in even more detailed networks. Our analysis demonstrates the feasibility of obtaining a high-quality map of the human proteome by multi-laboratory efforts and creates the precondition for a community-wide project.

100 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

4.2. Introduction Proteins engage in a plethora of stable and dynamic interactions to form macromolecular assemblies, essential components of any cellular process. The recent progress in sensitivity, throughput and specificity of AP-MS techniques1-8 makes it ideal for the large- scale analysis of human protein complexes9-20. Roughly a tenth of the estimated 650,000 protein-protein interactions in human cells have been experimentally observed so far21,22, suggesting that the complete characterization of the human interactome, similarly to early genome sequencing projects, will require a collaborative effort of a larger scientific community. The success of such collaborative AP-MS efforts towards completing a reference dataset for the human interaction proteome strictly depends on the specificity and robustness in the respective sample preparation and data acquisition methods. Some of these issues have been addressed for collecting binary protein interaction information by the popular yeast two hybrid system23-25; however, no sufficient evidence has been available so far that the protocols used to characterize complexes by AP-MS would be robust enough to warrant large-scale multi-laboratory analysis necessary for the entire proteome. Currently only 25% of annotated interactions are reproduced in an independent experiment using the same human protein as a bait. This rather low value could be explained by the heterogeneity of the experimental systems and processing pipelines. We conducted a large-scale comparative inter-laboratory study, in which 32 human protein kinases were analyzed by AP-MS in parallel in two different laboratories. Samples and data were exchanged at all steps of the procedure to measure inter- laboratory reproducibility in biochemistry, AP-MS data acquisition and computation. While the question of mass spectrometry data reproducibility has been already addressed before26, this is the first study in which the inter-laboratory reproducibility of the entire AP-MS workflow, from sample preparation to bioinformatic filtering, is inspected in a systematic fashion. Our analysis showed that careful execution of a robust suite of experimental protocols using the same biological system leads to very high reproducibility within an individual laboratory (98%) and surprisingly high overall inter-laboratory reproducibility (81%). We found that the biochemical component affected the inter-laboratory reproducibility slightly more than mass spectrometry, however these discrepancies were effectively minimized by applying stringent bioinformatic filtering. The study also showed that

101 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

merging the AP-MS datasets from the two laboratories improved the resolution of the protein interaction network without sacrificing confidence. This work demonstrates that large-scale analyses of the human proteome interactome by multi-laboratory efforts are technically feasible and likely to produce more accurate and complete results than single- laboratory projects.

4.3. Results 4.3.1. Cell lines generation, sample preparation and MS analysis For this study, we selected 32 kinases with average basal expression levels from kinases identified in a previous mass spectrometry survey as components of the human central proteome27. These kinases represent eight different kinase families (Fig. 4.1a) expressed in a wide range of human tissues28,29, and differ substantially by domain composition as well as biological processes (Supplementary Table 1). They constitute a versatile set of well-to-poorly studied bait proteins for benchmarking AP-MS protocol robustness. We tagged the kinase bait proteins with Twin-Strep-tag® II and hemagglutinin (HA) epitope tag (SH), expressed them in isogenic HEK293 cell lines that allow controllable expression following tetracycline induction13 and applied a double-affinity purification strategy. Cell lines expressing a SH-GFP fusion protein and the SH-tag alone were used as negative controls. To limit the number of factors that could affect the reproducibility of the experiments the two laboratories followed identical protocols and committed to the same lot numbers of consumables and reagents (see Methods). The experimental workflow is described in Figure 4.1b. Both laboratories produced two biological replicates of each purification. Each sample was then analyzed twice by LC- MS/MS to identify the co-purified proteins: half of the sample was analyzed on-site and the other half was sent to the partner laboratory for cross-analysis (see Methods). In total, the dataset consisted of 256 and 16 LC-MS/MS analyses for the kinases and negative controls, respectively. All MS files are available at PeptideAtlas raw data repository (PASS00117).

102 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

Figure 4.1: Experimental design. (a) Phylogenetic tree of human kinases showing the kinases selected for this study. The number of selected kinases per family is indicated in brackets. Human Kinome illustration reproduced courtesy of Cell Signaling Technology, Inc. (www.cellsignal.com) (b) AP-MS analysis workflow. Single cell lines expressing each tagged kinase were created and used by the two laboratories. Two double-step affinity purifications (AP, biological replicates indicated by numbers) were performed by each laboratory (upper bar). After tryptic digestion and preparation of the obtained peptide mixture for LC-MS, half of each sample was shipped to the partner laboratory (dotted arrows). LC-MS/MS analysis (middle bar) was conducted by each laboratory on own samples and the ones received from the partner laboratory. After protein database search on the obtained data, CeMM and ETHZ datasets were merged and a three-step bioinformatic filtering was applied (lower bar) to produce the final interactome.

103 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

4.3.2. Bioinformatic processing of the datasets We submitted the generated MS data to an internal protein identification pipeline with stringent false discovery rate (FDR, 1%) to obtain a reliable list of protein identifications for each LC-MS/MS experiment (Methods, Supplementary Table 2). We performed the filtering of contaminants separately for three datasets (Fig. 4.1b): CeMM-only (CeMM samples analyzed at CeMM), ETHZ-only (ETHZ samples analyzed at ETHZ), and MERGED (all MS analyses including the exchanged samples). Independent processing allowed: (1) comparison of the results between the two laboratories, and (2) use of the MERGED dataset to obtain a combined interactome with best possible coverage and accuracy while evaluating the benefit of additional replicates. The filtering strategy comprised three stages to eliminate the three categories of protein contamination in the AP-MS datasets (Fig. 4.1b and Supplementary Note 1): 1) potential carry-over proteins remained even following LC-MS wash and blank runs applied between sample runs (< 1% of identified proteins); 2) high-abundant proteins usually identified as non-specific binders in proteomic studies, unless detected with 8 times higher spectral counts compared to their average abundance in the negative controls (77 ± 13% of the proteins in each sample, see Supplementary Note 1 and Supplementary Table 3). 3) low-abundant non-systematic contaminants, falling below the threshold of a weighted spectral count-based score (wSC) that favors more abundant and reproducible interactions. To provide an unbiased filtering for each dataset (CeMM-only, ETHZ-only, or MERGED), the threshold was set to retain 85% of the interactions present in public databases (Fig. 4.2). The last stage eliminated 50 ± 10% of the remaining interactions from the CeMM-only and ETHZ-only datasets. All filtered and unfiltered networks are provided in the Supplementary Data and Supplementary Table 4.

104 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

Figure 4.2: Cross-lab comparison of weighted spectral counts filtering. Each dot in the plot represents one protein interaction. X and Y axes show the weighted spectral count score (wSC) calculated using CeMM-only and ETHZ-only datasets respectively. The color of the dot specifies if the interaction was retained in CeMM-only, ETHZ-only or MERGED filtered datasets (see legend): in the upper-right corner there are interactions (green dots) that are retained in all 3 datasets, the lower-left corner contains interactions that are removed from both single-lab datasets. 30% of these interactions are preserved in the MERGED network (red dots), while the others were removed from all three datasets (black dots).

4.3.3. Reproducibility analysis As a measure of the experiments reproducibility we used the fraction of proteins identified in experiment A that were also present in experiment B, where A and B are either replicate experiments done at the same site or independent experiments done by different laboratories. Using this definition, we found the AP-MS experiments in the public databases to be 25% reproducible (Supplementary Note 2). We determined the overall intra-laboratory reproducibility in our study comparing the biological replicates produced and analyzed at the same laboratory. Excluding the final filtering stage, the reproducibility was 81.3 ± 2.1%. For the fully-filtered datasets, it increased to 97.5 ± 0.9%. For inter-laboratory reproducibility the corresponding numbers were 50% and 70%.

105 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

Besides estimating the overall reproducibility, our additional goal was to quantitate the two major sources of variation in AP-MS data: sample preparation and LC-MS/MS. Sample preparation (cell biology, biochemistry) induces fluctuations in the concentration of the proteins present in the samples. Subsequent LC-MS/MS analysis increases variability due to the stochasticity in the acquisition of fragment ion spectra of low intensity peptide ions. The quantitation of both factors is challenging, because it requires the exact knowledge of sample contents after all the preparation procedures and before the submission to LC- MS/MS, not achievable by any current experimental method. Therefore, we used Bayesian statistical modeling (Supplementary Fig. 4.2) to reconstruct the contents of the samples by looking at the data derived from the eight replicate experiments per bait produced by the two laboratories (Fig. 4.1b and Supplementary Table 5). The analysis showed that within a laboratory biochemical and MS components had approximately the same reproducibility, while biochemistry was 12% less reproducible than MS between the two laboratories (Fig. 4.3). To estimate the impact of the final filtering stage on reproducibility, we applied the Bayesian analysis to the fully-filtered CeMM-only and ETHZ-only datasets. We found that the final stage minimized the laboratory-specific biases that were mostly comprised of low-abundance interactions undetected by the second laboratory. It must be noted that independent filtering causes an apparent increase of inter-laboratory variability and when considering the MERGED filtered network, reproducibilities of sample preparation and LC-MS/MS analyses were estimated to be both above 85% (87.7 ± 2.4% and 93.2 ± 1.5%, respectively) and the overall inter-laboratory reproducibility to be 81.1 ± 3.1%, whereas CeMM-only and ETHZ-only filtered networks were 70% reproducible.

106 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

Figure 4.3: Intra- and Inter-lab AP-MS reproducibility. Average reproducibility after the first two filtering stages (grey bars) and for the fully filtered network (green bars) as estimated by Bayesian analysis. Reproducibility is shown for the sample preparation alone (blue border), the MS analysis alone (red border) and the whole pipeline (grey border). The intra-laboratory reproducibility (upper half) is defined as the fraction of interactions identified in one experiment (see Supplementary Table 4 for the number of biological and technical replicates) that were reproduced in the replicate experiment at the same laboratory. The inter-laboratory reproducibility (lower half) is a fraction of interactions identified by one laboratory that were reproduced in the experiment of the other laboratory. Error bars represent standard deviation of the corresponding reproducibility estimates.

4.3.4. Final network The final kinase interaction network (MERGED dataset), contained 377 unique proteins (including baits) and 488 interactions (Fig. 4.4a). The availability of both CeMM and ETHZ data and especially the analysis of the exchanged samples resulted in an improvement in the sensitivity of the bioinformatic filtering, as more data on interaction reproducibility were available. In particular, an additional dimension of inter-laboratory reproducibility allowed a more confident classification of the observed interactions. The effect of this enhanced resolution can be appreciated when considering the low-abundance interactions removed by the final filtering step from both single-laboratory datasets (Fig. 4.2). In the MERGED dataset, the reproducibility of approximately 30% of these interactions was supported by the data from the partner laboratory. These interactions

107 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

were thus preserved in the final MERGED network, whereas the low-abundance interactions detected only in one dataset were excluded. In this high-confidence protein-protein interaction (PPI) network, 24% of the interactions were found in public databases and 76% are postulated as novel. The public databases also contained 1,075 interactions for the kinases analyzed that were not identified in our study. Interestingly, among the known interactions contained in our network, 42.7% were reported in at least two publications, whereas only 12.3% of the known interactions not present in our dataset showed such coverage. This 3.5-fold enrichment strongly indicates that the data obtained by rigorous and systematic AP-MS approaches may yield higher confidence data than other sources of PPI in the public domain. To further evaluate the specificity of the workflow, we assembled a set of potential non- specific interactions based on conflicting cellular localization of bait and prey proteins as reported by UniProt database. Though certainly imperfect due to the incomplete knowledge of proteins localization, the analysis indicated that the wSC cut-off threshold used for the final interaction network corresponds to a rough estimate of 11% false discovery rate and that merging of the datasets improves the performance of the filter (Supplementary Note 1, Supplementary Fig. 4.1). The bait kinases where not chosen to constitute a coherent biological process. Yet, due to the rather ubiquitous expression and abundance and the presence of paralogous baits, the interactomes of the majority of these kinases form a network through some common interactors (Fig. 4.4a). Among these are the known kinase chaperones HSP90/CDC37, interacting with nine of the bait kinases. Seven of these kinases are also present in a recently published list of proteins down-regulated upon HSP90 inhibition. Three of these (CDK4, CDK9 and PRKAA1) are described as HSP90 core interactors, thus supporting our observations30. Two examples from the final interactome illustrate the level of data detail achievable by the workflow described in this study, and show in detail public data coverage and inferred correlation with known protein complex stoichiometry (Fig. 4.4b,c). The full high-confidence dataset on the 32 kinase interactomes is available at the IntAct database (IM-17578). Providing 364 novel interactions within the kinase complexes, this dataset is expected to represent a highly-valuable resource for cell and structural biologists.

108 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

Figure 4.4: The kinome interactome. (a) Network of confident interactions (solid green lines) for the MERGED dataset. The 32 bait kinases and prey interactors are represented by orange squares and blue circles, respectively. Thick lines correspond to interactions that are already present in the public databases. Black dotted lines indicate known interactions between the prey proteins. (b) Subnetwork showing the interactions of STK24, STK25 and MST4 kinases. The core members of the STRIPAK complex are in the middle and circled. Green edges correspond to interactions retained by all three datasets after filtering. Blue and yellow edges correspond to interactions retained by the MERGED and CeMM-only datasets, or MERGED and ETHZ-only datasets, respectively. Interactions retained only in the MERGED dataset are shown in red. (c) Subnetwork showing the ILK interactome. The color annotation is the same as in (d).

109 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

4.3.5. A STE20 kinase family network STK24, STK25 and MST4 kinases are members of the STE20 family. These kinases were described as part of the STRIPAK (striatin-interacting phosphatase and kinase) protein complex, originally identified in another AP-MS study and found to be involved in Golgi polarization31-33. In our analysis, we found all three kinases stably interacting with each other and with seven core and accessory proteins of the STRIPAK complex (Fig. 4.4b). These observations demonstrate the high reproducibility of our workflow with respect to abundant interactors (> 10 spectral counts). At the same time, this example showed that merging of the two datasets proved essential in preserving two classes of low-abundance interactions: those retained in either the CeMM-only or the ETHZ-only dataset, and those absent from both single-laboratory datasets despite being known or very plausible (Fig. 4.4b). The first class was represented by the interactions with TAOK1, a MAP3K family member regulating DNA damage response and microtubule dynamics34,35. Supportive of the complementarity of single-laboratory datasets, TAOK1 interaction with STK25 was retained in the CeMM-only dataset, while the interaction with MST4 was in the ETHZ-only network. The second class contained the interactions between MST4 and PPP2CB, DYNLL1 (already found associated with the STRIPAK complex31) and GOLGA2, a Golgi localized protein also involved in the maintenance of Golgi polarized organization and already found in complexes with STK2532,36.

4.3.6. ILK network Integrin-linked kinase (ILK) is an evolutionary-conserved protein kinase that transduces -integrin subunits while acting as scaffold at focal adhesions37. Most of the abundant interactors of ILK identified in both laboratories were strong known binders (Fig. 4.4c). These included the ILK-PINCH-Parvin (IPP) complex components LIMS1 and LIMS2 (PINCH proteins) and the alpha- and beta-parvins (PARVA and PARVB)38. Another abundant interactor was RSU1 (RAS suppressor protein 1), a protein known to bind to LIMS1 in a RAS-dependent manner39. In addition, the ILK network contained multiple yet uncharacterized interactions. Many of the low-abundance interactions (spectral counts ≤ 5), illustrate once again the value of combining datasets across laboratories, since they were observed by each laboratory independently but only passed the stringent validation criteria in the MERGED analysis. RAF1, for example, has been shown to interact with RSU1

110 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

in vitro40, but had not been reported in vivo and might represent an important link between the IPP complex and the RAS-mediated regulation of cell proliferation. The presence of CDIPT (phosphatidylinositol synthase) is also of particular interest, considering the reported role of the phosphatidylinositol 3-phosphate kinase in the integrin-mediated activation of ILK41.

4.4. Discussion We evaluated the robustness of an integrated protocol to chart human protein complexes by AP-MS and to establish one possible road-map to the future proteome-wide characterization of human protein complexes. We designed and executed a parallel two- laboratory effort for a period of three years, during which experimental parameters were first studied and defined, including exchange protocols for samples, appropriate reagents secured and shared, scientists trained at the alternate sites. The project was performed on identical mass spectrometry instrumentation, albeit with different liquid chromatography systems and maintaining most of the instruments settings typical of the individual laboratories, to mimic the realistic situation where individual mass spectrometry instruments cannot be dedicated to a single project only. We found that by following the same standardized protocol laboratory-specific biases were minimized and the data generated was highly reproducible between laboratories (81%). Considering the diversity in the biophysical and biological properties of the selected kinases (Supplementary Table 1), this protocol can be recommended to study the majority of proteins, with slight adaptation for particular protein categories (multi- pass membrane, mitochondrial or nuclear proteins etc.) which may profit from specific adjustments (e.g. different detergents, fractionation). Bait overexpression levels should also be controlled: low-expressing baits may require higher amount of protein extract which increases the contaminant fraction; strong bait overexpression might lead to artifacts (misfolding, spurious interactions etc.). However, while benchmarking should be done in these cases, high reproducibility is expected as most of the workflow would remain unchanged. This cross-analysis study generated the largest AP-MS dataset on the human kinome published to date. Although a complete overview of the biological implications underlying all the new findings goes beyond the focus of this report, three examples of new biological

111 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

insights suggested by the dataset may help appreciate the importance of the information obtained: 1) The RPS6KA1 (Ribosomal protein S6 kinase alpha-1) complex, known to regulate growth and translation, contains the two proteins TLK2 (Tousled-like kinase 2) and MASTL (Microtubule-associated serine/threonine-protein kinase-like) that explain the suspected link to mitotic entry, anaphase and cytokinesis in human cells42,43. 2) The EEF2K (Eukaryotic elongation factor 2 kinase) complex includes WDFY3 (WD repeat and FYVE domain containing 3), suggesting an exciting link between protein synthesis and autophagy44. 3) The MAPK14 complex includes the new interactors CCDC8, CUL7 and OBSL1, all three found mutated in the 3-M syndrome, an autosomal recessive disorder characterized by severe pre- and post-natal growth retardation, providing a formidable link between MAPK14 and tissue homeostasis45. Furthermore, we confirmed the advantages of sharing AP-MS data for generating high- quality protein interaction networks. As already emphasized by others, discrimination between true interactors and contaminants becomes more difficult as the detection signal becomes weaker46, therefore multiple purifications, ideally across laboratories, are required to identify true interactors. The integration of heterogeneous datasets utilizing different AP-MS protocols is challenging47. A very useful initiative with an immediate benefit would be a common repository for profiling background levels of proteins in various AP-MS systems. Taking into account the 1,075 previously identified interactions not identified in either laboratories involved here, the 124 confirmed and 364 novel ones, and using the estimated false discovery rate of 11%, our AP-MS workflow provides for 29% sensitivity in reproducing the current knowledge of protein interactions in human cells. This should be regarded as a lower-bound estimate that includes interactions not observable in our experimental set-up (e.g. other cell types or growth conditions). Small- scale experiments using poorly-characterized antibodies or error-prone techniques might also contribute to this estimate, as well as larger proteomic datasets utilizing less stringent thresholds for the removal of non-specific binders. Considering only PPIs reported by multiple publications results in a workflow sensitivity of 79%. The known PPIs supported by multiple experiments represent 42.7% of all public interactions covered by our analysis, whereas only 12.7% of the 1,075 public interactions not observed here are reported in more than one study. This indicates that the data obtained in this

112 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

study is of high-quality and that a weak correlation exists between the number of trustworthy interactions and the number of publications (Fig. 4.5). In summary, our study showed the advantages of collaborative and systematic AP-MS and bioinformatics efforts in overcoming several of the short-comings associated with the one-off use of the approach. We are now able to propose protocols and procedures that enable a highly-reliable characterization of protein complexes from human cells. The comparative study across laboratories constituted a logistic challenge that exceeded our initial estimates but yielded results that surpassed our expectations in terms of overall reproducibility. The results bear the important implication that, given the choice, the same number of experiments distributed over two laboratories results in a more complete and relevant dataset than that performed at a single site only. The recommendations resulting from this study (Table 1) should empower the community with an experimental roadmap for the complete characterization of protein complexes in human cells under controlled conditions. The high reproducibility provided by this approach is an important step forward for successful integration of AP-MS data with other types of biological information.

Figure 4.5: Statistics for the number of kinase interactors. Comparison between the number of kinase interactions available in public databases (blue area) and those contained in the MERGED dataset (red lines). The kinases are ranked by the number of interactions annotated in public databases.

113 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

Table 4.1: Recommendations for multi-laboratory large-scale proteomic studies Workflow Component Recommendations  Cellular system allowing control over bait expression levels  Single standard for protocols/guidelines  Centralized technical training of all the participants of the collaboration BIOLOGY  Same reagents/consumables (lot #)  Replicates: ≥ 2. Distribute the generation of biological replicates across multiple centers  Instruments with the same sensitivity range  Synchronization of the instrument settings (as much as the set-up allows) MASS  Maintain the same parameters throughout the whole study SPECTROMETRY  Sample exchange (peptides)  Replicates: ≥2  Raw MS data common repository  Sensitive collection of negative control experiments BIOINFORMATICS  Bioinformatics filtering addressing the removal of carried-over proteins, systematic and non-systematic non-specific binders

114 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

4.5. Online Methods Materials

Ammonium bicarbonate (NH4HCO3), glycine (NH2CH2COOH), iodoacetamide, dithiothreitol, propan-2-ol (LC-MS grade), protease inhibitor cocktail for mammalian cell lysate P8340, anti-HA agarose, doxycycline, HA7 antibody, foetal bovine serum, (SIGMA- Aldrich); Strep-Tactin sepharose (IBA TAGnologies); D-biotin (Alfa Aesar); Bio-Spin chromatography columns (Bio-Rad); Gateway BP and LR ClonaseTM II Enzyme Mix Kit, lipofectamine 2000, pDONR201, pcDNA6/TR, pOG44, DMEM, trypsin (Invitrogen); Fugene6 (Roche Applied Science), UltraMicroSpin columns 3-30 µg capacity (The Nest Group, Inc.), HPLC amber vials and screw caps (Agilent Technologies); trifluoroacetic acid

(TFA), formic acid (HCOOH) (Merck); acetonitrile (CH3CN, LC-MS grade), methanol

(CH3OH, LC-MS grade) (Fisher Scientific).

Generation of inducible stable cell lines and cell culture Human full-length kinase cDNAs were cloned using BP recombination into the Gateway- compatible pDONR221 vector (Invitrogen)48. To generate the corresponding expression vectors for tetracycline-controlled expression of the SH-tagged version of the kinases, an LR recombination was performed between the entry vector and the in-house-designed destination vectors, the N-terminal tagging vector pcDNA5/FRT/TO/SH/GW13 was used for all the baits, except PRKCD for which the C-terminal tagging vector pcDNA5/FRT/TO/cSH/GW (Varjosalo et al., manuscript in preparation) was used. To generate the stable cell lines inducibly-expressing the SH-tagged versions of the kinases Flp-In™ T-REx™ HEK293 (cultured in DMEM (4.5 g/L glucose, 2 mM L-glutamine) supplemented with 10% FCS, 50 mg/mL penicillin, 50 mg/mL streptomycin, 100 µg/mL Zeocin™ and 15 µg/mL Blasticidin) were co-transfected with the expression vector and the pOG44 vector (Invitrogen) using the Fugene6 transfection reagent (Roche). Two days after transfection, cells were selected in DMEM supplemented with 10% FCS, 50 mg/mL penicillin, 50 mg/mL streptomycin and hygromycin (100 g/mL) for 2-3 weeks, then the positive clones were pooled and amplified. After induction with 1 µg/mL doxycycline (Sigma) for 24 h, the inducible expression of the constructs was verified by western blot. For selected baits, the expression levels were monitored to be within the general natural endogenous expression levels.

115 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

To minimize experimental inter-laboratory variation, the same lots of the cell culture- related (DMEM, FCS, antibiotics, trypsin, dishes) and protein purification materials (Strep-Tactin and anti-HA beads, D-biotin, BioSpin columns) were used. For each pulldown, a cell pellet deriving from 5 × 15 cm fully-confluent dishes (approximately 5 × 107 cells) was lysed for 10 min on ice in 5 mL HNN lysis buffer (50 mM HEPES pH 8.0, 150 mM NaCl, 5 mM EDTA, 0.5% NP-40, 50 mM NaF, 1.5 mM Na3VO4, 1.0 mM PMSF and 10 µL/mL protease inhibitor cocktail) supplemented with 1 µg/mL avidin (added only for the lysis step). Insoluble material was removed by centrifugation at 13,000 r.p.m. for 20 min at 4°C. 200 µL Strep-Tactin® sepharose beads were transferred to a Bio-Spin chromatography column and washed with 2 × 1 mL HNN-HS buffer, then the lysate was added and gravity drained. The beads were washed with 3 × 1 mL HNN-HS buffer, and bound proteins eluted with 3 × 300 µL freshly-prepared 2.5 mM D-biotin in HNN-HS buffer into a fresh 1.5 mL eppendorf tube. 100 µL anti-HA agarose beads were washed with 1 × 1 mL HNN-HS buffer and centrifuged at 1,000 r.p.m. for 1 min at 4°C. The supernatant was removed and the beads resuspended in 100 µL HNN-HS buffer. The anti- HA agarose beads were added to the biotin eluate and rotated for 1 h at 4°C, then the suspension was loaded into a fresh Bio-Spin chromatography column and gravity drained. The anti-HA agarose was washed with 3 × 1 mL HNN-HS buffer and then with 2 × 1 mL HNN-HS buffer without detergent and inhibitors. Retained proteins were eluted from the column with 500 µL 200 mM NH2CH2COOH, pH 2.5 directly into a glass HPLC vial and immediately neutralized with 160 µL 1 M NH4HCO3, pH 8.8. The remaining sample was frozen at -20C until further processing. All affinity purifications were performed as biological replicates per site (n = 4) to give a total of n = 8 LC-MS analyses per SH-tagged protein.

Solution tryptic digestion and sample preparation for LC-MS DTT was added to the eluates to a final concentration of 10 mM and the sample incubated for 1 h at 56°C. To block cysteine residues, iodoacetamide was added to a final concentration of 55 mM and the samples incubated at room temperature in the dark for 30 min. 1 µg trypsin was added and the samples were incubated overnight at 37°C. Tryptic digests were quenched with 10% TFA, concentrated and purified by solid phase extraction (SPE), and eluted with 90% CH3CN, 0.1% TFA. The volume of the eluted sample

116 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

was reduced to approximately 2 μL in a vacuum centrifuge and reconstituted to a final volume of 30 μL with 0.1% TFA, 1% CH3CN and vortexed thoroughly. One half of the sample was aliquoted into a separate vial and transported to either the Zurich or Vienna site in dry ice. At CeMM, 2,5% of the final volume was further diluted to a volume of 8 µL before injection.

Liquid chromatography mass spectrometry At CeMM, mass spectrometry was performed on a hybrid LTQ Orbitrap XL mass spectrometer (ThermoFisher Scientific) using the Xcalibur version 2.0.7 coupled to an Agilent 1200 HPLC nanoflow system (dual pump system with one precolumn and one analytical column) (Agilent Biotechnologies) via a nanoelectrospray ion source using liquid junction (Proxeon). Solvents for LC-MS separation of the digested samples were as follows: solvent A consisted of 0.4% formic acid in water and solvent B consisted of 0.4% formic acid in 70% methanol and 20% isopropanol. From a thermostatted microautosampler, 8 µL of the tryptic peptide mixture (corresponding to 2.5% of the final SH-TAP eluate) were automatically loaded onto a trap column (Zorbax 300SB-C18 5µm, 5×0.3 mm, Agilent Biotechnologies) with a binary pump at a flow rate of 45 µL/min. 0.1% TFA was used for loading and washing the precolumn. After washing, the peptides were eluted by back-flushing onto a 16 cm fused silica analytical column with an inner diameter of 50 µm packed with C18 reversed phase material (ReproSil-Pur 120 C18-AQ, 3 µm, Dr. Maisch GmbH). The peptides were eluted from the analytical column with a 27 minute gradient ranging from 3 to 30% solvent B, followed by a 25 minute gradient from 30 to 70% solvent B and, finally, a 7 minute gradient from 70 to 100% solvent B at a constant flow rate of 100 nL/min. The analyses were performed in a data-dependent acquisition mode using a top 6 collision-induced dissociation (CID) method. Dynamic exclusion for selected ions was 60 seconds. No lock masses were employed. Maximal ion accumulation time allowed on the LTQ Orbitrap in CID mode was 150 ms for MSn in the LTQ and 1,000 ms in the C-trap. Automatic gain control was used to prevent overfilling of the ion traps and were set to 5,000 (CID) in MSn mode for the LTQ, and 106 ions for a full FTMS scan. Intact peptides were detected in the Orbitrap at 100,000 resolution. LC-MS sequence blocks were designed as follows: 10 fmol of a tryptic digest of bovine serum albumin (BSA) was firstly injected to ensure that both the HPLC and MS were performing to the highest

117 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

standards. Metrics such as peak intensity, full-width-at-half-maximum (FWHM) peak height, and elution retention time (Tr) of specific tryptic peptides were monitored and recorded, as was the overall protein sequence coverage (≥60% must be achieved). For each bait protein, the BSA analysis was followed by: (i) biological replicate 1 from CeMM; (ii) biological replicate 2 from CeMM; (iii) biological replicate 1 from ETHZ; and (iv) biological replicate 2 from ETHZ. Following analysis of the affinity-purified samples, a minimum of three blank and wash LC-MS analyses were performed to remove and minimize any protein contaminants from the previous block of samples prior to analyzing the subsequent block. From the Mascot output, the final blank analysis was manually assessed for the number of contaminating peptides and the information recorded. At ETHZ, mass spectrometry was performed on a hybrid LTQ Orbitrap XL mass spectrometer (ThermoFisher Scientific) using the Xcalibur version 2.0.7 coupled to an Eksigent NanoLC-2D HPLC nanoflow system (dual pump system with one analytical column) (Eksigent) via a nanoelectrospray ion source using a liquid junction (ThermoFisher Scientific). Solvents for LC-MS separation of the digested samples were as follows: solvent A consisted of 0.1% formic acid in water (98%) and acetonitrile (2%) and solvent B consisted of 0.1% formic acid in acetonitrile (98%) and water (2%). From a thermostatted microautosampler, 2 µL of the tryptic peptide mixture (corresponding to 7% of the final SH-TAP eluate) were automatically loaded onto a 15 cm fused silica analytical column (PicoFrit, New Objective) with an inner diameter of 75 µm packed with C18 reversed phase material (ReproSil-Pur 120 C18-AQ, 3 µm, Dr. Maisch GmbH) and the peptides were eluted from the analytical column with a 40 minute gradient ranging from 5 to 35% solvent B, followed by a 10 minute gradient from 35 to 80% solvent B at a constant flow rate of 300 nL/min. The analyses were performed in a data-dependent acquisition mode using a top 3 collision-induced dissociation (CID) method. Dynamic exclusion for selected ions was 30 seconds. No lock masses were employed. Maximal ion accumulation time allowed on the LTQ Orbitrap in CID mode was 200 ms for MSn in the LTQ and 300 ms in the C-trap. Automatic gain control was used to prevent overfilling of the ion traps and were set to 30,000 (CID) in MSn mode for the LTQ, and 500,000 ions for a full FTMS scan. Intact peptides were detected in the Orbitrap at 60,000 resolution. LC- MS sequence blocks were designed as follows: 500 ng of tryptic digest from Saccharomyces cerevisiae and 50 fm of Glu-Fib peptide standard were first injected to

118 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

ensure that both the HPLC and MS were performing to the highest standards. Metrics such as peak intensity, full-width-at-half-maximum (FWHM) peak height, and elution retention time (Tr) of specific tryptic peptides (for Glu-Fib and yeast digest) were monitored and recorded, as was the overall protein sequence coverage and the number of identified proteins (for yeast digest). For each bait protein, the analysis sequence was as follows: (i) biological replicate 1; (ii) high ACN wash run; (iii) Glu-Fib standard; and (iv) biological replicate 2. The high ACN wash and the Glu-Fib standard peptide LC-MS analyses were performed to remove and minimize any protein contaminants from the previous sample prior to analyzing the next sample. From the Mascot output, the Glu-Fib standard peptide analysis was manually assessed for the number of possible contaminating peptides and the information recorded.

Protein identification Peak extraction and conversion of RAW files into the MGF format for subsequent protein identification was achieved by a combination of extract_msn (part of XCalibur suite) and ReAdW (part of TPP) scripts. An initial database search was performed with broader mass tolerance to re-calibrate the mass lists for optimal final protein identification. For the initial protein database search, Mascot (www.matrixscience.com, version 2.3.02) was used. Error tolerances on the precursor and fragment ions were ±10 ppm and ±0.6 Da, respectively, and the database search limited to fully-tryptic peptides with maximum 1 missed cleavage, carbamidomethyl cysteine and methionine oxidation set as fixed and variable modifications, respectively. The Mascot peptide ion score threshold was set to 30, and at least 3 peptide identifications per protein were required. Searches were performed against the human part of UniProtKB/SwissProt database (www.uniprot.org version 2010-09) including all protein isoforms. The initial peptide identifications were used to deduce independent linear transformations for precursor and fragment masses that would minimize the mean square deviation of measured masses from theoretical. Re-calibrated mass list files were searched against the same human protein database by a combination of Mascot and Phenyx (GeneBio, SA, version 2.5.14) search engines. The results of the two algorithms were merged, requiring at least 2 unique peptides with a score above threshold of either search engine. Single peptide identifications were accepted if the score was above a more stringent threshold (see Supplementary Table

119 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

2). Spectra with conflicting peptide identifications were excluded from the combined result. For each laboratory, the peptide score thresholds were individually set to achieve a 1% FDR at the level of the protein groups when searched against the reversed sequence database (see Supplementary Table 2). Mass tolerance was also adjusted for each laboratory (see Supplementary Table 2) and all the other search parameters were shared by the two laboratories. For protein grouping based on shared peptides, the identified proteins from all replicate LC-MS analyses were combined and the proteins without protein-specific peptides were discarded. From the LC-MS analyses of the affinity- purified baits, protein groups with multiple reporter proteins were also removed. For the negative control experiments, these protein groups were preserved to enhance the sensitivity of the subsequent negative control filtering. For dataset comparison and bait-prey interaction network construction, information on isoforms for the identified proteins was discarded.

Mapping interactions from public databases The internal protein-protein interaction database from CeMM that integrates human protein interactions from the IntAct, BioGrid, MINT, HPRD, and ImmuneDB databases and synchronizes protein accession codes with recent UniProt database releases, was used to map observed bait-prey interactions onto known protein-protein interactions from public databases. For the comparison of our dataset with the public interactions of bait proteins, the interactions annotated as physical associations (MI:0915) were used and identified by experimental methods excluding genetic interference, interaction prediction or of unspecified origin (MI:0686, MI:0492, MI:0493, MI:0254, MI:0362, MI:0063) for the lack of evidence, and two hybrid method (MI:0018) was excluded to avoid bias towards direct interactions during wSC filtering.

120 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

4.6. References 1 Krogan, N. J. et al. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440, 637-643 (2006). 2 Zhao, R. et al. Navigating the chaperone network: an integrative map of physical and genetic interactions mediated by the hsp90 chaperone. Cell 120, 715-727 (2005). 3 Glatter, T. et al. Modularity and hormone sensitivity of the Drosophila melanogaster insulin receptor/target of rapamycin interaction proteome. Mol Syst Biol 7, 547 (2011). 4 Butland, G. et al. Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature 433, 531-537 (2005). 5 Ho, Y. et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180-183 (2002). 6 Gavin, A. C. et al. Proteome survey reveals modularity of the yeast cell machinery. Nature 440, 631-636 (2006). 7 Guruharsha, K. G. et al. A protein complex network of Drosophila melanogaster. Cell 147, 690-703 (2011). 8 Rigaut, G. et al. A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotechnol 17, 1030-1032 (1999). 9 Sowa, M. E., Bennett, E. J., Gygi, S. P. & Harper, J. W. Defining the human deubiquitinating enzyme interaction landscape. Cell 138, 389-403 (2009). 10 Sardiu, M. E. et al. Probabilistic assembly of human protein interaction networks from label-free quantitative proteomics. Proceedings of the National Academy of Sciences of the United States of America 105, 1454-1459 (2008). 11 Stukalov, A., Superti-Furga, G. & Colinge, J. Deconvolution of Targeted Protein-Protein Interaction Maps. Journal of proteome research2012). 12 Burckstummer, T. et al. An efficient tandem affinity purification procedure for interaction proteomics in mammalian cells. Nat Methods 3, 1013-1019 (2006). 13 Glatter, T., Wepf, A., Aebersold, R. & Gstaiger, M. An integrated workflow for charting the human interaction proteome: insights into the PP2A system. Mol Syst Biol 5, 237 (2009). 14 Scigelova, M., Hornshaw, M., Giannakopulos, A. & Makarov, A. Fourier transform mass spectrometry. Mol Cell Proteomics 10, M111 009431 (2011). 15 Choi, H. et al. SAINT: probabilistic scoring of affinity purification-mass spectrometry data. Nat Methods 8, 70-73 (2011). 16 Lavallee-Adam, M., Cloutier, P., Coulombe, B. & Blanchette, M. Modeling contaminants in AP-MS/MS experiments. J Proteome Res 10, 886-895 (2011). 17 Brehme, M. et al. Charting the molecular network of the drug target Bcr-Abl. Proc Natl Acad Sci U S A 106, 7414-7419 (2009). 18 Bouwmeester, T. et al. A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. Nat Cell Biol 6, 97-105 (2004). 19 Fang, L. et al. Characterization of the human COP9 signalosome complex using affinity purification and mass spectrometry. J Proteome Res 7, 4914-4925 (2008). 20 Behrends, C., Sowa, M. E., Gygi, S. P. & Harper, J. W. Network organization of the human autophagy system. Nature 466, 68-76 (2010). 21 Stumpf, M. P. et al. Estimating the size of the human interactome. Proc Natl Acad Sci U S A 105, 6959-6964 (2008).

121 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

22 Fortney, K. & Jurisica, I. Integrative computational biology for cancer research. Hum Genet 130, 465-481 (2011). 23 Venkatesan, K. et al. An empirical framework for binary interactome mapping. Nat Methods 6, 83-90 (2009). 24 Braun, P. et al. An experimentally derived confidence score for binary protein-protein interactions. Nat Methods 6, 91-97 (2009). 25 Cusick, M. E. et al. Literature-curated protein interaction datasets. Nature methods 6, 39- 46 (2009). 26 Bell, A. W. et al. A HUPO test sample study reveals common problems in mass spectrometry-based proteomics. Nat Methods 6, 423-430 (2009). 27 Burkard, T. R. et al. Initial characterization of the human central proteome. BMC Syst Biol 5, 17 (2011). 28 Geiger, T., Wehner, A., Schaab, C., Cox, J. & Mann, M. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins. Mol Cell Proteomics 11, M111 014050 (2012). 29 Beck, M. et al. The quantitative proteome of a human cell line. Mol Syst Biol 7, 549 (2011). 30 Wu, Z., Moghaddas Gholami, A. & Kuster, B. Systematic Identification of the HSP90 Candidate Regulated Proteome. Mol Cell Proteomics 11, M111 016675 (2012). 31 Goudreault, M. et al. A PP2A phosphatase high density interaction network identifies a novel striatin-interacting phosphatase and kinase complex linked to the cerebral cavernous malformation 3 (CCM3) protein. Mol Cell Proteomics 8, 157-171 (2009). 32 Kean, M. J. et al. Structure-function analysis of core STRIPAK Proteins: a signaling complex implicated in Golgi polarization. J Biol Chem 286, 25065-25075 (2011). 33 Herzog, F. et al. Structural probing of a protein phosphatase 2A network by chemical cross-linking and mass spectrometry. Science 337, 1348-1352 (2012). 34 Draviam, V. M. et al. A functional genomic screen identifies a role for TAO1 kinase in spindle-checkpoint signalling. Nat Cell Biol 9, 556-564 (2007). 35 Fidalgo, M. et al. The adaptor protein cerebral cavernous malformation 3 (CCM3) mediates phosphorylation of the cytoskeletal proteins Ezrin/Radixin/Moesin by Mammalian Ste20-4 to protect cells from oxidative stress. J Biol Chem2012). 36 Preisinger, C. et al. YSK1 is activated by the Golgi matrix protein GM130 and plays a role in cell migration through its substrate 14-3-3zeta. J Cell Biol 164, 1009-1020 (2004). 37 Hannigan, G. E. et al. Regulation of cell adhesion and anchorage-dependent growth by a new beta 1-integrin-linked protein kinase. Nature 379, 91-96 (1996). 38 Tu, Y., Huang, Y., Zhang, Y., Hua, Y. & Wu, C. A new focal adhesion protein that interacts with integrin-linked kinase and regulates cell adhesion and spreading. J Cell Biol 153, 585-598 (2001). 39 Dougherty, G. W., Jose, C., Gimona, M. & Cutler, M. L. The Rsu-1-PINCH1-ILK complex is regulated by Ras activation in tumor cells. Eur J Cell Biol 87, 721-734 (2008). 40 Masuelli, L. & Cutler, M. L. Increased expression of the Ras suppressor Rsu-1 enhances Erk-2 activation and inhibits Jun kinase activation. Mol Cell Biol 16, 5466-5476 (1996). 41 Delcommenne, M. et al. Phosphoinositide-3-OH kinase-dependent regulation of glycogen synthase kinase 3 and protein kinase B/AKT by the integrin-linked kinase. Proc Natl Acad Sci U S A 95, 11211-11216 (1998).

122 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

42 Li, Z., Gourguechon, S. & Wang, C. C. Tousled-like kinase in a microbial eukaryote regulates spindle assembly and S-phase progression by interacting with Aurora kinase and chromatin assembly factors. J Cell Sci 120, 3883-3894 (2007). 43 Mochida, S., Maslen, S. L., Skehel, M. & Hunt, T. Greatwall phosphorylates an inhibitor of protein phosphatase 2A that is essential for mitosis. Science 330, 1670-1673 (2010). 44 Filimonenko, M. et al. The selective macroautophagic degradation of aggregated proteins requires the PI3P-binding protein Alfy. Mol Cell 38, 265-279 (2010). 45 Hanson, D. et al. Exome sequencing identifies CCDC8 mutations in 3-M syndrome, suggesting that CCDC8 contributes in a pathway with CUL7 and OBSL1 to control human growth. Am J Hum Genet 89, 148-153 (2011). 46 Hart, G. T., Ramani, A. K. & Marcotte, E. M. How complete are current yeast and human protein-interaction networks? Genome Biol 7, 120 (2006). 47 Hart, G. T., Lee, I. & Marcotte, E. R. A high-accuracy consensus map of yeast protein complexes reveals modular nature of gene essentiality. BMC Bioinformatics 8, 236 (2007). 48 Varjosalo, M. et al. Application of active and kinase-deficient kinome collection for identification of kinases regulating hedgehog signaling. Cell 133, 537-548 (2008).

123 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

4.7. Supplementary Materials 4.7.1. Supplementary Figures and Tables

Supplementary Figure 1: Receiver Operating Characteristics of 3 different AP-MS data filtering methods. Performance of 3 scoring methods (wD-score – blue, SAINT – green, weighted Spectral Counts – red) on three datasets (each individual lab and the MERGED one).

Supplementary Figure 2: Scheme of Bayesian model of AP-MS experiments reproducibility.

For the description of the model refer to the 'Bayesian analysis' section in Supplementary Note 1.

124 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

Supplementary Table 2. Protein search engines score thresholds

Parameter CeMM ETHZ

precursor mass tolerance, ppm 5 7.5

fragment mass tolerance, Da 0.3 0.3

Mascot score multiple peptide i.d.’s ≥14.0 ≥19.0

single peptide i.d.’s ≥40.0 ≥35.0

Phenyx z-score multiple peptide i.d.’s ≥4.2 ≥4.5

single peptide i.d.’s ≥4.75 ≥5.5

Phenyx p-value ≤0.001 ≤0.001

Supplementary Table 3. Families of contaminant proteins

Proteins Family Gene name regular expression 14-3-3 proteins ^YWHA[A-Z] Acetyl Carboxylases ^ACAC[AB] Actinins ^ACTN\d+ Actins ^ACT[ABCG]d* Albumin ^ALB Alcohol dehydrogenases ^ADH\d+ Ankyrin domain family members ^(POTE[A-M]) ATP synthase subunits ^ATP\d[A-Z] BAG family chaperones ^BAG\d+ Calmodulins ^CALM\d+ Clusters of differentiation molecules ^(CD\d+|BST2)$ Cofilins ^CFL\d+ Collagens ^COL\d+A\d+ Corneodesmosin ^CSDN Cytochrome C ^CYCS Dermcidin ^DCD$ Drebrin ^DBN1 EF-hand domain family, member D2 ^EFHD2 Envoplakin ^EVPL Filamins ^FLN[ABC] Glutathione S-transferases ^GST[A-Z]\d+ Kinase chaperones ^(HSP90|CDC37) Heat shock proteins ^(HSP\w?\d+|DNAJ\w\d+) Hemoglobins ^HB[ABDEGMQZ] Heterogenous nuclear ribonucleoproteins ^(HNRN|SYNCRIP|RBMX) Histones ^(H[1-3][ABF]\w+|HIST[1-4]H\w+) Immunoglobulins ^(IGL[LVJC]|IGH[VAGDJM]|IGK[LV]|IGJ)

125 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

Involucrin ^IVL Keratin-associated proteins ^KRTAP\d+ Keratins ^KRT\d+ Lacrimal proline-rich protein ^PRR4 Lactate dehydrogenases ^LDH[ABCD] LIM domain and actin binding 1 ^LIMA1 Myosins ^(MYH\d+|MYL\d+|MYO1) Non-neural enolase ^ENO1 Peroxiredoxins ^PRDX\d+ Plectin 1 ^PLEC1 Proteasome ^PSM[A-D]\d+ Ribosomal small subunit ^(RPSP?\d+|RPSAP|SNORD36C|ZRSR2) Ribosomal large subunit ^RPLP?\d+ RNA binding proteins ^(PABP|SERBP|RBM\d+) RNA helicases ^DDX\d+ RuvB-like ^RUVBL\d S100 family ^(S100[AP]\d|HRNR|RPTN) Semenogelin ^SEMG\d+ Serine/threonine-protein phosphatase PGAM5, ^PGAM Solute carrier family ^SLC\d+\w+ TCPmitochondrial-1 chaperonin family ^(TCP1|CCT\d+[A-Z]*) Thioredoxins ^TXN Translation elongation factors ^EEF[1-2][A-G]* Tubulins ^TUB[ABG]\d*[ABC]*

Supplementary Table 5. Design of in silico experiments for the reproducibility analysis

Reproducibilty Analysis Number of biological Number of LC-MS/MS analyses replicates per biological replicate Overall 2 1

Sample Preparation 2 –

LC-MS/MS – 2

126 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

4.7.2. Supplementary Notes

Supplementary Note 1 – Reproducibility analysis

Bayesian analysis

The Bayesian model was used to estimate the reproducibility of biochemical and mass spectrometry components of the protocol within the same laboratory and across the two laboratories. Here, the reproducibility of an experiment B with respect to experiment A is defined as the conditional probability that protein x would be observed in the experiment B, given that the protein was already observed in the experiment A: 푃(푥 ∈ 퐴 ∧ 푥 ∈ 퐵) 푅 = 푃(푥 ∈ 퐵|푥 ∈ 퐴) = , 퐵,퐴 푃(푥 ∈ 퐴) where 푃(푥 ∈ 퐴), the probability of observing protein x in experiment A, is defined as a product of two independent workflow components:

푃(푥 ∈ 퐴) = 푃(푥 ∈ 퐴푏푖표) × 푃(푥 ∈ 퐴푀푆).

The probability of the biochemical component, 푃(푥 ∈ 퐴푏푖표), is the probability that protein x would be present in the purified sample at a concentration sufficient for subsequent identification by mass spectrometry. Its LC-MS/MS counterpart, 푃(푥 ∈ 퐴푀푆), refers to the probability of correct identification of the protein, given its presence in the analysed sample at sufficient concentration. To account for the correlation between protein abundance and reproducibility, the interactions were distributed into three equal bins based on the maximum sequence coverage of the protein in all eight replicate LC-MS/MS analyses (low, medium and high), as illustrated in Supplementary Figure 2.

127 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

The Bayesian model assigned each interaction to one of the biochemical reproducibility groups at each laboratories (low, medium, high); and each prey protein to a MS reproducibility group (low, medium, high). Thus, according to the model, P(x ∈ Abio) depends on the abundance of interaction x and its reproducibility group, the same for

P(x ∈ AMS). The model assumes that the inter-laboratory reproducibility is affected not only by the reproducibility of each component at an individual laboratory, but also by laboratory- specific biases that are conveyed by assigning interactions to different reproducibility groups at different laboratories. The input data for the model is the pattern of presence/absence of each interaction in 8 replicate experiments (2 CeMM + 2 ETHZ biological replicates × 2 LCMS replicates). Internally, the model uses these patterns and current interaction reproducibility parameters to find the most probable reconstruction of the sample contents. This reconstruction is, in its turn, used to adjust the reproducibility parameters etc. The inference of Bayesian model parameters was generated using JAGS Monte Carlo Markov Chain sampler (mcmc-jags.sourceforge.net, version 3.2.0) and further analyzed in R (r- project.org, version 2.14). In this study, the average reproducibility within each laboratory is reported. The result of the experiment is defined as the union of the lists of interactors identified in all replicate MS runs of all biological replicates. The number of biological and technical replicates used to report the reproducibility was chosen to conform to the workflow of the study (see Supplementary Table 5).

Reproducibility of public AP-MS data sets

The public protein interaction databases (BioGrid, IntAct, InnateDB, MINT) were searched for experiments that use human proteins as baits and employ one of the following interaction detection methods: MI:0096 (pull down), MI:0676 (tandem affinity purification), MI:0007 (anti tag coimmunoprecipitation), MI:0006 (anti bait coimmunoprecipitation). To facilitate comparison, only baits used in at least two publications were considered. Additionally, experiments with less than 5 proteins identified were excluded from the comparison to provide statistical significance. These criteria have identified 8880 publications that use 1884 distinct human proteins as baits.

128 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

All pairwise combinations of experiments that utilize the same protein as bait were considered. For each of the resulting 3460 pairs of experiments, reproducibility was calculated as the fraction of proteins identified in the first experiment (including the bait protein) that are also detected in the second experiment. The average reproducibility calculated by this method was 24.75%.

Supplementary Note 2 – Filtering strategy

Removal of protein carry-over

Proteins that were present at residual levels in an LCMS analysis of a particular affinity- purified sample, and were present at higher quantities in the preceding sample E were removed as a possible effect of carry-over. Bait-prey interaction B-P was considered a ‘carry-over’ from the preceding experiment E if all of the following conditions were met:  there was at least a 10-fold difference in spectral counts of P in experiment E versus B  B was analysed no more than 5 days or 5 intermediate LCMS analyses after E  protein P was observed in all LCMS analyses performed after the experiment E and before the experiment B except, maybe, a single run. In all intermediate analyses the spectral counts of P should not exceed its spectral counts in the experiment E. A manual inspection of the final network has identified few cases of carried-over prey that were not detected by the automatic procedure. PAK2 (a strong PAK1 interactor) was found to be present only in the replicates of one of the two single-laboratory LC-MS/MS analyses of the RIOK2 purification. These LC-MS/MS analyses were performed immediately after the PAK1 ones, but the ratio of PAK2 spectral counts in the two AP analyses was above 10. This interaction was removed manually from the final network. Similarly, PRKAA1, PRKAB1, PRKAB2, PRKAG1 and PRKAG2 were carried-over proteins from the analysis of the PRKAA1 purification into the analysis of STK4. These interactions were not observed in the other seven replicate LC-MS/MS analyses and were subject to manual removal.

129 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

Removal of negative control data

Affinity-purified SH-GFP, SH-RFP or SH tag only were used as negative controls. To improve the sensitivity for the removal of non-specifically-interacting proteins, in addition to proteins identified in the negative control experiments, all members of their protein families were also considered for removal (see Supplementary Table 3 for the list of frequent contaminant protein families defined by the gene names). The following filtering rule was applied: a prey is removed from the sample, if the ratio of the SAF (spectral abundance factor)49 in the sample to the average SAF of the protein family in the negative controls (or SAF of the protein itself, if the protein is not a member of a family) is below a given threshold (threshold of 8 was used as this removes most of contaminant protein families). In the combined data set, the interaction was removed if the SAF ratio was below the threshold at either laboratory. Score-based filtering

A weighted sum of prey spectral counts in all replicates of i-th bait (R(i)) were used as a scoring function for bait-prey interaction

푠푖,푗 = 푊푖,푗 ∑ 푆퐶푟,푗 푟∈푅(푖)

The weight of the interaction, Wi,j, scores the reproducibility in different biological replicates and across the two laboratories:

푛푆(푖, 푗)푛푆퐿(푖, 푗)푛푀퐿(푖, 푗) 푊푖,푗 = , 푁푆(푖)푁푆퐿(푖)푁푀퐿(푖) where Ns(i), NSL(i), NML(i) are, correspondingly, the number of samples with i-th bait, the number of laboratories that produced the samples and laboratories that analysed the samples by liquid chromatography mass spectrometry, and ns(i,j), nSL(i,j), nML(i,j) – the number of samples with i-th bait, where protein pj was detected, number of laboratories that produced the samples and the number of laboratories that analysed the samples by liquid chromatography mass spectrometry. Note that in the event of a single laboratory analysis without biological replicates, Wi,j is always 1. Only interactions with a score above a given threshold were considered for the final network. The threshold was defined for each data set individually, such that 85% of public interactions are retained after applying the filter.

130 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

Existing filtering methods, like wD-score 11 and SAINT 19, produced networks of comparable size, but a significantly smaller overlap between CeMM-only and ETHZ-only (40% and 38% compared with 48%).

Analysis of specificity

A set of non-specific interactions was defined based on Cellular Component GO (GO CC) annotations provided by UniProt. All GO CC annotations were divided into four big groups:  nucleus (GO:0005634 and its descendant terms)  cytoplasmic organelles (GO:0044444 and descendants, excluding nucleus and cytosol (GO:0005829))  cytosol (GO:0005737 and its descendants excluding the GO terms from the groups above)  plasma membrane and outer structures (GO:0030312 and GO:0005886 and their descendants) An interaction was considered non-specific, if bait and prey proteins were not reported to co-localize in any of the four cellular compartments defined above. As this type of analysis heavily relies on the completeness of annotation, both experimental evidence and electronically inferred annotations were used. To further reduce the number of false- positive non-specific interactions, we have excluded the ones that were independently observed at both laboratories (in MS analysis of their own samples) from the “negatome” as the chances that non-specific interaction would be independently observed in two experiments are lower than the chances that the annotation of cellular localization of the proteins is not complete. In total, from the set of all 1470 interactions that passed negative controls filtering in the MERGED data set we have identified 103 interactions to be non- specific. This set of interactions was used to estimate FDR of the scoring schemes (see Supplementary Fig. 4.1). References 49 Florens, L. et al. Analyzing chromatin remodeling complexes using shotgun proteomics and normalized spectral abundance factors. Methods 40, 303-311 (2006).

131 Chapter 4 Interlaboratory Reproducibility of Large-scale Human Protein-complex Analysis by Standardized AP-MS

132

Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

Dattatreya Mellacheruvu, Zachary Wright, Amber L. Couzens, Jean-Philippe Lambert, Nicole St- Denis, Tuo Li, Yana V. Miteva, Simon Hauri, Mihaela E. Sardiu,Teck Yew Low, Vincentius A. Halim, Richard D. Bagshaw, Nina C. Hubner, Abdallah al-Hakim, Annie Bouchard, Denis Faubert, Damian Fermin, Wade H. Dunham, Marilyn Goudreault, Zhen-Yuan Lin, Beatriz Gonzalez Badillo, Tony Pawson, Daniel Durocher, Benoit Coulombe, Ruedi Aebersold, Giulio Superti-Furga, Jacques Colinge, Albert J. R. Heck, Hyungwon Choi, Matthias Gstaiger, Shabaz Mohammed, Ileana M. Cristea, Keiryn L. Bennett, Mike P. Washburn, Brian Raught, Rob M. Ewing, Anne-Claude Gingras, Alexey I. Nesvizhskii.

Published in Mellacheruvu, D., et al. (2013). "The CRAPome: a contaminant repository for affinity purification-mass spectrometry data." Nat Methods 10(8): 730-736.

Contribution by SH Design and performance of the LC-MS/MS measurements, protein purification experiments, data analysis and manuscript discussion.

133 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

5.1. Abstract Affinity purification coupled with mass spectrometry (AP-MS) is now a widely used approach for the identification of protein-protein interactions. However, for any given protein of interest, determining which of the identified polypeptides represent bona fide interactors versus those that are background contaminants (e.g. proteins that interact with the solid-phase support, affinity reagent or epitope tag) is a challenging task. While the standard approach is to identify nonspecific interactions using one or more negative controls, most small-scale AP-MS studies do not capture a complete, accurate background protein set. Fortunately, negative controls are bait-independent. Hence, aggregating negative controls from multiple AP-MS studies can increase coverage and improve the characterization of background associated with a given experimental protocol. Here we present the Contaminant Repository for Affinity Purification (the CRAPome) and describe the use of this resource to score protein-protein interactions. The repository and computational tools are freely available online at www.crapome.org.

134 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

5.2. Introduction Affinity purification (AP) coupled with mass spectrometry (MS) has become a ubiquitous approach for the identification of protein-protein interactions[1]. In an ideal world, affinity purification would yield only true interacting partners, and no non-specific interactors (here referred to as “background contaminants”, or “contaminants”). In most cases, however, a large number of contaminants are co-purified with bait proteins and identified by mass spectrometry. A method to discern bona fide interacting partners from background contaminants is thus essential. In the case of affinity purification using epitope-tagged proteins, this is often aided by the inclusion of ‘negative control’ purifications, typically consisting of one or more “mock” purifications using the same support resin and cell line, but without expression of the polypeptide(s) of interest (referred to here as the “bait” protein(s)). These negative controls may consist of: (i) untransfected cells; (ii) cells expressing only the epitope tag; (iii) cells expressing the same epitope tag, but fused to a heterologous protein (most often a protein derived from another species that is not expected to interact with the host proteome, e.g. green fluorescent protein, mCherry, luciferase, etc.), or; (iv) cells expressing an unrelated epitope-tagged protein from the same species. With the exception of the latter case (or when isotope labeling is used for direct comparisons[2-5]), these controls can be considered as “universal”, meaning that they are useful for filtering the background from any bait protein subjected to the same purification scheme[6]. We and others have shown that the use of quantitative mass spectrometric information represents a robust approach to identify true interaction partners based on their differential abundance in AP-MS data from bait protein purifications versus that of negative controls, and have devised computational approaches to perform this type of analysis systematically[3, 7-10].

A question that arises when designing and performing AP-MS experiments is how to use previous knowledge regarding background contaminants, in addition to the user’s own negative controls, to best score interaction data. Here, it is important to keep in mind that variation in the sample or sample preparation (e.g. cell confluence, lysis conditions, affinity material batch, etc.) may influence the recovery of proteins, including contaminants. It is therefore not uncommon for a negative control experiment to fail to capture a complete set of contaminants, due to undetected variations at one or more

135 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

experimental steps. This issue is compounded by the fact that most mass spectrometers operate in a data-dependent manner, in which a predefined number of the most abundant co-eluting peptides are fragmented and identified: low abundance peptides (and hence proteins) therefore may or may not be detected in a given MS analysis. Analyzing one or a few negative control samples will thus generally not allow for a complete characterization of background contaminants for a given purification regime. In principle, if each laboratory performing AP-MS had access to unlimited resources, a sufficient number of negative control runs could be generated to properly model contaminants. However, in practice, many research groups simply do not have the resources for such an effort. As a result, there is great concern regarding the quality of protein-protein interaction datasets, and the accumulation of false positive interactions in AP-MS data repositories.

We reasoned that the negative controls generated by the proteomics research community could be developed as a resource for scoring AP-MS data. Such a resource would need to include linked protocols, easy navigation, and protein/gene mapping. Ideally, scoring schemes would also assist users in assessing the likelihood of each of protein detected in an AP-MS analysis of being a true interactor versus a contaminant.

With these goals in mind, here we present the Contaminant Repository for Affinity Purification, currently containing AP-MS data from 360 control purifications conducted by 12 different research groups (www.crapome.org) for two organisms, human and S. cerevisiae. Users employ an intuitive graphical user interface to explore the database, by either querying one protein at a time, downloading background contaminant lists for selected experimental conditions, or uploading their own data (alongside their own negative controls when available) and performing data analysis. The CRAPome database scores contaminants vs. true interactors based on semi-quantitative mass spectrometry data (normalized spectral counts) embedded in most mass spectrometry experiments. The Significance Analysis of INTeractome (SAINT[11-13]) scoring scheme, in addition to a simpler Fold Change calculation (FC score) are used to score user-supplied data and return a ranked list of putative interactors. We also describe database structure and composition, provide examples of the use of this resource to filter contaminants with

136 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

properly chosen controls, and demonstrate the utility of the scoring scheme for identifying bona fide interaction partners. The CRAPome accommodates a variety of purification schemes and, while it currently contains human and S. cerevisiae data only, will be expanded to other species.

5.3. Results 5.3.1. Creation of the CRAPome repository The CRAPome database is a web-accessible (www.crapome.org) repository of negative control AP-MS experiments associated with detailed protocols and controlled vocabularies used to organize the data. Several of the accumulated AP-MS data have been published[7, 9, 10, 14-27], while others are unpublished. Contributors to the CRAPome submit raw mass spectrometry files to the CRAPome administrator alongside a file description (Fig. 5.1a; database architecture in Supplementary Fig. 5.1). Raw data is converted to the mzXML format, which is then analyzed using a common pipeline (database search with X!Tandem[28] against the RefSeq[29] database (human) or SGD ORF protein sequence database (S. cerevisiae ) and statistical analysis with TPP[30-32]; see Methods), and inspected for overall data quality. Experiments accepted by the CRAPome administrator (see Methods for the description of quality control steps) are released and the data contributor associates meta-data (controlled vocabularies and text- based protocols) with each experiment (see Annotator Tutorial in the Supplementary Material and online at www.crapome.org). These annotated negative control runs form the core of the repository. Currently (version 1.0, March 2013), 360 experiments contributed by 12 laboratories are available in the repository, of which the bulk of the data (343 experiments) were generated using human cell lines (Fig. 5.1b). This large dataset covers many of the most commonly used AP-MS protocols, and includes data derived from several different epitope tags, cell lines and mass spectrometers (see Supplementary Table 1 for current lists). For each of the experiments in the CRAPome, mapping of the protein identifiers to NCBI Gene IDs is performed, and spectral count information is parsed to the relational database (see Methods). The database is expandable and new data is added to the CRAPome using the same deposition and annotation process. New protocols and controlled vocabularies can be added at any time to adapt the database to new experimental workflows.

137 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

Figure 5.1. The CRAPome at a glance. (a) Creation of the CRAPome. (1) Contributors to the CRAPome submit raw mass spectrometry files for negative control runs, detailed experimental protocols and mapping information. (2) These raw MS files are first converted to mzXML and analyzed by X!Tandem and the Trans-Proteomic Pipeline; counts are extracted for protein quantification and the CRAPome administrator performs a quality control step (see Methods). (3) Released high quality runs (data) are associated with experimental description and protocols (metadata by the CRAPome administrator in consultation with the data provider). (4) Query of the CRAPome database by external users via the web interface. (b) Composition of the CRAPome (human data) as of December 2012. (c) Overview of the first CRAPome workflow. (1) Proteins are queried against the CRAPome by inputting one of several identifiers (see Tutorial) which enable mapping to Gene ID. Different views enable exploration of the contaminant profile of each queried protein, either as a summary table (2) or in graphical formats (3). (d) Overview of the third CRAPome workflow (note that the second workflow is similar, except that no user data is uploaded; the second workflow generates lists of contaminant proteins). (1) Desired controls are selected, with the help of controlled vocabularies. (2) Users upload their own data (test experiments and controls if available) to the CRAPome and (3) select parameters for data analysis.

138 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

Data is displayed in a table format (see User Manual) and in different graphical formats, which include the detection of a given interaction in the public repository iRefIndex (4).

5.3.2. Graphical user interface End users access the database via a web interface (Fig. 5.1c, 5.1d, User Manual; www.crapome.org). After selecting the organism of interest (currently human or S. cerevisiae), the database can be queried in three ways (called “user workflows”). 1) Query selected proteins In the first workflow, users submit queries consisting of protein or gene identifiers and retrieve summaries of the occurrence of queried genes across the CRAPome database. An expanded view summarizes the conditions and protocols in which the protein has been identified in negative control experiments, associated with quantitative information, and summary graphs display selected information in a visual manner (Fig. 5.1c). 2) Create contaminant lists The second user workflow generates background lists from a subset of the CRAPome controls. The user first selects the list of desired controls (filtering using controlled vocabularies and protocol details; Fig. 5.1d; left), and downloads the resulting tables of contaminants, associated with quantitative parameters, including the occurrence of identification across selected controls, and the average spectral counts across selected controls in which the protein was detected (a maximum of 30 experiments can be viewed online and included for analysis in workflow 3 below; the entire dataset can be downloaded as a tab delimited file from www.crapome.org/Download). Registered users can also save the selected list of controls for future use. 3) Analyze user data The third workflow (available to registered users) allows the users to analyze their own data. A user can use our interface to score their AP-MS data using either their own controls, selected CRAPome controls, or a combination of user and CRAPome controls for improved discrimination between true interactions and background contaminants. Note that it is always recommended that users generate and use at least several of their own negative controls. The input data consists of one or multiple AP-MS experiments for at least one bait, ideally in biological replicates, and user controls (optional). Preparation of the data for upload to the CRAPome is described in Methods. In a first step, the user selects relevant controls from the CRAPome database (using the same interface as for workflow

139 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

2; Fig. 5.1d), or chooses previously saved selected lists of controls. The user then uploads their data in the specified format (or uses previously uploaded data). Upon selection of baits and controls (from the CRAPome database and/or generated by the user), the user can analyze the data. Two scoring functions are available: the Significance Analysis of INTeractome (SAINT) score[11-13], and a simpler Fold Change calculation (detailed below). The Fold Change calculation is run by default, and the SAINT analysis is optional, but can also be run in parallel. These scoring tools create lists of interacting partners, ranked by confidence. Known interactions documented in the interaction database aggregator iRefIndex (version 9.0[33]) are also mapped onto user data. The results are presented in a tabular format and can be downloaded as a tab delimited file. Additionally, summary graphical views of the data are provided for each of the baits (Fig. 5.1d; right), or for all baits combined, which enable the user to view their data at a glance.

5.3.3. Characterization of the CRAPome With the large amount of data available, we mined the database to determine: (i) which proteins have a higher propensity to be contaminants, and (ii) how background proteins differ based on experimental conditions. First, to understand whether the abundance level of a protein in a sample increases the propensity of the protein to be a contaminant, we plotted the proteins reported in the CRAPome repository (restricting the analysis to HEK293 cells, by far the most common human cell line in the CRAPome) against a list of proteins ranked by their abundance estimates based on whole proteome analysis of HEK293 cell lysate[34]. As shown in Fig. 5.2a, there is a clear relationship between the abundance of a protein in HEK293 and its detection in at least one of the HEK293 experiments in version 1.0 of the CRAPome. We next analyzed the frequency of detection of individual proteins in the CRAPome (mapped to gene names, as throughout this manuscript). Using a stringent filtering (protein FDR < 1%), 4449 proteins (or more precisely, non-redundant protein groups, see Methods for details) were identified from the data used to build the CRAPome repository (see Supplementary Table 2 for the top part of the list sorted based on the frequency of occurrence; the full list, including spectral counts in each of the 343 individual human experiments, can be downloaded from the CRAPome website). Of these, 14 proteins are detected in > 90% of all experiments, and 89 in > 50% of the experiments, qualifying them as ubiquitous contaminants (Fig. 5.2b). Not

140 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

surprisingly, these include keratins, cytoskeletal proteins such as tubulins and actins, and high abundance proteins including translation elongation factors and histones (Fig. 5.2c). Other proteins were not detected consistently across all purifications, but are abundant (in terms of total spectral counts) across the database: these are significantly enriched for several functional categories, most predominantly associated with RNA functions (see Supplementary Tables 4-6 for most enriched GO BP, MF, and CC categories). However, a large fraction of the proteins present in the CRAPome are detected in only a small fraction of experiments (3571, or 80% of the proteins in the CRAPome, are found in ≤ 10% of the experiments), which includes some proteins found with low spectral counts or detected in restricted subsets of the experiments (e.g. in association with a specific epitope tag or other experimental condition). Fig 2b also lists the full (redundant) gene counts obtained using more liberal filtering threshold, which represents the data as it is stored in the CRAPome database (see Methods for detail; Supplementary Table 3; full table on the CRAPome website).

141 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

Figure 5.2. Composition of the CRAPome (human data). (a) Relationship between the detection of a given protein in the CRAPome and its protein abundance (all entries are mapped to genes). The abundance distribution in HEK293 cells is calculated from shotgun mass spectrometry data (see Methods). The left axis indicates the number of proteins identified at each of the spectral count abundances (green circles); the right axis indicate the fraction of the proteins at a given binned abundance in the CRAPome database (blue triangles). (b) General overview of the frequency of detection across the CRAPome. Two numbers are computed at different frequencies. “Redundant” gene counts are based on a generous estimation of shared peptides: in this case, each protein/gene to which a given peptide is matched is counted as a contaminant. “Reduced” gene

142 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

counts are based on a more stringent definition of protein/gene parsimony, as described in Methods. (c) List of the most frequently globally detected protein families across the CRAPome, alongside some of the most frequently detected representative genes. (d) Similarity clusters of all experiments: all experiments in the CRAPome are scored for similarity in their contaminant profiles based on a cosine function: the size of the clusters represent the number of the experiments with strong similarity. Selected similarity clusters are indicated, alongside their composition. (e) Cluster 9, described in c as FLAG agarose in HeLa cells can be further defined as two sub-clusters based on subcellular fractionation performed prior to the affinity purification (cytosolic and nuclear fractions); other clusters can also be further refined. (f) Example of epitope- tag specificity for selected proteins. TUBB is detected frequently (and equally well) in FLAG and GFP affinity purifications, while STK38 is almost exclusively detected in FLAG APs and TP53 is predominantly detected in GFP APs. PPP4C is infrequently detected with either epitope. (g) Spectral count distribution of the proteins/genes shown in f across the entire dataset. Spectral count bins are shown for all non-zero experiments. The highest spectral count boundary for each bin is shown.

To further explore the contaminant propensity of the proteins in the CRAPome, we computed the similarity of all experiments (restricting the analysis to human data only), generating the heat map displayed in Fig. 5.2d (see Methods). The data cluster primarily according to experimental condition (although there may also be a bias in the type of background detected across different laboratories). Several of the clusters can be further separated into subclusters, as exemplified by the “FLAG HeLa agarose” cluster, showing a clear separation based on subcellular fractionation (cytoplasmic/nuclear) performed prior to AP-MS (Fig. 5.2e). Based on our analysis of the most important determinants of background behavior, we annotated all experiments in the CRAPome using 14 categories of controlled vocabulary (Supplementary Table 1), which can be used to select experiments that are most similar to those in a query set. We recognize that classification based on these 14 categories alone is insufficient to fully describe an experiment (for example, salt or detergent concentrations have a significant impact on the purification and are not easily captured in a controlled vocabulary framework), and have therefore provided more complete protocol descriptions of the experiments via a free text form.

To illustrate the different contaminant propensities of individual proteins, and the need to take into account not only the overall frequency of detection in the dataset, but also the experimental conditions, we analyzed the frequency distribution of four proteins with two types of epitope tags, FLAG and GFP (Fig. 5.2f). TUBB (tubulin beta) is detected across nearly all of the experiments, irrespective of the epitope tag. By contrast, STK38 (a

143 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

serine/threonine kinase) co-purifies significantly in FLAG experiments, but not in GFP experiments, while TP53 (the tumor suppressor protein p53) is detected predominantly in GFP-based AP protocols. The serine/threonine phosphatase PPP4C is not detected at a high frequency in experiments performed with either of these epitope tags (it is identified in 3/343 experiments across the entire database). Frequency and experimental conditions are also clearly not sufficient to describe contaminant propensity: abundance measures are also critical. For instance, if a protein is detected at a high frequency but low abundance (i.e. a low number of spectral counts in a high number of MS runs) in the CRAPome, but is detected with a high spectral count in bait purifications performed by a user, it is more likely to be a true interactor than if it is always detected with high abundance in the CRAPome. To illustrate this concept, we compared the non-zero values for the four proteins in Fig. 5.2f, but specifically examined spectral count distributions (binned values). This analysis revealed that while TUBB and STK38 are often present in very high counts in the CRAPome, TP53, while frequently detected in the GFP experiments, is usually detected with much lower spectral counts (Fig. 5.2g). PPP4C, which was detected at low frequency, is also present at low (1-2) spectral counts in the data the few times it was detected. These comparisons (frequency distribution across different experimental conditions and spectral count distributions) are easily accessed via the CRAPome user interface (see User Manual). They also provide the basis for statistical or empirical scoring of interactions, as described below.

5.3.4. Using CRAPome to score interactions The CRAPome can be used for the analysis of diverse AP-MS datasets, and most importantly for relatively small datasets where eliminating background contaminants computationally is a difficult task. The CRAPome thus implements two complementary scoring strategies, statistical scoring using SAINT and an empirical fold change-based scoring. Both approaches are based on quantitative comparisons of prey abundance levels (estimated using spectral counts) in purifications with bait proteins against the distribution of prey abundances across a set of negative controls.

SAINT, described previously[10-13], allows advanced statistical modeling of the input bait-prey spectral count data and reports a posterior probability of true interaction. The

144 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

CRAPome provides a facile web interface for performing SAINT analysis, including user- specified key model parameters and control selection[13]. SAINT assists with discrimination between true and false positives, and has been shown to perform well when using a sufficient number of matching negative controls (ideally at least 3-5 controls) showing a high degree of reproducibility. At the same time, SAINT can be sensitive to changes in the spectral count distributions of a given protein in either the controls or the bait samples, and thus its performance may be affected if the bait sample quality is poor or the negative controls are heterogeneous. SAINT is also computationally intensive. We therefore sought to provide the user with an alternative scoring tool enabling a more rapid analysis, and implemented a simpler Fold Change calculation (FC scores) based on computing the ratio of average normalized spectral counts in bait purifications versus negative controls. FC scoring is customizable and, in addition to the calculation of the standard FC score (referred to as primary score, or FC-A), computes a secondary, more stringent score (FC-B, see next section). The primary FC score can be considered an alternative to SAINT scoring, whereas the secondary FC score can be used in addition to SAINT or the primary FC score for improved detection of several classes of challenging contaminants (discussed below). Both SAINT and Fold Change calculations are run in parallel from the CRAPome interface, and comparison of their relative performances for each of the tested baits can be assessed by a Receiver Operating Characteristic analysis provided via the CRAPome interface.

The use of the analytical pipeline within the CRAPome is illustrated here by a small dataset consisting of two biological replicates of each of the following four baits: RAF1, EIF4A2, WASL and MEPCE. In addition, six matching controls (user controls) were generated and processed together with the four baits to generate the input data (see Methods for detail). RAF1 is a serine/threonine kinase that binds to Ras and to several chaperones and 14-3- 3 proteins[35, 36]. EIF4A2 is a translation initiation factor that is part of the EIF4F complex, which bridges the mRNA cap structure to the ribosome via the EIF3 complex[37]. WASL (also known as N-WASP) belongs to the Wiskott-Aldrich syndrome (WAS) family of proteins, involved in transduction of signals from receptors on the cell surface to the actin cytoskeleton[38]. Finally, MEPCE, the 7SK snRNA methylphosphate capping enzyme, interacts with numerous transcriptional and RNA processing

145 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

proteins[39]. MEPCE and EIF4A2 have many documented interactors[33] while WASL and RAF1 have fewer known interactors; all proteins provide challenges for background definition based on their association with polypeptides with contaminant-like behavior (chaperones, cytoskeletal proteins, RNA binding proteins, etc.; Fig. 5.2c).

The performance of the processing pipeline was first evaluated by plotting ROC curves based on the information extracted from iRefIndex[33]. The protein interaction list (all four baits combined) was sorted based either on the SAINT probability or the primary FC score computed using the six user controls (Fig. 5.3a). While SAINT did outperform the FC score on this dataset, both scoring schemes were able to efficiently recapitulate known interactions from the literature. Both scores also tracked very similarly for most of the proteins analyzed (Fig. 5.3b), with SAINT essentially providing a statistical conversion of the fold change onto the probability scale via the mixture model analysis of the underlying spectral count distributions. The performance of the interaction scores was further visualized by plotting the distribution of scores (histograms) separately based on iRefIndex annotation, showing that high scoring interactions (SAINT probability above 0.9, FC score above 4) are clearly enriched for interactions previously reported in the literature (Fig. 5.3c - d). The CRAPome interface provides, separately for each analyzed bait and for all baits combined, a ROC and a histogram view (with mouse-over function) which enables the user to explore the reported interactions at different scores for SAINT or FC, assisting in establishing appropriate thresholds.

We next tested whether the controls deposited in the CRAPome could be used for scoring interactions in the absence of the user controls. While we recommend always using at least some user controls for scoring interactions, there are certainly cases where such controls do not appropriately model the background. Controls from the repository were thus selected on the basis of the controlled vocabularies and protocols. We identified two relevant control sets from two different laboratories that fulfilled our criteria (HEK293 cells, FLAG tag, single step purification on M2 agarose) which contained 10 (Set 1; CRAPome protocol #56) and 11 (Set 2; CRAPome protocol #26) experiments, respectively. Using ROC analysis, we show that each of these sets of controls performs very similarly to the user controls both in SAINT (Fig. 5.3e) and FC (Fig. 5.3f) calculations.

146 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

In summary, our processing pipeline introduces a new scoring tool based on Fold Change enrichment, which is used in the CRAPome in parallel to SAINT scoring. A graphical view of the data (similar to the ROC analysis shown in Fig. 5.3a) enables the user to identify which scoring scheme is more appropriate for their analysis. Lastly, when too few (or no) interactions for a bait of interest are reported in iRefIndex, it may be prudent to use SAINT as the main scoring tool, to include several user controls, and to use a stringent cut-off (probability 0.9 or greater) for reporting the interactions.

147 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

Figure 5.3. Scoring functions in the CRAPome illustrated on a four bait dataset (MEPCE, EIF4A2, WASL, RAF1, see text for details; 8 experiments). (a) A simple Fold Change score (FC-A) performs nearly as well as SAINT for scoring known interactions using negative control runs (n = 6) provided by the user; ROC based on the interactions in iRefIndex. Note that when SAINT scores are identical, ties are broken by the FC score. Selected SAINT probability or FC score

148 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

thresholds are represented by triangles and circles, respectively. (b) FC and SAINT follow the same trend in most cases for scoring individual protein interactions; the relation between SAINT probability and FC score is well represented by a sigmoid function (dashed curve). (c - d) Histogram visualization of the data presented in (b) can help with data exploration and threshold selection. (e - f) Controls from the CRAPome are able to identify true interactors through SAINT (e) and FC (f): User controls (n = 6) are compared to two sets of controls from CRAPome, both selected based on the controlled vocabularies (Set 1 = 10 controls; Set 2 = 11 controls).

5.3.5. A more stringent FC score helps capture “rare” contaminants If user controls are available in sufficient numbers, the need for additional controls may not be apparent. However, there are a number of issues that may arise in affinity purification that affect scoring. For example, some contaminants appear infrequently across negative control runs, or across batches of purifications. These are normally “diluted out” when multiple experiments are used for fold change calculation, or even SAINT analysis (Fig. 5.4a). To assist in the identification of these possible contaminants, we suggest two options: 1) run SAINT with > 10 selected controls (possibly supplementing user controls with selected CRAPome controls). SAINT will automatically compress the data to 10 “virtual” controls in which the maximum spectral counts for each protein will be reported; 2) supplement normal scoring (using SAINT or the primary FC score) with a secondary more conservative fold change score, which has been designed specifically for this type of data. This conservative fold change score samples across all selected controls (both user-provided and CRAPome controls) to select the maximum three spectral count values for each protein detected across the controls; these values are averaged and used for the fold change calculation (Fig. 5.4a). To provide additional stringency in the scoring of a bait-prey interaction, we have also implemented a combination of the scores for the individual replicates of a bait purification by performing a geometric mean calculation instead of the standard averaging normally applied.

149 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

Figure 5.4. Use of a more stringent Fold Change score to recover true interacting partners for ORC2L. (a) Schematic illustration of the consequences of averaging all spectral counts as opposed to selecting the top three maximal values for scoring protein-protein interactions. Here, protein A represents a contaminant in the purification scheme that is detected with variable counts across the 15 selected controls (the intensity of shading is proportional to the spectral counts). By contrast, the protein B is a contaminant detected with similar counts across all selected controls. The standard primary Fold Change calculation FC-A averages the counts across all controls while the more stringent secondary Fold Change FC-B takes the average of the top 3 highest spectral counts for abundance estimate. The calculated Fold Change results in the FC-A and FC-B conditions is represented schematically where a larger circle indicates a higher fold change, with FC-A and FC-B scoring similarly protein B, but not protein A. (b) Comparison of SAINT scoring and stringent fold change calculation with good bait samples. Note here that only the top of the map is shown (only the interactions with SAINT ≥ 0.9 are displayed). (c) Comparison of SAINT scoring and stringent fold change calculation with bait samples (ORC2L) contaminated with myosin: the more stringent fold change score FC-B helps in discriminating between true interaction partners and contaminants.

150 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

We applied this more stringent scoring scheme to two biological replicates of the bait ORC2L, which, through visual inspection of the results, were found to contain large quantities of myosin. Myosin distinguishes itself from most other cytoskeletal proteins by its contaminant behavior in the CRAPome: it is usually present in small amounts across most controls, but spikes to extreme high abundance in some controls (as in the ORC2L samples). If myosin was isolated alone, this may not be a particularly difficult issue to resolve, but it co-precipitates with a large number of paralogous genes and interacting proteins that make detection of the true interactors much more difficult. Thus, while SAINT is capable of identifying true interactors in successful experiments as exemplified by EIF4A2 (Fig. 5.4b; see relative high agreement between SAINT score and high stringency fold change score), it assigns a high probability to myosins and associated proteins in the ORC2L samples (Fig. 5.4c). By contrast, the conservative fold change scores readily distinguishes between these contaminants and true interaction partners (ORC3, ORC4 and ORC5 are in iRefIndex[33], and LRWD1 is reported in PubMed[40]).

Based on these observations – and on the fact that experiments are rarely “perfect” – we automatically calculate two Fold Change scores in the CRAPome workflow 3. By default, the first score (FC-A) performs the default background estimation (averaging normalized spectral count values across selected controls), and combines the bait replicates by averaging the individual fold changes. The second score (FC-B) instead uses by default the more stringent scoring scheme described above, and combines the bait replicates by calculating the geometric mean. The automated graphs shown (with mouse over functionality) in the report page of the CRAPome facilitate exploration of the data.

151 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

5.4. Concluding Remarks We have created the first repository specifically designed to store and annotate AP-MS negative control data, along with new freely accessible web-based analytical tools for the identification of bona fide protein interactions. Here we demonstrate the utility of this resource for discriminating between true and false interactions. While lists of contaminating proteins have been reported in the past[41, 42], including a thorough characterization of selected bead proteomes using SILAC[3], there was no central repository for this type of data or freely available software tool for utilization. In general, obtaining and using AP-MS background lists in supplementary tables (often in PDF format) is inconvenient and time consuming, leading to a general under utilization of this knowledge. This is compounded by the fact that mapping protein accession numbers is difficult, as different groups utilize a variety of protein databases. While laboratories specializing in large scale AP-MS analysis usually have access to bioinformaticians and in- house databases of contaminants, this is certainly not the case in laboratories specializing in other aspects of life science. We therefore designed the CRAPome to facilitate access to a standardized (in terms of ID mapping, and search parameters) set of negative control experiments, organized via controlled vocabularies based on experimental considerations. Our goal was to make a freely accessible user interface that is intuitive and informative, even for those who may be new to mass spectrometry. Since we expect the database to keep growing, we have developed a system that can accommodate a large number of control purifications.

While we are currently using spectral counts as the sole quantification tool within the repository, extension of the system to other types of quantification (especially MS1, which is becoming possible as high mass resolution instruments are increasingly being used for AP-MS experiments) may help to further discriminate between background contaminants and true interactors. We expect a constant stream of data to be deposited in the CRAPome, which would partly fulfill the mandate from journals to make data publicly available. While we have restricted the release of the first version of the CRAPome to human and S. cerevisiae data, the system is ready to accommodate data from other species, which will further increase the usefulness of the system. As contributors (both current and new) continue depositing their data in the repository, robustness in scoring will increase, and

152 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

in depth characterization of contaminant behavior will be possible. Widespread adoption of the CRAPome (by experimentalists, computational biologists and reviewers alike) will improve the overall quality of AP-MS interaction data, as background contaminants are more easily recognized. This is important given a steady increase in the number of AP-MS results deposited to public repositories, and should also assist with establishing guidelines regarding the scoring and annotation of such data. The CRAPome can be used as a prospective and retrospective tool to analyze AP-MS data, and will be instrumental to curators of protein-protein interaction databases.

5.5. Online Methods Design and architecture of the CRAPome The CRAPome interface was developed using Drupal, an open source PHP based web framework, and MySQL and SQLite relational databases. The processing pipeline for adding data to the database, processing user input data, extracting data from the database, computing Fold Change scores, and preparing summary reports was developed using Python and a SQLite database. SAINT analysis[12, 13] is computationally intensive and is executed on a set of dedicated compute nodes. SAINT jobs are managed using Torque, an open source computing resource management system. All SAINT analysis requests are queued and executed on a first come, first served basis. The entire infrastructure is currently hosted on FLUX, the university-wide shared high-performance computing service at the University of Michigan. In addition to professional data backup and system management, its allocation based system allows adding computing nodes to the system should the number of users increase or if additional computing nodes are needed for running SAINT or other computation-heavy steps that may be added in the future. The actual data for each experiment (‘data’; Supplementary Fig. 5.1), such as the protein/gene accession numbers, the sequences of the identified peptides, peptide probabilities, and the spectral counts are stored in a SQLite database. The attributes used to annotate the experimental conditions (‘meta-data’) are stored in a separate MySQL database. The separation of data and meta-data is performed for the convenience of developing the web interface, which allows annotation of experiments (management of meta-data) directly by data contributors, while the processing and management of the data itself is performed by the database administrator.

153 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

In order to keep annotation of data consistent, the attributes and values that describe the experimental conditions are predefined. The corpus of these attributes (and their values) is referred to as the “controlled vocabulary” (Supplementary Table 1). In addition to the controlled vocabulary, each experiment deposited in the CRAPome repository is also annotated with a detailed description of the experimental protocol that enables users to obtain additional details about the experiments.

Processing of mass spectrometry data and population of the CRAPome database Datasets were obtained from the contributing laboratories in the .raw or .mgf file format. The files were converted to the open mzXML file format, and further processed using the X! Tandem/Trans-Proteomic Pipeline (TPP) suite of tools[30, 31, 43]. MS/MS spectra were searched against RefSeq protein sequence database version 47[29] (human) or SGD ORF protein sequence database orf_trans_all.fasta v. 03-Feb-2011 (S. cerevisiae), appended with an equal number of decoy sequences, using X! Tandem[28] with k-score plug-in. For the purposes of simplicity and uniformity, we developed two standard parameter templates for processing using X! Tandem and TPP, which were applied to data generated on low or high mass accuracy instruments, respectively. MS/MS spectra were searched using a precursor ion mass tolerance of 100 ppm (monoisotopic mass), or using -1 to +4 Da (average mass) window for high and low mass accuracy instruments, respectively. All other database search parameters were identical: cysteine carbamylation (C + 57.0215) and methionine oxidation (M + 15.9949) were specified as variable modifications. The search results were processed using PeptideProphet (high mass accuracy data was analyzed using high mass accuracy binning option), and then further processed using ProteinProphet to create protein summary files. For each experiment, all contributing data (multiple gel band fractions, technical replicates, etc.) were combined to generate a single set of PeptideProphet and ProteinProphet output files (pepXML and protXML files, respectively). One of the submitted datasets[19] consisted of a very large number (300) negative controls in which proteins were separated using 1D SDS-PAGE. In a fraction of these experiments, only selected bands were analyzed using MS. To avoid the problem of data inconsistency due to missing MS data for a subset of gel fractions, and to reduce the total number of entries in the CRAPome representing this dataset, the

154 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

individual experiments from the dataset were combined to generate 10 composite experiments (protocol #66; experiments CC185 – CC194).

To build the CRAPome database, spectral counts were extracted from protXML files using an in house-built software tool. For each protein in the protXML file, peptide to spectrum matches with a probability greater than or equal to 0.9 were extracted. The cumulative sum of the spectral assignments for these peptides constituted the spectral count for the corresponding protein. The spectral count was computed for each protein in the output file regardless of whether peptides mapping to a given protein could also map to other proteins. We note that this represents a deviation from the conventional approach of performing stringent false discovery rate (FDR) filtering and removing redundant or inconclusive, i.e. not supported by unique peptides, protein identifications[44] (the results of such stringent filtering are described below, see Global analysis and reduced gene counts section). The liberal approach for creating protein summaries for each experiment taken here in fact enables a conservative approach for scoring protein interactions. As discussed in[8], it ensures that the spectral counts of proteins from homologous families such as keratins, tubulins, and actins are not underestimated due to the ambiguities related to the identification of shared peptides. Finally, RefSeq protein accession numbers are mapped to official gene identification numbers using Ensembl Biomart tools and displayed as corresponding gene symbols (entries with NP accession numbers only; proteins with XP numbers and those with NP accession numbers that cannot be mapped to gene symbols are presently not visible in the database). When multiple proteins map to the same gene entry, the maximum spectral count among these proteins is selected as the spectral count for that gene. These data provide the basis of the CRAPome as accessible online, and were used to calculate ‘redundant gene counts’ shown in Fig. 5.1b.

155 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

Quality control As part of the process of creating the database, the CRAPome administrator performs a quality control check of the database search results. Experiments containing only a few identifications (less than 10 gene symbols with non-zero counts) are removed automatically, and experiments with less than 50 gene symbols are inspected in more detail. Furthermore, all negative control experiments generated using the same protocol (biological replicates) are inspected for consistency, and inconsistent samples are removed. Lastly, possible carry-over issues are identified and referred to the data depositors for further inspection. From the 402 experiments submitted to the CRAPome, 42 experiments were excluded based on these quality control steps.

Global analysis and reduced gene counts To allow a more informative analysis of the contaminant profiles and comparison with other data, all pepXML and protXML files generated as described above were processed using a more conventional set of filtering threshold. All pepXML files used to generate the CRAPome repository, human data subset (343 files) were processed together using ProteinProphet to generate a single protein summary file (pepXML file). The combined pepXML files, as well as pepXML and protXML files for each individual experiment were then processed using ABACUS[45] to generate a joint spectral count matrix using default parameters (accepting proteins with at least one peptide having PeptideProphet probability of 0.99 or greater, and protein probability as computed by ProteinProphet of 0.9 or greater). Each row in the filtered ABACUS file represented a protein group based on the combined protXML file, with a single accession number selected among indistinguishable protein entries forming that group. Spectral counts for the representative proteins were extracted from pepXML files for each individual experiment. The false discovery rate (FDR) for the combined protein list was less than 1% as estimated using decoy counts. The resulting spectral count matrix was used to compute similarity scores to generate the clustergram (see below), and to analyze the global properties of the data such as frequency of identification across the entire dataset (Fig 1b, ‘reduced gene count’).

156 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

Gene Ontology (GO) enrichment analysis was performed on the reduced list, and considering the top 25% most abundant proteins in each experiment only (1427 genes in total). The analysis was done using the online DAVID tool[46], restricting the analysis to level 3 biological process (BP), molecular function (MF) or cellular component.

To generate the clustergram (Fig. 5.2d), we first computed experiment-experiment similarity scores using cosine function from square root transformed spectral counts (data from protocol #66[19], was excluded from this analysis; see above). For computing the final clustergram, we required that each experiment had at least 2 additional experiments with a similarity score of 0.7 or higher. The final clustergram was generated using Cluster 3.0 software[47], with single linkage clustering using Pearson correlation (uncentered) as the similarity measure. The clustergram was visualized using TreeView software[48].

Calculation of protein abundance in HEK293 cells and of background contaminant propensity based on abundance To generate the list of proteins and protein abundances in the HEK293 whole cell lysate, we used publicly available data taken from[34]. Raw mass spectrometry data for this cell line were downloaded from the Tranche data exchange system (https://proteomecommons.org) using the hash specified in the original manuscript. Data were processed as described above (Global analysis and reduced gene counts). For each identified protein (representative protein per group, see above) in the filtered ABACUS file the summed spectral counts across the 4 biological replicates was taken as a measure of the protein abundance in the cell line. A global histogram of protein abundances was then generated by binning (Fig. 5.2a). The background contaminant propensity was then calculated as a fraction of HEK293 cell line identified proteins in each spectral count bin that were also detected in at least one HEK293 experiment in the CRAPome. For this comparison, we selected CRAPome experiments having the ‘Cell Line’ attribute value ‘HEK293’ only and queried protein accession numbers identified in the HEK293 whole cell lysate against the CRAPome HEK293 identified proteins. We then plotted the “fraction in CRAPome” as a function of protein abundance (binned spectral counts).

157 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

Data formats When querying the database to view contaminant profiles for selected proteins of interest (workflow 1), proteins can be referenced using a variety of identifiers: RefSeq protein ID, Ensembl protein ID, NCBI Gene ID, Uniprot entry name, Uniprot entry ID, HGNC gene symbol (human) or SGD ID (S. cerevisiae). All input identifiers are internally mapped to HGNC Gene Symbols (human) or SGD Standard Name (S. cerevisiae). The mapping is based on the Ensembl mappings downloaded from Biomart[49] on a regular basis. Input data for CRAPome analysis in workflow 3 can be uploaded using any of the mapped IDs. Prior to uploading data, the input file needs to be formatted as to contain four columns: Bait Name, AP Name, Prey Name, and Spectral count. Each row in this file lists the spectral count (Spectral count column) for each protein (referenced in Prey Name column) in purification with a particular bait protein (bait protein identifier is referenced in the Bait Name column). When multiple biological replicates for the same bait are available, they are distinguished using different text strings in the AP Name column (e.g. ‘R1’, ‘R2’, etc.). The negative controls runs are specified as text string ‘CONTROL’ in the Bait Name column (and named differently in the AP Name column, e.g. ‘UC1’, ‘UC2’, etc.).

AP-MS test data Cloning and expression of eIF4A2, RAF1 and MEPCE has been previously described[18]. WASL and ORC2L were amplified by PCR from Mammalian Gene Collection constructs BC052955 and BC014834 respectively, and cloned into pcDNA5-FRT-FLAG (using EcoRI/NotI for WASL, and AscI/NotI for ORC2L), and the junctions sequenced. The resulting vectors were stably co-transfected with the Flp-recombinase expressing vector pOG44 into Flp-In T-REx 293 cells (Invitrogen). Selection of stable transformants (single clones), clonal expansion, induction of protein expression and affinity purification coupled to mass spectrometry were performed essentially as described in[18], using FLAG M2 agarose beads (Sigma). Two biological replicate analyses of each bait were performed, alongside six negative controls (cells expressing the tag alone). All samples were analyzed on an LTQ mass spectrometer coupled to an online C18 reversed phase column. The detailed protocol is #48 on www.crapome.org. The mass spectrometry data was searched using the X! Tandem/TPP/ABACUS pipeline and settings as described in Global analysis and reduced gene counts. The filtered ABACUS file was formatted for

158 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

CRAPome as described above using an in-house tool. Data were uploaded to the CRAPome (workflow 3). Two sets of additional controls (Set 1 and Set 2, see main text for detail) were selected and used alongside the user controls. SAINT and Fold Change scores were generated using different settings (see main text and below). The ORC2L bait was processed in a similar way and uploaded for analysis to the CRAPome separately (it was not used for comparison between SAINT and FC scores shown in Fig. 5.3). The resulting input data matrices for eIF4A2, RAF1, MEPCE, and WASL baits and the six user controls, as well as for ORC2L and the same user controls, can be downloaded from the CRAPome website.

Interaction scoring: SAINT SAINT was implemented as described in[12]. Here the data was analyzed using default CRAPome SAINT options (lowMode=0, minFold=0, norm=1). In general, SAINT performance varies depending on the choice of options, especially minFold (requiring a certain minimum fold change as a part of probability calculation) and norm (normalization to the total spectral count in each experiment). SAINT run with the options specified above slightly outperform SAINT results with other options in these data (Supplementary Fig. 5.2). When the bait protein is analyzed in multiple biological replicates, SAINT probabilities computed independently for each bait replicate are averaged, and the average probability (AveP) is reported as the final SAINT score. For in- depth discussion of these options see[13]. The CRAPome also allows alternative specifications for combining biological replicates (e.g., geometric mean as a more conservative approach).

Interaction scoring: Fold Change The primary Fold Change score (FC-A, or just FC in the main text) for each bait - prey interaction pair is defined as the ratio of the normalized spectral count of protein i in purification with bait j, Ti,j, and the average normalized spectral count of that protein across the negative controls (user controls or selected CRAPome controls), Ci: FCi,j = (Ti,j +

α)/(Ci + α). The normalized spectral counts are computed as Ti,j = SCi,j /Nj, where the normalization factor is the sum over all proteins identified in the experiment with bait j,

Nj = ∑SCi,j . Similarly, the counts are normalized in each negative control experiment

159 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

x=1…n, Ci,x = SCi,x /Nx, prior to computing the averaged normalized count across all n controls, Ci = 1/n ∑Ci,x. A small background factor α is added to prevent division by zero, calculated as 1/min(Nx), where min(Nx) is the smallest normalization factor across all n negative controls. When the bait protein is analyzed in multiple biological replicates, the Fold Change scores computed independently for each bait replicate are averaged to arrive at the final FC score.

The secondary, more conservative score (FC-B) is computed as described above, expect that Ci is computed by averaging the highest 3 normalized spectral counts across all controls (by default, using the combined set of selected CRAPome controls and the user controls, when available). Furthermore, in the case of biological replicates for the bait protein, the final FC-B score is computed as the geometric mean of the FC scores for each replicate.

Comparison to literature data In order to rapidly benchmark scoring performance and to provide users with a view of the new data within the context of previously published results, a mapping of the interactions with those deposited in the iRefIndex repository[33] is provided within the interface. iRefIndex was selected based on its comprehensiveness in the number of interactions annotated (it aggregates data from primary curation databases), and the relative ease of download and data mapping. For our purposes, we are currently using version 9.0. Each entry from the database is mapped to a pair of genes (interacting proteins) using an in-house mapping tool. Entries identified as “complex” are excluded from this mapping. Due to uncertain quality of previously reported interactions involving ribosomal proteins, which are among the most common contaminating proteins in AP-MS experiments, we excluded all RPL and RPS proteins from the computation of ROC curves show in Fig 3.

160 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

Access to the database The CRAPome can be accessed at www.crapome.org. Once the manuscript is published, no registration will be required to access workflows 1 and 2. Registration will be required for users to analyze their own data in workflow 3 and will enhance the functionality of workflow 2. Registration allows the users to save selected lists of controls (Fig. 5.1c) and their previously uploaded data. Users can also access their previously analyzed data.

161 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

5.6. References 1. Gingras, A.C., M. Gstaiger, B. Raught, and R. Aebersold, Analysis of protein complexes using mass spectrometry. Nature reviews. Molecular cell biology, 8, 645-54 (2007). 2. Selbach, M. and M. Mann, Protein interaction screening by quantitative immunoprecipitation combined with knockdown (QUICK). Nature methods, 3, 981-3 (2006). 3. Trinkle-Mulcahy, L., et al., Identifying specific protein interaction partners using quantitative mass spectrometry and bead proteomes. The Journal of cell biology, 183, 223- 39 (2008). 4. Trinkle-Mulcahy, L., Resolving protein interactions and complexes by affinity purification followed by label-based quantitative mass spectrometry. Proteomics, 12, 1623-38 (2012). 5. Tackett, A.J., et al., I-DIRT, a general method for distinguishing between specific and nonspecific protein interactions. Journal of proteome research, 4, 1752-6 (2005). 6. Dunham, W.H., M. Mullin, and A.C. Gingras, Affinity-purification coupled to mass spectrometry: basic principles and strategies. Proteomics, 12, 1576-90 (2012). 7. Hubner, N.C., et al., Quantitative proteomics combined with BAC TransgeneOmics reveals in vivo protein interactions. The Journal of cell biology, 189, 739-54 (2010). 8. Nesvizhskii, A.I., Computational and informatics strategies for identification of specific protein interaction partners in affinity purification mass spectrometry experiments. Proteomics, 12, 1639-55 (2012). 9. Sardiu, M.E., et al., Probabilistic assembly of human protein interaction networks from label-free quantitative proteomics. Proceedings of the National Academy of Sciences of the United States of America, 105, 1454-9 (2008). 10. Skarra, D.V., et al., Label-free quantitative proteomics and SAINT analysis enable interactome mapping for the human Ser/Thr protein phosphatase 5. Proteomics, 11, 1508- 16 (2011). 11. Breitkreutz, A., et al., A global protein kinase and phosphatase interaction network in yeast. Science, 328, 1043-6 (2010). 12. Choi, H., et al., SAINT: probabilistic scoring of affinity purification-mass spectrometry data. Nature methods, 8, 70-3 (2011). 13. Choi, H., et al., Analyzing protein-protein interactions from affinity purification-mass spectrometry data with SAINT. Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.], Chapter 8, Unit8 15 (2012). 14. Al-Hakim, A.K., M. Bashkurov, A.C. Gingras, D. Durocher, and L. Pelletier, Interaction proteomics identify NEURL4 and the HECT E3 ligase HERC2 as novel modulators of centrosome architecture. Molecular & cellular proteomics : MCP, 11, M111 014233 (2012). 15. Chen, G.I., et al., PP4R4/KIAA1622 forms a novel stable cytosolic complex with phosphoprotein phosphatase 4. The Journal of biological chemistry, 283, 29273-84 (2008). 16. Cristea, I.M., R. Williams, B.T. Chait, and M.P. Rout, Fluorescent proteins as proteomic probes. Molecular & cellular proteomics : MCP, 4, 1933-41 (2005). 17. Daniels, D.L., et al., Examining the complexity of human RNA polymerase complexes using HaloTag technology coupled to label free quantitative proteomics. Journal of proteome research, 11, 564-75 (2012). 18. Dunham, W.H., et al., A cost-benefit analysis of multidimensional fractionation of affinity purification-mass spectrometry samples. Proteomics, 11, 2603-12 (2011).

162 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

19. Ewing, R.M., et al., Large-scale mapping of human protein-protein interactions by mass spectrometry. Molecular systems biology, 3, 89 (2007). 20. Forget, D., et al., The protein interaction network of the human transcription machinery reveals a role for the conserved GTPase RPAP4/GPN1 and microtubule assembly in nuclear import and biogenesis of RNA polymerase II. Molecular & cellular proteomics : MCP, 9, 2827-39 (2010). 21. Goudreault, M., et al., A PP2A phosphatase high density interaction network identifies a novel striatin-interacting phosphatase and kinase complex linked to the cerebral cavernous malformation 3 (CCM3) protein. Molecular & cellular proteomics : MCP, 8, 157- 71 (2009). 22. Kean, M.J., et al., Structure-function analysis of core STRIPAK Proteins: a signaling complex implicated in Golgi polarization. The Journal of biological chemistry, 286, 25065-75 (2011). 23. Kruiswijk, F., et al., Coupled activation and degradation of eEF2K regulates protein synthesis in response to genotoxic stress. Science signaling, 5, ra40 (2012). 24. Sato, S., et al., A set of consensus mammalian mediator subunits identified by multidimensional protein identification technology. Molecular cell, 14, 685-91 (2004). 25. de Lau, W., et al., Lgr5 homologues associate with Wnt receptors and mediate R-spondin signalling. Nature, 476, 293-7 (2011). 26. Greco, T.M., F. Yu, A.J. Guise, and I.M. Cristea, Nuclear import of histone deacetylase 5 by requisite nuclear localization signal phosphorylation. Molecular & cellular proteomics : MCP, 10, M110 004317 (2011). 27. Tsai, Y.C., T.M. Greco, A. Boonmee, Y. Miteva, and I.M. Cristea, Functional proteomics establishes the interaction of SIRT7 with chromatin remodeling complexes and expands its role in regulation of RNA polymerase I transcription. Molecular & cellular proteomics : MCP, 11, M111 015156 (2012). 28. Craig, R. and R.C. Beavis, TANDEM: matching proteins with tandem mass spectra. Bioinformatics, 20, 1466-7 (2004). 29. Pruitt, K.D., T. Tatusova, G.R. Brown, and D.R. Maglott, NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic acids research, 40, D130-5 (2012). 30. Nesvizhskii, A.I., A. Keller, E. Kolker, and R. Aebersold, A statistical model for identifying proteins by tandem mass spectrometry. Analytical chemistry, 75, 4646-58 (2003). 31. Keller, A., A.I. Nesvizhskii, E. Kolker, and R. Aebersold, Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical chemistry, 74, 5383-92 (2002). 32. Keller, A., J. Eng, N. Zhang, X.J. Li, and R. Aebersold, A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Molecular systems biology, 1, 2005 0017 (2005). 33. Razick, S., G. Magklaras, and I.M. Donaldson, iRefIndex: a consolidated protein interaction database with provenance. BMC bioinformatics, 9, 405 (2008). 34. Thakur, S.S., et al., Deep and highly sensitive proteome coverage by LC-MS/MS without prefractionation. Molecular & cellular proteomics : MCP, 10, M110 003699 (2011). 35. Tzivion, G., Z. Luo, and J. Avruch, A dimeric 14-3-3 protein is an essential cofactor for Raf kinase activity. Nature, 394, 88-92 (1998). 36. Wartmann, M. and R.J. Davis, The native structure of the activated Raf protein kinase is a membrane-bound multi-subunit complex. The Journal of biological chemistry, 269, 6695- 701 (1994).

163 Chapter 5 The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data

37. Gingras, A.C., B. Raught, and N. Sonenberg, eIF4 initiation factors: effectors of mRNA recruitment to ribosomes and regulators of translation. Annual review of biochemistry, 68, 913-63 (1999). 38. Miki, H., K. Miura, and T. Takenawa, N-WASP, a novel actin-depolymerizing protein, regulates the cortical cytoskeletal rearrangement in a PIP2-dependent manner downstream of tyrosine kinases. The EMBO journal, 15, 5326-35 (1996). 39. Jeronimo, C., et al., Systematic analysis of the protein interaction network for the human transcription machinery reveals the identity of the 7SK capping enzyme. Molecular cell, 27, 262-74 (2007). 40. Shen, Z., et al., A WD-repeat protein stabilizes ORC binding to chromatin. Molecular cell, 40, 99-111 (2010). 41. Chen, G.I. and A.C. Gingras, Affinity-purification mass spectrometry (AP-MS) of serine/threonine phosphatases. Methods, 42, 298-305 (2007). 42. Gingras, A.C., et al., A novel, evolutionarily conserved protein phosphatase complex involved in cisplatin sensitivity. Molecular & cellular proteomics : MCP, 4, 1725-40 (2005). 43. Deutsch, E.W., et al., A guided tour of the Trans-Proteomic Pipeline. Proteomics, 10, 1150- 9 (2010). 44. Nesvizhskii, A.I. and R. Aebersold, Interpretation of shotgun proteomic data: the protein inference problem. Molecular & cellular proteomics : MCP, 4, 1419-40 (2005). 45. Fermin, D., V. Basrur, A.K. Yocum, and A.I. Nesvizhskii, Abacus: a computational tool for extracting and pre-processing spectral count data for label-free quantitative proteomic analysis. Proteomics, 11, 1340-5 (2011). 46. Huang da, W., B.T. Sherman, and R.A. Lempicki, Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols, 4, 44-57 (2009). 47. de Hoon, M.J., S. Imoto, J. Nolan, and S. Miyano, Open source clustering software. Bioinformatics, 20, 1453-4 (2004). 48. Page, R.D., TreeView: an application to display phylogenetic trees on personal computers. Computer applications in the biosciences : CABIOS, 12, 357-8 (1996). 49. Kasprzyk, A., BioMart: driving a paradigm change in biological data management. Database : the journal of biological databases and curation, 2011, bar049 (2011).

5.7. Supplementary materials All supplementary materials are available on the attached CD.

164 Chapter 6 Conclusions

Chapter 6 Conclusions

In the scope of this thesis we showed that our AP-MS approach is highly reproducible, sensitive and applicable for large-scale experiments. Filtering of contaminant proteins is important to ensure high-quality AP-MS which typically require the statistical evaluation on the basis of numerous control purification experiments. To support a broader community in addressing these problems we undertook an effort to compile a repository of control purifications, called the CRAPome, to further improve overall quality in protein interaction in public databases. We applied these advances biologically motivated to two large-scale AP-MS studies (Hpo and PcG) and obtained interaction data of unprecedented resolution.

165

Acknowledgments

This page is dedicated to all the great people that accompanied me during my PhD thesis. I enjoyed a lot the time spent with professional, creative but also fun loving and friendly people I had the luck to cross paths with.

First and foremost I would like to thank Matthias Gstaiger and Ruedi Aebersold who gave me the opportunity to dive into the field of proteomics and mass spectrometry. This thesis would not have been possible without their guidance and support. I consider myself lucky to have worked in such a highly acclaimed laboratory and was given the great learning experience of maintaining an Orbitrap mass spectrometer.

I would like to thank the members of my PhD committee, Nic Tapon and Christian Beisel. Nic who had to fly in from London to share his great knowledge of the Hpo pathway, and Christian who I could go visit in Basel whenever I had questions or collaborative discussions.

Of course I thank all my colleagues here at the IMSB, starting with my lab mates. To me this was the best part of my PhD thesis, meeting all these great people from many different cultures and backgrounds but yet united in the nerdom of scientific research. I know, I have met many kindred spirits during my time here. To my good friend Audrey who supported me so much throughout all my fun, sad and sometimes crazy days; thank you very much. Thanks Ben Collins for being so funny but yet highly professional. Thank you Anton Vychalkovskiy to allow me to speak with a Russian accent so much. Thank you Petri Kouvonen, for teaching me about Salmiakki and Kalsarikänni. And thank you H. Alex Ebhardt for being you.

166 Chapter 6 Conclusions

I would also like to thank (in random order) Lucia Espona, Chris Barnes, Bernd Wollscheid, Florian Stengel, Alexander Leitner, Alessio Maiolica, Stefania Vaga, Martin Jünger, Karel Novy, Yansheng Liu, Olga Schubert, Thomas Walzthöni, Alexander Wepf, Ruth Hüttenhain, Ariel Bensimon, Lars Malström, Hansjörg Möst, Etienne Caron, Marco Faini, Nadine Sobotzki, Tiannan Guo, Yibo Wu, Ema Milani, Jeppe Mouritsen, Pouya Faridi, Ludo Gillet, Andy Frei, and all I might have missed. Thank you all, it was great working with you.

To Markku Varjosalo and Salla Keskitalo which I met here at the IMSB and became our friends in the far North; Thank you for making us dog people.

I would like to thank my parents Beatrix und Karl Hauri and my brothers Sacha and Serge, who always supported me in all my pursuits.

And finally I would like to thank to the love of my life Julie who took my mind off science sometimes and helped me relax with our two Dackels, Freddie and Baxter.

167