Large-Scale Models of Signal Propagation in Human Cells Derived from Discovery Phosphoproteomic Data

ARTICLE

Received 17 Jun 2015 | Accepted 9 Jul 2015 | Published 10 Sep 2015 DOI: 10.1038/ncomms9033 OPEN Large-scale models of signal propagation in human cells derived from discovery phosphoproteomic data

Camille D.A. Terfve1, Edmund H. Wilkes2, Pedro Casado2, Pedro R. Cutillas2 & Julio Saez-Rodriguez1,w

Mass spectrometry is widely used to probe the proteome and its modiﬁcations in an untargeted manner, with unrivalled coverage. Applied to phosphoproteomics, it has tremendous potential to interrogate phospho-signalling and its therapeutic implications. However, this task is complicated by issues of undersampling of the phosphoproteome and challenges stemming from its high-content but low-sample-throughput nature. Hence, methods using such data to reconstruct signalling networks have been limited to restricted data sets and insights (for example, groups of kinases likely to be active in a sample). We propose a new method to handle high-content discovery phosphoproteomics data on perturbation by putting it in the context of kinase/phosphatase-substrate knowledge, from which we derive and train logic models. We show, on a data set obtained through perturbations of cancer cells with small-molecule inhibitors, that this method can study the targets and effects of kinase inhibitors, and reconcile insights obtained from multiple data sets, a common issue with these data.

1 European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. 2 Integrative Cell Signalling and Proteomics, Centre for Haemato-Oncology, Barts Cancer Institute, Queen Mary University of London, John Vane Science Centre, Charterhouse Square, London EC1M 6BQ, UK. w Present address: Joint Research Center for Computational Biomedicine (JRC-COMBINE), RWTH- Aachen University Hospital MTI2 Wendlingweg, Aachen 2 D-52074, Germany. Correspondence and requests for materials should be addressed to J.S.-R. (email: [email protected]).

NATURE COMMUNICATIONS | 6:8033 | DOI: 10.1038/ncomms9033 | www.nature.com/naturecommunications 1 & 2015 Macmillan Publishers Limited. All rights reserved. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms9033

ignificant technical and data-processing advances have relationships, complexity of combinatorial and context-specific allowed shotgun (discovery) mass spectrometry (MS), the regulation, and limitations in our knowledge of both direct and Smost frequently used MS proteomics strategy, to routinely indirect effects of the molecular tools used12–15. Together, achieve a high degree of coverage of the proteome and modified these form a complex set-up with uncertainties at many levels, (for example, phosphorylated) proteome, with ever-improving the like of which is increasingly successfully handled with quantitative accuracy1–3. However, owing to the high redundancy statistical and network-modelling approaches (see for example, and extreme complexity of proteome samples, the full Ideker and Krogan16, and Terfve and Saez-Rodriguez17 for spectrum of peptides present is largely undersampled in any reviews). Indeed, the challenges mentioned above (uncertainty in single experiment. Hence, repeated analyses of the same or the data, sparsity of prior knowledge), combined with a scope similar biological samples can show problematically low unmatched by other proteomics technologies, make traditional overlap of identified proteins4–6. This leads to problems of high modelling approaches such as reverse-engineering and missing-data fraction and low reproducibility, especially when knowledge-driven model building largely unsuitable17. using data-dependent acquisition, where simple heuristics are Therefore, analyses of phospho-MS to understand signalling used to select precursors for tandem MS analysis7–11. This an be typically result in a list of modulated abundances, of which some alleviated using strategies by which extracted ion chromatograms can be followed up on, but which fail to interrogate the are constructed for all peptides identified in a set of samples9,12. connections between the elements of a signalling network, In addition, depth of analysis comes at a high cost in terms of despite a clear interest from the community2,8,15,18,19. experimental time, which limits the ability to perform replications In this work, we present a method (PHOsphorylation and analyse many conditions5. Networks for Mass Spectrometry (PHONEMeS)) to analyse Using such phosphoproteomics data (hereafter phospho-MS) changes in phospho-MS data on perturbation in the context of data to investigate signalling by phosphorylation, we are further a network of possible kinase/phosphatase-substrate (K/P-S) faced with problems linked to the specificity of kinase–substrate interactions (Fig. 1). This method combines (i) stringent

Normalize, summarize Gaussian mixture model: statistical significance perturbation status (0/1) Protein extraction Mass digestion spectrometry phospho-enrichment Drug2 Drug3 Drug1 Drug2 Ctrl Density Intensity Ctrl +20 kinase inhibitors Drug1 Drug3

m/z Intensity (pep.i)

1h Condition Intensity estimate (pep.i) A

...

Background network: default sampling weights Bipartite network biochem. relationships Drug Combine, target map, Kinase/phosphatase- filter substrate interactions Data: databases P , P

Sites P C Kinases/ Phosphatases Phosphorylation

Find logical perturbation flow A (not C) upstream of B Stabilised A C E site 1 downstream of A, via B Hypothesis edges Population E site 2 not reached by any drug generation frequencies B score/edge Sample D directly downstream of C, not B frequencies n models Site 1 of D does not influence activity stable ... D Correct Simulate weights & score E Best models

Figure 1 | Overview of the PHONEMeS method. (a) Data. Cells are treated with a panel of kinase inhibitors (Supplementary Table 1), and discovery phospho-MS data are obtained. The data are normalized and a linear model used to estimate the effects (and significance) of each treatment on each peptide. A Gaussian mixture model is fitted for each peptide. Those that show a naturally Boolean behaviour with two populations (a control and a perturbed state) are selected. Each measurement (peptide, condition) is associated with the log ratio of the probability of belonging to either the control or perturbed distribution. See also Fig. 2 and Supplementary Fig. 1. (b) Background network. The data are mapped to a K/P-substrate network combining information from multiple databases (Supplementary Table 2) from which we extract a network of possible paths connecting drug targets and data (Supplementary Fig. 2). (c) Training of the networks is done by iterating through (i) sample Boolean logic models of perturbation flow; (ii) simulating the logic model to steady state; (iii) scoring based on comparison of prediction with the log ratios above; and (iv) correcting sampling weights accordingto frequencies of edges in best models (Supplementary Fig. 2). The resulting most likely paths (within a given tolerance) are used to generate testable hypotheses.

2 NATURE COMMUNICATIONS | 6:8033 | DOI: 10.1038/ncomms9033 | www.nature.com/naturecommunications & 2015 Macmillan Publishers Limited. All rights reserved. NATURE COMMUNICATIONS | DOI: 10.1038/ncomms9033 ARTICLE statistical modelling of perturbation data with (ii) logic model site to determine which sites in our data sets are better captured building and training based on a space of paths from perturbed with two distinct populations of states across the conditions nodes to affected phosphorylation sites compatible with K/P-S observed and the location of the group of conditions where knowledge. Based on a phospho-MS data set acquired on the the site is at its control or perturbed state (see Figs 1a and 2). inhibition of kinases with small molecules, we show that We found that 62.3% of peptides were best captured with a single PHONEMeS is capable of recapitulating known relationships component, indicating that the level of phosphorylation of sites between different perturbed kinases and their substrates. on these peptides is not significantly affected by our kinase Furthermore, it organizes the data in a way that is readily inhibitors. This is expected when using a discovery technology, interpretable as a network of regulatory relationships as since most of the proteome can be assumed to be unaffected opposed to a list of sites affected by the inhibition of a by punctual perturbations such as those used here. Of those particular kinase. We demonstrate the power of this approach peptides that we better captured with more than one population, by modelling the effect of the inhibition of multiple kinases in a the majority (72.5%) showed a behaviour whereby the breast cancer cell line and verify the unexpected prediction that phosphorylation levels lie in either of two populations, one mTOR inhibition affects the function of the cyclin-dependent encompassing the control level and the other assumed to capture kinase CDK2. Finally, using an independent data set (obtained the set of conditions in which the perturbation has reached the with the same cell line but a different set of inhibitors and site (Supplementary Fig. 1). We focused on these 2,376 peptides instruments), we show that placing the data in context with that comply with the Boolean assumption, and computed for each PHONEMeS allows us to reconcile the insights obtained from peptide i in each condition j a single number Sij (the log ratio of two data sets that seem disparate at first sight, as is often the case the probability of belonging to the control versus the perturbed with discovery MS. distribution), which is negative when the data point is more likely to belong to the perturbed distribution, and positive otherwise Results (see Fig. 2). This single number can be used to derive a score that Data processing for perturbation flow modelling. The data used compares candidate networks based on their associated predicted here consist of liquid chromatography-tandem MS (LC-MS/MS) states of measured nodes on perturbation (see below). analysis of phospho-enriched proteomic extracts from MCF7 breast cancer cell line samples exposed to a panel of 20 small-molecule kinase inhibitors targeting multiple growth Context-specific network building and training. The core of pathways (Supplementary Table 1) for 1 h (ref. 20). To obtain PHONEMeS lies in building and training a background network estimates of the effect of each inhibitor on each of the 12,266 that represents a set of possible paths connecting inhibited unique peptides across biological duplicates and technical kinases/phosphatases to the measured phosphosites responding triplicates, as well as a rigorous measure of the reliability of to the perturbation. This aspect is conceptually similar to a these estimates, we applied a linear model with empirical Bayes previously presented methodology for low-content phosphopro- shrinkage of the standard errors, followed by multiple hypothesis teomics23. Manually assembling a network connecting the targets testing correction (see Fig. 1a and Supplementary Fig. 1a). of 20 kinase inhibitors and a couple of thousand perturbed sites is Boolean logic modelling is well suited for computationally obviously unfeasible. Therefore, we compiled a data set of efficient modelling of large-scale networks and has the potential known and predicted K/P-S interactions from multiple databases to capture relationships between species even when the data are of (see Supplementary Table 2), and looked for paths from semiquantitative nature21,22. However, it does make the strong perturbed K/P to data sites in this network of influence. The assumption that the state of species observed can be satisfactorily resulting network (hereafter referred to as ‘background network’) captured with a two-level scheme. As defining a functionally represents the set of paths compatible with knowledge on K/P-S relevant high/low state for each of thousands of phosphosites is relationships, through which the kinase inhibitions can reach the unrealistic, we applied a Gaussian mixture model (GMM) on each perturbed sites (Fig. 1b and Supplementary Fig. 2a–c).

Peptide i

Sij < –0.5 Sij > 0.5 0.6 0.5 -

ij S -

Cond Cond 0.4 j=AKTi1 –0.5 j=MTORi1 Condj =ctrl

Density i FC AKTi1-ctrl FCi MTORi1-ctrl 0.2

0.0 02468a.u. Linear model for peptide i: data point l (6 replicates), experiment k, under condition j: yjkl=expk + condj + r jkl

Figure 2 | Data processing in PHONEMeS. For each peptide i, a linear model is fitted to estimate the effects of each condition j, and the significance of the fold change (FC) versus control is computed for each treatment with a moderated t-statistic. A Gaussian mixture model is fitted for each peptide and those that are best fitted with two components (in our pilot data, 72% of the peptides with multiple components) are kept. Each measurement (peptide i, condition j) is associated with a single number Si,j, the log ratio of the probability of belonging to either the control or perturbed distributions. Peptide i is called ‘perturbed’ under condition j if this number is below À 0.5, and it is considered to be in the control state if this number is above 0.5. Values between À 0.5 and 0.5 are considered undetermined. See also Supplementary Table 3.

NATURE COMMUNICATIONS | 6:8033 | DOI: 10.1038/ncomms9033 | www.nature.com/naturecommunications 3 & 2015 Macmillan Publishers Limited. All rights reserved. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms9033

Training is done by iteratively sampling and scoring candidate principle analysis (see Supplementary Note 1 and Supplementary networks based on evidence from the data (see Fig. 1c and Figs 3 and 4), we wanted to determine (i) whether PHONEMeS Supplementary Fig. 2d–g). Candidate logic models are built from had the ability to discriminate incorrect targets and (ii) whether the space of solutions compatible with knowledge, by sampling combining data with knowledge in the form of a network structure incoming hyperedges (that is, sources and their logic combination) brings valuable information to the training process. Indeed, when for each node in the network (Fig. 3d). Each candidate network is looking at paths in the K/P-S network to the sites found in our data then simulated until a steady state is reached under all conditions to be perturbed under inhibition of the kinase mTOR (mammalian considered, and the predicted and observed states of nodes are target of rapamycin), we found that there are two groups of K/P: compared (Supplementary Fig. 2e) by computing a score those that can reach all perturbed sites for a condition, and those X X that cannot reach any (Supplementary Fig. 5). Therefore, any K/P S ¼ S ; À S ; ð1Þ model nodes predicted P i j nodes predicted C signif: P i j in the first group would be a viable drug target candidate in the sense that they have potential paths to all sites perturbed under that rewards correctly predicted perturbations and penalizes that drug treatment. wrongly predicted and missed perturbations. Based on these To answer these, we compared three analyses aiming to capture model scores, we select a family of best-performing models, and the effects of mTOR inhibitors (Fig. 3a). In the first setting, we built update the sampling procedure accordingly (Supplementary and trained models attempting to connect the correct drug target Fig. 2f–g). Training finishes when the average score and sampling (mTOR) to the sites perturbed under mTOR inhibition. In the frequencies of edges for the population of models sampled in each second setting, we instead used the targets of another drug generation stabilize. Multiple independent training results (that is, (MP2K1 and MP2K2, targets of the MEK inhibitors), and in the frequencies of edges in the final population of models) are typically third setting we used the right target but randomized K/P-S combined into a single solution to account for the stochastic nature networks (obtaining by randomly shuffling the K/P across K/P-S of the optimization. The results are visualized as a single network interactions). As we can see from the scores plots (Fig. 3b), where each node is represented with its highest frequency inputs networks based on the wrong information (wrong targets) provide with a user-defined level of tolerance. The tolerance (edges within clearly worse fits (pink–purple curves) than those based on x% of the highest frequency input) is used to assess the extent to plausible targets (blue–turquoise curves). The random network which different areas of the network are constrained by the data. scores (green–red curves) cover a range between these two extremes. These results suggest that an optimization with a Target prediction power and network information content. biologically realistic network but with the incorrect targets Having established the validity of the method using a proof-of- performs worse than a random network, which itself performs

MTOR 7 MP2K1/2 0 Target MTOR MP2K1 MTOR MP2K2

Network 4 7 4 4 Sites P under MTOR inhibition 17 5 Correct target, Incorrect target, Correct target, 13 10 real network real network randomized network

6 MP2K1, MP2K2 8 13 Correct target, real network 10 9 MTORrand MTOR

4 15 5 MTOR 1 Incorrect target, real network

2 6 3 4 1

2 2 0 17 21

Score (population average) n Sites perturbed –2 nn 0 1020304050 3 5

Correct target, randomised network K/P at distance 1 from target Generation Figure 3 | Target prediction and network information content. (a) We performed three sets of optimizations to model the effect of mTOR inhibition: (i) with real K/P interactions and the expected drug target (mTOR); (ii) with the real interactions but another drug’s targets (MP2K1, MP2K2); and (iii) with three separate randomized sets of K/P interactions and the expected target. (b) Population average scores for each of the two mTOR inhibitors. (c) Simplified representation of the resulting networks (20% tolerance; full networks shown in Supplementary Fig. 6). The size of the nodes reflects the number of K/P or sites perturbed, at a distance d from the drug target. Real network/correct target: 33 out of 34 perturbed nodes connected to mTOR, average shortest path length of 4.3; 92 nodes in total. One of the three randomized networks/real target: 30 out of 34 perturbed nodes connected to mTOR, average shortest path length of 4.9; 83 nodes in total. Real network/incorrect target: 30 out of 34 perturbed nodes connected to MP2K1/MP2K2, average shortest path length of 8.1; 141 nodes in total. Most sites are connected to the designated targets in all cases but the paths are shorter with the correct target than with the incorrect ones. The random networks show a major difference in terms of number of sites explained directly by the drug target or its first neighbours (14 sites for the real network versus 4 sites for this random network).

4 NATURE COMMUNICATIONS | 6:8033 | DOI: 10.1038/ncomms9033 | www.nature.com/naturecommunications & 2015 Macmillan Publishers Limited. All rights reserved. NATURE COMMUNICATIONS | DOI: 10.1038/ncomms9033 ARTICLE worse than a realistic network with the right target. These results mTOR effects CDK2-mediated G1/S-phase transition. The would presumably be influenced by the specificity of the drug and above analysis predicted an interesting path between mTOR and the similarity of the neighbourhoods of candidate targets. However, CDK2 (via Akt), leading to perturbation of the phosphorylation at least in this case study, our analysis indicates that prior level of predicted substrates of CDK2 such as cyclin-L1 (CCNL1) knowledge, at both the drug target and network level, is (see Fig. 4c). Therefore, we wanted to verify both the plausibility informative as it leads to better fits to data. Looking at the of this predicted path to CCNL1, and the hypothesis resulting networks (Fig. 3c and Supplementary Fig. 7), we can see that mTORC1/2 inhibition leads to a perturbation of the that even though all settings can connect chosen targets to most activity of CDK2. First, we constructed extracted ion sites, the incorrect targets reach the perturbed sites through longer chromatograms for the ion shown in Fig. 5b, which covers the paths, with an average shortest path length of 4.3 with the real Ser335/Ser338 phosphorylation sites (m/z 891.0787; tRE64 min) target and 8.1 with the wrong target. Comparing the real and across the control and drug-treated samples (see also randomized networks, the biggest difference lies in the number of Supplementary Fig. 9). As we can see in Fig. 5b,c, this sites that are found directly under the drug target or one of its first phosphopeptide was decreased 2–5 fold in abundance following neighbours (mTOR, AKT1, KS6B2/KS6B2 and SGK1 cover 14 sites Akt and mTORC1/2 inhibition, but not following P70S6K in the real network, whereas mTOR and its first neighbours only inhibition, confirming that the path mTOR-AKT-CDK2 is explain 4 sites in the random network). This indicates that, when likely to be responsible for the fact that perturbation of mTOR working with kinase inhibitors, one should question the assay and reaches CCNL1. This site was also inhibited by PI3K inhibitors, assumed drug targets when none of their known direct substrates but not by the other inhibitors tested (see Supplementary Fig. 9). are affected. In summary, this analysis indicates that considering Given that the sequence flanking the CCNL1-Ser335 phosphor- the scores and resulting networks (proximity of sites to drug target, ylation site shows a consensus CDK2 phosphorylation motif simplicity of the network, etc) should allow us to prioritize drug (xSPxxK25), it is likely that this site is the CDK2 substrate of targets as being more or less likely. Accordingly, we have used which phosphorylation level is affected by mTOR inhibition in a PHONEMeS to analyse the effect of each of the 20 drug treatments mechanism that is Akt and CDK2-dependent. applied in this data set (see Supplementary Note 2 and Next, we wanted to verify that the cell biological effect of CDK2 Supplementary Table 7). This allowed us to get a novel perspective is indeed altered when mTOR and Akt activities are modulated, on the specificity and efficacy of kinase inhibitors. as the PHONEMeS prediction suggests. CDK2 is known to control the transition of cells from G1 to S phase25, such that inhibition of CDK2 activity leads to a stalling of the cells in a Modelling the effect of mTOR inhibition. We used PHONEMeS G1–G0-like state due to an inability to make the transition into to model the propagation, through the phospho-signalling net- S phase. Therefore, if our predictions are correct, inhibition work, of mTOR inhibition with two distinct mTOR inhibitors of mTOR/Akt should result in a decrease of CDK2’s activity (Fig. 4 and Supplementary Table 1). The mTOR kinase interacts (as evidenced by the decrease in phosphorylation of its with multiple proteins to form two complexes: mTORC1 (which substrates as shown in Fig. 5a), and therefore an increase in the responds to nutrients and regulates processes such as protein proportion of cells unable to transition into S phase and beyond. synthesis and autophagy) and mTORC2 (which responds to To test this, we treated MCF7 cells with specific Akt and growth factors). The regulation and downstream effects of mTOR mTORC1/2 inhibitors for 24 h (see Fig. 5a) and subsequently in either of these are complex and only partially characterized24, measured the proportion of cells in each cell-cycle phase by flow but potent specific inhibitors of mTOR as the catalytic subunit of cytometry (see Fig. 5d). As we can see in Fig. 5d, on both complexes are available. Indeed, each of the inhibitors used pharmacological inhibition of either Akt or mTORC1/2, the here result in 37 significantly perturbed phosphosites, of which 32 proportion of cells present in G1 phase increased approximately are common, indicating both efficacy and common specificity. 1.7- and 2-fold, respectively, relative to DMSO control, and the Connecting these sites to mTOR in the network of possible K/P proportion of cells present in S and G2/M phase was solutions results in a network of 311 nodes and 4,859 edges concomitantly reduced. By contrast, on treatment with a (Supplementary Fig. 6). After applying PHONEMeS excluding CDK1/2/9 inhibitor (Fig. 5e), the proportion of cells in G2/M predicted interactions, we obtained a trained network that is phase was increased, consistent with an effect on CDK1 and easily visually interpretable, comprising 30 nodes and 32 edges confirming that the effect observed on Akt and mTOR inhibition with 20% tolerance (Fig. 4a). This network shows that mTOR is likely to be CDK2-dependent and CDK1-independent. As a inhibition results in an expected decrease of phosphorylation of whole, these data suggest that inhibition of either mTORC1/2 or many canonical mTOR substrates (for example, MYC, 4EBP1, Akt reduced the biochemical (Fig. 5b,c) and functional (Fig. 5d) AKTS1, KS6B1). It also shows a propagation of the signal further activity of CDK2, thus giving support to the predictions from the downstream through an effect on KS6B1 and AKT1. Therefore, network model (Fig. 5a). PHONEMeS successfully placed the perturbation data in context, resulting in a noticeable increase in interpretability. Because many of the sites affected by mTOR inhibition in our data do not Reconciling insights from independent data sets.A have experimentally demonstrated kinases, we wanted to see if we fundamental problem with discovery MS lies in the poor could use our trained network to confirm which of their predicted agreement between multiple, biologically similar, data sets7,11.We kinases is most likely to be responsible for their phosphorylation reasoned that if this was mostly a result of the undersampling in this context (Fig. 4b,c). Based on their predicted kinases, we problem4,10, then placing the data in its molecular context should were able to connect an additional 14 perturbed sites to the be able to reconcile the insights obtained, even if the individual perturbation of mTOR, and to provide strong evidence in favour sites that are sampled in each data set are different. To test of 1 (or 2) possible kinases among the 5 to 16 kinases that were this hypothesis, we applied PHONEMeS on an unpublished, predicted for these sites (see Supplementary Table 5). Finally, novel but related data set that was obtained in a different physical functional annotation of the resulting network (Fig. 4c) allowed location, on a different LC-MS/MS instrumentation and with a us to explicitly show effects of mTOR inhibition on downstream partially overlapping set of drugs (Fig. 6 and Supplementary processes such as translation, regulation of the cell-cycle/survival, Table 4). The two data sets differ substantially, both quantitatively and interactions within the two mTOR pathways. (depth, that is, number of peptides measured) and qualitatively

NATURE COMMUNICATIONS | 6:8033 | DOI: 10.1038/ncomms9033 | www.nature.com/naturecommunications 5 & 2015 Macmillan Publishers Limited. All rights reserved. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms9033

LARP1.S.774 MTOR

MTOR MYC.S.62 KS6B1.S.434 4EBP1.S.65 AKT1.S.473 KS6B1.S.434 AKTS1.S.183 MYC.S.62 KS6B1.S.447 AKT1.S.473 KS6B1.S.447 LARP1.S.774 4EBP1.T.41 AKTS1.S.183 4EBP1.T.41 EF2K.S.74 EF2K.S.74 KS6B1 KS6B1 4EBP1.S.65 AKT1 MYO5A.S.1652 BAD.S.99 AKT1 CD2AP.S.233 SNUT1.S.448 STK11.S.428 GSK3B.S.9 RICTR.T.1135 STK11.S.428 RPGF6.S.1070 CDK2.T.39 RICTR.T.1135 GSK3B.S.9 MACF1.S.1376 CDK2.T.39 TP53B.S.1430 STK11 GSK3B STK11 CDK2 PKCB1.S.490 UBP32.S.1361 CDK2 LAP2A.S.310 FOXK2.S.428 AAPK1.T.174 PTN2.S.304 4EBP2.T.46 AAPK1.T.174 UNK.S.385 PTN2.S.304 FOXK2.S.428 ACINU.S.1004

AAPK1 AAPK1 CCNL1.S.335

SF3B1.S.129

ACACA.S.80 ACACA.S.80

Subunit of mTORC1 TF, activates 12 inhibits mTOR MTOR GF genes AKTS1 phos.relieves inh. phos. leads KS6B1 to degradation

4EBP1.T41 KS6B1.S.434 AKT1.S.473 4EBP1.S65 AKTS1.S.183 MYC.S.6 MTOR E2FK Apoptosis TranslationKS6B1.S.447 LARP1.S.774 phos. 12 LARP1 4EBP1.T.41 promotes BAD

EF2K.S.74 AKT1 MYC survival KS6B1 4EBP1.S.65 AKT1 RPGF6.S.1070 BAD.S.99 12 12 GEF for RPGF6 GSK3B CD2AP.S.233 MRas GTPases SNUT1.S.448 MACF1 (proliferation) RICTR MYO5A.S.1652 STK11.S.428 Splicing CDK2.T.39 MYO5A KS6B1 RICTR.T.1135 SNUT1 (proliferation) MACF1.S.1376

Deubi.enz. KS6B1 or AKT1 CD2AP RegulationGSK3B.S.9 Cytoskeleton of mTORC2 UBP32.S.1361 STK11 phos. inhibits inh. Transl. CDK2 12 4EBP2.T.46 mTORC2 PTN2 TF GSK3B UNK.S.385 FOXK2 FOXK2.S.428 UBP32 12 PTN2.S.304 AAPK1 .T.174 LAP2A.S.310 ACINU.S.1004 4EBP2.T46 PKCB1 PKCB1.S.490 RTK phosphatase UNK LAP2A CCNL1.S.335 CDK2 GSK3B Regulation SF3B1.S.129 ACINU of RB1 Splicing CCNL1 AAPK1 SF3B1

12 Lipid synthesis Growth/anabolism GF/proliferation mTOR regulation ACACA ACACA .S.80 phos. form inactive

AAPK1 Cytoskeleton Splicing (with links to prolif./survival)

Figure 4 | Modelling the effect of mTOR inhibition. (a) Fifteen sites perturbed under one of the two mTOR inhibitors in our data were connected to mTOR in three independent optimizations, using exclusively experimentally demonstrated interactions. Results of these were combined into average frequencies in ﬁnal populations of models, which were used to extract a best-consensus model with 20% tolerance around the highest frequency edges. This model recapitulates effects on known substrates of mTOR, and highlights a good agreement between the effects of the two inhibitors. (b) Sites that were found to be perturbed but did not have an experimentally demonstrated kinase were connected to the optimized network based on predictions from networKIN. (c) Predictions were prioritized based on the sign of the perturbations observed, and functional annotations of proteins were extracted from Uniprot. This representation allows to analyse such data in a way that directly traces the effect of a kinase inhibition on the phosphoproteome, and prioritizes predicted kinases.

(sites perturbed identiﬁed, drugs used), as summarized in Fig. 7. though the sites sampled in each core network differ (only 2 sites However, when we put the two networks obtained in parallel are common, out of 24 and 16 sites, respectively), we can see in (Fig. 7), we can see that the insights are remarkably similar. Fig. 6b that eight kinases that are common between the two Indeed, both resulting networks consist of a ‘core network’ networks explain 87.5% of sites in the core network. (connecting 24 out of 34 and 16 out of 22 sites, respectively), with Discrepancies in the ‘peripheral networks’ presumably result paths of length up to two kinases from mTOR supported by from a decreased quality of knowledge in the relationship perturbed sites at every step, and a ‘peripheral network’ between the target and distant sites as we get further away containing exclusively paths of three kinases, comprising from mTOR, as well as a decreased intensity and reliability of the kinases not supported by data and many predicted edges. Even biological signal. Importantly, the kinases selected in the common

6 NATURE COMMUNICATIONS | 6:8033 | DOI: 10.1038/ncomms9033 | www.nature.com/naturecommunications & 2015 Macmillan Publishers Limited. All rights reserved. NATURE COMMUNICATIONS | DOI: 10.1038/ncomms9033 ARTICLE

G1 S Torin-1 G1 S MTOR *** *** 60

KS6B1.S.434 = 5000) *** ***

AKT1.S.473 = 5000) AKTS1.S.183 MYC.S.62 n 40 n 40 KS6B1.S.447 LARP1.S.774 *** MK-2206 4EBP1.T.41 *** EF2K.S.74 20 Percentage KS6B1 4EBP1.S.65 AKT1 Percentage 20 ** *** *** *** RPGF6.S.1070 BAD.S.99 *** ***

CD2AP.S.233 SNUT1.S.448 positive cells ( 0 positive cells ( 0 STK11.S.428 MYO5A.S.1652 CDK2.T.39 RICTR.T.1135 G2/M Dead G2/M Dead MACF1.S.1376 ** 60 0.5 GSK3B.S.9 UBP32.S.1361 0.6 N.S.

STK11 CDK2 = 5000) N.S. N.S. ***

4EBP2.T.46 = 5000) 0.4

n 40 n GSK3B FOXK2.S.428 0.4 40 UNK.S.385 N.S. 0.3 PTN2.S.304 ** N.S. AAPK1.T.174 LAP2A.S.310 ACINU.S.1004 20 *** *** *** *** 0.2

PKCB1.S.490 Percentage 0.2 Percentage 20 CCNL1.S.335 SF3B1.S.129 0.1

AAPK1 positive cells ( 0 0.0 positive cells ( 0 0.0 → G1 S transition 0 0.1 1.0 0 0.1 1.0 μM) μM) μM) μM) μM) μM) μM) μM) (AZD-5438) (μM) (AZD-5438) (μM) ACACA.S.80 DMSO DMSO

MK (0.1MK (1.0 MK (0.1MK (1.0 TorinTorin (0.1 (1.0 TorinTorin (0.1 (1.0

GLNPDGTPALSTLGGFpSPApSKPSSPR; CCNL1 p-S335 p-S338 (z=3) Experiment 1 Experiment 2 743266 Inhibitor 1 Inhibitor 2 Inhibitor 1 Inhibitor 2 650358 557449 464541 CCNL1 p-S335 p-S338 (z=3) 371633 Control 278724 Intensity (A.U.) 185816 5e+05 92908 62 63 64 65 62 63 64 65 62 63 64 65 62 63 64 65 62 63 64 65 62 63 64 65 61 62 63 64 61 62 63 64 61 62 63 64 62 63 64 65 62 63 64 62 63 64 65 0 4e+05 Akt 3e+05 63 64 65 66 63 64 65 66 63 64 65 66 62 63 64 65 62 63 64 65 62 63 64 65 63 64 65 66 63 64 65 66 63 64 65 66 62 63 64 65 62 63 64 65 62 63 64 65 2e+05 mTOR

Peak intensity (A.U.) 1e+05 63 64 65 66 62636465 62636465 62 63 64 65 62 63 64 65 62 63 64 65 62 63 64 65 62 63 64 65 62636465 62 63 64 65 62 63 64 65 62 63 64 65

1 2 1 2 p70S6K 1 2 Akt Akt 62 63 64 65 62 63 64 65 62 63 64 65 61 62 63 64 61626364 61 62 63 64 62 63 64 65 62 63 64 65 62 63 64 65 61 62 63 64 61 62 63 64 61 62 63 64 Control mTORmTOR p70S6Kp70S6K

Figure 5 | Inhibition of mTORC1/2 or Akt prevents CDK2-mediated G1- to S-phase transition. (a) The prediction that Cyclin-L1 phosphorylation is affected by mTOR inhibition via Akt and CDK2 was tested by inhibiting Akt and mTOR in a validation experiment. (b) Extracted ion chromatograms for the ion representing two phosphorylation sites on Cyclin-L1 (CCNL1) under treatment with mTOR, Akt and P70S6K inhibitor treatments conﬁrm the topology predicted in a (blue lines, 1st isotope; red lines, 2nd isotope; green lines, 3rd isotope). (c) Box plots summarizing the raw intensity data shown in b. Dotted line represents the median intensity under treatment with DMSO vehicle control. (d). Cell cycle assays for MCF7 cells treated with DMSO, MK-2206 (Akt inhibitor) or Torin-1 (mTORC1/2 inhibitor) show an increase in cells in G1 phase and a decrease in cells present in S and G2/M phase on treatment, consistent with an alteration of the activity of CDK2. (e). Cell cycle assays for MCF7 cells treated with DMSO or AZD-5438 (CDK1/2/9 inhibitor) show an increase in cells in G2/M phase and a decrease in cells in G1 and S phase on treatment, consistent with an effect on CDK1. P-values derived from one-way ANOVA and adjusted for multiple testing (***Po0.001; **Po0.01; NS, not signiﬁcant).

Protein extraction LTQ-orbitrap Sites digestion Velos perturbed 12,266 ˈ ɒ ɒ phospho-enrichment f s.f μ Peptides MTORi1:KU-0063794 1 M 37 t.wɜik 32 ne μ m æs MTORi2: Torin-1 1 M 37

+20 kinase inhibitors Intensity spek 1h m/z 3 sites common Protein extraction Orbitrap XL ˈ digestion 7,374 fɒ s.fɒ phospho-enrichment Peptides MTORi1:KU-0063794 2μM 14 ne t.wɜik 13 MTORi2: AZD8055 2μM 22 +18 kinase inhibitors m æs

1h Intensity spek (7 identical, 7 different m/z 4 identical at different conc.)

Figure 6 | Putting phospho-MS data in context: set-up. This analysis aims to investigate the potential of PHONEMeS to reconcile insights obtained from two similar data sets despite the actual sites sampled being widely different. We separately analysed two similar data sets (a, b) obtained in two different laboratories, with a partially overlapping set of small-molecule kinase inhibitors. We focused on the pairs of drugs with mTOR as nominal target because out of the six nominal targets in common between the two data sets, the mTOR inhibitors appeared to be the most speciﬁc drugs with a clearly detectable effect. The data sets have widely different coverage of the phosphoproteome, and only show three sites in common among the sites that are perturbed on mTOR inhibition.

NATURE COMMUNICATIONS | 6:8033 | DOI: 10.1038/ncomms9033 | www.nature.com/naturecommunications 7 & 2015 Macmillan Publishers Limited. All rights reserved. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms9033

a Resulting networks

MYC.S.62 d =0 4EBP1.S.65 MAF1.S.75 EF2K.S.74 PATL1.S.179 LARP1.S.774 AKTS1.S.183 4EBP1.T.41 4EBP1.S.65 AKTS1.S.183 MTOR MTOR KS6B1.S.447

MYC.S.62 d =1 4EBP1.S.65 MAF1.S.75 EF2K.S.74 PATL1.S.179 LARP1.S.774 AKTS1.S.183 4EBP1.T.41 MTOR 4EBP1.S.65 AKTS1.S.183 KS6B1.S.447 MTOR KS6B1.S.447 AKT1.T.450 KS6B2 KS6B2 KS6B2 KS6B1 T.388 S.370 T.228 T.390 KS6B2 KS6B2 KS6B1 AKT1.S.473 SGK1.S.422 KS6B2 T.388 T.228 S.370 T.390 AKT1 KS6B2 KS6B1

AKT1 SGK1 KS6B1 KS6B2 AKTS1NIPA KIF4A EMSY TP53B.S.1317 ACLY IF4B RS6 T.246 S.407 S.801 S.209 S.455S.422 S.236 MYO5A.S.1652 GSK3B.S.9 MACF1.S.1376 CD2AP.S.233 RPGF6.S.1070 SNUT1.S.448 RICTR.T.1135

MYC.S.62 d =2 4EBP1.S.65 MAF1.S.75 EF2K.S.74 PATL1.S.179 LARP1.S.774 AKTS1.S.183 4EBP1.T.41 MTOR 4EBP1.S.65 AKTS1.S.183 KS6B1.S.447 MTOR KS6B1.S.447 AKT1.T.450 KS6B2 KS6B2 KS6B2 KS6B1 S.370 T.228 T.390 KS6B1 T.388 AKT1.S.473 SGK1.S.422 KS6B2 KS6B2 KS6B2 AKT1 KS6B1 T.388 T.228 S.370 T.390 KS6B2 KS6B1 AKT1 SGK1 KS6B2 AKTS1 NIPA KIF4A EMSY ACLY IF4B RS6 TP53B.S.1317 T.246 S.407 S.801 S.209 S.455 S.422 S.236 KPCT KS6A3 MYO5A GSK3B MACF1 CD2AP RPGF6 SNUT1 GSK3A CDK2 NUAK1 CDK5 CDK1 RICTR.T.1135 S.695 S.386 S.1652 S.9 S.1376 S.233 S.1070 S.448 S.21 T.39 S.600 S.159 S.39 GSK3A.S.21 CDK2.T.39 CDK5.S.159 CDK1.S.39 AAPK1.S.487 GSK3A CDK2 NUAK1 CDK5 CDK1 KPCT KS6A3 GSK3B GSK3A CDK2 CDK5 CDK1 AAPK1 FLNA.S.1459 RPRD2.S.665 LATS1.S.464 CCNL1.S.338

LAP2APTN2 FOXK2 SF3B1 CCNL1 ACINU 4EBP2 PKCB1 ACACA UNK.S.385 S.310 S.304 S.428 S.129 S.335 S.1004 T.46 S.490 S.80 MAF1.S.75 d =3 MYC.S.62 PATL1.S.179 4EBP1.S.65 AKTS1.S.183 EF2K.S.74 MTOR 4EBP1.S.65 LARP1.S.774 KS6B1.S.447 4EBP1.T.41 AKT1.T.450 SGK1.S.422 KS6B2 KS6B2 KS6B2 KS6B1 AKTS1.S.183 T.388 S.370 T.228 T.390 MTOR KS6B1 KS6B1.S.447 AKT1 SGK1 KS6B2 KS6B2 KS6B2 KS6B2 KS6B1 AKT1.S.473 SGK1.S.422 AKTS1 NIPAKIF4A EMSY ACLY IF4B RS6 T.388 T.228 S.370 T.390 TP53B.S.1317 T.246 S.407S.801 S.209 S455 S422 S236 STK11 NUAK1 MK01 CDK5 PTN12 PTN12 CDK1 PDPK1SRC AKT1 SGK1 KS6B2 KS6B1 M3K5 MP2K4 RET GSK3AMP2K1CDK2 KPCT.S.695 KS6A3.S.386 S.83 S.80 S.696 S.21 S.298 T.39 S.428 S.600 S.29 S.159 S.39 S.435 S39 S410 S17 MYO5A GSK3B MACF1 CD2AP RPGF6 SNUT1 STK11 NUAK1 MK01 CDK5 PTN12 CDK1PDPK1 SRC KPCT KS6A3 RICTR.T.1135 M3K5 MP2K4 RET GSK3A MP2K1CDK2 S.1652 S.9 S.1376 S.233 S.1070 S.448 GSK3A KPCT MP2K4.S.80 RET.S.696 BIEA CDK2 PGFRAPGFRA CDK5LCKSRCABL2 CDK1AAPK1.S.487 CDK9 FLNA.S.1459 RPRD2.S.665 LATS1.S.464 CCNL1.S.338 S.21 S.230 T.39 S.1042 S.767 S.159S.42S.17S.631 S.39 S.695 S.175 MAPK2 PAK2 NUAK2 T338 S473T308 Y326 Y315 S473 T308 Y315 Y326 MP2K4 GSK3B RET GSK3A T183Y185 T404Y185 S407 T334 T25 T222 S272 S9 Y580 Y402Y579 BIEA CDK2 PGFRA CDK5 LCK SRCABL2 CDK1 AAPK1 KPCT CDK9 T.208 Y.130 MAPK2 FAK2 PAK2 LAP2A MK08 MK09 NUAK2 B3KVH4 B0LPE5 UNK.S.385 PTN2FOXK2 SF3B1CCNL1 ACINU 4EBP2 PKCB1 ACACA KPCI S.310 S.304 S.428 S.129 S.335 S.1004 T.46 S.490 S.80 AKT1 TISB.S.92 CDK3 KC1E CSK21 KS6B2 KS6B1 T.403 MK10.T.76 MK08 MK14 KPCB KPCB PIM2 PKN1 KC1D PRKDC FOXO3.S.253 KPCI RS6.S.235 Y.185 Y.15 S.389 S.331 T.2646 S.370 ERBB2 Y.323 T.500 Y.285 S.204 S.562 ABL2 S.819 ABL2 PP1A.T.320 LCK.S.42 MK10 CDK3 KC1E KC1D PRKDC CSK21 S.817 S.631 Y334Y325Y271 Y280 MK08 MK14 KPCB PIM2 PKN1 ERBB2 ABL2 KPCT PP1A PDPK1 LCK SRC TP53B MTA1 NC2B BAD CCNL1 PKCB1 UBP32 TR10A SSFA2 MDC1 CDK6 CDK14 S299 S664 Y514 T507 Y334 S.1430 T.564 S.106 S.99 S.338 S.490 S.1361 S.466 S.737 S.376 Y.24 Y.146 CDK6 CDK14 KPCD TAU.S.721 RPRD2.S.665 DNM1L.S.616

b Networks skeleton & background All Shortest All Shortest MTOR paths paths MTOR MTOR paths paths 7 5 2 4 sites 2 sites sites MTOR MTOR MTOR sites sites MTOR MTOR MTOR 4 K/P 3 K/P 3 K/P 1 K/P 7 6 8 8 sites sites sites sites 32 282 39 8 K/P 6 K/P 4 K/P 7 K/P 8 K/P 1 K/P 36 286 12 K/P K/P K/P 10 8 2 4 K/P K/P K/P sites sites sites sites 189 edges

128 edges 8 K/P 1 K/P 11 K/P Common core: Core: 24 sites Core: 16 sites 34 34 34 21/14 sites (87.5% of core sites) 22 22 22 10 10 K/P 10 K/P 6 sites sites sites sites sites sites sites explained by 8 K (73% of core K) sites Figure 7 | Putting phospho-MS data in context: results. (a) Networks resulting from independent analysis of two separate data sets on mTOR inhibition, laid out by distance from the drug target (mTOR) (kinases highlighted in green and perturbed sites in yellow are identical between the two networks). (b) Structure of the resulting and background networks. (a,c) Structure of the background networks for the main (a) and the validation data set (c). As expected (see Supplementary Fig. 4), the number of K/P that can reach perturbed sites from mTOR are roughly equivalent in the two data sets (a versus c, left), but the minimum number of K/P necessary to connect these (b versus d, right) is much smaller in the validation data set (c), which contains less sites to connect. The results from the validation data set are however less constrained (higher number of edges within tolerance), presumably as a result of shallower sampling of the proteome. (b) Structure of the 20% tolerance resulting networks for the two data sets. The core networks (paths with evidence at every step) explain, respectively, 4 and 16 perturbed sites with 11 kinases. Eight of these kinases are common between the two analyses, and explain 87.5% of the sites in the core network of each analysis, despite the overlap in sites identiﬁed consisting of only two sites. Comparing the resulting and background networks, we can see that the training to data allowed to strongly constrain the background network, but solutions do not simply represent the shortest possible paths. core network are not simply those that form the shortest path explains these two a priori disparate data sets independently between sites perturbed and mTOR (Fig. 6b). Indeed, respectively (Fig. 6b). 24 and 9 of these 39/12 shortest paths K/P are actually chosen for To conﬁrm the general applicability of PHONEMeS, we used each data set, combined with an additional 8 and 27 K/P. In the approach to analyse a published mTOR inhibition data set26, conclusion, out of a possible 282/286 kinases lying on paths generated using a different cell line and a very different MS linking mTOR to its perturbed substrates in each data set, approach, that is, iTRAQ labelled data with a very limited PHONEMeS shows that a remarkably similar set of kinases best number of replicates and conditions, and a lower proteome

8 NATURE COMMUNICATIONS | 6:8033 | DOI: 10.1038/ncomms9033 | www.nature.com/naturecommunications & 2015 Macmillan Publishers Limited. All rights reserved. NATURE COMMUNICATIONS | DOI: 10.1038/ncomms9033 ARTICLE coverage. We found that our approach both confirmed our own fragments of a phosphorylation-based network, in the form of insights about mTOR inhibition and allowed to generate kinase–substrate relationships and individual kinases likely to new insights that could not be obtained in the original study be affected by a perturbation, rather than retrieving global (see Supplementary Note 3 and Supplementary Fig. 10). paths of perturbation as we do here. Our study therefore demonstrates a novel approach to derive biological information from phosphoproteomics data that can complement existing tools Discussion and that, crucially, can provide insights not attainable with In this paper, we introduce a method, termed PHONEMeS, which current approaches. places discovery phosphoproteomic MS data on perturbation in A number of assumptions underlie our approach: first, since it the context of a network of possible K/P-S interactions. This constrains the space of solutions to what is compatible with method reconstructs paths from kinases inhibited on drug known and predicted K/P-substrate specificity, anything outside treatment to sites perturbed by these inhibitions, using a Boolean this scope cannot be found (although candidate interactions can logic modelling. The assessment of whether a site is perturbed, as easily be added prior to training). This allows constraining well as the scoring of models based on comparison between data solutions to an a priori realistic set, which seems like a reasonable and model prediction, is based on a stringent multi-step compromise when working with very high-content (but probabilistic scoring scheme. We found that this method is comparatively low-sample-throughput) data. Second, given the capable of discriminating incorrect targets from kinases that are nature of regulation by phosphorylation, the background truly affected by a drug treatment, which is extremely valuable networks tend to be very complex, sometimes resulting in poorly given that the in vivo specificity of kinase inhibitors is often constrained (or non-identifiable) areas. This limits the insights poorly defined15. In addition, we observed that the data-driven that can be generated from the analysis for certain sections of the training results in a considerable decrease in the complexity of the network, a feature that we make visible in the solutions. Ongoing networks that have to be interpreted, compared to the full developments in high-content MS technologies should make network of paths compatible with prior knowledge. Finally, we rich experimental designs more attainable, resulting in more demonstrated the use of this method as a means to place constrained models. Third, our approach describes rapid change discovery MS data on perturbation in a clearly and easily on perturbation (mathematically assumed to correspond to a interpretable context, leading to an easily testable hypothesis. We pseudo steady state) based on data at a characteristic time point went on to validate some of these predictions, such as a predicted on perturbation. This is mostly a decision associated with the phosphorylation of CCNL1 by CDK2, leading to perturbation of limited availability of perturbation time-series data and the the phosphorylation of CCNL1 on inhibition of mTOR and Akt complexity of constraining both structure and dynamics on such but none of the other kinases in the network. We additionally large scales. Modelling change on perturbation implies that the validated the prediction that CDK2’s activity is perturbed by edges do not capture activity per se, but a change in a K/P’s mTORC1/2 and Akt inhibition, resulting in an alteration of the activity on perturbation, at the time of measurement. Finally, our proportion of cells in the G1, S and G2/M phases of the cell cycle. work uses phosphorylation as a proxy of alteration of activity. We also showed the use of PHONEMeS to compare insights from Hence, the presence of a path does not necessarily mean that multiple similar data sets despite these having little overlap at first phosphorylations caused the activation/inhibition cascade, sight, a common problem with discovery MS. but that they could occur concurrently with those events. Other studies have looked for paths in large networks, mostly Conversely, an activation/inhibition path could simply be using protein–protein interaction (PPI) networks and different absent if no phosphorylation path could be found that co-occurs types of experimental data27–32. Because of the nature of the with the cascade. Other available data (for example, about other networks that frequently support these procedures (that is, PPI post-translational modifications) could easily be included in our networks or compendium functional networks such as framework to address this. STRING33), the resulting networks connect experimentally or This approach represents a step forward in the functional, knowledge-derived ‘hit nodes’ rather than physically interpretable context-specific interpretation of discovery MS phosphoproteo- and testable paths, and do not include site-specific information. mic data on perturbation, such as but not limited to kinase In addition, there rarely is a clear and consistent physical inhibition: any perturbation that can be connected to the K/P-S interpretation of paths, because the search procedure is often network (genetic, extracellular ligand and so on) can in principle designed to find minimal paths of influence or minimally be used. We are planning to investigate several additional features connecting subnetworks, and is agnostic to the biological for this method, such as different ways to (i) include (and study) interpretation of edges, which is often widely heterogeneous. the logic combinatorial complexity in the network (which we Furthermore, many of these approaches ignore quantitative have largely ignored here because, without combinations of information associated with ‘hit’ nodes that they aim to perturbations, complex logic gates rarely reach high frequencies); connect, and impose some constraint on inclusion of nodes (ii) include multiple states of regulation, such as for example, without data, thereby penalizing missing data/data with high control/up/down; (iii) extend the background knowledge to false-negative rates and incomplete coverage. This is a reasonable interactions based on lipid phosphorylation and phospho-binding choice when using genomic and transcriptomic technologies domains. where coverage is less of an issue, but less so with respect to proteomics experiments. Finally, many of these methods cannot simultaneously capture multiple conditions (and disjoint sets of Methods associated hits) in a single network. Therefore, multiple Data generation. The main data set used in this paper is described in Wilkes et al.20. Briefly, cells are treated for an hour with one of 20 small-molecule kinase conditions require either searching for multiple separate inhibitors (see table ST1) or the DMSO vehicle. Cells are collected, lysed and networks or lumping all nodes to be connected as if they enriched for phosphopeptides using TiO2, and samples are run through an LC-MS/ originated from the same perturbation. Some studies34–36 have MS (LTQ-Orbitrap-Velos) in (technical) triplicates. Each treatment is performed in applied well-known reverse-engineering approaches (for example, biological duplicates (successive early passages), leading to a total of six replicate measurements for each peptide and drug treatment. The peptide identification is correlation-based), but using extensively manually curated data of 38 12,37 done using Mascot and the quantification is done using the Pescal software , much smaller scope. A few studies have used very similar providing peak heights from extracted ion chromatograms for each data to ours but their insights were focused on inferring phosphopeptide ion. For the second data set, the analysis pipeline is identical

NATURE COMMUNICATIONS | 6:8033 | DOI: 10.1038/ncomms9033 | www.nature.com/naturecommunications 9 & 2015 Macmillan Publishers Limited. All rights reserved. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms9033 except for the inhibitors used (see table ST4) and the mass spectrometer Iterative training to data (LTQ-Orbitrap-XL). The construction and visualization of extracted ion Sampling. Logic models are sampled from the background network at each gen- chromatograms (XICs) was automated with a computer programme written in eration. Sampling is done for each node independently by dividing the (0,1) Python that interfaces with MSFileReader (ThermoFisher Scientific) as reported interval into as many bins as allowed, with input combinations of edges and previously9. The time and mass windows for the generation of XICs were 1.5 min drawing from the uniform distribution. For integrators, the allowed input com- and 7 p.p.m., respectively. binations are as follows: each edge independently, all edges together as a single AND or no edge. For example, an integrator with two incoming edges has four input combinations: (edge1, edge2, edge1 þ edge2, no edge), each getting one of Cell-cycle analysis. MCF7 cells were plated in six-well plates at a density of four bins, that is, a probability of 0.25. For a sink node, we sample a single K/P approximately 0.5 Â 106 cells per well in Dulbecco’s modified Eagle’s medium (for example, for a sink with two incoming edges: (edge1, edge2) each gets one of (supplemented with 10% foetal bovine serum; 1% penicillin/streptomycin). two bins, that is, a probability of 0.5). For intermediate nodes, we allow all possible Following a 24-h incubation period, these cells were treated with 0.1% DMSO, logic combinations of input edges by drawing for the edges and then separately for 0.1 or 1.0 mM AZD-5438, 0.1 or 1.0 mM MK-2206, or 0.1 or 1.0 mM Torin-1 for the gate that will combine them. For example, considering an intermediate with 24 h. Following this incubation, the cells were washed with PBS, trypsinized and two incoming edges (edge1, edge2) are sampled independently with 0.5 probability, collected by centrifugation (450g, 5 min). Cells were fixed with the drop-wise and the AND/OR is sampled with 0.5 probability. As a result, both edges will be addition of 10 ml ice-cold 70% EtOH—with constant agitation—and incubated at sampled simultaneously in an expected 25% of models (half of which will are 4 °C for 2 h. Fixed cells were then washed with PBS, collected by centrifugation expected to be sampled with an ‘AND’ gate), no edge will be sampled in 25% of (as above) and stained in the dark with 0.7 ml Guava Cell Cycle reagent (Merck models (expected), and either edge only will be sampled in half of the remaining Millipore, USA). Stained cells were then measured through the use of a Guava 50% of cases each. These add up to the following possibilities: (edge1, edge2, no easyCyteTM Flow Cytometer (Merck Millipore, USA). Each sample was run in edge, edge1‘AND’edge2, edge1‘OR’edge2), with respective probabilities (0.25, 0.25, triplicate on the instrument and 5,000 cells were analysed in each replicate 0.25, 0.125, 0.125) (more details in SF2). injection. Scoring. Models in the sampled population are simulated under each condition (drug or identical set of drug targets) by deterministically propagating perturbations from drug target nodes in the Boolean logic model, until reaching steady state. The Data processing. The raw peak heights were log transformed and quantile score of a model is obtained by summing for each simulated condition over the j data normalized, and a linear model was applied to estimate the effect of each treatment conditions that map to the set of drug targets used in the simulation (that is, and experimental artefact, as implemented in the Bioconductor package limma. evidence from multiple drugs with the same targets are added up): Multiple designs were tested and a design with a factor for each drug, a single X X S ¼ S ÀÀ S ; control and a factor for the biological replicates (experiments were done in two model nodes predicted P ij DP nodes predicted C i j sets, each containing a control and a set of drugs, and two biological replicates) was selected based on Akaike’s Information Criterion. The linear model was fitted (that is, true positive predictions (negative Si,j) þ false-positive predictions (positive taking into account the inter-technical replicate correlation. We then estimated the Si,j) À false-negative predictions (negative Si,j)). The best models are associated with log fold change between each kinase inhibition and the control condition for each the lowest (most negative) scores. True negative predictions are not taken into peptide, and computed the significance of this fold change using a moderated account because with discovery (untargeted) MS data we expect negative t-statistic by empirical Bayes shrinkage of the standard errors. The resulting measurements (nodes not perturbed) to outweigh positive ones under any particular P-values were corrected for multiple hypothesis testing using a Benjamini– condition. Note that the Si,j are obtained by peptide, and the simulation gives Hochberg procedure applied on each condition, as implemented in the R function perturbed/control information at the site level, so when multiple peptides match to p.adjust. We fitted a GMM on the estimates from the linear model, separately for the same site, their Si,j are added up as independent pieces of evidence. A family of each peptide, across 21 conditions (20 drugs, 1 control), using the R package best models (with a user-defined level of tolerance around the best model of the mclust. The package implements an expectation-maximization algorithm to fit a iteration) is selected and used for weights correction. GMM with one to nine components and return the optimal model according to a Weights correction. This is done by virtually copying edges a number of times Bayesian information criterion. Of the 11,654 peptides for which a model could be that is proportional to their frequency in the best models, with a fixed ceiling (‘cap’ fitted, 7,263 were best fitted with one component, 3,183 with two, and 670 with parameter). For the integrator example above, if the ceiling is cap ¼ 2 and edge1 is three. We filtered the data to exclude cases where the areas under the density curves present in 50% of best models, and the AND in the other 50%, then the new of the two distributions overlap by more than 10% and excluded those peptides for possibilities are (edge1 Â 2, edge2, edge1 þ edge2) Â 2, no edge), with respective which the control intensity could not be estimated due to insufficient data. A total probabilities (2/6, 1/6, 2/6, 1/6). For the sink example, if edge1 is present in 100% of of 2,376 peptides fulfilled all of these conditions. Based on the Gaussian parameter best models, then the new possible inputs are (edge1 Â 3, edge2), with respective estimates obtained for the 2,376 peptides mentioned above, we computed a single probabilities (0.75, 0.25). For the intermediate example, if edge1 is present in 100% number for each site i/condition j: Si,j ¼ log10(Pi,j(Ci)/Pi,j(Pi)) where Pi,j(Ci) is the of best models, and edge2 is present in 50% of best models, with the AND gate in probability that peptide i in condition j belongs to the control distribution for 50% of best models, then edge1 and edge2 will be sampled with respective peptide i and Pi,j(Pi) is the probability that peptide i in condition j belongs to the probabilities 3/5 and 2/5, and the AND will be sampled with probability perturbed distribution for peptide i. This number is negative when the measure- (0.5 þ 0.5)/2 (average of the previous sampling weight and the frequency of AND ment is more likely to belong to the perturbed distribution, and positive otherwise. in the best models of the generation). Results analysis. The iterative training is finished when the average score of the population of sampled models and the frequency of edges in this population Background network. We assembled the K/P-S data from multiple databases stabilize. These frequencies are the final result of a single optimization, and they are (Phospho.ELM, PhosphoSitePlus, HPRD, NetworKin and DEPOD, all downloaded normally combined and averaged across multiple independent optimizations to in January 2013), mapped it to Uniprot identifiers (UPIDs) and filtered it to account for the stochastic nature of the optimization. The results of combined contain only reliable, human protein–protein information. The NetworKIN data optimizations are expressed as a ‘majority voting model’, which is the model where were filtered to exclude interactions with either a motif or context score below 0.5, each node has as input the edge with maximum (averaged) frequency in the trained and predicted interactions for CSK21 and CSK22 (2288 each) were excluded populations. Starting from this single ‘consensus’ model, we look at a spectrum of because of low specificity. The final bipartite K/P-S network contains 128,251 models with increasing levels of noise, where each node receives all edges with a interactions between 604 kinases/phosphatases and 17,319 sites on 4,567 proteins frequency within a certain percentage of the best input frequency. All of the (see ST2). As discussed, this exclusively contains PPIs. Additional types of computation is done in R, and the resulting networks and attributes are output as knowledge can be added to the background network and be subjected to training in flat text files that can be directly imported and visualized in Cytoscape39. an identical manner. This is the case, for example, for the link between PI3K All the code is freely available as an R package under licence GPLv3, as well as (PK3CA, D) and PDPK1.S393, which is added to the knowledge to reflect the tutorials and Cytoscape files on http://www.cellnopt.org/PHONEMeS activation of PDPK1 as a consequence of lipid phosphorylation by PI3K. A Proof-of-principle and influence of parameters. For these analyses (Fig. 3), three background network is built from this graph reflecting two main goals: (i) finding independent rounds of optimization were performed for each setting, with relationships between perturbation targets, and (ii) finding kinases/phosphatases parameters set to the following: direct interactions ¼ G1, cap ¼ 5, n ¼ 5,000, that are affected by the perturbations and are responsible for the altered tolerance ¼ 30%, sizeP ¼ 1, unless stated otherwise. Optimizations were run for 50 phosphorylation levels in the data. Because the altered sites change from data to generations, at which point all had stabilized (see Supplementary Fig. 3). data, background networks are specific to each data set. First, we extract a directed Target discrimination and random networks. For these analyses (Fig. 3), the data network at the protein level that connects drug targets, including paths of up to relating to the two mTOR inhibitors in Supplementary Table 1 (sites perturbed seven nodes. Second, we collect all K/P that potentially target our data sites. Finally, under either drug) were used as a basis to build and train background networks we collect all K/P in the second network to all K/P in the second network, with using (i) the real K/P-S bipartite network and MTOR_HUMAN as a target; (ii) the paths of up to five nodes at the protein level. We then translate PPIs back into their real network and MP2K1/MP2K2_HUMAN as targets; and (iii) three different protein-site equivalents, and add ‘integrator edges’ to bridge the site-protein steps randomized versions of the bipartite K/P-S network where K/P were randomly in protein–protein paths (see SF2). All nodes in the background network are shuffled, and MTOR_HUMAN as a target. For each setting, three independent recorded with their UPIDs, and UPIDs (often without the ‘_HUMAN’ suffix) are optimizations were run for 50 generations, with parameters sizeP ¼ 0, direct used to refer to nodes in starting and resulting networks. interactions ¼ all, n ¼ 5,000, tolerance ¼ 15%, and cap ¼ 20.

10 NATURE COMMUNICATIONS | 6:8033 | DOI: 10.1038/ncomms9033 | www.nature.com/naturecommunications & 2015 Macmillan Publishers Limited. All rights reserved. NATURE COMMUNICATIONS | DOI: 10.1038/ncomms9033 ARTICLE

mTOR inhibitors analysis. The analysis for the first data set (Figs 4 and 7, left) is 26. Hsu, P. P. et al. The mTOR-regulated phosphoproteome reveals a mechanism sidentical as the one in option (i) of the ‘Target discrimination and random of mTORC1-mediated inhibition of growth factor signalling. Science 332, networks’ section above. The functional annotation of sites was extracted from 1317–1322 (2011). Uniprot. For sites where both GSK3B and CDK2 were possible kinases, the 27. Tuncbag, N. et al. Simultaneous reconstruction of multiple signalling pathways kinases were prioritized based on the sign of the perturbation (the perturbation on via the prize-collecting steiner forest problem. J. Comput. Biol. 20, 124–136 GSK3B being an activatory one, sites showing an increase were more likely to be placed downstream of GSK3B than CDK2, whose known substrate signs showed (2013). decreases in phosphorylation, consistent with an inactivatory perturbation). For the 28. Gitter, A., Carmi, M., Barkai, N. & Bar-Joseph, Z. Linking the signalling second data set (Fig. 6, right), the settings are identical but the data used to build and cascades and dynamic regulatory networks controlling stress responses. train the background network are those associated with the two mTOR inhibitors in Genome Res. 23, 365–376 (2013). Supplementary Table 4. All network visualization is done in Cytoscape. 29. Gosline, S. J. C., Spencer, S. J., Ursu, O. & Fraenkel, E. SAMNet: a network- based approach to integrate multi-dimensional high throughput datasets. Integr. Biol. 4, 1415–1427 (2012). References 30. Huang, S.-S. et al. Linking proteomic and transcriptional data through the 1. Altelaar, A. F. M., Munoz, J. & Heck, A. J. R. Next-generation proteomics: interactome and epigenome reveals a map of oncogene-induced signalling. towards an integrative view of proteome dynamics. Nat. Rev. Genet. 14, 35–48 PLoS Comput. Biol. 9, e1002887 (2013). (2013). 31. Lan, A. et al. ResponseNet: revealing signalling and regulatory networks linking 2. Choudhary, C. & Mann, M. Decoding signalling networks by mass genetic and transcriptomic screening data. Nucleic Acids Res. 39, W424–W429 spectrometry-based proteomics. Nat. Rev. Mol. Cell. Biol. 11, 427–439 (2010). (2011). 3. Gillet, L. C. et al. Targeted data extraction of the MS/MS spectra generated by 32. Yosef, N. et al. Toward accurate reconstruction of functional protein networks. data-independent acquisition: a new concept for consistent and accurate Mol. Syst. Biol. 5, 248 (2009). proteome analysis. Mol. Cell. Proteomics 11, O111.016717 (2012). 33. Franceschini, A. et al. STRING v9.1: protein-protein interaction networks, with 4. Gstaiger, M. & Aebersold, R. Applying mass spectrometry-based proteomics to increased coverage and integration. Nucleic Acids Res. 41, D808–D815 (2013). genetics, genomics and network biology. Nat. Rev. Genet. 10, 617–627 (2009). 34. Kumar, N., Wolf-Yadlin, A., White, F. M. & Lauffenburger, D. A. Modeling 5. Malmstroem, J., Lee, H. & Aebersold, R. Advances in proteomic workflows for HER2 effects on cell behavior from mass spectrometry phosphotyrosine data. systems biology. Curr. Opin. Biotechnol. 18, 378–384 (2007). PLoS Comput. Biol. 3, e4 (2007). 6. Picotti, P., Bodenmiller, B., Mueller, L. N., Domon, B. & Aebersold, R. Full 35. Locasale, J. W. & Wolf-Yadlin, A. Maximum entropy reconstructions of dynamic range proteome analysis of s. cerevisiae by targeted proteomics. Cell dynamic signalling networks from quantitative proteomics data. PLoS ONE 4, 138, 795–806 (2009). e6522 (2009). 7. Bell, A. W. et al. A HUPO test sample study reveals common problems in mass 36. Zhang, Y., Kweon, H. K., Shively, C., Kumar, A. & Andrews, P. C. Towards spectrometry–based proteomics. Nat. Methods 6, 423–430 (2009). systematic discovery of signalling networks in budding yeast filamentous 8. Bensimon, A., Heck, A. J. R. & Aebersold, R. Mass spectrometry-based growth stress response using interventional phosphorylation data. PLoS proteomics and network biology. Annu. Rev. Biochem. 81, 379–405 (2012). Comput. Biol. 9, e1003077 (2013). 9. Casado, P. & Cutillas, P. R. A self-validating quantitative mass spectrometry 37. Santra, T., Kholodenko, B. & Kolch, W. An integrated bayesian framework for method for assessing the accuracy of high-content phosphoproteomic identifying phosphorylation networks in stimulated cells. Adv. Exp. Med. Biol. experiments. Mol. Cell. Proteomics 10, M110.003079 (2011). 736, 59–80 (2012). 10. Nilsson, T. et al. Mass spectrometry in high-throughput proteomics: ready for 38. Cutillas, P. R. & Vanhaesebroeck, B. Quantitative profile of five murine core the big time. Nat. Methods 7, 681–685 (2010). proteomes using label-free functional proteomics. Mol. Cell. Proteomics 6, 11. Tabb, D. L. et al. Repeatability and reproducibility in proteomic identifications 1560–1573 (2007). by liquid chromatography-tandem mass spectrometry. J. Proteome Res. 9, 39. Shannon, P. et al. Cytoscape: a software environment for integrated models of 761–776 (2010). biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003). 12. Casado, P. et al. Kinase-substrate enrichment analysis provides insights into the heterogeneity of signalling pathway activation in leukemia cells. Sci. Signal. 6, rs6 (2013). Acknowledgements 13. Bodenmiller, B. et al. Phosphoproteomic analysis reveals interconnected We would like to thank the EMBL PhD programme who funded C.D.A.T. for this work. system-wide responses to perturbations of kinases and phosphatases in yeast. E.H.W. was supported by Barts and the London Charity and Cancer Research UK. We Sci. Signal. 3, rs4 (2010). also thank Florian Markowetz, Kathryn Lilley, Pedro Beltrao, David Ochoa and John 14. Jorgensen, C. & Linding, R. Simplistic pathways or complex networks? Curr. Marioni for helpful discussions. Opin. Genet. Dev. 20, 15–22 (2010). 15. Pawson, T. & Scott, J. D. Protein phosphorylation in signalling–50 years and Author contributions counting. Trends. Biochem. Sci. 30, 286–290 (2005). C.D.A.T. designed and implemented the method, and performed the data analysis. 16. Ideker, T. & Krogan, N. J. Differential network biology. Mol. Syst. Biol. 8, 565 E.H.W. and P.R.C. designed the mass spectrometry experiments, which were all per- (2012). formed by E.H.W. The cell cycle and validation experiments were designed by E.H.W., 17. Terfve, C. & Saez-Rodriguez, J. Modeling signalling networks using P.C. and P.R.C., and performed by E.H.W. and P.C. P.R.C. supervised the wet-lab and high-throughput phospho-proteomics. Adv. Exp. Med. Biol. 736, 19–57 mass spectrometry experiments. J.S.-R. supervised the method development and the (2012). computational experiments. 18. Del Rosario, A. M. & White, F. M. Quantifying oncogenic phosphotyrosine signalling networks through systems biology. Curr. Opin. Genet. Dev. 20, 23–30 (2010). Additional information 19. Naegle, K. M. et al. PTMScout, a web resource for analysis of high throughput Supplementary Information accompanies this paper at http://www.nature.com/ post-translational proteomics studies. Mol. Cell. Proteomics 9, 2558–2570 naturecommunications (2011). 20. Wilkes, E. H., Terfve, C., Gribben, J. G., Saez-Rodriguez, J. & Cutillas, P. R. Competing financial interests: The authors declare no competing financial interests. Empirical inference of circuitry and plasticity in a kinase signalling network. Proc. Natl Acad. Sci. USA 112, 7719–7724 (2015). Reprints and permission information is available online at http://npg.nature.com/ 21. Morris, M. K., Saez-Rodriguez, J., Sorger, P. K. & Laffenburger, D. A. Logic- reprintsandpermissions/ based models for the analysis of cell signalling networks. Biochemistry 49, 3216–3224 (2010). How to cite this article: Terfve, C. D. A. et al. Large-scale models of signal propagation 22. Hyduke, D. R. & Palsson, B. Towards genome-scale signalling-network in human cells derived from discovery phosphoproteomic data. Nat. Commun. 6:8033 reconstructions. Nat. Rev. Genet. 11, 297–307 (2010). doi: 10.1038/ncomms9033 (2015). 23. Saez-Rodriguez, J. et al. Discrete logic modelling as a means to link protein signalling networks with functional analysis of mammalian signal transduction. Mol. Syst. Biol. 5, 331 (2009). This work is licensed under a Creative Commons Attribution 4.0 24. Laplante, M. & Sabatini, D. M. mTOR signalling in growth control and disease. International License. The images or other third party material in this Cell 149, 274–293 (2012). article are included in the article’s Creative Commons license, unless indicated otherwise 25. Kitawaga, M. et al. The consensus motif for phosphorylation by cyclin in the credit line; if the material is not included under the Creative Commons license, D1-Cdk4 is different from that for phosphorylation by cyclin A/E-cdk2. EMBO users will need to obtain permission from the license holder to reproduce the material. J. 15, 7060–7069 (1996). To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/