Controllability Analysis of the Directed Human Protein Interaction Network Identifies Disease Genes and Drug Targets

Controllability analysis of the directed human protein interaction network identifies disease genes and drug targets

Arunachalam Vinayagama,1, Travis E. Gibsonb, Ho-Joon Leec,2, Bahar Yilmazeld,e, Charles Roeseld,e,3, Yanhui Hua,d, Young Kwona, Amitabh Sharmab,f,g, Yang-Yu Liub,f,g,1, Norbert Perrimona,h,1, and Albert-László Barabásif,g,1 aDepartment of Genetics, Harvard Medical School, Boston, MA 02115; bChanning Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115; cDepartment of Systems Biology, Harvard Medical School, Boston, MA 02115; dDrosophila RNAi Screening Center, Department of Genetics, Harvard Medical School, Boston, MA 02115; eBioinformatics Program, Northeastern University, Boston, MA 02115; fCenter for Complex Network Research, Department of Physics, Northeastern University, Boston, MA 02115; gCenter for Cancer Systems Biology, Dana-Farber Cancer Institute, Boston, MA 02115; and hHoward Hughes Medical Institute, Harvard Medical School, MA 02115

Contributed by Norbert Perrimon, March 24, 2016 (sent for review November 25, 2015; reviewed by Reka Albert and Tatsuya Akutsu)

The protein–protein interaction (PPI) network is crucial for cellular properties of control systems such as proportional action, feed- information processing and decision-making. With suitable inputs, back control, and feed-forward control (8–12). However, the PPI networks drive the cells to diverse functional outcomes such as main challenges that hinder systematic controllability analysis of cell proliferation or cell death. Here, we characterize the structural biological networks are the availability of large-scale biologically controllability of a large directed human PPI network comprising relevant networks and efficient tools to analyze their controlla- 6,339 proteins and 34,813 interactions. This network allows us to bility. To address these issues, two resources were integrated in i ii classify proteins as “indispensable,”“neutral,” or “dispensable,” this work: ( ) a directed human PPI network (13); and ( )an which correlates to increasing, no effect, or decreasing the number analytical framework to characterize the structural controllability of driver nodes in the network upon removal of that protein. We of directed weighted networks (14). The directed human PPI find that 21% of the proteins in the PPI network are indispensable. network represents a global snapshot of the information flow in Interestingly, these indispensable proteins are the primary targets cell signaling. For a given weighted and directed network associated with linear time-invariant dynamics, the analytical framework of disease-causing mutations, human viruses, and drugs, suggest- identifies a minimum set of driver nodes, whose control is suffi- ing that altering a network’s control property is critical for the cient to fully control the dynamics of the whole network (14, 15). transition between healthy and disease states. Furthermore, ana- In this work, we classified the proteins (nodes) as “in- lyzing copy number alterations data from 1,547 cancer patients dispensable,”“neutral,” or “dispensable,” based on the change reveals that 56 genes that are frequently amplified or deleted in of the minimum number of driver nodes needed to control the nine different cancers are indispensable. Among the 56 genes, 46 PPI network when a specific protein (node) is absent. In addi- of them have not been previously associated with cancer. This tion, we analyzed the role of different node types in the context suggests that controllability analysis is very useful in identifying of human diseases. Using known examples of disease-causing novel disease genes and potential drug targets. Significance network biology | controllability | protein–protein interaction network | disease genes | drug targets Large-scale biological network analyses often use concepts used in social networks analysis (e.g. finding “communities,”“hubs,” he need to control engineered systems has resulted in a etc.). However, mathematically advanced engineering concepts Tmathematically rich set of tools that are widely applied in the have only been applied to analyze small and well-characterized design of electric circuits, manufacturing processes, communi- networks so far in biology. Here, we applied a sophisticated – cation systems, aircraft, spacecraft, and robots (1 3). Control engineering tool, from control theory, to analyze a large-scale theory deals with the design and stability analysis of dynamic directed human protein–protein interaction network. Our anal- systems that receive information via inputs and have outputs ysis revealed that the proteins that are indispensable, from a available for measurement. Issues of control and regulation are network controllability perspective, are also commonly targeted central to the study of biological systems (4, 5), which sense and by disease-causing mutations and human viruses or have been process both external and internal cues using a network of identified as drug targets. Furthermore, we used the controlla- interacting molecules (6). The dynamic regulation of this mo- bility analysis to prioritize novel cancer genes from cancer ge- lecular network in turn drives the system to various functional nomic datasets. Altogether, we demonstrated an application of states, such as triggering cell proliferation or inducing apoptosis. network controllability analysis to identify new disease genes This feature of specific input signals driving networks from an and drug targets. initial state to a specific functional state suggests that the need to control a biological system plays a potentially important role in Author contributions: A.V., Y.-Y.L., N.P., and A.-L.B. designed research; A.V., T.E.G., and Y.-Y.L. performed research; H.-J.L., B.Y., C.R., Y.H., Y.K., and A.S. contributed new re- the evolution of molecular interaction networks. Note that the agents/analytic tools; A.V., T.E.G., and Y.-Y.L. analyzed data; and A.V., T.E.G., Y.-Y.L., term “state” is also used in a control context where the “state and A.-L.B. wrote the paper. space” of a control system is the space of values the “state var- Reviewers: R.A., Pennsylvania State University; T.A., Kyoto University. ” – iables can attain. For a protein protein interaction (PPI) net- The authors declare no conflict of interest. work, the state variables are the specific protein concentrations 1To whom correspondence may be addressed. Email: [email protected], and the state space is all positive real numbers of dimension [email protected], [email protected], or [email protected]. equal to the total number of proteins in the PPI network. 2Present address: Department of Laboratory Medicine, Gangnam Severance Hospital, According to control theory, a dynamic system is controllable Yonsei University College of Medicine, Seoul, 062-73, Korea. if, with a suitable choice of inputs, the system can be driven from 3Present address: Marine Science Center, Northeastern University, Nahant, MA 01908. any initial state to any desired final state in finite time (2, 7). This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. Previous studies have shown that network components exhibit 1073/pnas.1603992113/-/DCSupplemental.

4976–4981 | PNAS | May 3, 2016 | vol. 113 | no. 18 www.pnas.org/cgi/doi/10.1073/pnas.1603992113 mutations, virus targets, and drug targets, we identified in- 21% of nodes are indispensable, 42% are neutral, and the dispensable nodes that are key players in mediating the remaining 37% are dispensable (Fig. 1B). Interestingly, we found transition between healthy and disease states. Our study illus- that all of the three node types have a heterogeneous degree trates the potential application of network controllability analysis distribution, and indispensable nodes tend to have higher in- and as a powerful tool to identify new disease genes. out-degrees compared with neutral and dispensable nodes (Fig. 1 B and C). Similarly, indispensable nodes are associated with Results more PubMed records (www.ncbi.nlm.nih.gov/pubmed) and Characterizing the Controllability of the Directed PPI Network. We Gene Ontology (16) term annotation than neutral and dispens- applied linear control tools to access local controllability of PPI able nodes (Fig. S1 A and B). However, the correlation between networks whose dynamics are inherently nonlinear. The experi- the node-degree and the literature bias is weak (correlation co- mentally obtained network, however, can be assumed to capture efficient of 0.37 and 0.41 for in- and out-degree, respectively), linear affects around homeostasis. Furthermore, given that the suggesting that the higher degree of indispensable nodes is not tools developed in ref. 14 are for linear dynamics, we are careful explained by the literature bias alone (Fig. S1 C and D). to only assume that we can ascertain local controllability around We characterized indispensable, neutral, and dispensable homeostasis. Controllability henceforth referred to local con- nodes in the context of essentiality, evolutionary conservation, trollability (see SI Text for details). and regulation at the level of translational and posttranslational The directed human PPI network consists of 6,339 proteins modifications (PTMs). Our gene essentiality analysis indicated (nodes) and 34,813 directed edges, where the edge direction that indispensable nodes are enriched in essential genes, whereas corresponds to the hierarchy of signal flow between the inter- essential genes are underrepresented among dispensable nodes acting proteins and the edge weight corresponds to the con- (Fig. 1E, Fig. S1E, and Dataset S1). Furthermore, indispensable fidence of the predicted direction. We applied structural nodes are evolutionarily conserved from human to yeast com- controllability theory to identify a minimum set of driver nodes pared with the other two node types (Fig. 1E and Fig. S1F). Next, (i.e., nodes through which we can achieve control of the whole we analyzed the different node types in the context of cell sig- network). Note that the identified minimum driver node set naling, which is at the core of cellular information processing. In (MDS) is not unique, but its size, denoted as ND, is uniquely general, known signaling proteins are enriched as indispensable determined by the network topology. We found that the MDS of nodes. However, dissecting different functional classes within the directed human PPI network contains 36% of nodes. We also signaling proteins reveals that kinases are enriched as in- classified the nodes as indispensable, neutral, or dispensable, dispensable nodes whereas membrane receptors and transcrip- E A basedonthechangeofND upon their removal. A node is tion factors are enriched as neutral nodes (Fig. 1 and Fig. S2 ). (i) indispensable if removing it increases ND (e.g., node 2 in Fig. Analysis of the protein steady-state abundance in cell lines, as a 1A), (ii) neutral if its removal has no effect on ND (e.g., node 1 in measure of translational regulation, reveals that indispensable Fig. 1A), and (iii) dispensable if its removal reduces ND (e.g., nodes are enriched as high copy number proteins, whereas low- nodes 3 and 4 in Fig. 1A). In the directed human PPI network, copy number proteins show moderate enrichments for both SCIENCES Input network Characterizing minimum driver node set (MDS) Driver node: Node classification APPLIED BIOLOGICAL A 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2

3 4 3 4 3 4 3434343434 3 4 |MDS| = 2 |MDS| = 2 |MDS| = 3 |MDS| = 1 |MDS| = 1 Indispensable node Real network Node 1 Node 2 Node 3 Node 4 Neutral node Dispensable node Effect on MDS upon node removal B E Z score Human directed Node type Number of nodes Effect on node removal DispensableNeutral Indispensable -5 50 PPI network Indispensable 1330 (20.9%) Increases number of driver nodes Essential genes 6339 nodes, Neutral 2662 (41.9%) No change in number of driver nodes Mus musculus 34,813 edges Dispensable 2347 (37.1%) Decreases number of driver nodes Danio rerio Drosophila melanogaster Caenorhabditis elegans Indispensable Indispensable Evolutionary C D Conservation Saccharomyces cerevisiae Neutral Neutral Signaling proteins 5e−01 Dispensable 0.500 Dispensable Membrane receptors

0 4 8 12 048 12 Cell Protein kinases 1e−01 0.100 Average in-degree Average out-degree Signaling Transcription factors High copy number 0.020 p(k) p(k) 1e−02 Indispensable Indispensable Medium copy number Neutral Neutral Low copy number 0.005 Protein

2e−03 Dispensable Dispensable Abundance Very low copy number Acetylation 5e−04 0.001 Phosphorylation (PS/PT) 1 5 10 20 50 100 200 1 2 5 10 20 50 100 200 2 PTM Phosphorylation (PY) In-degree Out-degree Ubiquitination Modification

Fig. 1. Characterizing the controllability of human directed PPI network. (A) Schematic representation of the node classification using controllability framework. (B) Identification of indispensable, neutral, and dispensable nodes in human directed PPI network. (C) In-degree distribution and average in- degree for three different node types. (D) Out-degree distribution and average out-degree for three different node types. (E) Distinct enrichment profiles of indispensable, neutral, and dispensable nodes in the context of essential genes, evolutionary conservation, cell signaling, protein abundance, and PTMs.

Vinayagam et al. PNAS | May 3, 2016 | vol. 113 | no. 18 | 4977 indispensable and dispensable nodes (Fig. 1E and Fig. S2B). Upon infection, viruses control the host cellular network to use the Similarly, indispensable nodes are highly regulated through host resources to replicate and to evade the host immune response. PTM, including acetylation, ubiquitination, and phosphorylation Here, we analyzed the node types targeted by human viruses to (pS/pT and pY) (Fig. 1E and Fig. S2C). Altogether, our en- drive the network from a healthy state to an infectious state. First, richment analyses revealed distinct functional and regulatory we analyzed the targets of HIV, a member of the lentivirus family roles for indispensable, neutral, and dispensable nodes. that causes AIDS. Putative human genes, identified to have an effect on HIV-1 replication from large-scale functional genomic Understanding Healthy to Disease State Transition Using Network screens (data compiled from four RNAi datasets) (21–24) tend to Controllability. We analyzed the node classification in the con- be indispensable nodes (Fig. 2C, RNAi; and Fig. S3C). However, text of driving the system from healthy to disease condition and we did not detect a significant enrichment—most likely reflecting vice versa. Specifically, we analyzed the impact of three different the quality of the HIV RNAi screens (25). To analyze direct targets transitions: (i) healthy to disease transition induced by mutations of HIV, we compiled the HIV–human interactome (from recent or other genetic alterations; (ii) healthy to infectious transition literature and PPI databases) (26, 27), finding that indispensable induced by human viruses; and (iii) disease to healthy transition nodes are enriched for physical interactions with HIV proteins induced by drugs or small molecules. Note that our goal is to (Fig. 2C, PPIs; and Fig. S3C). Analysis of 208 different human– determine whether specific node types (indispensable, neutral, virus networks (26–29) reveals that human viruses commonly target or dispensable) are enriched for (i) disease-causing mutations, indispensable nodes to control the host network (Fig. 2C,Virus (ii) targets of human viruses, and (iii) drug targets. targets; and Fig. S3C). We noticed that after adjusting for literature First, we analyzed 445 genes annotated by the Sanger Center as bias indispensable nodes remain as viral targets, whereas adjusting causally implicated in oncogenesis (Cancer Gene Census; cancer. for degree bias shows only weak enrichment (Fig. S3D). This sanger.ac.uk/census) (17). Interestingly, we found that indispensable finding is in agreement with the previous observations that viruses nodes are highly enriched in cancer genes, whereas neutral nodes tend to target hubs (30). showed no enrichment and dispensable nodes are underrepresented Finally, we characterized the network controllability in the (Fig. 2A,CancerI;Fig. S3A;andDataset S2). To ensure that the context of driving the system from disease to healthy state. Spe- observed enrichment of indispensable nodes is not attributable cifically, we analyzed the node types that are targeted by the drugs/ to the literature and degree bias, we repeated our analysis using small molecules (Fig. 2D). By analyzing the targets of drugs ap- literature- and degree-controlled random sets (SI Text). After proved by the Food and Drug Administration (FDA) (31), we adjusting for literature and degree bias (Fig. 2A, PubMed and found that indispensable nodes are enriched for drug targets (Fig. Degree; and Dataset S2), indispensable nodes remain signifi- 2D, FDA targets; and Fig. S3 E and F). Extending the analysis to cantly enriched as cancer genes. Note that for enrichment analysis the list of proteins that are annotated as druggable (32), i.e., a below, the degree- and literature-controlled enrichments results presence of protein folds that favor interactions with drug-like were shown in Fig. S3B. To further substantiate that indispensable chemical compounds, showed that the druggable genome list is nodes are enriched as cancer genes, we analyzed 3,164 genes not significantly enriched for indispensable nodes (Fig. 2D,DI; predicted as cancer related genes (18) and observed a similar en- and Fig. S3E). Interestingly, analyzing the druggable genome list by richment for indispensable nodes (Fig. 2B,CancerII;andFig. S3A). excluding FDA-approved drug targets showed underrepresentation Next, we analyzed 1,403 genes annotated by Online Mendelian of indispensable nodes (Fig. 2D,DI;andFig. S3E). This finding Inheritance in Man (OMIM) (omim.org) as causal genes for suggests a potential application of our analysis to redefine the various genetic diseases, aiming to test whether the perturbation druggable genome based on the network controllability. of indispensable nodes is a specific feature of cancer or a general All of the above analyses of disease mutations, viruses, and feature of human diseases. Our analysis showed that the per- drugs consistently showed that indispensable nodes are preferred turbation of indispensable nodes is a common feature of human targets. We also analyzed how often indispensable nodes act as diseases (Fig. 2B, OMIM; and Fig. S3A). Interestingly, however, driver nodes by using a recently developed approach to identify our analysis of disease genes identified from genome-wide as- the role of each node as drivers in the MDSs (33). We found that sociation studies (GWAS) (www.genome.gov/gwastudies) (19) 378 nodes appear in all MDSs (i.e., they play roles in all of the revealed poor enrichment for indispensable nodes (Fig. 2B, control configurations), 3,330 nodes are in some but not all MDSs GWAS; and Fig. S3A), most likely reflecting the fact that GWAS (i.e., they play roles in some control configurations but the network identify genomic regions but not specific coding genes that cause can still be controlled without directly controlling them), and 2,631 the disease (20). Because indispensable nodes are enriched for nodes do not belong to any MDS (i.e., they play no roles in control) causal mutations (Fig. 2 A and B), our resource could help (Dataset S1) (33). Interestingly, we found that indispensable nodes identify causal genes from GWAS. are never driver nodes in any MDS (Fig. S3G and Dataset S1). We also characterized the network controllability in the context This fact can actually be rigorously proven (SI Text). Moreover, of host–parasite interactions, specifically human–virus interactions. perturbing indispensable nodes increases the number of driver

15 15 15 15 ABIndispensable CD Neutral 10 10 Dispensable 10 10

5 5 5 5

0 0 0 0 Z score Z score Z score Z score

−5 −5 −5 −5

−10 −10 −10 −10

Cancer I PubMed Degree Cancer II OMIM GWAS RNAi PPIs Virus FDA D I D II controlled random sets HIV targets targets targets Druggable genome

Fig. 2. Characterizing network controllability in transition from healthy to disease state. (A) Bar graph showing the enrichment results (z scores) of cancer genes compared with the random sets (Cancer I, cancer gene census) and the random sets controlled for literature (PubMed) or degree (Degree) bias. In the case of degree- or literature-controlled random sets, the random sets are sampled such that the average degree or average PubMed records of random sets matches the average of node type N. (B) Results from enrichment analysis of dataset corresponding to extended list of cancer genes (Cancer II), other human diseases (OMIM), and GWAS. (C) Results from enrichment analysis of the targets of HIV identified using RNAi screens (RNAi) and PPI networks (PPIs) and targets of other human virus (208 viruses). (D) Enrichment results from targets of FDA-approved drugs and druggable genome. DI, druggable genome; DII, druggable genome excluding FDA-approved targets.

4978 | www.pnas.org/cgi/doi/10.1073/pnas.1603992113 Vinayagam et al. nodes to control, suggesting that, from a controllability perspective, Degree distribution of the subtypes shows that type-I nodes these nodes are fragile points in the network. tend to be hubs, whereas the average degree of type-II nodes is We further analyzed indispensable nodes in specific signaling similar to the average degree of the network (Fig. 3D). Indeed, pathways such as receptor tyrosine kinase (RTK) signaling type-II nodes cannot be distinguished from the rest of the nodes pathways, which are commonly perturbed in cancer (34). Strik- based on any other network properties analyzed (Fig. S5G). ingly, 67 out of 170 RTK pathway members are indispensable Furthermore, type-I nodes show literature and annotation bias nodes (P < 0.0001), including 51 indispensable nodes targeted by compared with type-II nodes (Fig. 3 E and F). With respect to disease mutations, viruses, or drugs (Fig. S4A and Dataset S2). diseases, both node types show similar enrichment for cancer Furthermore, we identified 21 indispensable nodes from differ- genes and other human diseases (Fig. 3G). In contrast to type-I ent signaling pathways that are shared targets of cancer muta- nodes that tend to be hubs and well-studied genes, type-II nodes tions, viruses, as well as drugs (Fig. S4B and Dataset S2). are poorly studied and show no special network feature except indispensability, suggesting that control theory brings orthogonal Robustness of Indispensable Node Classification. The false-positive information to traditional network analysis. and false-negative interactions are major concerns in PPI networks, especially the false negatives because the current net- Applying Network Controllability Analysis to Mine Cancer Genomic works are vastly incomplete (35). Hence, we systematically Data. Our finding that indispensable nodes (both type-I and type-II) analyzed the robustness of node classification with respect to are more likely to correspond to cancer genes prompted us to adding or removing interactions. Specifically, we analyzed the systematically survey the perturbation of those genes in cancer. indispensable node classification as a function of removing edges We analyzed data from 1,547 patients obtained from The Cancer (or network filtering). The network filtering is achieved by using Genome Atlas (TCGA) (cancergenome.nih.gov) and cBioPortal a confidence score assigned to edge directions, where the most for Cancer Genomics (36), representing nine different cancer stringent filtering resulted in smaller high-confidence–directed types (Dataset S4). Specifically, we analyzed the amplification or networks (20,151 edges and 5,317 nodes). We analyzed the deletion of type-II indispensable nodes in nine cancer types. controllability of filtered networks and compared them to the Note that the copy number alteration (CNA) data are normal- ized to the expression levels to identify the amplification or de- original network. The results show that 90% of the indispensable letion that results in expression level changes (SI Text). We nodes in the stringent filtered network are indispensable in the A A B ranked all genes based on the number of patients where the gene original network (Fig. 3 , Fig. S5 and , and Dataset S3), is amplified or deleted and selected the top 1% as frequently suggesting that the indispensable node classification is robust amplified/deleted genes; 56 type-II genes were identified as part with respect to adding or removing edges in the network. of the top 1% of deleted/amplified genes in nine cancer types Next, we analyzed the controllability of networks with per- (Fig. 4A and Dataset S4). Strikingly, 10 of 56 type-II genes are turbations (e.g., edge rewiring or edge-direction flipping). In the known cancer genes, an overlap that is highly significant (P = case of random rewiring, up to 100% of the edges are rewired 0.00002) (Fig. 4B and Fig. S6A). Interestingly, the frequency of (node degrees are preserved), and in the case of direction-flipped deletion and amplification of type-II indispensable nodes is not networks, up to 100% of the edge directions are reversed. We significantly enriched compared with random sets, an observation observed that up to 50% of indispensable nodes in the rewired or thatwassimilartocancergenecensusgenelist(Dataset S4). direction-flipped network do not agree with the original annota- Furthermore, we compared the type-II genes with results from a SCIENCES tion, showing that indispensability is highly sensitive to the con- cell proliferation screen (37) that identified a subset of genes that nectivity pattern and edge direction (Fig. 3B, Fig. S5 C–F,and regulate cell proliferation (“GO” genes induce the proliferation APPLIED BIOLOGICAL Dataset S3). Comparing indispensable nodes of the real network and “STOP” genes suppress the proliferation); 17 of 56 genes to that of the rewired (100% rewiring) and flipped (40% flipping) represent regulators of cell proliferation (11 GO genes, 8 STOP networks revealed two subtypes (type-I and type-II) of in- genes, and 2 genes part of both GO and STOP genes) (Fig. 4C and dispensable nodes (Fig. 3C and Dataset S3). If a node’sin- Fig. S6 B and C). The overlap between type-II genes and GO genes dispensability is robust to rewiring or flipping, then we call it a are statistically significant (P = 0.0003). Of 56 genes, 10 genes are type-I node; if the node’s indispensability is sensitive to rewiring or frequently perturbed in multiple cancer types [e.g., proteasome 26S flipping, then we refer to them as type-II nodes. We found that subunit, non-ATPase, 4 (PSMD4) in four different cancers], and 57% of indispensable nodes are type-I nodes and 43% are type-II. all of them show similar deletion or amplification profile (e.g.,

AB1.0 1.0 C Sensitive to both edge rewiring and flipping direction 0.8 0.8 12.2% Sensitive to flipping 22.6% 0.6 0.6 direction 57.5% Filtered indispensable nodes Flipped Rewired indispensable nodes indispensable nodes Sensitive to 7.7% Fraction of unchanged Fraction of unchanged 0.4 0.4 edge rewiring 0.0 0.2 0.4 0.6 0.8 1.0 Type-II

0.5 0.6 0.7 0.8 0.9 topology) network to (sensitive Type-I Edge score cutoff Fraction of edges randomized indispensable nodes (robust to edge rewiring and flipping direction)

DEF G 8 20 Type-I 250 Type-I Type-I Type-I Type-II Type-II 30 Type-II Type-II 15 200 6 150 20 10 4 100 10 Z score 5 50 2

0 0 GO terms Average 0 0 Average node degree

In-degree Out-degree records PubMed Average PubMed GO terms Cancer OMIM

Fig. 3. Perturbation of network connectivity reveals two subtypes of indispensable nodes (type-I and type-II). (A) Plot showing the fraction of indispensable nodes in filtered networks that overlaps with the real network. The network filtering achieved using edge confidence score. (B) Fraction of indispensable nodes in rewired or direction-flipped overlap with the real network. (C) Identification of type-I and type-II indispensable nodes. The average node degree (D), PubMed record association (E), and Gene Ontology (GO) term annotations (F) for type-I and type-II indispensable nodes. (G) Enrichment of type-I and type-II indispensable nodes as cancer genes and OMIM disease genes.

Vinayagam et al. PNAS | May 3, 2016 | vol. 113 | no. 18 | 4979 PSMD4 amplified in all four cancers) (Fig. 4D). Almost half of identification of the indispensable nodes as primary targets of the genes (23 genes) are poorly studied, with less than 50 asso- diseases causing mutations, viruses, and drugs revealed a potential ciated PubMed records; for instance, small G protein signaling application of this framework to identify novel disease genes and modulator 2 (SGSM2) is associated with only eight PubMed potential drug targets. records (Fig. 4D). These contextual evidences, along with the Interestingly, disease-causing mutations, viruses, and drugs target indispensability, suggest that these 46 type-II nodes could be fragile points (indispensable nodes) that determine the number of potential cancer genes. driver nodes rather than the driver nodes themselves, suggesting that network controllability is crucial in transitioning between Database of Directed PPI Network with Predicted Controllability. We healthy and disease states. Although network topology-based prop- created the DirectedPPI database (www.flyrnai.org/DirectedPPI) erties such as hubs and modules are commonly used to identify to navigate the directed human PPI network with predicted disease genes (41–44), the controllability perspective provides a controllability. Users can enter a gene or upload a list of genes complementary network analysis framework for network medicine. and our tool generates a network with directed edges connecting In particular, type-II nodes that are not distinguishable from existing the input list. Our tool also accepts gene list with values (e.g., network properties and without publications bias were still identi- mutation frequency, P values from GWAS, or expression changes). fied by our controllability framework as nodes of special interest. Three different node types (indispensable, neutral, and dispensable) We envision that in the future, improving the quality and the com- are distinguished with node shape and color and for these nodes all pleteness of interactome maps and integrating dynamics of of the properties analyzed in this article are displayed. This tool will network components would hugely impact our understanding be useful to analyze disease datasets and other high-throughput of biological networks both in the context of biological function datasets to identify indispensable nodes and their interconnections. and human disease. Discussion Methods Studying the controllability of a complex biological network is Datasets and Enrichment Analysis. All of the datasets used for the enrichment rather difficult, because of the fact that we typically do not know analysis in this study are listed in Dataset S2. Details on the directed human the true functional form of the underlying dynamics. However, PPI network, random networks, and datasets used for enrichment analysis most biological systems operate near homeostasis, so local can be found in SI Text. properties are indeed what we want to ascertain. Here, we showed that application of linear control tools to study the local Controllability Analysis and Node Classification. To identify MDS, with size structural controllability of inherent nonlinear biological net- denoted as ND, whose control is sufficient to ensure the structural controlla- works provides meaningful predictions. Furthermore, we dem- bility of linear dynamics (14) and local structural controllability for nonlinear onstrated that local controllability tools help identifies known dynamics (SI Text) on any directed weighted network, we can map the struc- human diseases genes and this can be used to identify novel tural controllability problem in control theory to the maximum matching problem in graph theory, which can be solved in polynomial time (15). disease genes and drug targets. Our analysis of directed human PPI network identifies 36% of After a node is removed, denote the minimum number of driver nodes of the damaged network as N ′. We can classify nodes into three categories: the nodes as driver nodes, which is similar to what has been ob- D ∼ (i) a node is indispensable if in its absence we have to control more driver served in metabolic networks ( 30%) (14). The node classification nodes (i.e., N ′ > N ); (ii) a node is dispensable if in its absence we have N ′ < based on network controllability shows distinct biological prop- D D D ND); and (iii) a node is neutral if in its absence ND′ = ND. Note that in- erties in the context of essentiality, conservation and regulation. dispensable nodes are never driver nodes in any control configurations or Specifically, we found that indispensable nodes are well conserved, MDSs,whichcanbeprovenbycontradiction(SI Text). More information highly regulated at the level of translational and PTMs, and im- on node classification and local structural controllability can be found in portant for the transition between healthy and disease states. In- SI Text. terestingly, this enrichment pattern is partially shared by the nodes in the minimum dominating sets that are located in strategically im- Random Networks. To compare the real network with its randomized portant positions in controlling the network (38–40). Furthermore, counterparts, we performed two types of randomization: (i) edge rewiring:

GNA11 A 8 D AP2B1 PPP2CB Amplification STK11 6 Deletion PAFAH1B2 TUBG1 TAF1C HERPUD1 POLR2C 4 AURKA H3F3A UCEC ENAH 2 PPP1R16A

Number of SMARCA2

Type-II genes Type-II PIP5K1A 0 BRCA F11R C G AD OV SGSM2 LUAD RAD23B GBM KIR LG PSMD4 L3MBTL2 BRCA CO LUAD LUSC UCEC UBAC2 CCT3 AKAP10 PRPF6 CSE1L ITCH BTF3 ERBB2IP B Significant overlap OV (p-value 0.00002) SOCS6 BLMH Type-II nodes ZNF24 COAD PIK3CA frequently Cancer Gene FLII ATR 46 10 435 RALBP1 RNF19A CCT5 deleted or Census CDKN2A amplified PSMC1 LUSC NPM1 CTDP1 MARK3 GBM AP3B1 CHMP2A RPL18A Significant overlap ETF1 PTEN C RNF4 (p-value 0.0003) GLI3 MED10 Type-II nodes LGG TRIP13 frequently 11 1089 GO genes Regulators DOCK1 NUP98 deleted or 39 of cell STOP genes Type-II nodes part of Cancer genes 8 3753 preliferation ING4 NR2F6 amplified Type-II nodes not part of Cancer genes Overlap is not significant KIRC UBE2D2 or Identified as regulator of cell proliferation

Fig. 4. Applying network controllability to mine cancer genomic data. (A) Type-II genes frequently amplified or deleted in cancer patients (part of top 1% genes). The bar plot shows number of type-II genes deleted/amplified in brain lower grade glioma (LGG), kidney renal clear cell carcinoma (KIRC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), ovarian serous cystadenocarcinoma (OV), uterine corpus endometrial carcinoma (UCEC), breast invasive carcinoma (BRCA), colon adenocarcinoma (COAD), and glioblastoma multiforme (GBM) cancers. (B) Overlap between frequently deleted/amplified type-II genes and known cancer genes. (C) Overlap between frequently deleted/amplified type-II genes and regulators of cell proliferation (STOP genes reduces cell proliferation and GO genes increases cell proliferation). The P values show the significance of overlap calculated based on 1000 random sets. (D) Network representation of 56 type-II genes frequently deleted (red edge) or amplified (blue edge) in nine different cancer types. The node size corresponds to the number of PubMed records associated with the gene.

4980 | www.pnas.org/cgi/doi/10.1073/pnas.1603992113 Vinayagam et al. we randomly choose a p fraction of edges to rewire, using the degree- whether the amplification or deletion results in expression change for the preserving random rewiring algorithm (45); and (ii) edge flipping: we ran- corresponding gene. A gene is defined as amplified if the Genomic Identifi- domly choose a p fraction of edges to flip their directions. We tune p from cation of Significant Targets in Cancer (GISTIC) score is ≥1andthez score is ≥1.5 0 up to 1, resulting in a series of randomized networks. and deleted if the GISTIC score (46) is ≤−1andthez score is ≤−1.5 (SI Text).

Analysis of Cancer Genomic Datasets. Copy number alteration data for nine ACKNOWLEDGMENTS. We thank M. W. Kirschner, S. E. Mohr, I. T. Flockhart, cancer types were downloaded from the cBioPortal for Cancer Genomics S. Rajagopal, and Bingbo Wang for helpful suggestions. This work was (www.cbioportal.org). Gene expression data for each cancer type were supported by National Institutes of Health (NIH) Grants P01-CA120964, downloaded from TCGA (https://tcga-data.nci.nih.gov/tcga). Next, we filtered Centers of Excellence of Genomic Science (CEGS) 1P50HG004233, and for patients with both CNA and expression data available (details are available 1R01HL118455-01A1 and John Templeton Foundation Awards PFI-777 and in Dataset S4). We computed a z score for each gene in a patient to identify 51977. N.P. is supported by the Howard Hughes Medical Institute.

1. Isidori A (1995) Nonlinear Control Systems (Springer, Berlin, New York), 3rd Ed. 34. Blume-Jensen P, Hunter T (2001) Oncogenic kinase signalling. Nature 411(6835):355–365. 2. Kalman RE (1963) Mathematical description of linear dynamical systems. J Soc Indust 35. Venkatesan K, et al. (2009) An empirical framework for binary interactome mapping. Appl Math Ser A Control 1(2):152–192. Nat Methods 6(1):83–90. 3. Slotine JJE, Li W (1991) Applied Nonlinear Control (Prentice Hall, Englewood Cliffs, 36. Gao J, et al. (2013) Integrative analysis of complex cancer genomics and clinical NJ). profiles using the cBioPortal. Sci Signal 6(269):pl1. 4. Iglesias PA, Ingalls BP (2010) Control Theory and Systems Biology (MIT Press, Cam- 37. Solimini NL, et al. (2012) Recurrent hemizygous deletions in cancers may optimize bridge, MA). proliferative potential. Science 337(6090):104–109. 5. Del Vecchio D, Murray RM (2015) Biomolecular Feedback Systems (Princeton Univ 38. Wuchty S (2014) Controllability in protein interaction networks. Proc Natl Acad Sci Press, Princeton). USA 111(19):7156–7160. 6. Balázsi G, van Oudenaarden A, Collins JJ (2011) Cellular decision making and bi- 39. Nacher JC, Akutsu T (2012) Dominating scale-free networks with variable scaling ex- ological noise: From microbes to mammals. Cell 144(6):910–925. ponent: Heterogeneous networks are not difficult to control. New J Phys 14(7):073005. 7. Luenberger DG (1979) Introduction to Dynamic Systems: Theory, Models, and Appli- 40. Nacher JC, Akutsu T (2014) Analysis of critical and redundant nodes in controlling cations (Wiley, New York). directed and undirected complex networks using dominating sets. J Complex Netw 8. Csete ME, Doyle JC (2002) Reverse engineering of biological complexity. Science 2(4):394–412. 295(5560):1664–1669. 41. Rambaldi D, Giorgi FM, Capuani F, Ciliberto A, Ciccarelli FD (2008) Low duplicability 9. Mangan S, Alon U (2003) Structure and function of the feed-forward loop network and network fragility of cancer genes. Trends Genet 24(9):427–430. – motif. Proc Natl Acad Sci USA 100(21):11980 11985. 42. Barabási AL, Gulbahce N, Loscalzo J (2011) Network medicine: A network-based ap- 10. Callura JM, Cantor CR, Collins JJ (2012) Genetic switchboard for synthetic biology proach to human disease. Nat Rev Genet 12(1):56–68. – applications. Proc Natl Acad Sci USA 109(15):5850 5855. 43. Taylor IW, et al. (2009) Dynamic modularity in protein interaction networks predicts 11. Nepusz T, Vicsek T (2012) Controlling edge dynamics in complex networks. Nat Phys breast cancer outcome. Nat Biotechnol 27(2):199–204. – 8(7):568 573. 44. Jin Y, et al. (2012) A systems approach identifies HIPK2 as a key regulator of kidney – 12. Kiel C, Yus E, Serrano L (2010) Engineering signal transduction pathways. Cell 140(1):33 47. fibrosis. Nat Med 18(4):580–588. 13. Vinayagam A, et al. (2011) A directed protein interaction network for investigating 45. Maslov S, Sneppen K (2002) Specificity and stability in topology of protein networks. intracellular signal transduction. Sci Signal 4(189):rs8. Science 296(5569):910–913. 14. Liu YY, Slotine JJ, Barabási AL (2011) Controllability of complex networks. Nature 46. Beroukhim R, et al. (2007) Assessing the significance of chromosomal aberrations in 473(7346):167–173. cancer: Methodology and application to glioma. Proc Natl Acad Sci USA 104(50): 15. Hopcroft JE, Karp RM (1973) An n^{5/2} algorithm for maximum matchings in bi- 20007–20012. partite graphs. SIAM J Comput 2:225–231. 47. Sontag ED (1998) Mathematical Control Theory: Deterministic Finite Dimensional

16. Ashburner M, et al.; The Gene Ontology Consortium (2000) Gene ontology: Tool for SCIENCES Systems (Springer, New York), 2nd Ed. the unification of biology. Nat Genet 25(1):25–29. 48. Sastry S (1999) Nonlinear System: Analysis, Stability, and Control (Springer, New 17. Futreal PA, et al. (2004) A census of human cancer genes. Nat Rev Cancer 4(3):177–183. York). APPLIED BIOLOGICAL 18. Higgins ME, Claremont M, Major JE, Sander C, Lash AE (2007) CancerGenes: A gene 49. Lin C-T (1974) Structural controllability. IEEE Trans Automat Contr 19(3):201–208. selection resource for cancer genome projects. Nucleic Acids Res 35(Database issue): 50. Luo H, Lin Y, Gao F, Zhang CT, Zhang R (2014) DEG 10, an update of the database of D721–D726. essential genes that includes both protein-coding genes and noncoding genomic el- 19. Hindorff LA, et al. (2009) Potential etiologic and functional implications of genome-wide ements. Nucleic Acids Res 42(Database issue):D574–D580. association loci for human diseases and traits. Proc Natl Acad Sci USA 106(23):9362–9367. 51. Chen WH, Minguez P, Lercher MJ, Bork P (2012) OGEE: An online gene essentiality 20. McCarthy MI, Hirschhorn JN (2008) Genome-wide association studies: Potential next database. Nucleic Acids Res 40(Database issue):D901–D906. steps on a genetic journey. Hum Mol Genet 17(R2):R156–R165. 52. Hu Y, et al. (2011) An integrative approach to ortholog prediction for disease-focused 21. Brass AL, et al. (2008) Identification of host proteins required for HIV infection and other functional studies. BMC Bioinformatics 12:357. through a functional genomic screen. Science 319(5865):921–926. 53. Hornbeck PV, et al. (2012) PhosphoSitePlus: A comprehensive resource for investigating the 22. König R, et al. (2008) Global analysis of host-pathogen interactions that regulate structure and function of experimentally determined post-translational modifications in early-stage HIV-1 replication. Cell 135(1):49–60. – 23. Yeung ML, Houzet L, Yedavalli VS, Jeang KT (2009) A genome-wide short hairpin RNA man and mouse. Nucleic Acids Res 40(Database issue):D261 D270. screening of jurkat T-cells for human proteins contributing to productive HIV-1 rep- 54. Ben-Shlomo I, Yu Hsu S, Rauch R, Kowalski HW, Hsueh AJ (2003) Signaling receptome: lication. J Biol Chem 284(29):19463–19473. A genomic and evolutionary perspective of plasma membrane receptors involved in 24. Zhou H, et al. (2008) Genome-scale RNAi screen for host factors required for HIV signal transduction. Sci STKE 2003(187):RE9. replication. Cell Host Microbe 4(5):495–504. 55. Park J, et al. (2005) Building a human kinase gene repository: Bioinformatics, mo- – 25. Goff SP (2008) Knockdown screens to knockout HIV-1. Cell 135(3):417–420. lecular cloning, and functional validation. Proc Natl Acad Sci USA 102(23):8114 8119. 26. Jäger S, et al. (2011) Global landscape of HIV-human protein complexes. Nature 56. Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S (2002) The protein kinase – 481(7381):365–370. complement of the human genome. Science 298(5600):1912 1934. 27. Zanzoni A, et al. (2002) MINT: A Molecular INTeraction database. FEBS Lett 513(1): 57. Messina DN, Glasscock J, Gish W, Lovett M (2004) An ORFeome-based analysis of 135–140. human transcription factor genes and the construction of a microarray to interrogate 28. Navratil V, et al. (2009) VirHostNet: A knowledge base for the management and the their expression. Genome Res 14(10B):2041–2047. analysis of proteome-wide virus-host interaction networks. Nucleic Acids Res 37(Database 58. Beck M, et al. (2011) The quantitative proteome of a human cell line. Mol Syst Biol issue):D661–D668. 7:549. 29. Rozenblatt-Rosen O, et al. (2012) Interpreting cancer genomes using systematic host 59. Woodsmith J, Kamburov A, Stelzl U (2013) Dual coordination of post translational network perturbations by tumour virus proteins. Nature 487(7408):491–495. modifications in human protein networks. PLOS Comput Biol 9(3):e1002933. 30. Calderwood MA, et al. (2007) Epstein-Barr virus and virus human protein interaction 60. Fazekas D, et al. (2013) SignaLink 2 - a signaling pathway resource with multi-layered maps. Proc Natl Acad Sci USA 104(18):7606–7611. regulatory networks. BMC Syst Biol 7:7. 31. Knox C, et al. (2011) DrugBank 3.0: A comprehensive resource for ‘omics’ research on 61. Shannon P, et al. (2003) Cytoscape: A software environment for integrated models of drugs. Nucleic Acids Res 39(Database issue):D1035–D1041. biomolecular interaction networks. Genome Res 13(11):2498–2504. 32. Hopkins AL, Groom CR (2002) The druggable genome. Nat Rev Drug Discov 1(9):727–730. 62. Doncheva NT, Assenov Y, Domingues FS, Albrecht M (2012) Topological analysis and 33. Jia T, et al. (2013) Emergence of bimodality in controlling complex networks. Nat interactive visualization of biological networks and protein structures. Nat Protoc Commun 4:2002. 7(4):670–685.

Vinayagam et al. PNAS | May 3, 2016 | vol. 113 | no. 18 | 4981 Supporting Information

Vinayagam et al. 10.1073/pnas.1603992113

SI Text become unmatched (i.e., a new driver node), rendering ND′ = ND; Directed Human PPI Network. The directed human PPI network was case 2.2: if none of node i’s downstream neighbors are matched by i i N ′ = N − compiled from our previous study (13). Briefly, a Naïve Bayesian node , then in the absence of node , D D 1. In all of the N ′ > N classifier was applied to predict potential direction of signal flow cases, we do not have D D, which is in contrast to the defi- between the ith and jth interacting proteins pi and pj as pi → pj, pj → nition of indispensable nodes. Hence, driver nodes cannot be in- pi or both. The classifier uses features derived from the shortest dispensable. PPI paths between membrane receptors and transcription factors and assigns confidence for each predicted edge directions ranging Enrichment Analysis. To estimate the significance of overlap be- S D from 0.5 to 1. The weighted and directed edges are then encoded tween a given node type and given dataset , we compute an z in an N × N matrix, and A is denoted as the weighted adjacency enrichment score as matrix of the directed graph for the PPI network. The element of A i j a ðSD − mean of RDÞ in the th row and th column is denoted as ij and is defined as z score = , follows: aij is in the range [0.5,1] if there is signal flow from SD of RD protein pj to pi otherwise aij = 0. where SD is number of proteins from dataset D overlapping with Controllability Analysis and Node Classification. Recently, we de- node type S and RD is the number of proteins from dataset D veloped a mathematical framework and analytical tools to identify overlapping with random set of proteins of same size as N. Mean MDS, with size denoted as ND, whose control is sufficient to and SD of RD is computed from 1,000 simulations of random sets. ensure the structural controllability of linear dynamics (14) and Note that the entire network with 6,339 proteins is used as the local structural controllability for nonlinear dynamics (SI Text, background for random sampling. In addition to the z score, we Local Structural Controllability) on any directed weighted net- also computed the P value (two-tailed) by comparing the SD with work. This is achieved by mapping the structural controllability RD distribution (modeled as Gaussian distribution). In the case of problem in control theory to the maximum matching problem in degree- or literature-controlled random sets, the random sets are graph theory, which can be solved in polynomial time (15). Here, sampled such that the average degree or average PubMed records an edge subset M in a directed network or digraph is called a of random sets matches the average of node type S. “matching” if no two edges in M share a common starting node or a common ending node. A node is “matched” if it is an ending Datasets Used for Enrichment Analysis. All of the datasets used for node of an edge in the matching. Otherwise, it is “unmatched.” the enrichment analysis in this study are listed in Dataset S2. A matching of maximum cardinality is called a “maximum This includes the source of the data, reference, number of matching.” (In general, there could be many different maximum proteins compiled, and overlap with human directed PPI net- matchings for a given digraph.) We proved that the unmatched work. The datasets were downloaded from respective databases nodes that correspond to any maximum matching can be chosen or publications as mentioned in Dataset S2. The gene or pro- as driver nodes to control the whole network. Identifying a tein IDs from various resources were mapped to Entrez gene minimum set of driver nodes is equivalent to choosing an input IDs. All compiled datasets are available as an integrated table matrix (often denoted as B) with the minimum number of col- that shows the nodes and the overlap with respective datasets umns (see SI Text, Local Structural Controllability and ref. 14 for (Dataset S1). more details). The detailed construction of the input matrix B is not necessary for the identification of driver nodes. This is only Analysis of Cancer Genomic Datasets. Copy number alteration data mentioned to connect the notion of a driver node to the theo- for nine cancer types were downloaded from the cBioPortal for retical discussions in SI Text, Local Structural Controllability. Cancer Genomics (version corresponds to April 2013; www. After a node is removed, denote the minimum number of driver cbioportal.org). Using the GISTIC algorithm (46), the cBio- nodes of the damaged network as ND′. In this work, we classified Portal provides putative values of copy number alterations for nodes into three categories. (i) A node is indispensable if in its each cancer patient. The GISTIC score −2, −1, 0, 1, 2 corre- absence we have to control more driver nodes (i.e., ND′ > ND). sponds to deep loss (possibly a homozygous deletion), single- For example, removal of any node in the middle of a directed copy loss (heterozygous deletion), diploid, low-level gain, and path will break the path and cause the ND increase. Hence, all high-amplification, respectively. The gene expression data for but the start and end nodes of a directed path are indispensable. each cancer type were downloaded from the TCGA (version (ii) A node is dispensable if in its absence we have ND′ < ND. For corresponds to April 2013; https://tcga-data.nci.nih.gov/tcga). example, removal of one leaf node in a star will decrease ND by The tumor-matched datasets (for each participant have been 1. (iii) A node is neutral if in its absence ND′ = ND. For example, analyzed and compared with normal tissue on the CNA and gene removal of the central hub in a star will not change ND at all. expression level) were used in the analysis. Level 3 TGCA data Note that a driver node in any MDS can never be an in- (expression calls for genes, per sample) was used in our study. dispensable node. This can be proven by contradiction. Assume The TCGA data were downloaded by using TCGA web interface adrivernodei is indispensable. According to the minimum with filters set as “Data Type: Expression-Genes”; “Data Level: input theorem (9), driver nodes are just unmatched nodes with Level 3”; “Tumor/Normal: Tumor-matched.” respect to a particular maximum matching. There are two ca- Next, we filtered for patients with both CNA and expression ses; case 1: the driver node i has no downstream neighbors [i.e., data available (details are available in Dataset S4). We computed z kout(i) = 0],theninitsabsence,ND′ = ND − 1; and case 2: the a score for each gene in a patient to identify whether the driver node i has at least one downstream neighbors [i.e., amplification or deletion results in expression change for the kout(i) > 0]. There are two subcases; case 2.1: if in the maxi- corresponding gene. Briefly, for each gene the diploid mean and mum matching, one of node i’s downstream neighbors (node j) SD of expression values were calculated using the data from is matched by node i, then in the absence of node i,nodej will patients without any copy number alteration (GISTIC score, 0;

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 1of8 T diploid). Using the diploid mean and SD, we computed z score where x = ½x1 x2⋯ xn is the state vector and t is time. We are for each gene in a given patient. A gene is defined as amplified if interested in determining an n × m matrix B such that the con- the GISTIC score is ≥1 and the z score is ≥1.5 and deleted if the trolled system GISTIC score is ≤−1 and the z score is ≤−1.5. All of the data dx preprocessing and normalization were performed using Perl and = fðxðtÞÞ + Bu [1] Java scripts developed in house. dt

T Local Structural Controllability. A dynamic system is controllable is locally controllable through the input u = ½u1 u2 ... um . p p p ∂f p p if, with a suitable choice of inputs, it can be driven from any Let z be defined as fðz Þ = 0, Aðz Þ = ∂x ðz Þ,andGðz Þ = initial state to any final state in finite time (2). Most complex ½B AB⋯ An−1B.The matrix G(z*) is referred to as the Kalman biological systems are characterized by nonlinear interactions controllability matrix. The dynamics in Eq. 1 are locally con- between the components, and often only local properties can trollable around zp if GðzpÞ is rank n (Theorem 7 in ref. 47; be verified. Similarly, it is often easier to obtain local analytical Proposition 11.2 in ref. 48). The local controllability analysis results for controllability of nonlinear systems. Here, we review of Eq. 1 about a trim point therefore reduces to the classic a sufficient condition for “local controllability” of a nonlinear Kalman controllability analysis (2) of the linear dynamics system about a trim point. A system is “locally controllable” if there exists a neighborhood in the state space such that all dðzðtÞ − zpÞ = AðzðtÞ − zpÞ + Bu [2] initial conditions in that neighborhood are controllable to all dt . other elements in the neighborhood with locally bounded trajectories (47). This definition of controllability can be Recall that the dynamics in Eq. 2 are deemed “structurally con- “ ” verified by checking the well-known Kalman rank condition trollable” if there exists another pair (A0,B0) with the same struc- used in the controllability analysis of linear systems. The rest ture as the pair (A,B) (49). That is, we are not concerned about of the section is tutorial in fashion so as to illustrate how the the particular values in (A,B), just the pattern of the nonzero adjacency matrix can be used to analyze the local controlla- entries in (A,B). The dynamics in Eq. 1 are deemed “locally bility of a PPI network. structurally controllable” if the linearized dynamics in Eq. 2 Consider a dynamic system governed by a set of ordinary are structurally controllable. differential equations For the purposes of this work, the adjacency matrix of the experimentally determined PPI network is used to find the dx structure of A in ref. 2, then the nodes are classified based upon = fðxðtÞÞ, dt their impact on the structural controllability (Methods).

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 2of8 a b c 5000 200 30

25 1000 150 500 20

100 15 100 50 10 50 PubMed records 5 10 Average number of PubMed records 5 0 Average number of Gene Ontology terms 0 Dispensable Neutral Indispensable Dispensable Neutral Indispensable Correlation coefficient = 0.37 1 d 5000 050100 150 200 In-degree 1000 e 500 Dispensable Neutral Indispensable

100 50 Density Density Density PubMed records 0 0.015 0 0.015 0 0.015 10 400 500 600 700 600 650 700 750 800 850 300 350 400 450 500 550 600 5 Essential genes Essential genes Essential genes Correlation coefficient = 0.41 1

0 2040 60 80 100 120 Out-degree f Dispensable Neutral Indispensable Density Density Density 0.00 0.04 0.08 0.00 0.06 0.00 0.06 2280 2300 2320 2340 2550 2600 2650 2700 1260 1280 1300 1320 1340 Conserved in Mouse Conserved in Mouse Conserved in Mouse 0.04 Density Density Density 0.00 0.02 0.04 0.00 0.02 0.00 0.03 2100 2150 2200 2250 2300 2400 2450 2500 2550 2600 1200 1220 1240 1260 1280 1300 Conserved in Fish Conserved in Fish Conserved in Fish Density Density Density 0.000 0.015 0.000 0.015 0.000 0.015 1600 1700 1800 1900 2000 2100 1900 2000 2100 2200 2300 900 950 1000 1050 1100 1150 1200 Conserved in Fly Conserved in Fly Conserved in Fly Density Density Density 0.000 0.015 0.000 0.015 0.0000.015 1600 1700 1800 1900 2000 1800 1900 2000 2100 2200 850 900 950 1000 1050 1100 1150 Conserved in Worm Conserved in Worm Conserved in Worm Density Density Density 0.000 0.015 0.000 0.015 0.000 0.015 750 800 850 900 950 1000 1050 1100 900 950 1000 1050 1100 1150 1200 450 500 550 600 650 Conserved in Yeast Conserved in Yeast Conserved in Yeast

Fig. S1. (A and B) Literature and annotation bias for the three node types. Bar plots show average PubMed records associated (A) and Gene Ontology terms annotated (B) for each node type. (C and D) Correlation of node degree vs. literature bias. The plots show the correlation of in-degree (C) and out-degree (D)to the number of PubMed records associated with each node in the entire network. (E) Enrichment analysis of essential genes. Numbers of essential genes overlapping with dispensable, neutral, and indispensable nodes are shown in red arrows. The essential genes are compiled from the Database of Essential Genes (DEG) (tubic.tju.edu.cn/deg) (50) and Online GEne Essentiality database (OGEE) (ogeedb.embl.de) (51). Numbers of essential genes overlapping with size- controlled random sets are shown in gray bars. (F) Enrichment analysis of conserved genes. Numbers of genes conserved in Mus musculus (mouse), Danio rerio (fish), Drosophila melanogaster (fly), Caenorhabditis elegans (worm), and Saccharomyces cerevisiae (yeast) are shown in red arrows, and their respective size- controlled random set distributions are shown in gray bars. The ortholog mapping was performed using the Drosophila RNAi Screening Center (DRSC) In- tegrative Ortholog Prediction Tool (DIOPT) (www.flyrnai.org/DIOPT) (52).

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 3of8 a Signaling b Copy Dispensable Neutral Indispensable Dispensable Neutral Indispensable Density Density Density Density Density Density 0 0.015 0 0.015 0 0.020 0 0.04 0 0.04

200 250 300 350 400 450 350 400 450 500 150 200 250 300 350 400 0 0.020 Signaling Proteins Signaling Proteins Signaling Proteins 140 160 180 200 220 240 160 200 240 280 80 100 120 140 160 High copy number High copy number High copy number Density Density Density 0 0.02 0 0.04 Density Density Density 0 0.020

100 150 200 250 150 200 250 300 350 400 60 80 100 120 140 160 0 0.015 0 0.015 0 0.020 Membrane Receptors Membrane Receptors Membrane Receptors 350 400 450 500 400 450 500 550 180 220 260 300 Moderate copy number Moderate copy number Moderate copy number Density Density Density 0 0.04 0 0.04 0 0.03 Density Density Density

80 100 120 140 160 180 100 120 140 160 180 40 60 80 100 120 140 0 0.020 0 0.020 0 0.020 Proteins Kinases Proteins Kinases Proteins Kinases 250 300 350 250 300 350 400 120 140 160 180 200 220 Low copy number Low copy number Low copy number Density Density Density Density Density Density 0 0.020 0 0.015 0 0.015 0 0.02 300 350 400 450 500 400 450 500 550 600 180 220 260 300 0 0.020 0 0.015 Transcription Factors Transcription Factors Transcription Factors 220 260 300 340 250 300 350 120 140 160 180 200 Very low copy number Very low copy number Very low copy number c PTM Dispensable Neutral Indispensable Density Density Density 16000 0.015 1800 2000 01900 0.015 2100 2300 0 0.015 950 1050 1150 1250 All PTMs All PTMs All PTMs Density Density Density 00.015 00.015 00.015 450 500 550 600 650 700 550 600 650 700 750 250 300 350 400 450 Acetylation Acetylation Acetylation Density Density Density 00.015 00.015 00.015 1300 1500 1700 1500 1700 1900 800 900 1000 1100 Phosphorylation S/T Phosphorylation S/T Phosphorylation S/T Density Density Density 00.015 00.015 00.015 900 1000 1100 1200 1000 1100 1200 1300 500 550 600 650 700 750 Phosphorylation Y Phosphorylation Y Phosphorylation Y Density Density Density 00.015 00.015 00.015 900 1000 1100 1200 1100 1200 1300 1400 500 600 700 800 Ubiquitination Ubiquitination Ubiquitination

Fig. S2. (A) Enrichment analysis of signaling proteins. Numbers of nodes overlapping with signaling proteins (annotated with signaling pathways in Cell Signaling Technology database www.cellsignal.com/common/content/content.jsp?id=science-pathways) (53), receptors (54), protein kinases (55, 56) (kinase. com/kinbase/index.html), and transcription factors (57) are shown in red arrows and, their respective size-controlled random set distributions in gray bars. (B) Enrichment analysis of protein abundance. Numbers of nodes overlapping with high copy numbers (>100,000 copies) (A), moderate copy numbers (5000– 100,000 copies), low copy numbers (500–5,000 copies) (C), and very low copy numbers (<500 copies) are shown in red arrows, and their respective size-controlled random set distributions in gray bars. The copy number dataset was obtained from Beck et el. (58). (C) Enrichment analysis of protein PTMs. Numbers of nodes overlapping with any PTM [Acetylation, Tyrosine Phosphorylation (Phosphorylation Y), Serine/Threonine Phosphorylation (S/T), or Ubiquitination], Acetylation, Tyrosine Phosphorylation, Serine/Threonine Phosphorylation, and Ubiquitination datasets are shown in red arrows and their respective size- controlled random set distributions in gray bars. The PTM dataset was obtained from Woodsmith et al. (59).

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 4of8 a d Dispensable Neutral Indispensable 15 Indispensable Neutral Dispensable 10 Density 0 0.04 0 0.04 0 0.03 Density Density 60 80 100 120 140 160 100 120 140 160 180 40 60 80 100 120 140 5 Cancer genes I Cancer genes I Cancer genes I

0 Z score Density Density Density 0 0.015 0 0.015 0 0.015 −5 500 600 700 800 900 800 900 1000 400 500 600 700 Cancer genes II Cancer genes II Cancer genes II −10

NonePubMed Degree None PubMed Degree None PubMed Degree Density Density Density HIV targets (RNAi) HIV targets (PPIs) Virus targets (208 viruses) 0 0.015 0 0.015 0350 0.015 400 450 500 550 600 500 550 600 650 700 250 300 350 400 OMIM disease genes OMIM disease genes OMIM disease genes e Dispensable Neutral Indispensable Density Density Density 0 0.015 0 0.020 0 0.015 400 450 500 550 450 500 550 600 650 200 250 300 350 Density GWAS GWAS GWAS Density Density 0 0.015 0 0.015 0 0.020 b 250 300 350 400 350 400 450 500 150 200 250 300 FDA targets FDA targets FDA targets

15 Indispensable Density 10 Neutral Density Density 0 0.015 Dispensable 0 0.015 0 0.020 350 400 450 500 550 600 450 500 550 600 650 200 250 300 350 5 Druggable genome I Druggable genome I Druggable genome I

0 Z score Density Density −5 Density 00.030 0.000 0.030 00.020 240 280 320 360 250 300 350 400 120 140 160 180 200 220 −10 Druggable genome II Druggable genome II Druggable genome II

−15 NonePubMed Degree None PubMed Degree None PubMed Degree f Cancer II OMIM GWAS 10 Indispensable Neutral c Dispensable 5 Dispensable Neutral Indispensable 0 Z score Density Density Density 0 0.04 0.00 0.04 0 0.020 −5 140 180 220 160 200 240 80 100 120 140 HIV targets (RNAi screens) HIV targets (RNAi screens) HIV targets (RNAi screens)

−10 NonePubMed Degree None PubMed Degree FDA approved Druggable Density Density Density 0 0.04 0 0.02 0 0.020 100 150 200 250 180 220 260 300 100 150 200 HIV targets (PPIs) HIV targets (PPIs) HIV targets (PPIs) g

100 Density Density Density Critical node 0 0.02 0200 0.020 240 280 320 0220 0.015 260 300 340 100 140 180 80 Intermittent node Common virus targets Common virus targets Common virus targets Redundant node 60

Percentage of nodes 20

0 Dispensable Neutral Indispensable

Fig. S3. (A) Enrichment analysis of disease genes. Numbers of nodes overlapping with genes causally associated with cancer (Cancer genes I, cancer gene census) (Cancer Gene Census; cancer.sanger.ac.uk/census) (17) and a list of predicted cancer genes (Cancer genes II, extended list of cancer genes) (18), annotated as disease genes in the OMIM database (omim.org) and associated with disease in GWAS (www.genome.gov/gwastudies), are shown in red arrows, and their respective size-controlled random set distributions in gray bars. (B) Enrichment analysis of disease genes using literature- and degree-controlled random sets. In the case of degree- or literature-controlled random sets, the random sets are sampled such that the average degree or average PubMed records of random sets matches the average of node type N. (C) Enrichment analysis of virus targets. Numbers of nodes overlapping with genes identified to have an adverse effect on HIV-1 replication when knocked down (RNAi screens) (21–24), human proteins that directly interact with HIV proteins (HIV targets PPI) (26, 27), and human proteins that are known to physically interact with proteins from 208 viruses (common virus targets) (26–29) are shown in red arrows, and their respective size-controlled random set distributions in gray bars. (D) Enrichment analysis of virus targets using literature- and degree-controlled random sets. Random sets are generated as explained in B.(E) Enrichment analysis of drug targets. Numbers of nodes overlapping with proteins that are targeted by FDA-approved drugs (31), proteins with domains or folds that could bind to drug-like molecules (druggable genome I) (32), and a subset of druggable genome I excluding the FDA-approved drug targets (druggable genome II) are shown in red arrows, and their respective size-controlled random set distributions in gray bars. (F) Enrichment analysis of drug targets using literature- and degree-controlled random sets. Random sets are generated as explained in B.(G) Charac- terizing indispensable, neutral, and dispensable nodes based on their roles as driver nodes. The recently developed approach is used to classify a node as critical, intermittent, or redundant if it acts as a driver node in all, some, or none of the control configurations, respectively (33). The bar graph compares the indispensable, neutral, and dispensable nodes against the critical, intermittent, and redundant node classification.

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 5of8 a Cancer OMIM Virus FDA drug Cancer OMIM Virus FDA drug Cancer OMIM Virus FDA drug Cancer OMIM Virus FDA drug ABL1 IGF1 MAPK3 RAF1 AKT1 IGF1R MAPK7 RB1 CDC42 IGF2 MAPK8 RET CSK INS MAPK9 SHC1 DVL3 INSR MET SMAD2 EEF1A1 IRS1 MYC SMAD3 EGFR JAK1 NCOA2 SMAD4 EPHB2 JUN NTRK1 SOS1 ERBB2 KRAS PAK1 TEK ERBB3 MAP2K1 PDGFRB TGFBR1 FGFR1 MAP2K4 PDPK1 TGFBR2 GSK3B MAPK1 RAC1 VAV1 HRAS MAPK14 RAC2

b chronic myelomonocytic leukemia acute leukemia Hepatitis C virus (isolate 1) DB00102 DB08574 DB01259 Influenza A virus Deltapapillomavirus 4 gastric DB07317 West Nile virus DB04716 B−cell Non−Hodgkin Lymphoma DB04003 MLLT4 PDGFRB DB01296 ERBB2 Saimiriine herpesvirus 1 HSP90AA1 JAK2 Herpesvirus saimiri DB00459 LCK Myxoma virus NFKB2 Polyomavirus HPyV7 acute promyelocytic leukemia Myeloproliferative disorder AKT1 DB07585 RARA non−Hodgkina lymphom Saimiriine herpesvirus 2 Chondrosarcoma Influenza A virus Human papillomavirus type 6b acute myelogenous leukemia Hepatitis C virus DB08175 NCOA2 DB08059 Human herpesvirus 4 non small cell lung cancer Influenza A virus Murid herpesvirus 4 Human immunodeficiency virus 1 T−cell acute lymphoblastic chronic myeloid leukemia leukemia gliobastoma DB01248 Human papillomavirus type 16 PIK3R1 Bovine papillomavirus type 1 ovarian SARS coronavirus BCL2 breast Simian virus 40 Hepatitis E virus Human herpesvirus 8 Human adenovirus 5 acute lymphocytic leukemia Human T−lymphotropic virus 1 colorectal DB08043 diffuse large B−cell lymphoma ABL1 CREBBP Human herpesvirus 4 type 1 sarcoma Murine polyomavirus Human adenovirus 12 Moloney murine leukemia virus DB08655 Human T−lymphotropic virus 3 Adeno−associated virus − 2 Human herpesvirus 4 type 2 Human adenovirus A TOP1 glioma DB02872 Mouse polyomavirus chronic lymphatic leukemia MDM2 Human adenovirus 7 DB07354 Human adenovirus 3 CDK6 Human papillomavirus type 8 small cell lung BK virus strain AS DB07379 RB1 CCND1 Murine adenovirus 1 Human herpesvirus 8 Mupapillomavirus 1 TP53 Human adenovirus E Herpesvirus saimiri JUN DB00030Human papillomavirus type 1a Alphapapillomavirus 7 Human adenovirus C B−cell acute lymphocytic leukaemia Human papillomavirus type 18 retinoblastoma DB01169 JC polyomavirus Human papillomavirus type 31 Human papillomavirus type 11 Visna/maedi virus adrenocortical Human adenovirus 2 Vaccinia virus Copenhagen DB08363 lung Human adenovirus 1

Proteins Cancer mutation Virus FDA approved drug

Fig. S4. (A) Members of receptor tyrosine signaling pathways that are predicted as indispensable nodes and targeted by cancer mutations, OMIM disease, viruses, or FDA-approved drugs. RTK pathway members are as defined by the SignaLink database (60). (B) Indispensable nodes that are targeted by all three inputs (cancer mutation, viruses, and drugs). The labels of FDA drug nodes correspond to DrugBank IDs. The network was generated using Cytoscape (61).

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 6of8 a c e

0.5 0.5 0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

Fraction of nodes Indispensable Fraction of nodes Indispensable 0.1 Fraction of nodes Indispensable 0.1 Neutral 0.1 Neutral Neutral Dispensable Dispensable Dispensable 0.0 0.0 0.0

0.5 0.6 0.7 0.8 0.9 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 bfEdge score cutoff d Fraction of edges rewired Fraction of edges flipped 1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 annotation annotation 0.4 annotation Indispensable Indispensable 0.2 0.2 Indispensable Neutral Neutral 0.2 Neutral Dispensable Dispensable Fraction of nodes with unchanged Fraction of nodes with unchanged Dispensable 0.0 0.0 Fraction of nodes with unchanged 0.0 0.5 0.6 0.7 0.8 0.9 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of edges rewired Edge score cutoff Fraction of edges flipped g 0.05 0.0030 0.25 Type-I Type-I Type-I 40 Type-I Type-II Type-II Type-II Type-II 0.0025 0.04 0.20 0.0020 30 0.15 0.03 0.0015 20 0.10 0.02 0.0010 0.01 10 0.0005 0.05 Average Closeness Centrality Average Clustering Coefficient Average Betweenness Centrality

0 0 0 Average Neighborhood Connectivity 0

Fig. S5. (A and B) Robustness of node classification. (A) The fraction of indispensable, neutral, and dispensable nodes is plotted as a function of edge filtering (filtering using edge score). (B) The fraction of nodes in the filtered network sharing the same node classification as the real network (unchanged annotation) is plotted as a function of edge filtering. (C–F) Analysis of node classification in perturbed networks. (C) The fraction of indispensable, neutral, and dispensable nodes is plotted as a function of fraction of edges rewired. (D) The fraction of nodes in the rewired network sharing the same node classification as the real network (unchanged annotation) is plotted as a function of edge rewired. (E) Same as C, but the x axis corresponds to a fraction of flipped-edge directions. (F) Same as D, but the x axis corresponds to a fraction of flipped-edge directions. (G) Comparison of network properties of type-I and type-II indispensable nodes. The network betweenness centrality, closeness centrality, clustering coefficient, and neighborhood connectivity values are calculated using the NetworkAnalyzer Cytoscape plugin (62). The gray dotted line shows the average value of the network.

abc P value = 2.5e−05 P value = 0.079 P value = 0.00037 Z score = 4.2 0.14 Z score = -1 0.20 Z score = 3.5 0.20 0.12 0.10 0.15 0.15 0.08 0.10

0.10 Density 0.06 Density

Density 0.04 0.05 0.05 0.02 0 0 0 0246810 0 5101520 0 24681012 Cancer Gene Census genes STOP genes GO genes

Fig. S6. Enrichment analysis of type-II indispensable nodes frequently amplified/deleted in cancer. Numbers of type-II indispensable nodes (frequently amplified/deleted in cancer) overlapping with genes causally associated with cancer (Cancer Gene Census) (17) (A), negative regulators of cell proliferation (STOP genes) (37) (B), and positive regulators of cell proliferation (GO genes) (37) (C) are shown in red arrows, and their respective size-controlled random set distributions in gray bars.

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 7of8 Dataset S1. Node classification and datasets used in this study

Dataset S1

Node classification: Node classification based on network controllability. The table includes node classifications (indispensable, neutral, or dispensable), their role as a driver node (critical, intermittent, or redundant), in-degree, out-degree, and all other node properties analyzed in this study. Datasets: datasets used in this study for enrichment analysis. Name, source, reference, number of genes/proteins compiled from the dataset, and overlap with the human-directed PPI network are listed.

Dataset S2. Enrichment analysis results and dataset overlaps

Dataset S2

Enrichment: results of enrichment analysis corresponding to Figs. 1E and 2 A–D. The table includes the number of overlapping genes, random mean, random SD, z score, and P value for indispensable, neutral, and dispensable nodes. Enrichment-controlled: results of degree- and literature-controlled enrichment analysis, corresponding to Figs. 1E and 2 A–D. The table is similar to Enrichment but uses degree- and literature-controlled random sets. Dataset Overlap: indispensable nodes in RTK signaling pathways and the nodes’ overlap with cancer mutations, OMIM disease, virus targets, and drug targets (corresponding to Fig. S4A). Indispensable Targets: indispensable nodes targeted by all three inputs of cancer mutations, virus targets, and drug targets (corresponds to Fig.S4B).

Dataset S3. Results from the analysis of edge-filtered network and subtype of indispensable nodes

Dataset S3

Edge filtering: results from the analysis of edge-filtered network. Edge rewiring: results from the analysis of rewired and direction-flipped networks. Subtypes: list of indispensable nodes with subtype classification (type-I and type-II). The table incudes Entrez gene ID; gene symbol; subtype annotation; results from edge-filtered, rewired, and direction-flipped networks; and overlap with the regulators of cell proliferation.

Dataset S4. Perturbed indispensable nodes in cancer genomic data

Dataset S4

TCGA dataset: the gene expression and CNAs from cancer genomic studies. Gene expression data (level 3 expression calls for genes, per sample) from TCGA and copy number alteration from cBioPortal. The overlapping samples refer to patient samples with both level 3 gene expression and CNA data available. Type- II nodes: most frequently amplified or deleted (top 1%) type-II indispensable nodes in nine different cancer types (corresponds to Fig. 4E). Enrichment: enrichment analysis of Cancer Gene Census, type-I and type-II indispensable nodes in human cancer.

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 8of8