PANDA Vignette

User manual for PANDA: Preferential Attachment based common Neighbor Distribution derived functional Associations Hua Li and Pan Tong 2014-10-30 Contents 1 Introduction 1 2 A Quick Example 1 3 File Location and Session Info 4 1 Introduction This file gives a brief introduction on the functions used in the R package of PANDA (preferential-attachment based common neighbor distribution derived associations). PANDA is designed to perform the following tasks in protein-protein interaction (PPI) networks: (1) identify significantly functionally associated protein pairs, (2) predict GO terms and KEGG pathways for proteins, (3) make a cluster of proteins based on the significant protein pairs, (4) identify subclusters whose members are enriched in KEGG pathways. For other types of biological networks, (1) and (3) can still be performed. For more details on PANDA, please refer to \PAND: a distribution to identify functional linkage from networks with preferential attachment property", or consult Hua Li ([email protected]). 2 A Quick Example The first step is to load the package from the library. > library(PANDA) Then we load the example data shipped with this package. > data(dfPPI) > data(GENE2GOtopLite) 1 \SignificantPairs" (protein pairs will be ranked by p-values if "pvalue=TRUE" is specified, otherwise by > dendMap=ProteinCluster(Pfile=OrderAll, Plot=TRUE, TextScaler=50) plot the dendrogram. pairs.This function returns anagglomerative object hierarchical in clustering the (using class \dendrogram". the If unweighted \Plot=TRUE" group is specified, average) it for21882 will proteins also of JARID2 all4255 significant 1 MTF2 PER126211 -38.851498343 PER2 LDB1 SMAD2 SMARCA410234 SMARCA2 -39.00400 SMAD3 LDB2 CPNE1 -54.93328 -39.21016 -47.30440 CPNE4> 10 head(OrderAll) -69.52455> OrderAll=SignificantPairs(PPIdb=dfPPI) 10 probabilities): 26 20 16 18 names. GO and KEGG annotations ofto proteins. demonstrate \KEGGID2NAME" the maps capability KEGG of pathway ID PANDA. The to \GENE2GOtopLite" KEGG and pathway \GENE2KEGG" are> examples of data(KEGGID2NAME) > data(GENE2KEGG) Based on the p-values (or probabilities) of the significant protein pairs obtained above, we can perform We first apply PAND to the PPI network to derive functional links between protein by using the function The \dfPPI" is a PPI network consisting of 2360 proteins and 5355 interactions that is used as an example Sym_A Sym_B Probability CommonNeighbor 10 8 6 4 2 0 JUN EED FOS ISL2 RIF1 GLI3 GLI1 BMI1 TLE6 CIITA ALX4 LDB1 LDB2 LHX9 DLX1 MTA2 MTA1 MTA3 DPF2 EZH2 EZH1 PBX1 PBX3 PER1 PER2 PER3 MTF2 CBX2 RNF2 CRY2 CRY1 PHC2 LMO4 LMO3 LMO1 CHD4 SUFU RHOJ CDH2 KIF5B RAI14 MBD3 KIF3B KIF5C PIAS1 ILKAP RHOD SALL1 SALL4 GATA1 GMNN STK38 TRP63 TRP73 SUZ12 LASS2 SAP18 EPS15 NAA20 RAB38 SFRP1 SFRP2 TCEB1 TCEB2 CDK14 THBS2 THBS1 SSBP2 SSBP3 SSBP4 AEBP2 LMX1B ARNTL RBBP7 ZFPM1 FBXO3 NUAK2 ACVR1 HDAC2 HDAC1 HOXA6 HOXB8 CPNE1 CPNE4 CPNE2 RASD2 RRAS2 PCGF2 KIFAP3 ERCC8 JARID2 PBRM1 PYGO1 SMAD7 SMAD6 KDM4B KDM4A SMAD2 SMAD3 CLOCK KDM2B UHMK1 SCMH1 ARID1B NANOG ACTL6A RPS27A RHEBL1 TGFBR1 MAP2K3 SPATA24 ZDHHC6 CTNND1 BMPR1B EPS15L1 NKIRAS1 GATAD2B GATAD2A RHOBTB1 ARHGEF6 SMARCA4 SMARCA2 SMARCB1 SMARCC1 SMARCD1 TIMELESS TCFCP2L1 2 3 The significant protein pairs generated by the function \SignificantPairs" constituted a new network, with which we can make further functional predictions. We use the functions \GOpredict" and \KEGGpredict" to perform functional enrichment analysis (p-values were calculated with Fisher's exact test) among a protein's significant partners to predict GO terms and KEGG pathways for the protein: > GP=GOpredict(Pfile=OrderAll, PPIdb=dfPPI, Gene2Annotation=GENE2GOtopLite, p_value=0.001) > head(GP) Symbol GOID 1 TGFBR1 GO:0045669 2 CBX2 GO:0071535 3 MTA1 GO:0016581 4 PBX1 GO:0007387 5 THBS2 GO:0048603 6 GATAD2A GO:0072092 GOterm Ratio 1 positive regulation of osteoblast differentiation 2/2 2 RING-like zinc finger domain binding 2/2 3 NuRD complex 4/5 4 anterior compartment pattern formation 1/1 5 fibroblast growth factor binding 1/1 6 ureteric bud invasion 1/2 Pvalue 1 5.496440e-05 2 3.592444e-07 3 2.764488e-09 4 4.237288e-04 5 4.237288e-04 6 8.474576e-04 > KP=KEGGpredict(Pfile=OrderAll, PPIdb=dfPPI, Gene2Annotation=GENE2KEGG, p_value=0.001, IDtoNAME=KEGGID2NAME) > head(KP) Symbol KEGGID PathName Ratio Pvalue 1 SAP18 hsa05217 Basal cell carcinoma 2/2 5.334780e-04 2 TIMELESS hsa04710 Circadian rhythm - mammal 4/4 5.545925e-10 We use the following function to identify subclusters (from the cluster generated by \ProteinCluster") whose members are significantly enriched in any KEGG pathway (if KGremove=TURE, \hsa05200" and \hsa01100" will be excluded from this analysis as they are too broad): 4 > SignificantSubcluster(Dendrogram=dendMap, Gene2Annotation=GENE2KEGG, PPIdb=dfPPI, KGremove=TRUE, SPoint=1, EPoint=9.7) hsa04710 0.30 0.20 0.10 0.00 PER1 PER2 PER3 CRY2 CRY1 ARNTL CLOCK 3 File Location and Session Info > sessionInfo() R Under development (unstable) (2016-12-05 r71733) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux stretch/sid locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base other attached packages: 5 [1] PANDA_0.9.9 loaded via a namespace (and not attached): [1] compiler_3.4.0 IRanges_2.9.13 parallel_3.4.0 [4] DBI_0.5-1 tools_3.4.0 memoise_1.0.0 [7] Rcpp_0.12.8 Biobase_2.35.0 AnnotationDbi_1.37.0 [10] RSQLite_1.1 S4Vectors_0.13.5 BiocGenerics_0.21.1 [13] digest_0.6.10 stats4_3.4.0 cluster_2.0.5 [16] GO.db_3.4.0.

Load more