Logic Programming for Big Data in Computational Biology
Total Page:16
File Type:pdf, Size:1020Kb
Logic Programming for Big Data in Computational Biology Nicos Angelopoulos Wellcome Sanger Institute Hinxton, Cambridge [email protected] 18.9.18 overview I knowledge for Bayesian machine learning over model structure I applied knowledge representation for biological data analytics Bayesian inference of model structure (Bims) A Bayesian machine learning system that can model prior knowledge by means of a probabilistic logic programming. Nonmeclature I DLPs = Distributional logic programs I Bims = Bayesian inference of model structure Timeline I Theory (York, 2000-5) I Applications (Edinburgh, 2006-8, IAH 2009, NKI 2013) I Bims library and theory paper 2015-2017 Bims Overview I syntax of DLPs I a succinct classification tree prior program I Bayesian learning of model structure I learning classification and regression trees I Bayesian learning of Bayesian networks I thebims library DLPs- description We extend LP's clausal syntax with probabilistic guards that associate a resolution step using a particular clause with a probability whose value is computed on-the-fly. The intuition is that this value can be used as the probability with which the clause is selected for resolution. Thus in addition to the logical relation, a clause defines over the objects that appear as arguments in its head, it also defines a probability distribution over aspects of this relation. DLPs example member(H; [Hj T ]): member(El; [ HjT ]) :− (C1) member(El; T ): L :: length(List; L) ∼ El :: umember(El; List)(G1) 1 :: L :: umember(El; [EljTail]): (C ) L 2 1 1 − :: L :: umember(El; [HjTail]) :− (C ) L 3 umember(El; Tail): DLPs probabilistic goals 1 :: L :: umember(El; [EljTail]): (C ) L 4 1 1 − :: L :: umember(El; [HjTail]) :− (C ) L 5 K is L − 1; K :: umember(El; Tail): DLPs query ? − umember(X ; [a; b; c]): X = a (1=3 of the times = 1=3); X = b (1=3 of the times = 2=3 ∗ 1=2); X = c (1=3 of the times = 2=3 ∗ 1=2 ∗ 1): simple tree prior ?- cart( ζ, ξ, A,M). M=nd(x2,1,nd(x1,0,lf,lf),lf) (C0) cart(ζ; ξ; M; Cart): − 0 is ζ, 0: split(0; ζ; ξ; M; Cart). (C1) D : split(D; ζ; ξ; MB ; nd(F ; Val; L; R)) : − −ξ D+1 is ζ ∗ (1 + D) , D1 is D + 1, r select(F ; Val; MB ; LB ; RB ), D+1: split(D1; ζ; ξ; LB ; L), D+1: split(D1; ζ; ξ; RB ; R). (C2) 1 − D : split(D; ζ; ξ; MB ; lf ): Bims theory Bayes' Theorem p(DjM)p(M) p(MjD) = P M p(DjM)p(M) Metropolis-Hastings q(M∗; Mi )P(DjM∗)P(M∗) α(Mi ; M∗) = min ; 1 q(Mi ; M∗)P(DjMi )P(Mi ) DLP defined model space From Mi identify Gi then sample forward to M?. q(Mi ; M?) is the probability of proposing M? when Mi is the current model. Pyruvate kinase interactors objective improve chances of discovering binding molecules based on examples from screened chemical libraries. pyruvate kinase affinity data 582 Active and 582 Inactive. Dragon software produces 1500 property descriptors for each molecule, about 1100 were used. ten-fold cross-validation Compared to Feed Forward Neural Networks and Support Vector Machines by splitting the data into ten train/test segments. best likelihood model ten-fold validation T + T − Sensitivity = Specificity = T + + F − T − + F + molecules of Eduliss according to BCarts Bims: Bayesian inference of model structure Released in 2016 as an easily installable SWI-Prolog library Includes (IJAR paper in 2017) I priors and likelihoods for: CARTs and Bayesian networks I hooks for user defined models Probabilistic logic programming I thesis: probabilistic finite domains I PLP workshop and IJAR associated issues (5th edition) knowledge-based computation biology I graphical models (focal adhesion dynamics, NKI, 2011-3) I proteomics functional analysis (TKSilac,KSR1,ATG9A, Imperial, 2014-5) I mutational profiling (14MG, Sanger, 2016-8) Graphical models of FAD Graphical models (aka Bayesian networks) can provide a network view of dependencies among variables, capturing much richer information than pairwise correlations. In this project, microscopy based variables characterising focal adhesion in time are connected for a number of conditions in the HGF pathway. tkSilac: tyrosine kinase screen I MCF7 cell line I 33 SILAC runs I 65/66 expressed tyrosine kinases I 4739 quantified in some experiment I 1000 quantified in 60 or more TK KO Figure 2 Color Key 0.5 2 Value PTK2B NTRK2 RET PTK7 TEC SYK TNK1 ROR1 JAK2 KDR YES1 TYK2 LCK JAK1 PTK2 TNK2 STYK1 LYN FES EPHB6 FYN EPHA3 FLT3 DDR1 CSK EPHA7 ERBB3 EPHB2 FGFR1 BTK EPHA6 ABL2 CSF1R ERBB2 EPHB3 EPHA4 AXL ABL1 EPHA2 RYK MST1R LMTK2 NTRK1 NTRK3 EGFR EPHA1 MET HCK SRC PDGFRB PTK6 MERTK IGF1R INSR EPHB1 EPHB4 FRK FGFR2 FLT1 TYRO3 ROR2 ZAP70 MATK ERBB4 LMTK3 Fig. 2. Heatmap of quantified proteins after TK silencing. The overall pattern of regulation is shown in the heat- map of quantified values. After normalized to siControl, values of fold changes are all above 0, with value 1 show- ing that the expression levels of the specific protein are not altered after silencing TKs. For each knockdown (rows) the quantified value for an identified protein is plotted in red for down regulated proteins (below 1), white for non-differential and non-identified and blue for up-regulated proteins (above 1). The row labels indicate the knock out experiment and the colors correspond to the clusters described below. Figure 4 EPHA1 HCK MET EGFR A NTRK3 C HEATR1 NTRK1 XPOT LDHB 1 LMTK2 SRC PDGFR PTK6 MUC5B MST1R MERTK ASNS IGF1R NUP188 INSR 0 EPHB1 ASS1 PRKDC RYK EPHB4 MROH1 EPHA2 GYG1 FRK -1 FGFR2 ASMTL ABL1 HUWE1 FLT1 TUBA1C VDAC1 -2 AXL TYRO3 RNF213 EPHA4 LTN1 MON2 ROR2 USP32 EPHB3 ZAP70 TXNRD1 PHGDH MATK GEMIN4 ERBB2 GALNT7 ERBB4 MC1R CSF1R KRT18 LMTK3 CAD HEATR6 ABL2 MYBBP1A PTK2B SLC7A1 ME1 EPHA6 NTRK2 PREX1 STAT1 BTK RET PYGB PTK7 BCAS1 FGFR1 LASP1 TEC LCP1 FTH1 EPHB2 SYK AGR2 TNK1 MVP ERBB3 SLC12A2 ROR1 BASP1 EPHA7CSK JAK2 AKR1C2 KDR CDH1 YES1 BLOC1S5 DDR1 TYK2 ADAM10 LCK PODXL JAK1 PTK2 FLT3 TNK2 PBXIP1 ERMP1 FYN EPHA3 FLNB FES LYN SLC38A1 EPHB6 MUC1 CA2 STYK1 GGH NCAM2 GSTM3 B D CYB5R1 FBP1 Clusters No. TKs 100 GREB1 Up FARP1 ABL1, AXL, EPHA2, EPHA4, LMTK2, GUSB 1 8 Down MST1R, NTRK1, RYK GLA GFRA1 ABL2, BTK, CSF1R, CSK, DDR1, EPHA3, 75 MYOF 2 15 EPHA6, EPHA7, EPHB2, EPHB6, ERBB3, LXN FES, FGFR1, FLT3, FYN CELSR2 3 3 EGFR, EPHA1, NTRK3 SDC1 50 FREM2 4 6 EPHB1, EPHB4, FGFR2, FLT1, FRK, INSR NRCAM PGR 5 2 EPHB3, ERBB2 Counts CA12 6 6 ERBB4, LMTK3, MATK, ROR2, TYRO3, ZAP70 EPB41L1 25 7 7 HCK, IGF1R, MERTK, MET, PDGFRB, PTK6, SRC TMEM164 CD44 8 3 JAK1, LCK, PTK2 JAK2, KDR, NTRK2, PTK2B, PTK7, RET, ABL1 AXL EPHA2 EPHA4 LMTK2 MST1R NTRK1 R 9 12 YK ROR1, SYK, TEC, TNK1, TYK2, YES1 0 10 3 LYN, STYK1, TNK2 1 2 3 4 5 6 7 8 9 10 Clusters Fig. 4. Hierarchical clustering of the 65 TKs expressed in MCF7 cells. A, Hierarchical clustering of the 65 TKs was performed using R's hclust function. The complete linkage method which aims to identify similar clusters based on overall cluster measure was used. 10 distinctive clusters were obtained and the complete dendrogram is shown with the labels colored for these clusters. B, Full list of the TKs included in each cluster . The color coding of the clusters is used throughout to identify the analysis relevant to the corresponding clusters. C, Heatmap of the proteomic quantifications (log2 values of normalized fold changes against control) for the downstream effects (significantly up- or down-regulated proteins, Significant B test p< 0.05) after silencing TKs in cluster 1. D, Number of proteins significantly up or down-regulated in each identified cluster. x-axis shows 10 different clusters and y-axis indicates the counts. Figure 5 A B 10 10 Cell communication 9 Transporter activity 9 8 8 7 7 6 6 5 5 Cell cycle 4 Structural molecule activity 4 3 3 2 2 Reproduction 1 Receptor activity 1 Metabolic process Protein binding TF activity Transport Molecular transducer activity Immune system Enzyme regulator activity Growth Chemoattractant activity Development Catalytic activity Apoptosis Binding Cell adhesion Antioxidant activity 0 10 20 30 40 50 0 20 40 60 80 % % Fig. 5. Characterization of a functional portrait for each cluster. A, A functional profile of top GO biologic processes that the up- and downregulated proteins belong to is presented. x-axis shows the percentage of hits in each cluster that belong to a GO biologic process term. The color coding and the number for each cluster are indi- cated as above. B, A functional profile of top GO molecular functions that the up- and downregulated proteins belong to is presented. x-axis shows the percentage of hits in each cluster that belong to a GO molecular function term. Figure 6 cluster 1 development cluster 2 response to cytokine LCP1 GSTM3 NCAM2 PRKCD RPS27A PRKDC ● CDH1 VDAC1 NRCAM NUP210 AKR1C2 ● CD44 FLNB STAT1 AGR2 PXDN ASS1 KRT18 MUC1 ● ● KRT18 PGR PREX1 ● ● ● KPNA2 FREM2 ● SLC12A2 ACSL1 FARP1 ● ASS1 HUWE1 GFRA1 CAD ● KRT18 MAP4 ASNS SDC1 CELSR2 ● LMNA ● ACTN1 PODXL ● FLNB BASP1 ● ● ACTN4 EPB41L1 cytoskeleton organization CD44 ● PBXIP1 PRKCD MUC5B PHGDH MC1R SPTBN2 FLNB ● NUMA1 ADAM10 CA2 PREX1 ● MYBBP1A ● TXNRD1 BLOC1S5 ● cluster 3 SLC38A2 SLC3A2 cluster 4 immune system process ACTN4 ASS1 ANXA6 ion transmembrane transport SLC3A2 SLC7A5 EIF2AK2 DDX60 SLC7A2 APOB ● ISG15 SLC7A1 ● IFIT1 IFIT2 LASP1 SLC6A14 IFIT3 PRKDC FTH1 STAT2 ● DDX58 TAP2 response to hormone IRS1 PSME1 IFI35 PARP9 ITGA2 ● SDC1 ● MIR7703 MX1 CA2 GBA TAPBP OAS1 ● OAS3 ● ● ● OAS2 HERC6 RBM14 ● TAP1 ● AKR1C2 PCK2 SAMHD1 ● NFKB2 ADAR ● ● STAT1 STAT1 IFIH1 MUC5B CAD ANXA3 ● TRIM25 ● PML CD44 cluster 8 DNA replication RPL22 ● ● CRIP2 MCM7 FANCD2 MCM2 MCM4 DST GPR56 L1CAM ● ● ● ITGA5 ● SUPT16H NRP1 CTNNB1 SSRP1 SLC7A5 CAV1 BLVRA FBP1 SMC4