Logic Programming for Big Data in Computational Biology

Nicos Angelopoulos

Wellcome Sanger Institute Hinxton, Cambridge

[email protected]

18.9.18 overview

I knowledge for Bayesian machine learning over model structure I applied knowledge representation for biological data analytics Bayesian inference of model structure (Bims)

A Bayesian machine learning system that can model prior knowledge by means of a probabilistic logic programming. Nonmeclature I DLPs = Distributional logic programs I Bims = Bayesian inference of model structure Timeline I Theory (York, 2000-5) I Applications (Edinburgh, 2006-8, IAH 2009, NKI 2013) I Bims library and theory paper 2015-2017 Bims Overview

I syntax of DLPs

I a succinct classification tree prior program

I Bayesian learning of model structure

I learning classification and regression trees

I Bayesian learning of Bayesian networks

I thebims library DLPs- description

We extend LP’s clausal syntax with probabilistic guards that associate a resolution step using a particular clause with a probability whose value is computed on-the-fly. The intuition is that this value can be used as the probability with which the clause is selected for resolution. Thus in addition to the logical relation, a clause defines over the objects that appear as arguments in its head, it also defines a probability distribution over aspects of this relation. DLPs example member(H, [H| T ]). member(El, [ H|T ]) :− (C1) member(El, T ).

L :: length(List, L) ∼ El :: umember(El, List)(G1)

1 :: L :: umember(El, [El|Tail]). (C ) L 2 1 1 − :: L :: umember(El, [H|Tail]) :− (C ) L 3 umember(El, Tail). DLPs probabilistic goals

1 :: L :: umember(El, [El|Tail]). (C ) L 4 1 1 − :: L :: umember(El, [H|Tail]) :− (C ) L 5 K is L − 1, K :: umember(El, Tail). DLPs query

? − umember(X , [a, b, c]).

X = a (1/3 of the times = 1/3); X = b (1/3 of the times = 2/3 ∗ 1/2); X = c (1/3 of the times = 2/3 ∗ 1/2 ∗ 1). simple tree prior ?- cart( ζ, ξ, A,M).

M=nd(x2,1,nd(x1,0,lf,lf),lf) (C0) cart(ζ, ξ, M, Cart): − ψ0 is ζ, ψ0: split(0, ζ, ξ, M, Cart).

(C1) ψD : split(D, ζ, ξ, MB , nd(F , Val, L, R)) : − −ξ ψD+1 is ζ ∗ (1 + D) , D1 is D + 1, r select(F , Val, MB , LB , RB ), ψD+1: split(D1, ζ, ξ, LB , L), ψD+1: split(D1, ζ, ξ, RB , R).

(C2) 1 − ψD : split(D, ζ, ξ, MB , lf ). Bims theory

Bayes’ Theorem

p(D|M)p(M) p(M|D) = P M p(D|M)p(M)

Metropolis-Hastings   q(M∗, Mi )P(D|M∗)P(M∗) α(Mi , M∗) = min , 1 q(Mi , M∗)P(D|Mi )P(Mi ) DLP defined model space

From Mi identify Gi then sample forward to M?. q(Mi , M?) is the probability of proposing M? when Mi is the current model. Pyruvate interactors

objective improve chances of discovering binding molecules based on examples from screened chemical libraries. pyruvate kinase affinity data 582 Active and 582 Inactive. Dragon software produces 1500 property descriptors for each molecule, about 1100 were used. ten-fold cross-validation Compared to Feed Forward Neural Networks and Support Vector Machines by splitting the data into ten train/test segments. best likelihood model ten-fold validation

T + T − Sensitivity = Specificity = T + + F − T − + F + molecules of Eduliss according to BCarts Bims: Bayesian inference of model structure

Released in 2016 as an easily installable SWI-Prolog library Includes (IJAR paper in 2017) I priors and likelihoods for: CARTs and Bayesian networks I hooks for user defined models

Probabilistic logic programming I thesis: probabilistic finite domains I PLP workshop and IJAR associated issues (5th edition) knowledge-based computation biology

I graphical models (focal adhesion dynamics, NKI, 2011-3) I proteomics functional analysis (TKSilac,KSR1,ATG9A, Imperial, 2014-5) I mutational profiling (14MG, Sanger, 2016-8) Graphical models of FAD

Graphical models (aka Bayesian networks) can provide a network view of dependencies among variables, capturing much richer information than pairwise correlations. In this project, microscopy based variables characterising focal adhesion in time are connected for a number of conditions in the HGF pathway. tkSilac: kinase screen

I MCF7 line I 33 SILAC runs I 65/66 expressed tyrosine

I 4739 quantified in some experiment I 1000 quantified in 60 or more TK KO Figure 2

Color Key

0.5 2 Value

PTK2B NTRK2 RET PTK7 TEC SYK TNK1 ROR1 JAK2 KDR YES1 TYK2 JAK1 PTK2 TNK2 STYK1 LYN FES EPHB6 FYN EPHA3 FLT3 DDR1 CSK EPHA7 ERBB3 EPHB2 FGFR1 BTK EPHA6 ABL2 CSF1R ERBB2 EPHB3 EPHA4 AXL ABL1 EPHA2 RYK MST1R LMTK2 NTRK1 NTRK3 EGFR EPHA1 MET HCK SRC PDGFRB PTK6 MERTK IGF1R INSR EPHB1 EPHB4 FRK FGFR2 FLT1 TYRO3 ROR2 ZAP70 MATK ERBB4 LMTK3

Fig. 2. Heatmap of quantified after TK silencing. The overall pattern of regulation is shown in the heat- map of quantified values. After normalized to siControl, values of fold changes are all above 0, with value 1 show- ing that the expression levels of the specific are not altered after silencing TKs. For each knockdown (rows) the quantified value for an identified protein is plotted in red for down regulated proteins (below 1), white for non-differential and non-identified and blue for up-regulated proteins (above 1). The row labels indicate the knock out experiment and the colors correspond to the clusters described below. Figure 4

EPHA1

HCK MET EGFR A NTRK3 C HEATR1 NTRK1 XPOT LDHB 1

LMTK2 SRC PDGFR PTK6 MUC5B MST1R MERTK ASNS IGF1R NUP188 INSR 0 EPHB1 ASS1 PRKDC RYK EPHB4 MROH1 EPHA2 GYG1 FRK -1 FGFR2 ASMTL ABL1 HUWE1 FLT1 TUBA1C VDAC1 -2 AXL TYRO3 RNF213 EPHA4 LTN1 MON2 ROR2 USP32 EPHB3 ZAP70 TXNRD1 PHGDH MATK GEMIN4 ERBB2 GALNT7 ERBB4 MC1R CSF1R KRT18 LMTK3 CAD HEATR6 ABL2 MYBBP1A PTK2B SLC7A1 ME1 EPHA6 NTRK2 PREX1 STAT1 BTK RET PYGB PTK7 BCAS1 FGFR1 LASP1 TEC LCP1 FTH1 EPHB2 SYK AGR2 TNK1 MVP ERBB3 SLC12A2 ROR1 BASP1 EPHA7CSK JAK2 AKR1C2 KDR CDH1 YES1 BLOC1S5 DDR1 TYK2 ADAM10 LCK PODXL JAK1 PTK2 FLT3 TNK2 PBXIP1 ERMP1 FYN EPHA3 FLNB FES

LYN SLC38A1 EPHB6 MUC1 CA2 STYK1 GGH NCAM2 GSTM3 B D CYB5R1 FBP1 Clusters No. TKs 100 GREB1 Up FARP1 ABL1, AXL, EPHA2, EPHA4, LMTK2, GUSB 1 8 Down MST1R, NTRK1, RYK GLA GFRA1 ABL2, BTK, CSF1R, CSK, DDR1, EPHA3, 75 MYOF 2 15 EPHA6, EPHA7, EPHB2, EPHB6, ERBB3, LXN FES, FGFR1, FLT3, FYN CELSR2 3 3 EGFR, EPHA1, NTRK3 SDC1 50 FREM2 4 6 EPHB1, EPHB4, FGFR2, FLT1, FRK, INSR NRCAM PGR

5 2 EPHB3, ERBB2 Counts CA12 6 6 ERBB4, LMTK3, MATK, ROR2, TYRO3, ZAP70 EPB41L1 25 7 7 HCK, IGF1R, MERTK, MET, PDGFRB, PTK6, SRC TMEM164 CD44 8 3 JAK1, LCK, PTK2

JAK2, KDR, NTRK2, PTK2B, PTK7, RET, ABL1 AXL EPHA2 EPHA4 LMTK2 MST1R NTRK1 R

9 12 YK ROR1, SYK, TEC, TNK1, TYK2, YES1 0 10 3 LYN, STYK1, TNK2 1 2 3 4 5 6 7 8 9 10 Clusters

Fig. 4. Hierarchical clustering of the 65 TKs expressed in MCF7 cells. A, Hierarchical clustering of the 65 TKs was performed using R's hclust function. The complete linkage method which aims to identify similar clusters based on overall cluster measure was used. 10 distinctive clusters were obtained and the complete dendrogram is shown with the labels colored for these clusters. B, Full list of the TKs included in each cluster . The color coding of the clusters is used throughout to identify the analysis relevant to the corresponding clusters. C, Heatmap of the proteomic quantifications (log2 values of normalized fold changes against control) for the downstream effects (significantly up- or down-regulated proteins, Significant B test p< 0.05) after silencing TKs in cluster 1. D, Number of proteins significantly up or down-regulated in each identified cluster. x-axis shows 10 different clusters and y-axis indicates the counts. Figure 5 A B

10 10 Cell communication 9 Transporter activity 9 8 8 7 7 6 6 5 5 Cell cycle 4 Structural molecule activity 4 3 3 2 2 Reproduction 1 Receptor activity 1

Metabolic process Protein binding TF activity

Transport Molecular transducer activity

Immune system regulator activity

Growth Chemoattractant activity

Development Catalytic activity

Apoptosis Binding

Cell adhesion Antioxidant activity

0 10 20 30 40 50 0 20 40 60 80 % %

Fig. 5. Characterization of a functional portrait for each cluster. A, A functional profile of top GO biologic processes that the up- and downregulated proteins belong to is presented. x-axis shows the percentage of hits in each cluster that belong to a GO biologic process term. The color coding and the number for each cluster are indi- cated as above. B, A functional profile of top GO molecular functions that the up- and downregulated proteins belong to is presented. x-axis shows the percentage of hits in each cluster that belong to a GO molecular function term. Figure 6

cluster 1 development cluster 2 response to

LCP1 GSTM3 NCAM2 PRKCD RPS27A PRKDC ● CDH1 VDAC1 NRCAM NUP210 AKR1C2 ● CD44 FLNB STAT1 AGR2 PXDN ASS1 KRT18 MUC1 ● ● KRT18 PGR PREX1 ● ● ● KPNA2 FREM2 ● SLC12A2 ACSL1 FARP1 ● ASS1 HUWE1 GFRA1 CAD ● KRT18 MAP4 ASNS SDC1 CELSR2 ● LMNA ● ACTN1 PODXL ● FLNB BASP1 ● ● ACTN4 EPB41L1 cytoskeleton organization CD44 ● PBXIP1 PRKCD MUC5B PHGDH MC1R SPTBN2 FLNB ● NUMA1 ADAM10 CA2 PREX1 ● MYBBP1A ● TXNRD1 BLOC1S5 ● cluster 3 SLC38A2 SLC3A2 cluster 4 process ACTN4 ASS1 ANXA6 ion transmembrane transport SLC3A2 SLC7A5 EIF2AK2 DDX60 SLC7A2 APOB ● ISG15 SLC7A1 ● IFIT1 IFIT2 LASP1 SLC6A14 IFIT3 PRKDC FTH1 STAT2 ● DDX58 TAP2 response to hormone IRS1 PSME1 IFI35 PARP9 ITGA2 ● SDC1 ● MIR7703 MX1 CA2 GBA TAPBP OAS1 ● OAS3 ● ● ● OAS2 HERC6 RBM14 ● TAP1 ● AKR1C2 PCK2 SAMHD1 ● NFKB2 ADAR ● ● STAT1 STAT1 IFIH1 MUC5B CAD ANXA3 ● TRIM25 ● PML CD44 cluster 8 DNA replication RPL22 ● ● CRIP2 MCM7 FANCD2 MCM2 MCM4 DST GPR56 L1CAM ● ● ● ITGA5 ● SUPT16H NRP1 CTNNB1 SSRP1 SLC7A5 CAV1 BLVRA FBP1 SMC4 SDC1 ● ● G6PD ● HMGB1 LGALS3 CELSR1 ● ● NCAPD2 DNMT1 MYH14 KIAA1324 LRPPRC ● HMGB2 EPHX1 MCM6 MCM5 SCARB1 CD63 SLC3A2 PNPT1 ● CD44 PXDN ● ● DUT SMC2 KPNA2 GSTM3 ● ● AGRN HSP90B1 ● POLD1 ● ACSL1 ● MVP NUP210 PCNA cluster 6 cell motility ● RAN H1FX LRPPRC ● ● FEN1 ● MCM3 cluster 5 catabolic process PNPT1 RNA transport ● ● cluster 10 cell death cluster 9 protein localiztion AKR1C2 CA12 GBA SLC7A5 CERS2 BAG6 ASNS VDAC2 ● MON2 ● SORT1 GPD2 PHGDH AGRN TUBB4A XPO1 ● ● HK1 ● CELSR2 SLC25A24 ● TUBA1C CAD FLNB ● ● CD44 NQO1 SLC9A1 ● CYB5R3 LCP1 SLC9A3R1 MYH14 ACTN4 TFRC ● ● ● CDH1 GLA PYGB ● ASS1 INPP4B ARMCX3 ● ● SORD ERO1L SYNE2 MVP MUC1 ACTN4 ● DDX20 ● ● ● APRT TUBB8 SLC5A6 JUP ● ● XPOT TUBA1A ● PRDX2 AHSA1 ● PDCD6IP VDAC1 PRDX3 ● PRKDC ACSL1 PFKL MLPH SLC9A3R1 ● ENO1 SLC7A2 SLC44A2 ● BAG6 ● RPS27A ● MGEA5 ● MYH9 cluster 7 metabolic process ●

Fig. 6. Representatives of defined functional networks in each classified TK cluster. The functional networks were generated using GO analysis combined with the STRING platform. Proteins in lighter color are up-regulated, whereas brighter color indicates down-regulation. Arrows show the interactions between connected proteins. Rep- resentative defined functional networks associated with their clusters are shown here. The color coding and the number for each cluster are indicated as above. volcano plot (BT474HR H/M)

● < 0.001 ● ● < 0.01 ● < 0.05 ● >0.05

10 ● VCP ●

●●

● ● ● ● ● ● ● ● ●● SFN ● ● ● ● ●● ● ● ● ● ● ● 9 ● ● ● ● ● MCM6 ● ● ● ● ●● S100A6 ● ● ● ● STOML2 ● ●● MCM3 ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ●● ● ● ● PEG10 ● ● ● CLEC1B ● ●● ● ● KATNAL2 ●● ●● ● ● ● ● ● LGALS1 ● ● ●●● ●●●● ● ● ●● ● ● ● GSTM3 ● ● ● ● ● ● ● LDHB ● ● ● ● ● ● ● ● ● ●●● ● ● ●●●● ● ●● ● BLMH ● ● ●●●● ● ● ●● ● ● ● ● ●● ● ● ●● CRIP1 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● STK3 ● ● ●

8 ● ● EADS2● ● ●●●●● ● ● ●●● ● ● ● ●●● ● ●●● ● ● SCAP SEMA4C ●● ● ●● ● ● ● ● ●● ● ●● ● ● ● DCTN5 ● ● ● ●●● ● ●● ● KYNU ● ● ● ● ● ● ● ● ●●●●●● ● ●● RP1 ● ● ● ● ● ●● ● VIM ● ● ●● ●● ●● ● ●●● ●● ● ● ● ● ● ● ●● ● YBX2 BSG ●● ●● ●●●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● SH3BGRL SYF2 ●● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ●●●● ●● ●●● ●●● SSNA1 ● ●●●●● ● ● ● ● ●● ● ●● ● PEX3 ● ● ●● ● ● ●●● ● DHX37 ● ● ● ● ● ● ● CNOT8 ●●●● ●●● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ● PGM5 ● ● ●●●● ●● ● ●●● ●● ● TRIP6 ● ● ●● ● ● ● ● ●● ●● ● ● ● GBP2 7 ● ●● ● ● ● ● ●●● ● ATG9A ● ● ● ● ● ●●●● ● ●● CA5B ●●● ●● ● ● ●● ● ● ● ● ● ● ● ●● ●

Log10 Intensity (R10K8/R0604) ● ● ● ● ● ● ● ● ● ● ● ● ● ANO1 ●●●● ● ● ● IGFBP3 ● ● ● ●● ● SLC6A14 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● RGN ● ● ● ● ● ● ● ● 6 ●● ● ●

−4 −2 0 2 4

Log2 Ratio Total Protein (R10K8/R0604) Autophagy

RAB7A ●

WIPI2 DNAJB1 ● ● CD46 ●

●ATG9A ST13 ● TBK1 ●

BAG3 ● HSPB8 ● ● ATG16L1 ● PRKCD ● CALCOCO2 NFKB1 ● ● ● MAPK3 ITGA6 HDAC6 ● NBR1● ITGB4 ● RELA ATG4B ● SQSTM1 ● ●

GABARAPL2● ● ERBB2 ULK1 ● ● ● ● MAP1LC3B ATG7 CHMP2B ● CASP3 CASP8 ● ● ● TSC2 ATG3 ● ● WDFY3 BNIP1

CDKN1B ● ● ● RB1 GAA ● ● BAG1 ● MTOR PEX3 ● ● SERPINA1 TSC1 FOXO3 ● ● TP53 CANX ● Myelodysplastic syndrom, NGS somatic mutations profiling

AUC (1y) for Clinical vs Lasso_min model

0.0037 Auc (1y) 1.0 0.679351 0.78364

0.9 Harrel's C 0.634512

● ● 0.712633

0.8 ● ● ● ● ● ● ● ● ● R square ● 0.179789 AUC (1y) AUC

● 0.415647 ● 0.7

● ● ● 0.6

● 0.5

Clinical Lasso_min 5 year: Clinical vs Lasso (Optimal)

Clinical Lasso_min

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

0 1 2 3 4 5 0 1 2 3 4 5 myeloma structural variations

t_4_14

t_14_16 t_11_14

HDR

DIS3 delTRAF3

del13q14

del17p13 NRAS gain1q21 delCYLD Fisher test odds Mut.excl (shown odds=0.25) TP53 KRAS delFAM46C Co-occur (shown odds=4) # events (shown med=225,max=643) logic programming for (biological) data analytics Positives I interpreted I memory management I clean and high level I probabilistic ML & reasoning (Prism,Bims,Pepl) I intuitive database integration (db facts,bio db) I multi-threaded and web-capable I talking to other systems (R:Real,ODBC,proSQLite) I (largely) OS independence Negatives I graphics I SWI-Prolog, at core a one-person project I code sharing in toddler stage (but showing promise) I in-browser interaction with other technologies symbolic AI education, can be a central player in contributing tangibly to the current AI resurgence, while managing expectations of modern AI see media coverage of Facebook/Cambridge-Analytica & Uber/Tesla driveless accidents biology presents a unique application area, where unprecedented volumes of data are generated knowledge is a crucial concept, currently being shaped transferable to other big data areas

KR bottom line

(probabilistic) logic programming and Bayesian networks are powerful tools for

explainable, accountable, open and shareable AI & ML biology presents a unique application area, where unprecedented volumes of data are generated knowledge is a crucial concept, currently being shaped transferable to other big data areas

KR bottom line

(probabilistic) logic programming and Bayesian networks are powerful tools for

explainable, accountable, open and shareable AI & ML

symbolic AI education, can be a central player in contributing tangibly to the current AI resurgence, while managing expectations of modern AI see media coverage of Facebook/Cambridge-Analytica & Uber/Tesla driveless accidents KR bottom line

(probabilistic) logic programming and Bayesian networks are powerful tools for

explainable, accountable, open and shareable AI & ML

symbolic AI education, can be a central player in contributing tangibly to the current AI resurgence, while managing expectations of modern AI see media coverage of Facebook/Cambridge-Analytica & Uber/Tesla driveless accidents biology presents a unique application area, where unprecedented volumes of data are generated knowledge is a crucial concept, currently being shaped transferable to other big data areas