Graph Representation Learning for Drug Discovery

Graph Representation Learning for Drug Discovery Jian Tang Mila-Quebec AI Institute CIFAR AI Research Chair HEC Montreal www.jian-tang.com The Process of Drug Discovery • A very long and costly process • On average takes more than 10 years and $2.5B to get a drug approved Preclinical Lead Discovery Lead Optimization Clinical Target Study 2 years 3 years Trial 2 years Screen millions of Modify the molecule In-vitro and functional molecules; to improve specific in-vivo Multiple Phases Found by serendipity: properties. experiments; Penicillin e.g. toxicity, SA synthesis Molecules Research Problems Lead Lead Preclinical Clinical Target Discovery Optimization Study Trial 2 years 3 years 2 years Molecule Property Prediction Property Molecule Design and Optimization Property Retrosynthesis Prediction + Molecule Properties Prediction • Predicting the properties of molecules or compounds is a fundamental problem in drug discovery • E.g., in the stage of virtual screening • Each molecule is represented as a graph • The fundamental problem: how to represent a whole molecule (graph) Graph Neural Networks • Techniques for learning node/graph representations • Graph convolutional Networks (Kipf et al. 2016) • Graph attention networks (Veličković et al. 2017) • Neural Message Passing (Gilmer et al. 2017) 푘 푘 MESSAGE PASSING: 푀푘(ℎ푣 , ℎ푤, 푒푣푤) w AGGREGATE : 푚푘+1 = AGGREGATE{푀 ℎ푘, ℎ푘 , 푒 : 푤 ∈ 푁 푣 } 푣 푘 푣 푤 푣푤 v 푘+1 푘 푘+1 COMBINE : ℎ푣 = COMBINE(ℎ푣 , 푚푣 ) 퐾 READOUT: 푔 = READOUT{ℎ푣 : 푣 ∈ 퐺} InfoGraph: Unsupervised and Semi-supervised Whole-Graph Representation Learning (Sun et al. ICLR’20) • For supervised methods based on graph neural networks, a large number of labeled data are required for training • The number of labeled data are very limited in drug discovery • A large amount of unlabeled data (molecules) are available • This work: how to effectively learn whole graph representations in unsupervised or semi-supervised fashion Fanyun Sun, Jordan Hoffman, Vikas Verma and Jian Tang. InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. ICLR’20. Figure 1: Illustration of InfoGraph. An input graph isencoded into afeature map by graph convolutions and jumping concatenation. The discriminator takes a(global representation, patchFigurerepresentation)1: Illustrationpairof InfoGraph.as input andAndecidesinput graphwhetherisencoded into afeature map by graph convolutions and jumping they are from the same graph. InfoGraph uses a batch-wise fashionconcatenation.to generateThealldiscriminatorpossible posititakesveaand(globalnegrepresentation,ative patch representation) pair as input and decides whether samples. For example, consider the toy example with 2 input graphstheyinarethefrombatchtheandsame7 nodesgraph.(or patchInfoGraphrepresentations)uses a batch-wise fashion to generate all possible positive and negative in total. For the global representation of the blue graph, there willsamples.be 7 inputForpairsexample,to theconsiderdiscriminatorthe toandy examplesame forwiththe2 input graphs in the batch and 7 nodes (or patch representations) red graph. Thus, the discriminator will take 14 (global representation,in total.patchFor representation)the global representationpairs as inputof theinbluethisgraph,case. there will be 7 input pairs to the discriminator and same for the red graph. Thus, the discriminator will take 14 (global representation, patch representation) pairs as input in this case. Figure3.1 Pr1: oblemIllustrationDefinitionof InfoGraph. An input graph isencoded into afeature map by graph convolutions and jumping concatenation. The discriminator takes a(global representation,3.1patchPr oblemrepresentation)Definition pair as input and decides whether Unsuper vised Graph Representation Learning. Given a set of graphs G = { G1,G2,...} and a positive integer δ Figurethe(they aree1:xpectedfromIllustrationembeddingthe sameof InfoGraph.graph.size), ourInfoGraphgoalAnisinputto learnusesgraphaaδbatch-wise-dimensionalisencodedfintoashiondistribafeatureutedto generaterepresentationmap byallgraphpossibleof evconeryvpositigrapholutionsvGe and2andG.nejumpinggative Unsuper vised Graph Representation Learningi . Given a set of graphs G = { G1,G2,...} and a positive integer δ samples. For example, consider the toy example with 2 input graphs in the batch and 7 nodes (or patch |representations)G|⇥δ concatenation.We denote theThenumberdiscriminatorof nodes intakGi esasa|G(globali |. We denoterepresentation,the matrix(thepatchofexpectedrepresentationsrepresentation)embeddingof allsize),pairgraphsouras inputgoalas Φis2andtoRlearndecides.aδ-dimensionalwhether distributed representation of every graph G 2 G. in total. For the global representation of the blue graph, there will be 7 input pairs to the discriminator and same for the i they are from the same graph. InfoGraph uses a batch-wiseWefashiondenote theLto numbergenerateof allnodespossiblein G aspositi|G |.vWeeanddenotenegtheativmatrixe of representations of all graphs as Φ 2 R|G|⇥δ. redSemi-supergraph. Thus,viedtheGraphdiscriminatorPr edictionwillTaskstak. Gie 14ven(globalaset of representation,labeled graphs Gpatch= { Grepresentation)1, ··· ,G|GL | } withpairsi correspondingas iinput in this case. samples. For example, consider the toy example with U2 input graphs in the batch and 7 nodes (or patch representations) L output { o1, ··· ,o|GL | } , and aset of unlabeled samples G = { GSemi-super|GL |+1, ··· ,Gvied|GLGraph|+ |GU | }Pr, ouredictiongoal isTtoaskslearn. Giavmodelen aset of labeled graphs G = { G1, ··· ,G|GL | } with corresponding in total. For the global representation of the blue graph, there will beU7 inputLpairs to the discriminator and sameUfor the that can make predictions for unseen graphs. Note that in most cases |G | |G |.L L L U red graph. Thus, the discriminator will take 14 (global representation,output { o1,patch··· ,o|representation)G | } , and aset of pairsunlabeledas inputsamplesin thisG case.= { G|G |+1, ··· ,G|G |+ |G | } , our goal isto learn amodel 3.1 Pr oblem Definition that can make predictions for unseen graphs. Note that in most cases |GU | |GL |. 3.2 InfoGraph Unsuper vised Graph Representation Learning. Given a set of graphs G = { G ,G ,...} and a positive integer δ 3.1 Pr oblem Definition 3.2 InfoGraph 1 2 (theWeexpectedfocus onembeddinggraph neuralsize),netwourorksgoal(GNNs)—aisto learnfleaxibleδ-dimensionalclass of embeddingdistributedarchitecturesrepresentationwhichofgenerateevery graphnode Gi 2 G. representations by repeated aggregation over local node neighborhoods. Therepresentations of nodes are learned by |G|⇥δ We denote theFigurenumber1: Illustrationof nodesofinInfoGraph.Gi as |Gi |.AnWeinputdenotegraphtheisWmatrixeencodedfocusofonintorepresentationsgraphafeatureneuralmapnetwofbyallorksgraphgraphs(GNNs)—aconasvolutionsΦ 2 fleRxibleand jumping.class of embedding architectures which generate node UnsuperaggregatingvisedtheGraphfeaturesReprof theesentationir neighborhoodLearningnodes,. Gisovween refera settoofthesegraphsas patchG = representations.{ G1,G2,...} andGNNsa positiutilizeveainteger δ (the expectedconcatenation.embedding Thesize),discriminatorour goal istaktoeslearna(globalaδ-dimensionalrepresentation,representationsdistribpatchutedbyL representation)repeatedrepresentationaggrepairgationofas oeinputvererylocalandgraphdecidesnodeGneighborhoods.2whetherG. Therepresentations of nodes are learned by Semi-superREADOUTviedfunctiGraphon to summariPr edictionze allTtheasksobtained. Givenpatcaseth representationsof labeled graphsinto aGfixed=length{ G1,graph-le··· ,G|vGelL |representation.} with correspondingi they are from the same graph. InfoGraph uses a batch-wiseaggregatingfashionthe featuresto generateof theirallneighborhoodpossible positinodes,ve and|soG|⇥weneδ greferativeto these as patch representations. GNNs utilize a We denote the number of nodes in Gi as |Gi |. We denoteU the matrix of representations of all graphs as Φ 2 R . outputFormally{ o1,,thesamples.···k,o-th|GlayerLF|or} ,eandofxample,aaGNNsetconsiderofisunlabeledthe tosamplesy exampleG with= READOUT{2Ginput|GL |+1graphs,functi···in,Gonthe|GtoLbatchsummari|+ |GUand| } ,ze7ournodesallgoalthe(orobtainedispatchto learnpatcrepresentations)ahmodelrepresentations into a fixed length graph-level representation. ⇣ ⇣n⇣ U L ⌘L o⌘⌘ Semi-superthat can makinviedetotal.(predictionsk) GraphFor thePrforglobaledictionunseen(k)representation(Tkgraphs.−asks1) . GiNoteofventheathatbluesetin(ofkgraph,)mostlabeledFormallycasesthere(k −graphs1),will|theG(kbe−k|G1)-th7 inputlayer=|G{ G|ofpairs.1,a··GNNto· ,Gtheis|GdiscriminatorL | } with correspondingand same for the hv = COMBINE hv , AGGREGATE U hv ,hu ,euv : u 2 N (v⇣) , (1) ⇣n⇣ ⌘ o⌘⌘ output { o , ··red· graph.,o L }Thus,, andtheasetdiscriminatorof unlabeledwillsamplestake14G(global= { Grepresentation,L , ··· ,GpatchL representation)U } , our goalpairsistoaslearninputainmodelthis case. 1 |G | |G |+1h(k) = COMBINE|G |+ |G |(k) h(k− 1) , AGGREGATE(k ) h(k − 1) ,h(k − 1) ,e : u 2 N (v) , (1) 3.2 InfoGraph(k ) Uv L v v u uv thatwherecan hmakv eispredictionsthe feature vforectorunseenof nodegraphs.v at theNotek-th thatiteration/layerin most cases(or patch|G representation| |G |. centered at node i), euv (k ) (0) Wise thefocusfeatureon3.1graphvectorPr oblemofneuraltheDefinitionedgenetwbetweenorks (GNNs)—au and v, andfleN xible(v) areclasswhereneighborhoodsofhv embeddingisthetofeaturenodearchitecturesvv.ectorhv isof oftennodewhichinitializedv at thegeneratek-thasiteration/layernode (or patch representation centered

Graph Representation Learning for Drug Discovery

NVIDIA CEO Jensen Huang to Host AI Pioneers Yoshua Bengio, Geoffrey Hinton and Yann Lecun, and Others, at GTC21

On Recurrent and Deep Neural Networks

Yoshua Bengio and Gary Marcus on the Best Way Forward for AI

Hello, It's GPT-2

Hierarchical Multiscale Recurrent Neural Networks

I2t2i: Learning Text to Image Synthesis with Textual Data Augmentation

Yoshua Bengio

Generalized Denoising Auto-Encoders As Generative Models

Exposing GAN-Synthesized Faces Using Landmark Locations

Extracting and Composing Robust Features with Denoising Autoencoders

The Creation and Detection of Deepfakes: a Survey

Unsupervised Pretraining, Autoencoder and Manifolds