Biological Network Analysis: Graph Mining in Bioinformatics Karsten Borgwardt

Biological Network Analysis: Graph Mining in Bioinformatics Karsten Borgwardt Interdepartmental Bioinformatics Group MPIs Tübingen with permission from Xifeng Yan and Xianghong Jasmine Zhou Karsten Borgwardt: Graph Mining in Bioinformatics, Page 1 Mining coherent dense subgraphs across massive biological networks for functional discovery H. Hu1, X. Yan2, Y. Huang1, J. Han2, and X. J. Zhou1 1University of Southern California 2University of Illinois at Urbana-Champaign Biological Networks • Protein-protein interaction network • Metabolic network • Transcriptional regulatory network • Co-expression network • Genetic Interaction network • … Data Mining Across Multiple Networks f f f j j j a a a h c c h c h e e e b b b k k k d d d g i g i g i f f f j j j a h a a h c h c c e e e b b k b k k d g i d d i g i g Data Mining Across Multiple Networks f f f j j j a a a h c c h c h e e e b b b k k k d d d g i g i g i f f f j j j a h a a h c h c c e e e b b k b k k d g i d d i g i g Identify frequent co-expression clusters across multiple microarray data sets f f j j a c h a c h c1 c2… cm e e b b g1 .1 .2… .2 k k d g i d g i g2 .4 .3… .4 … f f a e j a j c h c h c1 c2… cm e g .8 .6… .2 1 b k b k d g2 .2 .3… .4 g i d g i … . f f j a j c1 c2… cm a c h c h g .9 .4… .1 e e 1 b b g .7 .3… .5 k k 2 d g i d g i … f f j j a h a h c1 c2… cm c c e e g1 .2 .5… .8 b k b k g2 .7 .1… .3 d g i d g i … Frequent Subgraph Mining Problem is hard! Problem formulation: Given n graphs, identify subgraphs which occur in at least m graphs (m ≤ n) Efficient modeling of Biological Networks: each gene occurs once and only once in a graph. That means, the edge labels are unique. The common pattern growth approach Find a frequent subgraph of k edges, and expand it to k+1 edge to check occurrence frequency – Koyuturk M., Grama A. & Szpankowski W. An efficient algorithm for detecting frequent subgraphs in biological networks. ISMB 2004 – Yan, Zhou, and Han. Mining Closed Relational Graphs with Connectivity Constraints. ICDE 2005 Problem of the Pattern-growth approach The time and memory requirements increase exponentially with increasing size of patterns and increasing number of networks. The number of frequent dense subgraphs is explosive when there are very large frequent dense subgraphs, e.g., subgraphs with hundreds of edges. Problem of the Pattern-growth approach Pattern Expansion f f f f j f j j a j a h j k k+1 a c h a h c h c a h c e c e e e e b b b b k k b k k d i k d g i d i d g i g d g i g f f f j f j f j a j a j a c a c c h a h c h h e c h e e e e b b k b b k k b k d k d g i d d g i g i d g i g i f f f j j f j a j a h f j a c h a h c h c a h c e c e e e e b b b k b k b k k g k d g i g d g i d i d g i d i f f f f f j j a j j a h a j a h c h a h c c h c c e e e e e b b b k k b k k b k d d d g i d g i g i d g i g i Our solution We develop a novel algorithm, called CODENSE, to mine frequent coherent dense subgraphs. The target subgraphs have three characteristics: (1) All edges occur in >= k graphs (frequency) (2) All edges should exhibit correlated occurrences in the given graph set. (coherency) (3) The subgraph is dense, where density d is higher than a threshold γ and d=2m/(n(n-1)) (density) m: #edges, n: #nodes CODENSE: Mine coherent dense subgraph f f f a c h a a h c h c b e b e b e f d g i d g i d g i a c h G1 G2 G3 e b f f f a d c h a c h a c h g i summary graph Ĝ e e e b b b d g i d g i d g i G4 G5 G6 CODENSE: Mine coherent dense subgraph f f a c h Step 2 c h e e b d g i MODES g i summary graph Ĝ Sub(Ĝ) Observation: If a frequent subgraph is dense, it must be a dense subgraph in the summary graph. However, the reverse conclusion is not true. CODENSE: Mine coherent dense subgraph E G1 G2 G3 G4 G5 G6 f c h Step 3 c-e 0 0 1 1 0 1 e c-f 0 1 0 1 1 1 c-h 0 0 0 1 1 1 i g c-i 0 0 1 1 1 0 Sub(Ĝ) e-f 0 0 0 1 1 1 … … … … … … … edge occurrence profiles CODENSE: Mine coherent dense subgraph g-h f-i E G1 G2 G3 G4 G5 G6 e-i h-i c-e 0 0 1 1 1 1 Step 4 c-f 0 1 0 1 1 1 e-g g-i c-h 0 0 0 1 1 1 e-h c-i 0 0 1 1 1 0 c-h e-f 0 0 0 1 1 1 f-h … … … … … … … c-f e-f c-i edge occurrence profiles c-e second-order graph S CODENSE: Mine coherent dense subgraph g-h f-i g-h e-i h-i e-i h-i Step 4 g-i e-g e-g g-i e-h e-h c-h f-h c-h f-h c-f e-f c-f e-f c-i c-e c-e second-order graph S Sub(S) Observation: if a subgraph is coherent (its edges show high correlation in their occurrences across a graph set), then its 2nd-order graph must be dense. CODENSE: Mine coherent dense subgraph g-h h e-i h-i e Step 5 g i e-g g-i e-h f c h c-h f-h e c-f e-f c-e Sub(G) Sub(S) Our solution We develop a novel algorithm, called CODENSE, to mine frequent coherent dense subgraphs. The target subgraphs have three characteristics: (1) All edges occur in >= k graphs (frequency) (2) All edges should exhibit correlated occurrences in the given graph set. (coherency) (3) The subgraph is dense, where density d is higher than a threshold γ and d=2m/(n(n-1)) (density) m: #edges, n: #nodes CODENSE: Mine coherent dense subgraph a f f f h a a c c h c h b e b e b e f f d d d g i g i g i Step 1 a c h Step 2 c h G1 G2 G3 e e a b f f f c h a c h a c h Add/Cut d g i MODES g i b e e summary graph Ĝ Sub(Ĝ) b b e d g i d g i d g i G G G 4 5 6 Step 3 g-h g-h f-i h e-i h-i e-i h-i E G1 G2 G3 G4 G5 G6 e c-e 0 0 1 1 1 1 g i Step 6 Step 5 Step 4 e-g g-i e-g g-i c-f 0 1 0 1 1 1 e-h e-h c-h 0 0 0 1 1 1 f c-i 0 0 1 1 1 0 MODES c-h c h Restore c-h f-h f-h e-f 0 0 0 1 1 1 G and c-f c-f e e-f e-f … … … … … … … MODES c-i c-e c-e edge occurrence profiles Sub(G) Sub(S) second-order graph S CODENSE The design of CODENSE can solve the scalability issue. Instead of mining each biological network individually, CODENSE compresses the networks into two meta-graphs and performs clustering in these two graphs only. Thus, CODENSE can handle any large number of networks. MODES: Mine overlapped dense subgraph G j a j Sub(G) a j G’ f f h b h Step 1 b h Step 2 V c HCS’ c condense i e i e i g d g d g HCS’ a Sub(G’) f h f Step 4 b h Step 3 h V HCS’ c restore e i e i i d Comparison with other Methods • By transforming all necessary information of the n graphs into two graphs, CODENSE achieves significant time and memory efficiency.

Load more