Computational Genomics Molecular Networks

Computational Genomics

10-810/02-710, Spring 2009

Physical Molecular Networks and Network algorithms

Eric Xing

Lecture 28, April 29, 2009

Reading: handouts © Eric Xing @ CMU, 2005-2009 1

Molecular Networks

z Inferred molecular networks:

z Gene correlation networks (lecture 26)

z Module networks (lecture 27)

z …

z Physical molecular networks:

z Protein-protein interaction (PPI) networks

z Protein-DNA interaction (PPI) networks --- transcription regulation networks

1 Protein-Protein Interactions (PPI)

z Protein-protein interactions involve the association of protein molecules

z Eg. signals from the exterior of a cell are mediated to the inside of that cell by protein-protein interactions

z Eg. form protein complex, such as nuclear pore, that carries another protein from cytoplasm to nucleus

PPI Network

2 Yeast Two-Hybrid System (Y2H)

z A molecular biology technique used to discover protein- protein interactions.

z It tests physical interactions (such as binding) between two proteins

z Key: the activation of downstream reporter gene(s) by the binding of a transcription factor onto an upstream activating sequence (UAS)

Y2H: I

z Gal4 transcription factor gene produces two domain protein (BD and AD) which is essential for transcription of the reporter gene (LacZ).

z BD is responsible for binding to the UAS

z AD is responsible for activation of transcription

3 Y2H: II

z Two fusion proteins are prepared: Gal4BD+Bait and Gal4AD+Prey. None of them is usually sufficient to initiate the transcription (of the reporter gene) alone.

Y2H: III

z When both fusion proteins are produced and Bait part of the first interact with Prey part of the second, transcription of the reporter gene occurs.

4 Protein-DNA Interactions

z Find DNA binding target seq for each transcription factor

z Understand the regulatory relations between genes

z System biology: build gene regulatory networks

5 ChIP-Sequencing (ChIP-Seq)

z A molecular biology technique used to analyze protein interactions with DNA.

z It combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify binding sites of DNA-associated proteins

z It can be used to precisely map global binding sites for any protein of interest (more accurate than ChIP-chip).

ChIP-Seq: I

z Covalent cross-links between proteins and DNA are formed, typically by treating cells with formaldehyde or another chemical reagent.

6 ChIP-Seq: II

z Isolate genomic DNA

z Sonicate DNA to produce sheared, soluble chromatin

ChIP-Seq: III

z An antibody specific to the protein of interest is used to selectively coimmunoprecipitate the protein-bound DNA fragments that were covalently cross-linked.

7 ChIP-Seq: IV

z Reverse cross-links, purify DNA and prepare for sequencing

ChIP-Seq: V

z Map the resulting sequences back to the reference genome, whereby the most frequently sequenced fragments formed peaks at specific genomic regions.

8 Other Related Techniques

Mining and analyzing networks

z Identifying Signaling Pathways

z color-coding technique (Alon, Yuster and Zwick. 1995) and generalizations (Scott et al. RECOMB 2005) z Identifying Interaction Complexes (clique-like structures)

z Statistical subgraph scoring (Sharan et al. RECOMB 2004) z Network alignment

z PathBLAST: identify conserved pathways (Kelley et al 2003)

z MaWISh: identify conserved multi-protein complexes (Koyuturk et al 2004)

z Nuke: Scalable and General Pairwise and Multiple Network Alignment (Flannick, Novak, Srinivasan, McAdams, Batzoglou 2005) z Network Dynamics

z Sandy: backtracking to find active sub-network (Luscombe et al, Nature 2005) z Node function inference

z Stochastic block models (Aroldi et al, 2006)

z Latent space models (Hoff, 2004) z Link prediction

z Naïve Bayes classifier, Bayesian network

z MRF

9 Network evolution

3 Problems: MRCA-Most Recent Common Ancestor Time Direction Time 1. Test all possible ? relationships. Parameters:time rates, selection 2. Examine unknown U E n internal states. v o o b lu s t e io r 3. n v Explore unknown a a r b y le P paths between states a t at nodes. h

Æ Network alignment

observableobservable observable

Motivation

z Sequence alignment seeks to identify conserved DNA or protein sequence

z Intuition: conservation implies functionality EFTPPVQAAYQKVVAGV (human) DFNPNVQAAFQKVVAGV (pig) EFTPPVQAAYQKVVAGV (rabbit)

z By similar intuition, subnetworks conserved across species are likely functional modules

10 Network Alignment

z “Conserved” means two subgraphs contain proteins having homologous sequences, serving similar functions, having similar interaction profiles

z Key word is similar, not identical

Conserved Protein interactions groups

mismatch/substitution z Product graph:

z Nodes: groups of sequence-similar proteins, one per species.

z Edges: conserved interactions.

Scoring Scheme

z Given two protein subsets, one in each species, with a many- to-many correspondence between them, we wish:

z Each subset induces a dense subgraph.

z Matched protein pairs are sequence-similar.

z Two hypothesis:

z Conserved complex model: matched pairs are similar.

z Random model: matched pairs are randomly chosen.

Pr(S | similar) L(C,C') = L(C) / L(C')× ∏ u,v u,v matched Pr(Su,v | random)

Similarity (BLAST E-value)

11 Scoring Scheme cont.

z For multiple networks: run into problem of scoring a multiple sequence alignment.

z Need to balance edge and vertex terms.

z Practical solution:

z Sensible threshold for sequence similarity.

z Nodes in alignment graph are filtered accordingly.

z Node terms are removed from score.

Multiple Network Alignment

Subnetwork search Network alignment Preprocessing Conserved paths Filtering & Interaction scores: Visualizing logistic regression on p-value<0.01, #observations, expression ≤80% overlap correlation, clustering coeff. Conserved Protein Conserved clusters interactions groups

z Two recent algorithms:

z ???, Sharan et al. PNAS 2005

z Nuke: Flannick, Novak, Srinivasan, McAdams, Batzoglou 2005

12 Nuke: the model

z Example: hypothetical ancestral module

descendants

equivalence classes

Nuke: Scoring

z Probabilistic scoring of alignments: P(nodes | M) P(edges | M) log + log P(nodes | R) P(edges | R)

z M : Alignment model (network evolved from a common ancestor)

z R : Random model (nodes and edges picked at random)

z Nodes and edges scored independently: How? This is hot research issue! (not covered here) S = S + S 2.5 N E 1.2 = 11.0 + 4.0 0.4 0.8 3.0 0.8 -0.4 -0.3 -0.2 0.5 0.6 0.6 4.0 1.5

13 A General Network Aligner: Algorithm

z Given this model of network alignment and scoring framework, how to efficiently find alignments between a pair of networks (N1, N2)?

z Constructing every possible set of equivalence classes clearly prohibitive

z Idea: seeded alignment

z Inspired by seeded sequence alignment (BLAST)

z Identify regions of network in which “good” alignments likely to be found

z MaWISh does this, using high-degree nodes for seeds

z Can we avoid such strong topological constraints? Seed

Extend

Multiple Alignment

z Progressive alignment technique

z Used by most multiple sequence aligners

M. tuberculosis E. coli C. crescentus z Simple modification of implementation to align alignments rather than networks

z Node scoring already uses weighted SOP

z Edge scoring remains unchanged

14 Pairwise alignments

lpxC lpxC Cj0693c HP0707 mraY mraY mrdB mrdB

pbpB pbpB murD murD ruvB ruvB ftsA ftsA murG murG

ruvC ruvC ruvA ruvA murE murE

Cj0965c HP0496 ftsW ftsW pal pal Cj0112 HP1126 ftsZ ftsZ murC murC exbB2 exbB murF murF Cell division exbB3 exbB exbB1 exbB yccZ wza CC0169 CC2432 wzc yccC CC0164 exbD2 exbD3 exbD1 exbD exbD exbD DNA uptake

Polysaccharide transport wcaJ CC1486 CC2425 CC2384 CC0166 © Eric Xing @ CMU, 2005-2009 29

Multiple alignments

yidC yidC yidC yidC

gidA gidA gidA gidA trmE thdF thdF thdF

pal pal pal pal dnaA dnaA dnaA dnaA gyrB gyrB gyrB gyrB exbD ex b D exbD2 exbD3 exbD1 ex bD ex b D exbD

dnaN dnaN dnaN dnaN

DNA replication ybgC CC3234 Cj0965c HP0496 ex b B tolQ ex b B ex b B exbB2 ex b B1 exbB exbB ruvC ruvC ruvC ruvC

CC3230 Cj0112 HP1126

ruvA ruvA ruvA ruv A

DNA uptake

15 Dynamic Yeast TF network

Transcription Factors z Analyzed network as a static entity

z But network is dynamic

z Different sections of the network are active under different cellular conditions

z Integrate gene expression data

Target Genes

Gene expression data

z Genes that are differentially expressed under five cellular conditions

Cellular condition No. genes

Cell cycle 437 Sporulation 876

Diauxic shift 1,876 DNA damage 1,715 Stress response 1,385

z Assume these genes undergo transcription regulation

16 Backtracking to find active sub- network

Define differentially expressed genes

Identify TFs that regulate these genes

Identify further TFs that regulate these TFs

Active regulatory sub-network

Network usage under different conditions

static

17 Network usage under different conditions

cell cycle

Network usage under different conditions

sporulation

18 Network usage under different conditions

diauxic shift

Network usage under different conditions DNA damage

19 Network usage under different conditions

stress response

Network usage under different conditions

Cell cycle Sporulation Diauxic shift DNA damage Stress

How to model the networks change? --- an open problem

20 Network Tomography: functional analysis

A Latent Mixture Membership Model

Motivation

z In many networks (e.g., biological network, citation networks), each node may be “multiple-class”, i.e., has multiple functional/topical aspects.

z The interaction of a node (e.g., a protein) with different nodes (partners) may be under different function context.

z Prior knowledge of group interaction may be available.

21 A Mixture Membership Stochastic

Blockmodel (MMSB) Airoldi, Blei, Fienberg, and Xing, 2008

Topic vector of node i Topic vector of node j

Topic indicators of node i as donor Topic vector of node j as acceptor

The link indicator of (i,j)

Hierarchical Bayesian MMSB

R , ~ Bernoulli(ργ , + (1 − ρ)δ ) R , ~ Bernoulli(ργ , + (1 − ρ)δ ) i j zi ,j ,1 zi ,j ,2 0 i j zi ,j ,1 zi ,j ,2 0

For each object i=1,…,N: For each pair of object (i,j)

Z , , ~ Multi(θ ) θi ~ Dirichlet()α i j 1 i

For each topic-pair (s,t): Zi ,j ,2 ~ Multi(θ j )

Ri ,j ~ Bernoulli(ργ z ,z + (1 − ρ)δ 0 ) γ s ,t ~ Beta()β i ,j ,1 i ,j ,2

22 Variational Inference

z The Joint likelihood:

z +z +α −1 r z m z n +β −1 (1−r )z m z n +β −1 ∑j i ,j ,1 i ,j ,2 ∑i ,j i ,j i ,j ,1 i ,j ,2 1 ∑i ,j i ,j i ,j ,1 i ,j ,2 2 p(r ,z ,θ,γ ) = ∏θi ×γ m ,n (1 −γ m,n ) i

z GMF approximation:

⎛ N ⎞ ⎛ K ⎞ ⎛ N ⎞ q(r, z,θ ,γ |α, β ) = ⎜ q(θ | µ )⎟×⎜ q(γ |ν )⎟×⎜ q( , , |ϕ )⎟ ⎜∏ i i ⎟ ⎜ ∏ s ,t s ,t ⎟ ⎜ ∏ zi ,j ,1 zi ,j ,2 ri ,j i ,j ⎟ ⎝ i =1 ⎠ ⎝ s =1,t =1 ⎠ ⎝ i =1,j =1 ⎠

µ = α + z + z i ∑j i ,j ,1 ∑j i ,j ,2 ν = β + r z z s ,t ∑i ,j s ,t i ,j ,1 i ,j ,2 ... z MF approximation:

⎛ N ⎞ ⎛ K ⎞ ⎛ N ⎞ q(r,z,θ,γ |α, β ) = ⎜ q(θ | µ )⎟×⎜ q(γ |ν )⎟×⎜ q( |φ )q( |φ )q( |ϕ )⎟ ⎜∏ i i ⎟ ⎜ ∏ s ,t s ,t ⎟ ⎜ ∏ zi ,j ,1 i ,j ,1 zi ,j ,2 i ,j ,1 ri ,j i ,j ⎟ ⎝ i =1 ⎠ ⎝ s =1,t =1 ⎠ ⎝i =1,j =1 ⎠

Inferred Mixed membership Network Tomography

23 Dynamic MMSB

Fu, Song, and Xing, 2009

24 Trajectory of MM of genes during Drosophila life cycle

Summary of MMSB

z A stochastic block model

z Each node can play "multiple roles", and its ties with other nodes can be explained by different roles

z Hierarchical Bayesian formalism

z Dynamic tomography

z Efficient variational inference

25 Computational Molecular Biology

Using mathematical models and computational reasoning to pursue predictive understanding of life

Research in Computer Science

z Computer science is a ``science of the artificial.’’

z Problems are precisely stated and are often generic rather than application-specific.

z The quality of an algorithm is measured by its worst-case time bound.

z Mathematical elegance is just as important as relevance to applications.

26 Research in Computational Biology

z The goal is to understand ground truth.

z Problem statements are often fuzzy.

z Problems are often application-specific, and problem formulations must be faithful to those applications.

z The quality of an algorithm is measured by its performance on real data.

z Biological findings are more important than computational methods.

Adapting to Computational Biology

z Choose problems that are fundamental, timely and relevant.

z Mathematical depth and elegance are highly desirable, but often simple mathematics, artfully applied, is the key to success.

z Avoid problems that will change when technology changes.

z Learn the biological background of your problem, the available sources of data and their noise characteristics.

z Work with an application-oriented team and don’t get typecast as an algorithms specialist or just "play with numbers."

z Benchmark your algorithms on real data, establish a user community and make your software available and easy to use.

27 Computational Biology can Benefit from Research in Machine Learning

z Biological processes are stochastic and partially observed

z probabilistic models and statistical inference/learning algorithms

z Biological data are usually non-linear and high dimensional

z kernel methods and convex optimization

z Biological systems are complex and usually intractable

z efficient representation and approximation techniques

z Biological prior knowledge provide crucial model constrains and biological subjects can be studied from different angles

z Bayesian approach and data fusion methods

Conclusion

2) Effective Algorithm and Simulators

1) Extendable Models

3) Interactive Analysis

28 Reference

z Deng et al. Assessment of the reliability of protein-protein interactions and protein function prediction. Proc. PSB, 140-151 (2003).

z Bader et al. Gaining confidence in high-throughput protein interaction networks. Nat. Biotechnol., 78-85 (2004).

z Kelley et al. PathBLAST: a tool for alignment of protein interaction networks. Nucl. Acids Res. 32, W83-8 (2004).

z Kelley et al. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. PNAS 100, 11394-9 (2003).

z Sharan et al. Conserved patterns of protein interaction in multiple species. PNAS 102, 1974-9 (2005).

z Sharan et al. Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data. J. Comp. Biol. In press (2005).

z Scott et al. Efficient algorithms for detecting signaling pathways in protein interaction networks. Proc. RECOMB, 1-13 (2005).

z E. Airoldi, D. Blei, S. Fienberg, and E. P. Xing, Mixed Membership Stochastic Blockmodel, Journal of Machine Learning Research, 9(Sep):1981--2014, 2008.

z W. Fu, L. Song, and E. P. Xing, A State-Space Mixed Membership Blockmodel for Dynamic Network Tomography, Manuscript, arXiv:0901.0135. © Eric Xing @ CMU, 2005-2009 57