<<

for Biological Data Mining

장 병 탁 서울대 컴퓨터공학부 E-mail: [email protected] http://scai.snu.ac.kr./~btzhang/

Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University

This material is available at http://scai.snu .ac.kr/~btzhang/

Outline

? Basics in

? Current Issues and Applications

? Machine Learning for

? DNA Chip Data Mining

? Graphical Models for Expression Analysis

? Summary

2

1 What is Bioinformatics?

? Bio – molecular biology ? Informatics – computer science ? BioInformatics – solving problems arising from biology using methodology from computer science. ? BioInformatics vs. Computationl Biology

3

Basics in Molecular Biology

4

2 DNA Double Helix

5

DNA Base-pairs

6

3 DNA

AACCTGCGGAAGGATCATTACCGAGTGCGGG TCCTTTGGGCCCAACCTCCCATCCGTGTCTAT TGTACCCGTTGCTTCGGCGGGCCCGCCGCTT GTCGGCCGCCGGGGGGGCGCCTCTGCCCC CCGGGCCCGTGCCCGCCGGAGACCCCAACA CGAACACTGTCTGAAAGCGTGCAGTCTGAGTT GATTGAATGCAATCAGTTAAAACTTTCAACAAT GGATCTCTTGGTTCCGGCATGCAATCAGTCC CGTTGCTTCGGCACTGTCTGAAAGCGCCTTT GGGCCCAACCTCCCATCCGTGTCTATTGTAC CCGTTGCTTCGGCGGGCCCGCCGCTTGTCG GCCGCCGGGGGGGCGCCGTTGCTTCGGCG GGCCCGCCGCTTGTCGGCCGCCGGGGCTAT TGTACCCGTTGCTTCGGATCTCTTGGGGATCT CTTGGTTCCGGCATGCAATCAGTCCCGTTGCT TCGGCACTGTCTGAAAGCGCCTTTGGGCCCA ACCTCCCACCGTTGCTTCGGCGGGCCCGCC GCTTGTCGGCCGCCGGGGGGGCGGCCGCC GGGGGCACTGTCTGAAAGCTCGGCCGCC

7

Human Sequenced!

? “The most wondrous map ever produced by human kind” ? Scientists jointly announced that they had obtained a near complete set of the biochemical instructions for human life. ? “One of the most significant scientific landmarks of all time, comparable with the invention of the wheel or the splitting of the atom”

8

4 Some Facts

? DNA differs between humans by 0.2%, (1 in 500 bases). ? Human DNA is 98% identical to that of chimpanzees. ? 97% of DNA in the human genome has no known function. ? 3.109 letters in the DNA code in every in your body. ? 1014 cells in the body. ? 12,000 letters of DNA decoded by the Human every second.

9

Molecular Biology: Flow of Information

DNA RNA Function ACTGG

Leu

A Ala

A Ser

G PheCysLysCys CysArg

G Asp

T

G

T G

C DNA Protein

10

5 Using the Genome

Genetic Information Molecular Dynamics ? Redundancy in genetic Molecular Structure information

? Single have multiple functions Biochemical Function ? Genes 1-D, gene products 3-D Biological Behavior

11

Gene Structure

search DNA “gene”

compute RNA

compute Protein sequence ?how?

Folded Protein

12

6 DNA (gene) RNA Protein

control TATA Termination control statement start stop statement

gene

Ribosome binding (RNA polymerase)

5’ utr mRNA 3’ utr

Transcription (Ribosome)

Protein

13

Numbers of Genes

? Humans 25,000 - 40,000

? C. elegans (worm): 19,000

? S. cerevisiae (yeast) 6,000

? Tuberculosis microbe 4,000

14

7 : 3 bases=1amino acid

First Third Second position Position Position (5’end) (3’end) T C A G

Ser Phe Tyr Cys T Ser Phe Tyr Cys C T Leu Ser STOP A STOP Ser Leu STOP Trp G Leu Pro His Arg T Leu Pro His Arg C C Leu Pro Gln Arg A Leu Pro Gln Arg G lle Thr Asn Ser T lle Thr Asn Ser C A Lle Thr Lys Arg A Met Thr Lys Arg G Val Ala Asp Gly T Val Ala Asp Gly C G Val Ala Glu Gly A Val Ala Giu Gly G

15

Nucleotide Sequence

SQ sequence 1344 BP; 291 A; C; 401 G; 278 T; 0 other aacctgcgga aggatcatta gcgggcccgc cgcttgtcgg cgcttgtcgg ccgccggggg ccgagtgcgg gtcctttggg ccgccggggg ggcgcctctg ccccccgggc ccgtgcccgc cccaacctcc catccgtgtc ccccccgggc ccgtgcccgc cggagacccc aacacgaaca tattgtaccc tgttgcttcg aacctgcgga aggatcatta ctgtctgaaa gcgtgcagtc gcgggcccgc cgcttgtcgg ccgagtgcgg gtcctttggg tgagttgatt gaatgcaatc ccgccggggg ggcgcctctg cccaacctcc catccgtgtc agttaaaact ttcaacaatg ccccccgggc ccgtgcccgc tattgtaccc tgttgcttcg gatctcttgg aacctgcgga cggagacccc aacacgaaca gcgggcccgc cgcttgtcgg ccgagtgcgg gtcctttggg ctgtctgaaa gcgtgcagtc agttaaaact ttcaacaatg cccaacctcc catccgtgtc tgagttgatt gaatgcaatc gatctcttgg ttccggctgc tattgtaccc tgttgcttcg agttaaaact ttcaacaatg tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg gatctcttgg ttccggctgc gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg tattgtaccc tgttgcttcg ccgccggggg ggcgcctctg agttaaaact ttcaacaatg gcgggcccgc cgcttgtcgg ccccccgggc ccgtgcccgc gatctcttgg ttccggctgc ccgccggggg ggcgcctctg cggagacccc tgttgcttcg tattgtaccc tgttgcttcg ccccccgggc ccgtgcccgc gcgggcccgc cgcttgtcgg gcgggcccgc cgcttgtcgg cggagacccc tgttgcttcg ccgccggggg cggagacccc ccgccggggg ggcgcctctg gcgggcccgc cgcttgtcgg gcgggcccgc cgcttgtcgg ccccccgggc ccgtgcccgc ccgccggggg cggagacccc ccgccggggg ggcgcctctg cggagacccc tgttgcttcg 16

8 Protein (Amino Acid) Sequence

CG2B_MARGL Length: 388 April 2, 1997 14:55 Type: P Check: 9613 .. 1 MLNGENVDSR IMGKVATRAS SKGVKSTLGT RGALENISNV ARNNLQAGAK KELVKAKRGM TKSKATSSLQ SVMGLNVEPM EKAKPQSPEP MDMSEINSAL EAFSQNLLEG VEDIDKNDFD NPQLCSEFVN DIYQYMRKLE REFKVRTDYM TIQEITERMR SILIDWLVQV HLRFHLLQET LFLTIQILDR YLEVQPVSKN KLQLVGVTSM LIAAKYEEMY PPEIGDFVYI TDNAYTKAQI RSMECNILRR LDFSLGKPLC IHFLRRNSKA GGVDGQKHTM AKYLMELTLP EYAFVPYDPS EIAAAALCLS SKILEPDMEW GTTLVHYSAY SEDHLMPIVQ KMALVLKNAP TAKFQAVRKK YSSAKFMNVS TISALTSSTV MDLADQMC

17

Protein Structure

18

9 Human Genetic Variations (Single Nucleotide Polymorphisms) ? SNP’s- “genetic individuality” ? ~1/1000 bases variable (2 humans) ? Make us more/less susceptible to diseases ? May influence the effect of drug treatments

TTTGCTCCGTTTTCA TTTGCTCYGTTTTCA TTTGCTCTGTTTTCA

19

SNP (Single Nucleotide Polymorphism)

? Finding single nucleotide changes at specific regions of genes

?Diagnosis of hereditary diseases ?Personal drug ?Finding more effective drugs and

treatments 20

10 Human Individuality

21

Flood of Data! (SWISS-PROT)

80

70

60

50

40

30

20

10

Number of sequences x 1000 0 1988 1990 1992 1994 1996 Year of release

22

11 How Can We Analyze the Flood of Data? ? Data: don’t just store it, analyze it! By comparing sequences, one can find out about things like ? ancestors of organisms ? phylogenetic trees ? protein structures ? protein function

23

Bioinformatics Is About:

? Elicitation of DNA sequences from genetic material ? Sequence annotation (e.g. with information from experiments) ? Understanding the control of (i.e. under what circumstances are transcribed from DNA) ? The relationship between the amino acid sequence of proteins and their structure.

24

12 Aim of Research in Bioinformatics

? Understand the functioning of living things – to “improve the quality of life”.

? ? Identification of genetic risk factor ? Gene therapy ? Genetic modification of good crops and animals, etc

25

Current Issues and Applications

26

13 The Central Dogma of Information Flow in Biology

The sequence of amino acids making up a protein and hence its structure (folded state) and thus its function, is determined by transcription from DNA via RNA

DNA RNA Protein Function

27

3 Main Classes of Problem Areas

? Central dogma related: sequence, structure or function ? Data related: storage, retrieval & analysis (exponential growth of knowledge in molecular biology) ? Simulation of biological processes: protein folding (molecular dynamics) of metabolic pathways

28

14 Topics in Bioinformatics

? ? Sequence alignment ? Structure and function prediction ? Gene finding ? Structure analysis ? Protein structure comparison ? Protein structure prediction ? RNA structure modeling ? Expression analysis ? Gen expression analysis ? Gene clustering ? Pathway analysis ? Metabolic pathway ? Regulatory networks 29

Sequence Analysis

Finding information and patterns in DNA and protein data

? Finding evolutionary relationships ? Finding coding regions of genomic sequences ? Translating DNA to protein ? Finding regulatory regions ? Assembling genome sequences

30

15 Structure Analysis

? Amino acid sequences of protein determine its 3D conformation

MNIHRSTPITIARYGRSRNKT QDFEELSSIRSAEPSQSFSPNL GSPSPPETPNLSHCVSCIGKY LLLEPLEGDHVFRAVHLHSG EELVCKVFDISCYQESLAPCF

Sequence Structure Function

31

Gene Expression Analysis

Nature 21, 10 (1999)

32

16 Pathway Analysis

? The one of the declarative way representing biological knowledge

Metabolic pathway 33

Applications of Bioinformatics

? Drug design ? Identification of genetic risk factors ? Gene therapy ? Genetic modification of food crops and animals ? Biological warfare, crime etc.

? Personal Medicine? ? E-Doctor?

34

17 Bioinformatics as Information Technology

GenBank SWISS-PROT Database

Information Hardware Retrieval Supercomputing

Biomedical text analysis Bioinformatics

Algorithm Agent Information filtering Monitoring agent Sequence alignment Machine

Learning Clustering Rule discovery 35

Bioinformatics on the Web

The experimental process sample hybridization array scanner

Data management

relational database

web interface

image analysis results and links to other download summaries information data to other resources applications Data analysis and interpretation 36

18 Bioinformatics and Artificial Intelligence ? A new application domain of AI and machine learning ? Data mining and knowledge discovery ? Information filtering for scientists ? Intelligent agents for customized data service ? A new basis for developing new AI technologies ? “Biointelligence” ? Biomolecular (DNA) computing ? Molecular evolutionary algorithms

37

Machine Learning for Bioinformatics

38

19 Machine Learning Techniques for Bio Data Mining

? Sequence Alignment ? Simulated Annealing ? Genetic Algorithms ? Structure and Function Prediction ? Hidden Markov Models ? Multilayer Perceptrons ? Decision Trees ? Molecular Clustering and Classification ? Support Vector Machines ? Nearest Neighbor Algorithms ? Expression (DNA Chip Data) Analysis: ? Self-Organizing Maps ? Bayesian Networks

39

Problems in Biological Science Machine Learning Methods Sequence alignment Pairwise sequence alignment Optimization algorithms ( search) Database search for similar sequences - Dynamic programming Multiple sequence alignment - Simulated annealing Phylogenetic tree reconstruction - Genetic algorithms Protein 3D structure alignment - Neural networks - Hidden Markov models Structure/function RNA secondary structure prediction prediction RNA 3D structure prediction Protein 3D structure prediction

Motif extraction Pattern recognition and Functional site prediction learning algorithms Cellular localization prediction - Discriminant analysis prediction - Hierarchical neural networks Transmembrane segment prediction - Hidden Markov models Protein secondary structure prediction - Formal grammar Protein 3D structure prediction

40

20 Problems in Biological Science Machine Learning Methods Molecular Clustering Superfamily classification Clustering algorithms /Classification Ortholog/paralog grouping of genes - Hierarchical cluster analysis 3D fold classification - Kohonen neural networks Classification algorithms - Bayesian Networks - Neural Networks - Support Vector Machines - Decision Trees

Expression (DNA Chip Data) - Support Vector Machimes Analysis - Bayesian Networks - Latent Variable Models - Generative Topographic Mapping

41

Sequence Alignment

42

21 Sequence Alignment (Similarity Search) ? Basic operation ? Comparison against each of the known examples stored in a primary database to detect any similarity that can be used for further reasoning

? Example

ATTGGCCA ATTGGCCA ATTGGCCA ATTGGCCA | | | | | | | | A— GG— A AGG ——A AG ———A A———GA 4+2*10=24 6+1*10=16 6+1*10=16 6+1*10=16

43

Simulated Annealing for Multiple Sequence Alignment

? Metropolis Monte Carlo procedure is repeated at gradually decreasing temperature for energy minimization

E ? E ? E(xn ') ? E(xn ) ?1 when ? E ? 0 p ? ? ?exp(? ? E /Tn ) otherwise

x (e.g. all possible alignments)

44

22 Genetic Algorithms: Representation

? For ? The sorted order representation 1 2 3 4 5 Individual 1110 | 0010 |1001 |1011 | 0011 | 0011 starting Decimal Number 14 2 9 6 11 3 position Sort Order 5 1 3 2 4 Intermediate Layout 2 4 3 5 1 Final Layout 3 5 1 2 4 ? Operators ? A simple swap operation as the mutation operator ? Permutation Crossover ? Transposition operator ? Inversion operator

45

Structure and Function Prediction

? Hidden Markov Models for Protein Modeling ? Multilayer Perceptrons for Internal ExonPrediction: GRAIL ? Decision Trees for Gene Finding

46

23 Structure and Function Prediction

? Protein structure prediction

?Protein modeling ?Gene finding and gene prediction

47

Hidden Markov Models for Protein Modeling

? 20 alphabets (20 amino acids)

? m0: start state, m5: end state, mk: match states ? ik: insertion states, dk: deletion states ? T(s2|s1): transition probabilities ? P(x|mk): alphabet generating probabilities (x: letter: amino acid) 48

24 Multilayer Perceptrons for Internal Prediction: GRAIL

Coding potential value GC Composition Length Discrete bases Donor exon score Acceptor vocabulary

1 score

0

sequence 49

Coding and Non-coding Regions

DNA -> RNA -> Protein

DNA

Non-coding AUG TAA Non-coding region region

Regulatory region Protein coding region

DNA

GENE 50

25 Decision Trees for Gene Finding

? MORGAN: A decision tree system for gene finding. Coding and non-coding regions finding/exon finding

d+a<3.4? donor: donor

yes no by Markov Chains site score d+a: donor and d+a<1.3? d+a<5.3? acceptor site score (6,560) hex<16.3? hex<0.1? hex<-5.6? hex: in-frame hexamer freq. (9,49) (18,160) asym<4.6? (737,50) donor<0.0? (142,73) asym: Fickett’s position assy- (24,13) (5,21) (23,16) (1,5) metry statistic

51

Molecular Clustering and Classification

52

26 Molecular Clustering and Classification ? Clustering (unsupervised learning) ? Hierarchical cluster analysis ? Kohonen neural networks ? Classification (supervised learning) ? ? Neural networks ? Bayesian networks ? Support vector machines ? Nearest Neighbor Algorithm ? Decision trees

53

Support Vector Machines for Functional Classification of Genes (1)

? Classification of microarray gene expression data [M. Brown, et al., PNAS, 97(1):262-267] ? Classifying gene functional class using gene expression data from DNA microarray hybridzation experiments ? Dataset: 2467 genes, 79 experiments (2467x79 matrix)

1. Tricarboxylic-acid pathway 2. Respiration chain complexes 3. Cytoplasmic ribosomal proteins 4. Proteasome 5. Histones 6. Helix-turn-helix Functional classes defined from MYGD

121 Expression profiles of the cytoplasmic ribosomal proteins. ( Similarity can be found! ) 54

27 Support Vector Machines for Functional Classification of Genes (2) Cost = FP + 2FN

FLD: Fisher’s linear discriminant C4.5 and MOC1: Decision trees Parzen: Parzen windows (similar nonparametric density estimation technique)

Comparison of error rates for various classification methods on 4 classes 55

Nearest Neighbor Algorithms for 3D Protein Classification

? 3D shape similarity model by shape histograms [Ankerst, 1999]

d(i,j): distance of the cells that corresponds to the bins i, j. The cell distance is calculated from the difference of the shell radii and the angles between the sectors. 56

28 DNA Microarray Data Mining

57

Gene Expression Analysis

? DNA Microarray ? Hybridize thousands of DNA samples of each gene on a glass with special cDNA samples (made under two different conditions: background condition, experimental condition) ? Ratio of a gene: ratio of two expression levels of a gene

58

29 Spotted Microarray Chip

Nature Genetics 21, 15 (1999)

59

DNA Chip Technology

? Pin microarray methods ? Inkjet methods ? Photolithography methods ? Electronic array methods

60

30 Application of DNA Microarrays

? Applications ? Gene discovery: gene/mutated gene • Growth, behavior, homeostasis … ? Disease diagnosis ? Drug discovery: ? Toxicological research:

61

Computational Tools for DNA Microarrays ? Major components ? LIMS (laboratory information management system) ? Image processing ? Data mining ? Experiment design ? Major trends for the data analysis ? Statistical methods ? Machine learning ? Reverse engineering

62

31 Diversity of Gene Expression

? Tissues ? muscle, skin, liver, brain, … ? Developmental stages ? embryonic, stem, adult cells… ? Clinical symptoms ? liver cell, hepatoma, hepatitis, regeneration … ? Environmental factors ? synthetic/natural chemicals, virus… .,…

63

Analysis of DNA Microarray Data Previous Work

? Characteristics of data ? Analysis of expression ratio based on each sample ? Analysis of time-variant data ? Clustering ? Self-organizing maps [Golub et al., 1999] ? Singular value decomposition [Orly Alter et al., 2000] ? Classification ? Support vector machines [Brown et al., 2000] ? Gene identification ? Information theory [Stefanie et al., 2000] ? Gene modeling ? Bayesian networks [Friedman et.al., 2000]

64

32 CAMDA-2000 Data Sets

CAMDA-2000 Data Sets

? CAMDA ? Critical Assessment of Techniques for Microarray Data Mining ? Purpose: Evaluate the data-mining techniques available to the microarray community. ? Data Set 1 ? Identification of cell cycle-regulated genes ? Yeast Sacchromyces cerevisiae by microarray hybridization. ? Gene expression data with 6,278 genes. ? Data Set 2 ? Cancer class discovery and prediction by gene expression monitoring. ? Two types of cancers: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). ? Gene expression data with 7,129 genes.

66

33 CAMDA-2000 Data Set 1 Identification of Cell Cycle-regulated Genes of the Yeast by Microarray Hybridization ? Data given: gene expression levels of 6,278 genes spanned by time ? ? Factor-based synchronization: every 7 minute from 0 to 119 (18) ? Cdc15-based synchronization: every 10 minute from 10 to 290 (24) ? Cdc28-based synchronization: every 10 minute from 0 to 160 (17) ? Elutriation (size-based synchronization): every 30 minutes from 0 to 390 (14) ? Among 6,278 genes ? 104 genes are known to be cell-cycle regulated • classified into: M/G1 boundary (19), late G1 SCB regulated (14), late G1 MCB regulated (39), S- phase (8), S/G2 phase (9), G2/M phase (15). ? 250 cell cycle–regulated genes might exist

67

CAMDA-2000 Data Set 1 Characteristics of data (? Factor-based Synchronization)

? M/G1 boundary ? S Phase

? Late G1 SCB regulated ? S/G2 Phase

? Late G1 MCB regulated ? G2/M Phase

68

34 CAMDA-2000 Data Set 2 Cancer Class Discovery and Prediction by Gene Expression Monitoring

? Gene expression data for cancer prediction ? Training data: 38 leukemia samples (27 ALL , 11 AML) ? Test data: 34 leukemia samples (20 ALL , 14 AML) ? Datasets contain measurements corresponding to ALL and AML samples from Bone Marrow and Peripheral Blood. ? Graphical models used: ? Bayesian networks ? Non-negative matrix factorization ? Generative topographic mapping

69

70

35 Graphical Models for DNA Chip Data Mining

71

Classes of Graphical Models

Graphical Models Undirected Directed

- Bayesian Networks - Latent Variable Models - Boltzmann Machines - Hidden Markov Models - Markov Random Fields - Generative Topographic Mapping - Non-negative Matrix Factorization

72

36 Gene Expression Analysis

? Latent Variable Models ? Bayesian Networks ? Non-negative Matrix Factorization ? Generative Topographic Mapping

73

Latent Variable Models Probabilistic Clustering - Model

gi: ith gene

zk: kth cluster tj: jth time p(gi|zk): generating probability of ith gene given kth cluster

vk=p(t|zk): prototype of kth cluster x ij x ij ? ? x ij ' j '

p(gi | zk )p(zk ) similarity (x i, v k ) ? xijvkj p(gi ? zk ) ? p(zk | gi) ? , ? j p(gi) g f (g,t, z) ? ij log(p(z ) p(g | z ) p(t | z )) objective function ? ? g ? k i k k i j ? ij ' k (maximized by EM) j ' 74

37 Latent Variable Models Probabilistic Clustering – Learning

initialize p(zk), p(gi|zk), p(tj|zk) for i=1~N, j=1~M, k=1~K such that N M K ? p ( gi | zk ) ? 1, ? p( t j | zk ) ? 1, ? p( z k ) ? 1 i? 1 j ?1 k ? 1 while(until reach to max_iteration) do EM adaptation //E-step

p( zk ) p(gi | zk ) p(t j | zk ) p( zk | gi ,t j ) ? ? p( zk ') p( gi | zk' ) p(t j | zk' ) k ' //M-step g g ij p ( z | g , t ) ij ? g k i j ? p( zk | g i , t j ) j ? ij ' i g j ' ? ij ' p ( g | z ) ? j ' i k g p (t | z ) ? i' j p ( z | g , t ) j k g ? ? k i ' j ij ' p( z | g ,t ) i ' j ? g i ' j ' ? ? k i j ' j ' i j ' ? g ij '' j '' 1 g ij g p(z ) ? p(z | g , t ), R ? ij k R ? ? g k i j ? ? i j ? ij' i j ? g ij ' j' j '

end while //prototypes

p(t|zk), k=1~K are prototypes for each cluster //clustering ? given a gene gi, cluster of gi is k for which p(gi zk) is the biggest. 75

Latent Variable Models Probabilistic Clustering – Learning Curve

-781.5 1 34 67 100 133 166 199 232 265 298 331 364 397 430 463 496 529 562 595 628 661 694 727 760 793 826 859 892 925 958 991 1024 1057 1090 1123 1156 1189 1222 1255 1288 1321 1354 1387 1420 1453 1486 1519 1552 1585 1618 1651 1684 -782

-782.5

-783

-783.5 objective function value

-784

-784.5 Number of iteration

76

38 Latent Variable Models Probabilistic Clustering – Result

? Prototypes

0.35 0.35

0.3 0.3

0.25 0.25

0.2 0.2 계열1 계열1 0.15 0.15

0.1 0.1

0.05 0.05

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

0.35 0.35

0.3 0.3

0.25 0.25

0.2 0.2 계열1 계열1 0.15 0.15

0.1 0.1

0.05 0.05

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

0.35 0.35

0.3 0.3

0.25 0.25

0.2 0.2 계열1 계열1 0.15 0.15

0.1 0.1

0.05 0.05

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

? Clustering: Given a gene gi, the cluster of gi is k,

where k = argmax m p(gi ? zm)

77

Bayesian Networks

78

39 Bayesian Networks for Gene Expression Analysis (1) Feature Selection ? Part of the data

? There are all 7,129 integer-valued attributes. ? Attribute selection

P ?|? 1 ? ? 2 | /(? 1 ? ? 2 ) ? 10 attributes with the highest P values are selected. ? Attribute value categorization categorized _ value ? ?value/(max( attribute) ? min(attribute)) ? 10? 79

Bayesian Networks for Gene Expression Analysis (2)

? Learning Gene C Gene B Processed Learning Data algorithm data Gene D Gene A

Preprocessing Target ? Inference

Gene C Gene B Gene C Gene B Gene C Gene B

Gene D Gene A Gene D Gene A Gene D Gene A

Target Target Target The values of Gene C and Belief propagation Probability for the target Gene B are given. is computed. 80

40 Bayesian Networks for Gene Expression Analysis (3) Structure Learning

Leukotriene FAH Zyxin

LYN C-myb CD33 LEPR DF GB DEF

Liver Target

81

Bayesian Networks for Gene Expression Analysis (4) Learning Procedure

Input : gene expression data, D Output : network structure G, local probability tables ? Objective function : BDe score(G,? )

n qi ?(? ) ri ? (? ? N ) ? ? ? ij ijk ijk ij ? ijk BDe score(G,? ) ? p(G)?? ? ?? for all k i ? 1 j ? 1 ?(? ij ? Nij ) k ?1 ?(? ijk ) N ? N ij ? for all k ijk p(G): prior probability for the structure G n: the number of nodes (attributes) qi: the number of states of the parents of ith node ri: the number of states of node i

? ijk: Dirichlet prior for node i at jth parents state and kth state

Nijk: the frequency of node i at jth parents state and kth state in Data D Procedure 1. From Chow and Liu’s algorithm the network structure without edge directions is learned(by mutual information). 2. Greedy hill-climbing search for maximum BDe score in the structure learned in 1. 82

41 Bayesian Networks for Gene Expression Analysis (5) Inference

? The Bayesian network constructed (partial) FAH Leukotriene

C-myb Target

? Given the values of C-myb and Leukotriene, the value of the Target can be inferred by P(T | C,L) ? P(T, F | C, L) ? F ? P(T | F,C,L)P(F | C,L) ? F ? P(T | F)P(F | C,L) ? F

83

Bayesian Networks for Gene Expression Analysis (6) Classification Results

? Prediction error of this Bayesian network (given all attribute values)

Bayesian Weighted voting network Training data 1/38 1/38 Test data 8/34 4/34

? The result can be improved by more appropriate data preprocessing.

84

42 Non-negative Matrix Factorization

85

Non-negative Matrix Factorization (1)

? Method ? Using NMF for class clustering and prediction of gene expression data from acute leukemia patients ? NMF (non-negative matrix factorization)

G ? WH ? NMF as a latent variable model r h1 h2 hr (G)i ? ? (WH) i? ? ? Wia H a? a? 1 … G : gene expression data matrix W W : basis matrix (prototypes) H : encoding matrix (in low … dimension) g1 g2 ? g ?? Wh gn Gi? ,Wia, H a? ? 0 86

43 NMF (2) Clustering Gene Expression Data

H1· H2 · G W(?) H(?)

7,129 W genes x … encoding . . . . . … ...... 7,129 38 samples … genes g1 g2 g3 g4 g7,129 38 samples 2 factors

? Factors can capture the correlations between the genes using the values of expression level. ? Cluster training samples into 2 groups by NMF ? Assign each sample to the factor (class) which has higher encoding value. ? Accuracy: 0 ~1 error for the training data set

87

NMF (3) Learning Procedure

Input : Gene expression data matrix, G (n ? m) Output : base matrix W (n ? k), encoding matrix H (k ? m) n: data size, m: number of genes, k: number of latent variables

n m Objective function : F ? ? ? ?Gi ? log(WH)i? ?(WH)i? ? i ?1 ? ?1

Procedure 1. Initialize W, H with random numbers.

Wij ? 0, ? Wij ? 1 j

Hij ? 0

2. Update W, H iteratively until max_iteration or some criterion is met.

G W ? W i? H Gi? ia ia ? a ? H ? H W ? (WH ) i? a? a? ? ia i (WH) i? W ia Wia ? ? W ja j

88

44 NMF (4) Learning Curve

Learning Curve Log likelihood

Log Likelihood

1 11 21 31 41 51 61 71 81 91 101 Number of iteration

89

NMF (5) Clustering Result

7 ALL AML 6

5

4

3

2

1

0 0 1 2 3 4 5 6

90

45 NMF (6) Diagnosis

h1 h2 W g

h(?) 7,129 W genes x 7,129 . . . genes . . . … g1 g2 g3 g4 g7,129 2 factors

? For each test sample g, estimate the encoding vector h that best approximates the sample. ? W is the basis matrix computed during training (fixed). ? As in training, assign each sample to the factor (class) which has the highest encoding value. ? Accuracy: 1~2 error(s) for the test data set

91

Generative Topographic Mapping

92

46 Generative Topographic Mapping (1)

? GTM: a nonlinear, parametric mapping y(x;W) from a latent space to a data space.

Grid t3

y(x;W)

x2 t2

t1 x1

93

GTM (2) Learning Algorithm (EM)

? Generate the grid of latent points. ? Generate the grid of latent function centers. ? Compute the matrix of basis function activations ? . ? Initialize weights W in Y = ? W and the noise variance ?. 2 2 ? Compute ? n,k = ||tn – ? kW|| = ||tn – yk(x,W)|| for each n, k ? Repeat - Compute the responsibility matrix R using ? and ? . [E-Step] Compute G=RTR - Update W by ? TG? W= ? TRT [M-step ] - Compute ? = ||t – ? W||2 - Update ? ? Until convergence

94

47 GTM (3) Visualization

? Posterior distribution in latent space given a data point t: X(t) ~

? For a whole set of data: for each t, plot in the latent space ? Posterior mode:

? Posterior mean:

95

GTM (4) Clustering Experiment

? Gene Selection ? Select about 50 genes out of 7,129 based on the three test scores of cancer diagnosis. • Correlation metric (similar as t-test) • Wilcoxon test scores (a nonparametric t-test) • Median test scores (a nonparametric t-test) ? Clustering & Visualization ? After learning a model, genes are plotted in the latent space. ? With the mapping in the latent space, clusters can be identified.

96

48 ? List of Genes Selected

97

GTM (5) Learning Curve

98

49 GTM (6) Clustering Result

Genes with high expression levels in case of ALL (large P-metric value)

Genes with high expression levels in case of AML (negative large P-mertic value)

99

Summary

? Challenges of Machine Learning Applied to Biosciences ? Huge data size ? Noise and data sparseness ? Unlabeled and imbalanced data ? Biosciences for Machine Learning ? New application ? Biosystems are existence proofs for ideal AI systems ? Provides interesting metaphors and algorithms! ? Synergy effects between biosciences and artificial intelligence

100

50