Machine Learning for Biological Data Mining
장 병 탁 서울대 컴퓨터공학부 E-mail: [email protected] http://scai.snu.ac.kr./~btzhang/
Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University
This material is available at http://scai.snu .ac.kr/~btzhang/
Outline
? Basics in Molecular Biology
? Current Issues and Applications
? Machine Learning for Bioinformatics
? DNA Chip Data Mining
? Graphical Models for Gene Expression Analysis
? Summary
2
1 What is Bioinformatics?
? Bio – molecular biology ? Informatics – computer science ? BioInformatics – solving problems arising from biology using methodology from computer science. ? BioInformatics vs. Computationl Biology
3
Basics in Molecular Biology
4
2 DNA Double Helix
5
DNA Base-pairs
6
3 DNA
AACCTGCGGAAGGATCATTACCGAGTGCGGG TCCTTTGGGCCCAACCTCCCATCCGTGTCTAT TGTACCCGTTGCTTCGGCGGGCCCGCCGCTT GTCGGCCGCCGGGGGGGCGCCTCTGCCCC CCGGGCCCGTGCCCGCCGGAGACCCCAACA CGAACACTGTCTGAAAGCGTGCAGTCTGAGTT GATTGAATGCAATCAGTTAAAACTTTCAACAAT GGATCTCTTGGTTCCGGCATGCAATCAGTCC CGTTGCTTCGGCACTGTCTGAAAGCGCCTTT GGGCCCAACCTCCCATCCGTGTCTATTGTAC CCGTTGCTTCGGCGGGCCCGCCGCTTGTCG GCCGCCGGGGGGGCGCCGTTGCTTCGGCG GGCCCGCCGCTTGTCGGCCGCCGGGGCTAT TGTACCCGTTGCTTCGGATCTCTTGGGGATCT CTTGGTTCCGGCATGCAATCAGTCCCGTTGCT TCGGCACTGTCTGAAAGCGCCTTTGGGCCCA ACCTCCCACCGTTGCTTCGGCGGGCCCGCC GCTTGTCGGCCGCCGGGGGGGCGGCCGCC GGGGGCACTGTCTGAAAGCTCGGCCGCC
7
Human Genome Sequenced!
? “The most wondrous map ever produced by human kind” ? Scientists jointly announced that they had obtained a near complete set of the biochemical instructions for human life. ? “One of the most significant scientific landmarks of all time, comparable with the invention of the wheel or the splitting of the atom”
8
4 Some Facts
? DNA differs between humans by 0.2%, (1 in 500 bases). ? Human DNA is 98% identical to that of chimpanzees. ? 97% of DNA in the human genome has no known function. ? 3.109 letters in the DNA code in every cell in your body. ? 1014 cells in the body. ? 12,000 letters of DNA decoded by the Human Genome Project every second.
9
Molecular Biology: Flow of Information
DNA RNA Protein Function ACTGG
Leu
A Ala
A Ser
G PheCysLysCys CysArg
G Asp
T
G
T G
C DNA Protein
10
5 Using the Genome
Genetic Information Molecular Dynamics ? Redundancy in genetic Molecular Structure information
? Single genes have Biophysics multiple functions Biochemical Function ? Genes 1-D, gene products 3-D Biochemistry Biological Behavior
11
Gene Structure
search DNA “gene”
compute RNA
compute Protein sequence ?how?
Folded Protein
12
6 DNA (gene) RNA Protein
control TATA Termination control statement start stop statement
gene
Ribosome binding Transcription (RNA polymerase)
5’ utr mRNA 3’ utr
Transcription (Ribosome)
Protein
13
Numbers of Genes
? Humans 25,000 - 40,000
? C. elegans (worm): 19,000
? S. cerevisiae (yeast) 6,000
? Tuberculosis microbe 4,000
14
7 Genetic Code: 3 bases=1amino acid
First Third Second position Position Position (5’end) (3’end) T C A G
Ser Phe Tyr Cys T Ser Phe Tyr Cys C T Leu Ser STOP A STOP Ser Leu STOP Trp G Leu Pro His Arg T Leu Pro His Arg C C Leu Pro Gln Arg A Leu Pro Gln Arg G lle Thr Asn Ser T lle Thr Asn Ser C A Lle Thr Lys Arg A Met Thr Lys Arg G Val Ala Asp Gly T Val Ala Asp Gly C G Val Ala Glu Gly A Val Ala Giu Gly G
15
Nucleotide Sequence
SQ sequence 1344 BP; 291 A; C; 401 G; 278 T; 0 other aacctgcgga aggatcatta gcgggcccgc cgcttgtcgg cgcttgtcgg ccgccggggg ccgagtgcgg gtcctttggg ccgccggggg ggcgcctctg ccccccgggc ccgtgcccgc cccaacctcc catccgtgtc ccccccgggc ccgtgcccgc cggagacccc aacacgaaca tattgtaccc tgttgcttcg aacctgcgga aggatcatta ctgtctgaaa gcgtgcagtc gcgggcccgc cgcttgtcgg ccgagtgcgg gtcctttggg tgagttgatt gaatgcaatc ccgccggggg ggcgcctctg cccaacctcc catccgtgtc agttaaaact ttcaacaatg ccccccgggc ccgtgcccgc tattgtaccc tgttgcttcg gatctcttgg aacctgcgga cggagacccc aacacgaaca gcgggcccgc cgcttgtcgg ccgagtgcgg gtcctttggg ctgtctgaaa gcgtgcagtc agttaaaact ttcaacaatg cccaacctcc catccgtgtc tgagttgatt gaatgcaatc gatctcttgg ttccggctgc tattgtaccc tgttgcttcg agttaaaact ttcaacaatg tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg gatctcttgg ttccggctgc gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg tattgtaccc tgttgcttcg ccgccggggg ggcgcctctg agttaaaact ttcaacaatg gcgggcccgc cgcttgtcgg ccccccgggc ccgtgcccgc gatctcttgg ttccggctgc ccgccggggg ggcgcctctg cggagacccc tgttgcttcg tattgtaccc tgttgcttcg ccccccgggc ccgtgcccgc gcgggcccgc cgcttgtcgg gcgggcccgc cgcttgtcgg cggagacccc tgttgcttcg ccgccggggg cggagacccc ccgccggggg ggcgcctctg gcgggcccgc cgcttgtcgg gcgggcccgc cgcttgtcgg ccccccgggc ccgtgcccgc ccgccggggg cggagacccc ccgccggggg ggcgcctctg cggagacccc tgttgcttcg 16
8 Protein (Amino Acid) Sequence
CG2B_MARGL Length: 388 April 2, 1997 14:55 Type: P Check: 9613 .. 1 MLNGENVDSR IMGKVATRAS SKGVKSTLGT RGALENISNV ARNNLQAGAK KELVKAKRGM TKSKATSSLQ SVMGLNVEPM EKAKPQSPEP MDMSEINSAL EAFSQNLLEG VEDIDKNDFD NPQLCSEFVN DIYQYMRKLE REFKVRTDYM TIQEITERMR SILIDWLVQV HLRFHLLQET LFLTIQILDR YLEVQPVSKN KLQLVGVTSM LIAAKYEEMY PPEIGDFVYI TDNAYTKAQI RSMECNILRR LDFSLGKPLC IHFLRRNSKA GGVDGQKHTM AKYLMELTLP EYAFVPYDPS EIAAAALCLS SKILEPDMEW GTTLVHYSAY SEDHLMPIVQ KMALVLKNAP TAKFQAVRKK YSSAKFMNVS TISALTSSTV MDLADQMC
17
Protein Structure
18
9 Human Genetic Variations (Single Nucleotide Polymorphisms) ? SNP’s- “genetic individuality” ? ~1/1000 bases variable (2 humans) ? Make us more/less susceptible to diseases ? May influence the effect of drug treatments
TTTGCTCCGTTTTCA TTTGCTCYGTTTTCA TTTGCTCTGTTTTCA
19
SNP (Single Nucleotide Polymorphism)
? Finding single nucleotide changes at specific regions of genes
?Diagnosis of hereditary diseases ?Personal drug ?Finding more effective drugs and
treatments 20
10 Human Individuality
21
Flood of Data! (SWISS-PROT)
80
70
60
50
40
30
20
10
Number of sequences x 1000 0 1988 1990 1992 1994 1996 Year of release
22
11 How Can We Analyze the Flood of Data? ? Data: don’t just store it, analyze it! By comparing sequences, one can find out about things like ? ancestors of organisms ? phylogenetic trees ? protein structures ? protein function
23
Bioinformatics Is About:
? Elicitation of DNA sequences from genetic material ? Sequence annotation (e.g. with information from experiments) ? Understanding the control of gene expression (i.e. under what circumstances proteins are transcribed from DNA) ? The relationship between the amino acid sequence of proteins and their structure.
24
12 Aim of Research in Bioinformatics
? Understand the functioning of living things – to “improve the quality of life”.
? Drug design ? Identification of genetic risk factor ? Gene therapy ? Genetic modification of good crops and animals, etc
25
Current Issues and Applications
26
13 The Central Dogma of Information Flow in Biology
The sequence of amino acids making up a protein and hence its structure (folded state) and thus its function, is determined by transcription from DNA via RNA
DNA RNA Protein Function
27
3 Main Classes of Problem Areas
? Central dogma related: sequence, structure or function ? Data related: storage, retrieval & analysis (exponential growth of knowledge in molecular biology) ? Simulation of biological processes: protein folding (molecular dynamics) of metabolic pathways
28
14 Topics in Bioinformatics
? Sequence analysis ? Sequence alignment ? Structure and function prediction ? Gene finding ? Structure analysis ? Protein structure comparison ? Protein structure prediction ? RNA structure modeling ? Expression analysis ? Gen expression analysis ? Gene clustering ? Pathway analysis ? Metabolic pathway ? Regulatory networks 29
Sequence Analysis
Finding information and patterns in DNA and protein data
? Finding evolutionary relationships ? Finding coding regions of genomic sequences ? Translating DNA to protein ? Finding regulatory regions ? Assembling genome sequences
30
15 Structure Analysis
? Amino acid sequences of protein determine its 3D conformation
MNIHRSTPITIARYGRSRNKT QDFEELSSIRSAEPSQSFSPNL GSPSPPETPNLSHCVSCIGKY LLLEPLEGDHVFRAVHLHSG EELVCKVFDISCYQESLAPCF
Sequence Structure Function
31
Gene Expression Analysis
Nature Genetics 21, 10 (1999)
32
16 Pathway Analysis
? The one of the declarative way representing biological knowledge
Metabolic pathway 33
Applications of Bioinformatics
? Drug design ? Identification of genetic risk factors ? Gene therapy ? Genetic modification of food crops and animals ? Biological warfare, crime etc.
? Personal Medicine? ? E-Doctor?
34
17 Bioinformatics as Information Technology
GenBank SWISS-PROT Database
Information Hardware Retrieval Supercomputing
Biomedical text analysis Bioinformatics
Algorithm Agent Information filtering Monitoring agent Sequence alignment Machine
Learning Pattern recognition Clustering Rule discovery 35
Bioinformatics on the Web
The experimental process sample hybridization array scanner
Data management
relational database
web interface
image analysis results and links to other download summaries information data to other resources applications Data analysis and interpretation 36
18 Bioinformatics and Artificial Intelligence ? A new application domain of AI and machine learning ? Data mining and knowledge discovery ? Information filtering for scientists ? Intelligent agents for customized data service ? A new basis for developing new AI technologies ? “Biointelligence” ? Biomolecular (DNA) computing ? Molecular evolutionary algorithms
37
Machine Learning for Bioinformatics
38
19 Machine Learning Techniques for Bio Data Mining
? Sequence Alignment ? Simulated Annealing ? Genetic Algorithms ? Structure and Function Prediction ? Hidden Markov Models ? Multilayer Perceptrons ? Decision Trees ? Molecular Clustering and Classification ? Support Vector Machines ? Nearest Neighbor Algorithms ? Expression (DNA Chip Data) Analysis: ? Self-Organizing Maps ? Bayesian Networks
39
Problems in Biological Science Machine Learning Methods Sequence alignment Pairwise sequence alignment Optimization algorithms (homology search) Database search for similar sequences - Dynamic programming Multiple sequence alignment - Simulated annealing Phylogenetic tree reconstruction - Genetic algorithms Protein 3D structure alignment - Neural networks - Hidden Markov models Structure/function RNA secondary structure prediction prediction RNA 3D structure prediction Protein 3D structure prediction
Motif extraction Pattern recognition and Functional site prediction learning algorithms Cellular localization prediction - Discriminant analysis Coding region prediction - Hierarchical neural networks Transmembrane segment prediction - Hidden Markov models Protein secondary structure prediction - Formal grammar Protein 3D structure prediction
40
20 Problems in Biological Science Machine Learning Methods Molecular Clustering Superfamily classification Clustering algorithms /Classification Ortholog/paralog grouping of genes - Hierarchical cluster analysis 3D fold classification - Kohonen neural networks Classification algorithms - Bayesian Networks - Neural Networks - Support Vector Machines - Decision Trees
Expression (DNA Chip Data) - Support Vector Machimes Analysis - Bayesian Networks - Latent Variable Models - Generative Topographic Mapping
41
Sequence Alignment
42
21 Sequence Alignment (Similarity Search) ? Basic operation ? Comparison against each of the known examples stored in a primary database to detect any similarity that can be used for further reasoning
? Example
ATTGGCCA ATTGGCCA ATTGGCCA ATTGGCCA | | | | | | | | A— GG— A AGG ——A AG ———A A———GA 4+2*10=24 6+1*10=16 6+1*10=16 6+1*10=16
43
Simulated Annealing for Multiple Sequence Alignment
? Metropolis Monte Carlo procedure is repeated at gradually decreasing temperature for energy minimization
E ? E ? E(xn ') ? E(xn ) ?1 when ? E ? 0 p ? ? ?exp(? ? E /Tn ) otherwise
x (e.g. all possible alignments)
44
22 Genetic Algorithms: Representation
? For sequence assembly ? The sorted order representation 1 2 3 4 5 Individual 1110 | 0010 |1001 |1011 | 0011 | 0011 starting Decimal Number 14 2 9 6 11 3 position Sort Order 5 1 3 2 4 Intermediate Layout 2 4 3 5 1 Final Layout 3 5 1 2 4 ? Operators ? A simple swap operation as the mutation operator ? Permutation Crossover ? Transposition operator ? Inversion operator
45
Structure and Function Prediction
? Hidden Markov Models for Protein Modeling ? Multilayer Perceptrons for Internal ExonPrediction: GRAIL ? Decision Trees for Gene Finding
46
23 Structure and Function Prediction
? Protein structure prediction
?Protein modeling ?Gene finding and gene prediction
47
Hidden Markov Models for Protein Modeling
? 20 alphabets (20 amino acids)
? m0: start state, m5: end state, mk: match states ? ik: insertion states, dk: deletion states ? T(s2|s1): transition probabilities ? P(x|mk): alphabet generating probabilities (x: letter: amino acid) 48
24 Multilayer Perceptrons for Internal Exon Prediction: GRAIL
Coding potential value GC Composition Length Discrete bases Donor exon score Acceptor Intron vocabulary
1 score
0
sequence 49
Coding and Non-coding Regions
DNA -> RNA -> Protein
DNA
Non-coding AUG TAA Non-coding region region
Regulatory region Protein coding region
DNA
GENE 50
25 Decision Trees for Gene Finding
? MORGAN: A decision tree system for gene finding. Coding and non-coding regions finding/exon finding
d+a<3.4? donor: donor
yes no by Markov Chains site score d+a: donor and d+a<1.3? d+a<5.3? acceptor site score (6,560) hex<16.3? hex<0.1? hex<-5.6? hex: in-frame hexamer freq. (9,49) (18,160) asym<4.6? (737,50) donor<0.0? (142,73) asym: Fickett’s position assy- (24,13) (5,21) (23,16) (1,5) metry statistic
51
Molecular Clustering and Classification
52
26 Molecular Clustering and Classification ? Clustering (unsupervised learning) ? Hierarchical cluster analysis ? Kohonen neural networks ? Classification (supervised learning) ? Hidden Markov Model ? Neural networks ? Bayesian networks ? Support vector machines ? Nearest Neighbor Algorithm ? Decision trees
53
Support Vector Machines for Functional Classification of Genes (1)
? Classification of microarray gene expression data [M. Brown, et al., PNAS, 97(1):262-267] ? Classifying gene functional class using gene expression data from DNA microarray hybridzation experiments ? Dataset: 2467 genes, 79 experiments (2467x79 matrix)
1. Tricarboxylic-acid pathway 2. Respiration chain complexes 3. Cytoplasmic ribosomal proteins 4. Proteasome 5. Histones 6. Helix-turn-helix Functional classes defined from MYGD
121 Expression profiles of the cytoplasmic ribosomal proteins. ( Similarity can be found! ) 54
27 Support Vector Machines for Functional Classification of Genes (2) Cost = FP + 2FN
FLD: Fisher’s linear discriminant C4.5 and MOC1: Decision trees Parzen: Parzen windows (similar nonparametric density estimation technique)
Comparison of error rates for various classification methods on 4 classes 55
Nearest Neighbor Algorithms for 3D Protein Classification
? 3D shape similarity model by shape histograms [Ankerst, 1999]
d(i,j): distance of the cells that corresponds to the bins i, j. The cell distance is calculated from the difference of the shell radii and the angles between the sectors. 56
28 DNA Microarray Data Mining
57
Gene Expression Analysis
? DNA Microarray ? Hybridize thousands of DNA samples of each gene on a glass with special cDNA samples (made under two different conditions: background condition, experimental condition) ? Ratio of a gene: ratio of two expression levels of a gene
58
29 Spotted Microarray Chip
Nature Genetics 21, 15 (1999)
59
DNA Chip Technology
? Pin microarray methods ? Inkjet methods ? Photolithography methods ? Electronic array methods
60
30 Application of DNA Microarrays
? Applications ? Gene discovery: gene/mutated gene • Growth, behavior, homeostasis … ? Disease diagnosis ? Drug discovery: Pharmacogenomics ? Toxicological research: Toxicogenomics
61
Computational Tools for DNA Microarrays ? Major components ? LIMS (laboratory information management system) ? Image processing ? Data mining ? Experiment design ? Major trends for the data analysis ? Statistical methods ? Machine learning ? Reverse engineering
62
31 Diversity of Gene Expression
? Tissues ? muscle, skin, liver, brain, … ? Developmental stages ? embryonic, stem, adult cells… ? Clinical symptoms ? liver cell, hepatoma, hepatitis, regeneration … ? Environmental factors ? synthetic/natural chemicals, virus… .,…
63
Analysis of DNA Microarray Data Previous Work
? Characteristics of data ? Analysis of expression ratio based on each sample ? Analysis of time-variant data ? Clustering ? Self-organizing maps [Golub et al., 1999] ? Singular value decomposition [Orly Alter et al., 2000] ? Classification ? Support vector machines [Brown et al., 2000] ? Gene identification ? Information theory [Stefanie et al., 2000] ? Gene modeling ? Bayesian networks [Friedman et.al., 2000]
64
32 CAMDA-2000 Data Sets
CAMDA-2000 Data Sets
? CAMDA ? Critical Assessment of Techniques for Microarray Data Mining ? Purpose: Evaluate the data-mining techniques available to the microarray community. ? Data Set 1 ? Identification of cell cycle-regulated genes ? Yeast Sacchromyces cerevisiae by microarray hybridization. ? Gene expression data with 6,278 genes. ? Data Set 2 ? Cancer class discovery and prediction by gene expression monitoring. ? Two types of cancers: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). ? Gene expression data with 7,129 genes.
66
33 CAMDA-2000 Data Set 1 Identification of Cell Cycle-regulated Genes of the Yeast by Microarray Hybridization ? Data given: gene expression levels of 6,278 genes spanned by time ? ? Factor-based synchronization: every 7 minute from 0 to 119 (18) ? Cdc15-based synchronization: every 10 minute from 10 to 290 (24) ? Cdc28-based synchronization: every 10 minute from 0 to 160 (17) ? Elutriation (size-based synchronization): every 30 minutes from 0 to 390 (14) ? Among 6,278 genes ? 104 genes are known to be cell-cycle regulated • classified into: M/G1 boundary (19), late G1 SCB regulated (14), late G1 MCB regulated (39), S- phase (8), S/G2 phase (9), G2/M phase (15). ? 250 cell cycle–regulated genes might exist
67
CAMDA-2000 Data Set 1 Characteristics of data (? Factor-based Synchronization)
? M/G1 boundary ? S Phase
? Late G1 SCB regulated ? S/G2 Phase
? Late G1 MCB regulated ? G2/M Phase
68
34 CAMDA-2000 Data Set 2 Cancer Class Discovery and Prediction by Gene Expression Monitoring
? Gene expression data for cancer prediction ? Training data: 38 leukemia samples (27 ALL , 11 AML) ? Test data: 34 leukemia samples (20 ALL , 14 AML) ? Datasets contain measurements corresponding to ALL and AML samples from Bone Marrow and Peripheral Blood. ? Graphical models used: ? Bayesian networks ? Non-negative matrix factorization ? Generative topographic mapping
69
35 Graphical Models for DNA Chip Data Mining
71
Classes of Graphical Models
Graphical Models Undirected Directed
- Bayesian Networks - Latent Variable Models - Boltzmann Machines - Hidden Markov Models - Markov Random Fields - Generative Topographic Mapping - Non-negative Matrix Factorization
72
36 Gene Expression Analysis
? Latent Variable Models ? Bayesian Networks ? Non-negative Matrix Factorization ? Generative Topographic Mapping
73
Latent Variable Models Probabilistic Clustering - Model
gi: ith gene
zk: kth cluster tj: jth time p(gi|zk): generating probability of ith gene given kth cluster
vk=p(t|zk): prototype of kth cluster x ij x ij ? ? x ij ' j '
p(gi | zk )p(zk ) similarity (x i, v k ) ? xijvkj p(gi ? zk ) ? p(zk | gi) ? , ? j p(gi) g f (g,t, z) ? ij log(p(z ) p(g | z ) p(t | z )) objective function ? ? g ? k i k k i j ? ij ' k (maximized by EM) j ' 74
37 Latent Variable Models Probabilistic Clustering – Learning
initialize p(zk), p(gi|zk), p(tj|zk) for i=1~N, j=1~M, k=1~K such that N M K ? p ( gi | zk ) ? 1, ? p( t j | zk ) ? 1, ? p( z k ) ? 1 i? 1 j ?1 k ? 1 while(until reach to max_iteration) do EM adaptation //E-step
p( zk ) p(gi | zk ) p(t j | zk ) p( zk | gi ,t j ) ? ? p( zk ') p( gi | zk' ) p(t j | zk' ) k ' //M-step g g ij p ( z | g , t ) ij ? g k i j ? p( zk | g i , t j ) j ? ij ' i g j ' ? ij ' p ( g | z ) ? j ' i k g p (t | z ) ? i' j p ( z | g , t ) j k g ? ? k i ' j ij ' p( z | g ,t ) i ' j ? g i ' j ' ? ? k i j ' j ' i j ' ? g ij '' j '' 1 g ij g p(z ) ? p(z | g , t ), R ? ij k R ? ? g k i j ? ? i j ? ij' i j ? g ij ' j' j '
end while //prototypes
p(t|zk), k=1~K are prototypes for each cluster //clustering ? given a gene gi, cluster of gi is k for which p(gi zk) is the biggest. 75
Latent Variable Models Probabilistic Clustering – Learning Curve
-781.5 1 34 67 100 133 166 199 232 265 298 331 364 397 430 463 496 529 562 595 628 661 694 727 760 793 826 859 892 925 958 991 1024 1057 1090 1123 1156 1189 1222 1255 1288 1321 1354 1387 1420 1453 1486 1519 1552 1585 1618 1651 1684 -782
-782.5
-783
-783.5 objective function value
-784
-784.5 Number of iteration
76
38 Latent Variable Models Probabilistic Clustering – Result
? Prototypes
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2 계열1 계열1 0.15 0.15
0.1 0.1
0.05 0.05
0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2 계열1 계열1 0.15 0.15
0.1 0.1
0.05 0.05
0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2 계열1 계열1 0.15 0.15
0.1 0.1
0.05 0.05
0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
? Clustering: Given a gene gi, the cluster of gi is k,
where k = argmax m p(gi ? zm)
77
Bayesian Networks
78
39 Bayesian Networks for Gene Expression Analysis (1) Feature Selection ? Part of the data
? There are all 7,129 integer-valued attributes. ? Attribute selection
P ?|? 1 ? ? 2 | /(? 1 ? ? 2 ) ? 10 attributes with the highest P values are selected. ? Attribute value categorization categorized _ value ? ?value/(max( attribute) ? min(attribute)) ? 10? 79
Bayesian Networks for Gene Expression Analysis (2)
? Learning Gene C Gene B Processed Learning Data algorithm data Gene D Gene A
Preprocessing Target ? Inference
Gene C Gene B Gene C Gene B Gene C Gene B
Gene D Gene A Gene D Gene A Gene D Gene A
Target Target Target The values of Gene C and Belief propagation Probability for the target Gene B are given. is computed. 80
40 Bayesian Networks for Gene Expression Analysis (3) Structure Learning
Leukotriene FAH Zyxin
LYN C-myb CD33 LEPR DF GB DEF
Liver Target
81
Bayesian Networks for Gene Expression Analysis (4) Learning Procedure
Input : gene expression data, D Output : network structure G, local probability tables ? Objective function : BDe score(G,? )
n qi ?(? ) ri ? (? ? N ) ? ? ? ij ijk ijk ij ? ijk BDe score(G,? ) ? p(G)?? ? ?? for all k i ? 1 j ? 1 ?(? ij ? Nij ) k ?1 ?(? ijk ) N ? N ij ? for all k ijk p(G): prior probability for the structure G n: the number of nodes (attributes) qi: the number of states of the parents of ith node ri: the number of states of node i
? ijk: Dirichlet prior for node i at jth parents state and kth state
Nijk: the frequency of node i at jth parents state and kth state in Data D Procedure 1. From Chow and Liu’s algorithm the network structure without edge directions is learned(by mutual information). 2. Greedy hill-climbing search for maximum BDe score in the structure learned in 1. 82
41 Bayesian Networks for Gene Expression Analysis (5) Inference
? The Bayesian network constructed (partial) FAH Leukotriene
C-myb Target
? Given the values of C-myb and Leukotriene, the value of the Target can be inferred by P(T | C,L) ? P(T, F | C, L) ? F ? P(T | F,C,L)P(F | C,L) ? F ? P(T | F)P(F | C,L) ? F
83
Bayesian Networks for Gene Expression Analysis (6) Classification Results
? Prediction error of this Bayesian network (given all attribute values)
Bayesian Weighted voting network Training data 1/38 1/38 Test data 8/34 4/34
? The result can be improved by more appropriate data preprocessing.
84
42 Non-negative Matrix Factorization
85
Non-negative Matrix Factorization (1)
? Method ? Using NMF for class clustering and prediction of gene expression data from acute leukemia patients ? NMF (non-negative matrix factorization)
G ? WH ? NMF as a latent variable model r h1 h2 hr (G)i ? ? (WH) i? ? ? Wia H a? a? 1 … G : gene expression data matrix W W : basis matrix (prototypes) H : encoding matrix (in low … dimension) g1 g2 ? g ?? Wh gn Gi? ,Wia, H a? ? 0 86
43 NMF (2) Clustering Gene Expression Data
H1· H2 · G W(?) H(?)
7,129 W genes x … encoding . . . . . … ...... 7,129 38 samples … genes g1 g2 g3 g4 g7,129 38 samples 2 factors
? Factors can capture the correlations between the genes using the values of expression level. ? Cluster training samples into 2 groups by NMF ? Assign each sample to the factor (class) which has higher encoding value. ? Accuracy: 0 ~1 error for the training data set
87
NMF (3) Learning Procedure
Input : Gene expression data matrix, G (n ? m) Output : base matrix W (n ? k), encoding matrix H (k ? m) n: data size, m: number of genes, k: number of latent variables
n m Objective function : F ? ? ? ?Gi ? log(WH)i? ?(WH)i? ? i ?1 ? ?1
Procedure 1. Initialize W, H with random numbers.
Wij ? 0, ? Wij ? 1 j
Hij ? 0
2. Update W, H iteratively until max_iteration or some criterion is met.
G W ? W i? H Gi? ia ia ? a ? H ? H W ? (WH ) i? a? a? ? ia i (WH) i? W ia Wia ? ? W ja j
88
44 NMF (4) Learning Curve
Learning Curve Log likelihood
Log Likelihood
1 11 21 31 41 51 61 71 81 91 101 Number of iteration
89
NMF (5) Clustering Result
7 ALL AML 6
5
4
3
2
1
0 0 1 2 3 4 5 6
90
45 NMF (6) Diagnosis
h1 h2 W g
h(?) 7,129 W genes x 7,129 . . . genes . . . … g1 g2 g3 g4 g7,129 2 factors
? For each test sample g, estimate the encoding vector h that best approximates the sample. ? W is the basis matrix computed during training (fixed). ? As in training, assign each sample to the factor (class) which has the highest encoding value. ? Accuracy: 1~2 error(s) for the test data set
91
Generative Topographic Mapping
92
46 Generative Topographic Mapping (1)
? GTM: a nonlinear, parametric mapping y(x;W) from a latent space to a data space.
Grid t3
y(x;W)
x2 t2
t1 x1
93
GTM (2) Learning Algorithm (EM)
? Generate the grid of latent points. ? Generate the grid of latent function centers. ? Compute the matrix of basis function activations ? . ? Initialize weights W in Y = ? W and the noise variance ?. 2 2 ? Compute ? n,k = ||tn – ? kW|| = ||tn – yk(x,W)|| for each n, k ? Repeat - Compute the responsibility matrix R using ? and ? . [E-Step] Compute G=RTR - Update W by ? TG? W= ? TRT [M-step ] - Compute ? = ||t – ? W||2 - Update ? ? Until convergence
94
47 GTM (3) Visualization
? Posterior distribution in latent space given a data point t: X(t) ~
? For a whole set of data: for each t, plot in the latent space ? Posterior mode:
? Posterior mean:
95
GTM (4) Clustering Experiment
? Gene Selection ? Select about 50 genes out of 7,129 based on the three test scores of cancer diagnosis. • Correlation metric (similar as t-test) • Wilcoxon test scores (a nonparametric t-test) • Median test scores (a nonparametric t-test) ? Clustering & Visualization ? After learning a model, genes are plotted in the latent space. ? With the mapping in the latent space, clusters can be identified.
96
48 ? List of Genes Selected
97
GTM (5) Learning Curve
98
49 GTM (6) Clustering Result
Genes with high expression levels in case of ALL (large P-metric value)
Genes with high expression levels in case of AML (negative large P-mertic value)
99
Summary
? Challenges of Machine Learning Applied to Biosciences ? Huge data size ? Noise and data sparseness ? Unlabeled and imbalanced data ? Biosciences for Machine Learning ? New application ? Biosystems are existence proofs for ideal AI systems ? Provides interesting metaphors and algorithms! ? Synergy effects between biosciences and artificial intelligence
100
50