Bioinformatics and Biocomputing Outline

Bioinformatics and Biocomputing Outline

Bioinformatics and Biocomputing Byoung-Tak Zhang Center for Bioinformation Technology (CBIT) & Biointelligence Laboratory School of Computer Science and Engineering Seoul National University [email protected] http://bi.snu.ac.kr/ or http://cbit.snu.ac.kr/ Outline ! Bioinformation Technology (BIT) ! DNA Chip Data Mining: IT for BT ! DNA Computing: BT for IT ! DNA Computing with DNA Chips ! Outlook 2 HumanGenomeProject ANew Disease Encyclopedia New Genetic Genome Fingerprints Goals Health • Identify the approximate 40,000 genes Implications in human DNA New • Determine the sequences of the 3 billion Diagnostics bases that make up human DNA • Store this information in database New • Develop tools for data analysis Treatments • Address the ethical, legal and social issues that arise from genome research 3 Bioinformation Technology: Bioinformatics vs. Biocomputing Bioinformatics IT BT Biocomputing 4 Bioinformatics 5 What is Bioinformatics? Bio – molecular biology Informatics – computer science Bioinformatics – solving problems arising from biology using methodology from computer science. ! Bioinformatics vs. Computational Biology ! Bioinformatik (in German): Biology-based computer science as well as bioinformatics (in English) 6 Molecular Biology: Flow of Information DNA RNA Protein Function hj{nn s ]X D h z h h n wj s j hj j kuh { { w h { j 7 DNA (Gene) RNA Protein j {h{hG { j n y {GOyuhG P \’ yuh Z’ {GOyP w 8 Nucleotide and Protein Sequence DNA (Nucleotide) Sequence zxGGXZ[[GiwbGY`XGhbGjbG[WXGnbGY^_G{bGWG Protein (Amino Acid) Sequence CG2B_MARGL Length: 388 April 2, 1997 14:55 Type: P Check: 9613 .. 1 tsunlu}kzyGGptnr}h{yhzGGGzrn}rz{sn{GGynhslupzu}GG ARNNLQAGAK KELVKAKRGM TKSKATSSLQ SVMGLNVEPM EKAKPQSPEP MDMSEINSAL EAFSQNLLEG VEDIDKNDFD NPQLCSEFVN DIYQYMRKLE REFKVRTDYM TIQEITERMR SILIDWLVQV HLRFHLLQET LFLTIQILDR YLEVQPVSKN KLQLVGVTSM LIAAKYEEMY PPEIGDFVYI TDNAYTKAQI RSMECNILRR LDFSLGKPLC IHFLRRNSKA GGVDGQKHTM AKYLMELTLP EYAFVPYDPS EIAAAALCLS SKILEPDMEW GTTLVHYSAY SEDHLMPIVQ KMALVLKNAP TAKFQAVRKK YSSAKFMNVS TISALTSSTV MDLADQMC 9 Some Facts ! 1014 cells in the human body. ! 3 × 109 letters in the DNA code in every cell in your body. ! DNA differs between humans by 0.2% (1 in 500 bases). ! Human DNA is 98% identical to that of chimpanzees. ! 97% of DNA in the human genome has no known function. 10 Topics in Bioinformatics Sequence analysis 4 Sequence alignment 4 Structure and function prediction 4 Gene finding Structure analysis 4 Protein structure comparison 4 Protein structure prediction 4 RNA structure modeling Expression analysis 4 Gene expression analysis 4 Gene clustering Pathway analysis 4 Metabolic pathway 4 Regulatory networks 11 Extension of Bioinformatics Concept ! Genomics 4Functional genomics 4Structural genomics ! Proteomics: large scale analysis of the proteins of an organism ! Pharmacogenomics: developing new drugs that will target a particular disease ! Microarray: DNA chip, protein chip 12 Applications of Bioinformatics ! Drug design ! Identification of genetic risk factors ! Gene therapy ! Genetic modification of food crops and animals ! Biological warfare, crime etc. ! Personal Medicine? ! E-Doctor? 13 Bioinformatics as Information Technology ni z~pzzTwyv{ k p o y z iGG i h h pG tG zG t s j yG wG 14 Background of Bioinformatics ! Biological information infra 4Biological information management systems 4Analysis software tools 4Communication networks for biological research ! Massive biological databases 4DNA/RNA sequences 4Protein sequences 4Genetic map linkage data 4Biochemical reactions and pathways ! Need to integrate these resources to model biological reality and exploit the biological knowledge that is being gathered. 15 Areas and Workflow of Bioinformatics hnj{hn{{jhn{hjh {nnh{jjh{hhnn{h j{jhn{jh{{hj{nj hnn{jhj{{hjnh{h {jhn{jnh{jhj{hn j{nhj{{hjnhnhn{ Microarray (Biochip) Structural Functional Pharmaco- Proteomics Genomics Genomics genomics Infrastructure of Bioinformatics 16 DNA Chip Data Mining: IT for BT 17 cDNA Microarray Excitation Scanning Laser 2 Laser 1 cDNA clones PCR product amplification (probes) purification mRNA target Emission Printing Overlay images and normalize Hybridize target 0.1nl/spot to microarray Microarray Analysis 18 The Complete Microarray Bioinformatics Solution Databases Data Cluster Management Analysis Statistical Data Analysis Mining Image Automation Processing 19 DNA Chip Applications ! Gene discovery: gene/mutated gene 4Growth, behavior, homeostasis … ! Disease diagnosis 4Cancer classification ! Drug discovery: Pharmacogenomics ! Toxicological research: Toxicogenomics 20 Disease Diagnosis: Cancer Classification with DNA Microarray - cDNA microarray data of 6567 gene expression levels [Khan ’01]. - Filter genes that are correlated to the classification of cancer using PCA and ANN learning. - Hierarchical clustering of the DNA chip samples based on the filtered 96 genes. - Disease diagnosis based on DNA chip. [Fig.] Flowchart of the experimental procedure. 21 Disease Diagnosis: Hierarchical Clustering Based on Gene Expression Levels - Hierarchical clustering of cancer by 96 gene expression levels. - The relation between gene expression and cancer category. - Four cancer diagnostic categories [Fig.] The dendrogram of four cancer clusters and gene expression levels (row:genes,column:samples). 22 AI Methods for DNA Chip Data Analysis ! Classification and prediction 4ANNs, support vector machines, etc. 4Disease diagnosis ! Cluster analysis 4Hierarchical clustering, probabilistic clustering, etc. 4Functional genomics ! Genetic network analysis 4Differential models, relevance networks, Bayesian networks, etc. 4Functional genomics, drug design, etc. 23 Cluster Analysis [Gene Cluster 1] [Gene Cluster 2] [Gene Cluster 3] [DNA microarray dataset] [Gene Cluster 4] 24 Methods for Cluster Analysis ! Hierarchical clustering [Eisen ’98] ! Self-organizing maps [Tamayo ’99] ! Bayesian clustering [Barash ’01] ! Probabilistic clustering using latent variables [Shin ’00] ! Non-negative matrix factorization [Shin ’00] ! Generative topographic mapping [Shin ’00] 25 Clustering of Cell Cycle-regulated Genes in S. cerevisiae (the Yeast) ! Identify cell cycle-regulated genes by cluster analysis. 4104 genes are already known to be cell-cycle regulated. 4Known genes are clustered into 6 clusters. ! Cluster 104 known genes and other genes together. ! The same cluster " similar functional categories. [Fig.] 104 known gene expression levels according to the cell cycle (row: time step, column: gene). 26 Probabilistic Clustering Using Latent Variables gi: ith gene zk: kth cluster tj: jth time step p(gi|zk): generating probability of ith gene given kth cluster vk=p(t|zk): prototype of kth cluster ∈ = = p(gi | zk ) p(zk ) similarity (x , v ) = x v p(gi zk ) p(zk | gi ) i k ∑ ij kj p(gi ) j = f (g,t,z) ∑∑gij ∑log(p(zk )p(gi | zk )p(t j | zk )) : (*) objective function ij k (maximized by EM) 27 Experimental Result: Identify Cell Cycle-Regulated Genes ! Clustering result [Table] Clustering result with α-factor arrest data. In 4 clusters, the genes, that have high probability of being cell cycle-regulated, were found. 28 Experimental Result: Prototype Expression Levels of Found Clusters • The genes in the same cluster show similar expression patterns during the cell cycle. • The genes with similar expression patterns are likely to have correlated functions. [Fig.] Prototype expression levels of genes found to be cell cycle- regulated (4 clusters). 29 Clustering Using Non-negative Matrix Factorization (NMF) ! NMF (non-negative matrix factorization) ≈ G WH ! NMF as a latent variable model r ≈ = (G)iµ (WH)iµ ∑Wia H aµ h1 h2 hr a=1 … G =/gene expression data matrix W W =/basis matrix (prototypes) H =/encoding matrix (in low … dimension) g g g 1 2 < g >= Wh n ≥ Giµ ,Wia , H aµ 0 30 Experimental Result: Five Clusters Found by NMF ! 5 prototype expression levels during the cell cycle. ũŪű ũŪů ũŪŭ ũŪū ũŪ ũũű ũũů ũũŭ Expression level ũũū ũ Ū ū Ŭ ŭ Ů ů Ű ű Ūũ ŪŪ Ūū ŪŬ Ūŭ ŪŮ Ūů ŪŰ Ūű Time step in cell cycle 31 Clustering Using Generative Topographic Mapping (GTM) • GTM: a nonlinear, parametric mapping y(x;W) from a latent space to a data space. Grid t3 Generation yOxbWPaGmapping x2 t2 Visualization t x1 1 <Latent space> <Data space> 32 Experimental Result: Clusters Found by GTM ! Three cell cycle-regulated clusters found by GTM Cluster center No. of train Correct no. / test Overall mean expression Data/ no. in data levels (Cln/b) of known cluster genes S/G2 5/ 1/2 (.148 .184 -.367 -.044) S (0.111 –0.333) 5/5 5 / 5 (100%) (1.075 1.482 -.233 -.375) M/G1 c1 (0.111 0.333) 13 / 7 1/ 6 (-.171 -.573 .091 .311) c2 (-0.111 –0.111) /2 0/ 6 c3 (0.323 0.1) /2 0/ 6 G2/M c1 (0.111 0.333) 10 / 5 0/ 5 (-.616 –1.01 1.832 1.596) c2 (0.111 0.111) /3 3 / 5 (80%) G1 c1 (-0.111 0.333) 35 / 18 10 / 16 (62%) (.894 .907 -.766 -.479) c2 (-0.111 0.111) /7 0/16 33 Experimental Result: Comparison with other methods ! Comparison of prototype expression levels No. of Mean expression No. of selected Mean expression selected levels by GTM genes by levels by Spellman genes Spellman S/G2 92 (.13 -.06 -.1 .01) 121 (.13 .05 -.16 .03) S 25 (.84 .81 -.42 -.33) 71 (.46 .47 -.43 -.18) M/G1 c1 120 (.82 .65 -.65 -.38) 113 (-.21 -.61 -.04 .07) c2 34 (-.04 -.37 -.01 -.11) c3 10 (.32 .29 -.3 .05) G2/M c1 33 (-.59 -.96 1.34 1.29) 195 (-.32 -.62 .49 .54) c2 60 (.08 -.30 .51 .57) G1 c1 122 (.92 .74 -.62 -.33) 300 (.66 .49 -.55 -.33) c2 74 (.79 .82 -.48 -.34) (total = 570) (total = 800) 34 Genetic Network Analysis - Discover the complex regulatory interaction among genes. - Disease diagnosis, pharmacogenomics and toxicogenomics - Boolean networks - Differential equations - Relevance networks [Butte ’97] - Bayesian networks [Friedman ’00] [Hwang ’00] [Fig.] Basin of attraction of 12-gene Boolean genetic network model [Somogyi ’96]. 35 Bayesian Networks ! Represent the joint probability distribution among random variables efficiently using the concept of conditional independence. A B An edge denotes the possibility of the causal relationship between nodes. •A, C and D are independent given B. C D •C asserts dependency between A and B. •A, B and E are independent given C.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    36 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us