<<

bioinfo.mbb.yale.edu/mbb452a akGrti,Yl University Yale Gerstein, Mark Introduction

1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Biological Bioinformatics + Calculations

2 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu iifraisi MS o oeua Molecular for “MIS” is Bioinformatics • definition? a for idea One • • ttsis oudrtn and and understand CS, to math, ) applied as such disciplines from large-scale. associated information applying then conceptualizing is Bioinformatics (Molecular) htis What Bio i h es fpyia-hmsr)and physical-chemistry) of sense the (in “ - informatics Bioinformatics ihteemolecules, these with techniques ” raiethe organize ilg ntrsof terms in biology (derived ? na on

3 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu oeua ilg:a nomto Science Information an Biology: Molecular Gntcmaterial •Genetic Processes • Molecules • Dogma Central • fMlclrBiology Molecular of DNA >RNA -> ◊ ◊ >Protein -> >Phenotype -> ehns,Seiiiy Regulation Specificity, Mechanism, Structure, Sequence, >DNA -> Sm aayi activity catalytic •Some (tRNA/mRNA) synthesis • (mRNA) transfer •Information ie rmDBulg tnod rpisfo Strobel) S from graphics Stanford, Brutlag, D from (idea Information of Amounts Large • Paradigm Central • o Bioinformatics for eoi euneInformation Sequence Genomic >mN (level) mRNA -> ◊ ◊ >PoenSequence Protein -> >PoenStructure Protein -> Statistical Standardized >PoenFunction Protein -> >Phenotype -> Cnrlo growth/differentiation of •Control protection •Immune motion/support •Mechanical transport/storage •Cofactor biocatalyst •Primary or . by performed facilitated are functions cellular •Most

4 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu a N Sequence DNA Raw • oeua ilg nomto DNA - Information Biology Molecular ◊ ◊ ◊ ◊ as nogenes? into Parse Not? or Coding 2Mi in M ~2 ~1Kinagene,4bases:AGCT aaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg ggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaa aatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtc acaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaacc gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact gctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgg caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaa cgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgctt caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtgg ...... gacgctggtatcgcattaactgattctttcgttaaattggtatc gaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgca aaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacact gttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatc gcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgtt tcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacct gcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgca gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact ggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgt ggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgca ttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaact gcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattc aaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatcca atggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtg gcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatac atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgca

5 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu 20Kkonpoensequences protein known K bacteria), ~200 (in protein • average an in aa ~300 of Strings • alphabet letter 20 • 3f_ -P--KRP d3dfr__ -G---RP d4dfra_ -PEKNRP d8dfr__ d1dhfa_ ---PKRP d3dfr__ ---G-RP d4dfra_ VPEKNRP d8dfr__ VPEKNRP d1dhfa_ d3dfr__ d4dfra_ d8dfr__ 1ha LNCIVAVSQ d1dhfa_ d3dfr__ d4dfra_ d8dfr__ d1dhfa_ 20a nadomain a in aa ~200 ◊ ACDEFGHIKLMNPQRSTVWY -PEKNRP TAFLWAQDR ISLIAALAV LNSIVAVCQ TAFLWAQDR ISLIAALAV LNSIVAVCQ LNCIVAVSQ oeua ilg Information: Biology Molecular LPERTNVVLT LPGRKNIILS LKDRINIVLS LKGRINLVLS LPERTNVVLT LPGRKNIILS LKDRINIVLS LKGRINLVLS NG DR NM NM DG DR NM NM LIGKDGHLPW VIGMENAMPW GIGKDGNLPW GIGKNGDLPW LIGKDGHLPW VIGMENAMPW GIGKDGNLPW GIGKNGDLPW HQEDYQ SSQPGT RELKEA RELKEP HQEDYQ -SQPGT RELKEA RELKEP rti Sequence Protein -H -N PP PP H- N- PP PP AQGA DDRV PKGA PQGA AQGA DDRV PKGA PQGA LPDDLHYFRAQT LPADLAWFKRNT LRNEYKYFQRMT LRNEFRYFQRMT LPDDLHYFRAQT LPADLAWFKRNT LRNEYKYFQRMT LRNEFRYFQRMT - - H H - - H H VVVHDVAAVFAYAK TWVKSVDEAIAACG YLSKSLDDALALLD FLSRSLDDALKLTE VVVHDVAAVFAYAK TWVKSVDEAIAACG YLSKSLDDALALLD FLSRSLDDALKLTE VG------LD------STSHVEGKQ- TTSSVEGKQ- V------G L------N STSHVEGKQ- TTSSVEGKQ- u not but QHLD----Q DVPE----- SPELKSKVD QPELANKVD QHLDQ---- DVP------SPELKSKVD QPELANKVD KIMVVGRRTYESF KPVIMGRHTWESI NAVIMGKKTWFSI NLVIMGKKTWFSI KIMVVGRRTYESF KPVIMGRHTWESI NAVIMGKKTWFSI NLVIMGKKTWFSI BJOUXZ ELVIAGGAQIFTAFK . MVWIVGGTAVYKAAM MVWIVGGSSVYKEAM ELVIAGGAQIFTAFK EIMVIGGGRVYEQFL MVWIVGGTAVYKAAM MVWIVGGSSVYKEAM IMVIGGGRVYEQFL DDV PKA EKP NHP DDV PKA EKP NHP

6 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu DNA/RNA/Protein • ◊ lotalprotein all Almost oeua ilg Information: Biology Molecular ih adTpPoenfo eitwbpage) web Levitt M from Protein Top Page, Hand Web Right Soll D From Adapted (RNA armlclrStructure Macromolecular

7 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu E 40LS161GKY1516 1GKY1515 1GKY1514 1GKY1513 1GKY1512 62.94 1.00 48.41 1GKY1511 1.00 78 56.647 48.15 1GKY1510 55.811 1.00 1GKY 23.841 49.88 77 21.104 56.860 1.00 76 186 16.887 53.45 1GKY 75 10.735 20.402 58.185 1.00 1GKY 55.06 1GKY 74 LYS 11.452 21.198 58.180 1.00 25.54 1.00 1GKY 73 11.531 22.452 57.567 186 63.500 30.20 1GKY 72 186 12.422 22.263 1.00 36.48 29.501 1.00 57.79 1GKY 71 1450 LYS 186 13.836 63.532 1.00 70 OXT LYS 13.502 62.827 53.00 1GKY 186 28.316 63.930 1.00 1GKY 1449 NZ LYS 28.819 43.24 TER 186 12.548 31.409 63.974 1.00 69 1448 CE LYS 11.360 41.99 68 186 7.294 30.189 64.814 1.00 1GKY 67 1447 CD LYS 2 46.62 1GKY ATOM 8.052 29.607 63.579 1.00 49.13 1GKY 1446 CG LYS 1.00 ATOM 2 10.593 29.500 62.974 1445 CB 2 61.685 ATOM ARG 10.453 30.200 50.04 1444 1 29.755 1.00 50.35 ATOM 9.242 1.00 49.88 C ARG 1 8.753 59.226 1.00 ATOM ARG 1 60.722 12 CA 29.767 60.595 SER 1 30.832 ... 11 N 8.876 30.166 OG SER 10.432 10 SER 1 9.401 ATOM 9 CB 1 SER ATOM 8 O ATOM 7 C SER 0 0 SER ATOM 6 CA 0 ATOM 5 N ACE ACE ATOM 4 CH3 ACE ATOM 3 O ATOM 2 C ATOM 1 ATOM ATOM ATOM triplets XYZ of Number on Statistics • ◊ ◊ ◊ 0Kkondmi,~0 folds ~300 domain, known K 10 0 residues/domain 200 v.Rsdei e:4bcbn tm iehi tm,10cbcA cubic 150 , sidechain 4 + atoms backbone 4 Leu: is Residue Avg. •=> oeua ilg Information: Biology Molecular 10 y rpes(820 e rti domain protein per (=8x200) triplets xyz ~1500 rti tutr Details Structure Protein -> 0 Aaos eaae y38A 3.8 by separated atoms, CA 200

8 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Sequences Finiteness ol of World highlight fthe of the nml ~100 Animal, genes atra 1.6 Bacteria, 2000? genes 3M,~6K Mb, 13 ua,~3 Human, b ~100K Gb, Eukaryote, b ~1600 Mb, genes 1998 1997 1995 b ~20K Mb, 282 269 387 1945] : 496] : :1] [ [ Science Science [ [???]

9 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu oeua Biology Molecular hl Genomes Whole nertv Data Integrative • Everything Driving Revolution The • 03 ua:3G 0 genes... K 100 & Gb 3 human: 2003, genomes! completed >30 genes 1999: K 19 with yeast ~100Mb for worm: genes 1998, ~6000 & Mb done 13 genes yeast: 1600 1997, & Mb 1.6 (bacteria): HI 1995, Information: admsqecn n sebyof assembly and random http://www.tigr.org) website, TIGR from adapted (Picture Science Fleischmann rtha,J . uran .L,Gohgn .S . nh,C . coad .A,Small, A., L. McDonald, D., L., L. C. Fine, Gnehm, C., M., R. S. K.V.,Fraser,C.M.,Smith,H.O.& Brandon, N. M., D. Geoghagen, L., Saudek, J. T., D., D. Fuhrmann, M. Nguyen, L., Cotton, J. A., C., E., Hedblom, Glodek, M. Fritchman, K., T., I., Hanna, Spriggs, McKenney, L. A., R., M., Liu, C. T. R., J. Phillips, Shirley, Utterback, Merrick, F., J., A., J. Scott, Weidman, B. D., M., J. Dougherty, J. Gocayne, Kelley, F., C., J. Fields, W., Tomb, Fitzhugh, J., G., C. Sutton, Bult, R., A. Kerlavage, 6:496-512. 269: , R .D . ,A d a m s ,M .D . ,W h i t e ,O . ,C l a y t o n ,R .A . ,K i r k n e s s ,E .F . , Venter Haemophilus ,J.C.( 1995 nlezerd." influenzae ). "Whole-genome -GAPekso, A G -- reading. better make latter the although managedinalifetime, Shakespeare than data of bits more produce can laboratory a single week, a than less in that, quickly so accumulate now sequence Genome Nature 401 1-1 (1999) 115-116 :

10 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu on/adr Chips, Young/Lander, Brown, e.Ep over Exp. Rel. Timecourse b.Exp. Abs. µ µ µ µ array, Expression Protein Aebersold, Chips; Church, and Samson Also SAGE; : eeExpression Transcriptosome aaes the Datasets: Transposons, rti Exp. Protein Snyder,

11 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu background from telling floats 60K = 10 x 6000 points, time 10 at array these of infinite variety an do can but once genome sequence only Can genes! 6000 all for levels Academia in Data Expression Yeast ra Data Array : cuts fJHager) J of (courtesy

12 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu aallanalysis. and parallel deletion gene by genome S. cerevisiae (1999). the al. of et characterization & Functional W. R. M., Davis, Laub, H., Liao, T., Jones, H., G., J. Giaever, Hegemann, Friend, E., F., Gentalen, H., Foury, M., S. Bakkoury, El Dow, W., F., S. Dietrich, M., K., A. Davis, Chu, C., H., Connelly, Bussey, D., R., J. Boeke, Benito, R., Bangham, K., B., Anderson, Andre, H., Liang, A., D., D. Astromoff, Shoemaker, A., E. Winzeler, Knockouts Systematic Science 285 901-6 , 143-52 ~18Minteractions 2 / 6000 x 6000 yeast: For map. linkage for protein genome clones human EST the human from yeast & modular H. cDNA a Zhou, two-hybrid of E., Construction Chan, (1998). M., L. Qiu, Zhu, Y., Luo, B., S. Hua, maps linkage hybrids, 2 te Whole- Other Experiments Genome Gene 215 ,

13 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Ptwydaigfo apsEoy,Phylogeny EcoCyc, Karp’s P from drawing (Pathway Future.... The • to Information • nesadgenomes understand rmSJGud ioari Haystack) a in Dinosaur Gould, J S from ◊ ◊ ◊ ◊ ◊ (MEDLINE) Literature The , Environments, traditional Phylogeny, Whole Networks Regulatory traditional (glycolysis), Pathways Metabolic oeua ilg Information: Biology Molecular te nertv Data Integrative Other

14 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu •DrivingForcein Net & Disk vs CPU • Bioinformatics rmDBulg Stanford) Brutlag, D from adapted picture (Internet Explonential ◊ oecrucial more computersiseven on information of amounts large store to ability the been, has speed computer in increase the as important As yDvlpeto Computer of Development by Structures Domain Protein Num. Technology rwho aaMatched Data of Growth 1000 1500 2000 2500 3000 3500 4000 4500 500 9018 901995 1990 1985 1980 0 9918 9318 9718 9119 1995 1993 1991 1989 1987 1985 1983 1981 1979 Hosts Internet 0 20 40 60 80 100 120 140

CPU Instruction Time (ns)

15 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu iifraisi born! is Bioinformatics cuts fFn Drablos) Finn of (courtesy

16 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Cartoon Weber

17 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu • Redundancy Sequence Genomic Pathways • into grouped are Genes • Multiple Have May Gene Single genes • similar many has • the Have Sequences Different • oeua Biology Molecular similarities?..... Howdowefindthe Code Genetic the to due Functions Structure Same eudnyand Redundancy h hrce of Character The Information: Multiplicity euaoysystems regulatory levels expression functions genes Integrative ↔ structures ↔ eois- pathways ↔ ↔ ↔ …. ↔

18 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu u iifraishsanew a has Bioinformatics But • of Because • tl fcalculation... of style possible become calculations new , in improvement and data in increase ◊ w Paradigms Two cetfcComputing Scientific e aaimfor Paradigm New Biology • •Physics ◊ ◊ ◊ ◊ ◊ ◊ ewrs fdrtd “federated” networks, repressor plastocyanin~ colicin~ ~ globin relationships unexpected discovering and information Classifying CPU Supercomputer, Trajectory Rocket of Determination Exact principles physical on based Prediction

19 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu idn Patterns Finding • Comparison String Text • • ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ Datamining Clustering Machine / AI grep Vista, Alta Statistics Significance Alignment 1D Search Text betDB Object Building, eea ye f“ of Types General Querying in Bioinformatics hsclSimulation Physical • • ◊ ◊ ◊ ◊ ◊ ◊ ◊ Informatics Simulation Numerical Electrostatics Mechanics Newtonian recognition) (Visision, 3D and Comparison Volumes) (Surfaces, Graphics Robotics ”

20 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu ulctosi h Genome the in Duplications • in Repeats Characterizing • Genomic in Genes Finding • eoi DNA Genomic DNA ◊ ◊ ◊ ◊ ◊ Patterns Statistics promotors exons introns iifraisTpc -- Topics Bioinformatics eoeSequence Genome

21 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu utpeAinetand Alignment Multiple • Alignment Sequence • osnu Patterns Consensus ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ Motifs Profiles HMMs, Comparisons Transitive representation consensus a in result the fuse then and one sequence than more align to How matrices scoring substitution acid Amino FASTA) (BLAST, speed increase to Hashing Alignment Suboptimal Alignment Global vs Local Programming Dynamic optimally via strings two align to How gaps matching, string non-exact rti Sequence Protein crn cee and schemes Scoring • acigstatistics Matching Bioinformatics ◊ ◊ ◊ ◊ o opeiySequences Complexity Low dist.) val. (extreme Distributions Score e-value)? an (or P-value A significant or statistically alignment is given match a if tell to How ois-- Topics

22 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu eodr Structure Secondary • “Prediction” ◊ ◊ ◊ ◊ ◊ Bioinformatics tutr Prediction Structure Secondary Assessing finding TM-helix Statistics Simple Alg. Genetic Networks, Neural Propensities via eune/ Sequence Structure ois-- Topics eaino euneSmlrt to Similarity Sequence of Relation • Prediction Function • Prediction Structure Tertiary • tutrlSimilarity Structural ◊ ◊ ◊ ◊ ciest identification site Active initio Ab Threading Recognition Fold

23 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu aclto fVlm and Volume of Calculation • and Geometry Protein Basic • Surface Fitting Least-Squares ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ akn Measurement Packing Matching Surface as Design Drug and Docking area an calculate to How solid a represent to How plane a represent to How Graphics Molecular structures 2 of fit LSQ Rotations Axes, Angles, Distances, acltn ei xsi 3D in axis helix a Calculating • i itn line a fitting via ois- Structures -- Topics tutrlAlignment Structural • ◊ ◊ ◊ ◊ odLibrary Fold Hashing Matrices, Distance Approaches: Other do? to what sequences, unlike converge, not does DP structure. 3D of basis the on sequences Aligning

24 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu rti Units? Protein • Database Relational • Concepts ◊ ◊ ◊ ◊ ◊ ◊ ◊ ahas functions? pathways, motions, folds, classified: How biological information? of units the are What Cross-tabulation Reports and Forms Normalization Tables, Joining indexes reports, transactions, forms, views, OODBMS, SQL, Keys Foreign Keys, ois oue,domains modules, motifs, • structure sequence, • (/dbm) Referencing Array • "where" as Join Natural • eeto ncosproduct cross on selection h isProblem Bias The • Trees and Clustering • ◊ ◊ ◊ ◊ ◊ weighting sequence implications Evolutionary Methods Other clustering Basic asmn,Maximum Parsimony, • linkage multiple • single-linkage • •UPGMA likelihood Databases ois-- Topics

25 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu h eoi s Single- vs. Genomic The • and Classification Function • referencing cross scale Large • Analysis Expression • oeuePerspective Orthologs information of ◊ ◊ ◊ dniyn euaoyRegions Regulatory Identifying differences Measuring clustering Courses Time ois- Genomics -- Topics • Trees Genome • Genomics Structural • eoeComparisons Genome • ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ukSrcuePrediction Structure Bulk folds common & shared Genomes, in Folds proteins interacting of Identification Genomes from Trees Annotation Genome Analysis Words Frequent Large-scale pathways Families, Ortholog

26 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu oeua Simulation Molecular • ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ nryMinimization MC & Dynamics Molecular time? over changes structure How Springs as Bonds Forces VDW Electrostatics functions energy potential interactions, Basic Geometry o omauetechange the measure to How • navco (gradient) vector a in -> ois- Simulation -- Topics Energy -> Forces atc oesand Models Lattice • Equation Poisson-Boltzman • Density Number • Sets Parameter • Simplification

27 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Bioinformatics Schematic

28 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu necessary…. really Not Learn You’ll What Today Know to Need eusv ecn Parser Descent Recursive a Write Function, Hashing a Design Equation, Poisson-Boltzman Distribution Value P-valueof.01andanExtreme a (3D), Matrices Rotation of Energy, (grad) Derivative the is Force a3Dvector scores), test (of Distribution Bell-shaped a Deviation, Standard of Calculation ahBiology Math Background htceoie are chemokines what negative, gram is coli E. metazoa, a is worm a does, GroEL What families protein zone, twilight sequence packed, tightly are Proteins ATP nucleus, the helix, alpha- RNA, DNA,

29 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu h N Computer DNA The • Simulation Pathway Metabolic • Determination Structure for Methods Sampling • Gibb's Using Discovery Motif • Libraries Digital • ◊ ◊ ◊ ◊ M tutr Determination Structure NMR Crystallography Computational literature biological for bases Knowledge Comparison Textual and Search Bibliographic Automated itneGeometry Distance • Refinement • r hyo rntThey Aren’t or They Are Bioinformatics (#1) ?

30 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu • • • • • (NO) (YES) (NO?) (YES) (YES?) ◊ ◊ ◊ ◊ M tutr Determination Structure NMR Crystallography Computational literature biological for bases Knowledge Comparison Textual and Search Bibliographic Automated • Refinement • Bioinformatics (YES) h N Computer DNA The eaoi aha Simulation Pathway Metabolic Determination Structure for Methods Sampling Gibb's Using Discovery Motif r hyo rntThey Aren’t or They Are iia Libraries Digital itneGeometry Distance #,Answers) (#1, ?

31 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu ikg Analysis Linkage • Methods Sequencing Genomic • Organisms of of Modeling • forensics in methods DNA • inspection sequence by identification Gene • ◊ ◊ ◊ ◊ ◊ ikn pcfcgnst aiu traits various to genes specific Linking mapping genetic and Physical Contigs Assembling Modeling Ecological sites splice of Prediction r hyo rntThey Aren’t or They Are Bioinformatics (#2) ?

32 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu • • • • • (YES) (NO?) (NO) (YES) (YES) ◊ ◊ ◊ ◊ ◊ ikn pcfcgnst aiu traits various to genes specific Linking mapping genetic and Physical Contigs Assembling Modeling Ecological sites splice of Prediction Bioinformatics oeigo ouain fOrganisms of Populations of Modeling ikg Analysis Linkage Methods Sequencing Genomic forensics in methods DNA inspection sequence by identification Gene r hyo rntThey Aren’t or They Are #,Answers) (#2, ?

33 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu optrzdDanssbsdo eei Analysis Genetic on based Diagnosis Computerized • Non- on Based Phylogenies of Determination • modeling Homology • Simulations Life Artificial • Processing Image Radiological • prediction structure RNA • (Pedigrees) Characteristics Organism molecular sequences in Identification ◊ ◊ ◊ eei loihsi oeua biology molecular in Algorithms Genetic Security Computer / Artificial human) (visible Human for Representations Computational r hyo rntThey Aren’t or They Are Bioinformatics (#3) ?

34 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu • • • • • • nlss(Pedigrees) Analysis (NO) Characteristics Organism molecular (NO) (YES) (NO) (NO) sequences in Identification (YES) ◊ ◊ ◊ (NO?) Security Computer / Immunology Artificial human) (visible Anatomy Human for Representations Computational Bioinformatics optrzdDanssbsdo Genetic on based Diagnosis Computerized Non- on Based Phylogenies of Determination Simulations Life Artificial Processing Image Radiological oooymodeling Homology prediction structure RNA eei loihsi oeua biology molecular in Algorithms Genetic r hyo rntThey Aren’t or They Are #,Answers) (#3, ?

35 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu okn,SrcueModeling Structure Docking, • Inhibitors (Function) Designing Molecules Other • Bind Structures How Understanding • Fo ett ih,fgrsaatdfo le ru okn aea cip,DsnNRGopWbpg tSrps n from and Scripps, at page Web Group NMR Dyson Scripps, at Center). Page Theory Docking Cornell Group at Olsen Page from Chemistry adapted Computational figures right, to left (From ao plcto I: Application Major einn Drugs Designing

36 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu ua s os s Yeast vs. Mouse vs. Human Organisms Different • in Ones Similar Find • eksSnrm 040MKX90 .e1 C2L61 rbbecpe transporter protein copper resistance Probable Sulphite kinase protein Serine/threoinine L36317 X67787 channel chloride Voltage-gated U07163 Methionine inhibitor CCC2 FZF1 dissociation GDP Z23117 2.1e-17 1.1e-20 IPL1 L26505 permease S69371 Sulfate 2.0e-18 GEF1 X69208 X51630 MET30 IPP-5-phosphatase 7.9e-31 Putative kinase 1.7e-34 GDI1 X82013 protein MNK WT1 M58051 Serine/threonine protein 309400 194070 FGFR3 2.1e-42 regulator Z25884 Z47047 Inhibitory 100800 L13385 SUL1 YIL002C M21307 CLC1 Helicase LIS1 X78121 7.2e-38 kinase 160800 1.2e-47 M33779 PI3 transporter 247200 copper YPK1 Probable CHM U22341 U14528 protein 303100 M88162 5.4e-53 IRA2 U31331 resistance dismutase Superoxide Metal L36317 DTD OCRL 2.0e-46 SGS1 kinase 222600 309000 L19268 2.6e-119 TEL1 Glycerol transporter J03279 L35237 ABC CCC2 Peroxisomal M89914 2.8e-90 5.9e-161 DM U39817 X69049 Gene SOD1 YCF1 Yeast 160900 NF1 U17065 U26455 1.3e-167 Syndrome 162200 BLM U11700 2.0e-58 Menkes GUT1 GenBank 210900 Achondroplasia ATM 1.8e-129 PXA1 Tumor WND M28668 Wilms 208900 K00065 3.4e-107 Yeast Disease 277900 Thomsen CFTR protein SOD1 L13943 Lissencephaly repair 219700 BLASTX DNA Dysplasia protein 105400 Z21876 Diastrophic repair DNA Choroideremia GK GenBank ALD U07187 1 307030 Human Type 300100 M84170 Neurofibromatosis, Syndrome # Lowe MIM MLH1 Dystrophy Myotonic Sclerosis 6.3e-196 MSH2 Lateral Amyotrophic 9.2e-261 Telangiectasia Ataxia X-linked U07418 Adrenoleukodystrophy, Syndrome U03911 Bloom Deficiency MLH1 Kinase Glycerol 120436 MSH2 Disease Wilson 120436 Fibrosis Cystic Colon Non-polyposis Cancer Hereditary Colon Non-polyposis Hereditary Cloned Disease Proteins Positionally Human cerevisiae Between S. Date and to Genes Matches Human Similarity Sequence Best Scinfo CIDsaeGnsDtbs erdcdBelow.) Reproduced Database Genes Disease NCBI from (Section ◊ airt oEps nlatter! on Expts. do to Easier idn Homologues Finding ao plcto II: Application Major eeAc o -au eeAc o Description for Acc# Gene P-value for Acc# Gene ua DAYatcDNA Yeast cDNA Human

37 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu • parts: two has Comparison Comparison • Structure for Problems Analogous • Scoring and Comparison Sequence thing another • to thing one Cross-Referencing, • nertdPresentation Integrated ◊ ◊ ◊ Assessing (2) Optimally (1) cr naUiomFramework Uniform a in Score Structures Align Sequences Align idn ooous( Homologues Finding ao plcto II: Application Major Aligning Significance niist e Comparison a get to entities 2 fti cr nagiven a in score this of cont Context Score .)

38 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu aaae,Statistics Databases, • and Organisms Compare • a of Occurrence Overall • dpe rmGnQi e ae adrGop EBI) Group, Sander Page, Web GeneQuiz from Synechocystis, adapted v. yeast figures, (Clock Tissues Genome the in Feature Certain ◊ ◊ vrl eoeCharacterization Genome Overall omlTissues Normal vs Cancerous in levels Expression Yeast in kinases many how e.g. ao plcto I|I: Application Major

39 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Simplfying 2345671 1011121314151617181920… 89 357811 1213 891011 1415… 234567 1 (human) ahas &c Pathways, eoe ihFolds, with Genomes ( .pallidum T. ) ~100000 ~1000 ~1000 genes genes folds

40 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Organisms Resolution Different? Structural tWhat At ( (human) .pallidum T. Are 0111 141516 1213 1011 171819 20 … 9 8 7 6 5 4 3 2 1 357811 1213 891011 1415… 234567 1 ) person plant m1Å 1m od(Ig) fold protein 100Å tutr ( structure super-secondary αβαβ,ααα ββ,ΤΜ−ΤΜ, ) 10Å Relevance Ptoe nyfolds only (Pathogen spsil targets) possible as Practical strand Drug helix (C,H,O...) individual atom

41 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu