What is bioinformatics? What is bioinformatics?
•the analysis and interpretation of nucleotide and an interdisciplinary field at the interface of the protein sequences and structures computational and life sciences
“The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. “
National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/
What is bioinformatics? What is bioinformatics?
•the analysis and interpretation of nucleotide and •the analysis and interpretation of nucleotide and protein sequences and structures protein sequences and structures • development of algorithms and software to support • development of algorithms and software to support the acquisition … of biomolecular data the acquisition … of biomolecular data •the development of software that enables efficient access and management of biomolecular information
1 The Origins of Computational Biology Bioinformatics stems from parallel revolutions in biology and computing Amino acid sequencing Turing designs a stored At the beginning of World War II (1939‐1944): Sanger sequences insulin. program computer
•The shared program computer had not yet been 1950 Edsac: 1st stored Stein, Moore, Spackman: invented, and there were no programming program computer automatic amino acid languages, databases, or computer networks. analyzer •The relationship between genes and proteins, the molecular basis of genes, the structure of DNA and the genetic code were all unknown.
Trends in Biochem. Sci, 99
Automatic recording apparatus used in the chromatographic analysis of mixtures of amino acids EDSAC: The first stored‐program computer.
Stein, Moore, Spackman, 1958
2 The Origins of Computational Biology
Grace Murray Hopper Amino acid sequencing Turing designs a stored finds the first bug Sanger sequences insulin. program computer
1950 Edsac: 1st stored Stein, Moore, Spackman: program computer automatic amino acid analyzer Grace Murray Hopper co‐invents COBOL and finds the first bug Rear Admiral Grace Hopper over‐seeing her team of programmers, Philadelphia Inquirer, 1957
The Origins of Computational Biology
Amino acid sequencing Turing designs a stored Sanger sequences insulin. program computer
1950 Edsac: 1st stored Stein, Moore, Spackman: program computer automatic amino acid analyzer Grace Murray Hopper co‐invents COBOL and finds the first bug
Grace Murray Hopper Discovery of DNA structure, Watson, Crick, Franklin
3 The Origins of Computational Biology The helical structure of DNA
Determination of the Fortran, Basic, LISP James Watson and Francis Crick genetic code 1960
Rosalind Franklin
1970
The Origins of Computational Biology The genetic code The genetic code is a degenerate, non‐overlapping, triplet code. Crick, Barnett, Brenner, and Watts‐Tobin, 1961 Determination of the Fortran, Basic, LISP genetic code 1960 Determination of On going protein sequen‐ the genetic code cing, Dayhoff publishes the Protein Atlas
1970
4 Atlas of Protein Sequence & Structure The Origins of Computational Biology 1965 ‐ 1978
Determination of the Fortran, Basic, LISP genetic code 1960 On going protein sequen‐ cing, Dayhoff publishes the Protein Atlas Arpanet
Margaret Dayhoff PhD in Chemistry, 47 Watson Computing Lab Fellow 47 ‐ 48 1970
The ARPAnet is established, The Origins of Computational Biology precursor to the Internet.
Determination of the Fortran, Basic, LISP genetic code Packet switching is invented, which supports flexible, multi‐node network 1960 On going protein sequen‐ cing, Dayhoff publishes the Protein Atlas Arpanet Interface Message Processor (IMP) network is constructed, linking 4, and The Protein Data Bank is then 15 nodes established. 1970
5 The Origins of Computational Biology Protein Data Bank (PDB)
Growing collection of X‐ray diffraction protein 1970 structure data Sanger‐Coulson sequencing TCP/IP Maxam‐Gilbert sequencing Development of molecular graphics display for Internet 3D visualization
SEARCH program supports remote access First royal email Gilbert, Sanger win Nobel USENET newsgroups Prize 1980
Beginnings of molecular evolution The Origins of Computational Biology
early mammal 1970 …atgccaggtcccagtga… Sanger‐Coulson sequencing TCP/IP Maxam‐Gilbert sequencing Internet
…atgcaaggagtccgagc……atgcgaggtccagtagt… …atgcaaggagtccgagc… …atgcgaggtccagtagt… First royal email Gilbert, Sanger win Nobel human rat mouse USENET newsgroups …atgcaaggagtcggagc… …atgcgaggtccaacagt… …atgggaggtccagtagt… Prize 1980 …atgcaaggagtcgcagagc… …atgcgaggtctcacagtgt… …atgggaggtctcccagtgt…
6 Sequence similarity → structural similarity Sequence similarity → structural similarity
Structure? Structure?
…vkltpegtr_wgghpldekflske… …vkltpegtr_wgghpldekflske… …vhltpettrgwgghmldekeiske… Estimate protein structure from a related protein with known …vhltpettrgwgghmldekeiske… structure and similar sequence.
Estimate protein structure from a related protein with known structure and similar sequence.
Sequence similarity → functional similarity The Origins of Computational Biology
? Congress establishes Genbank 1990 ...atgcgaggattctcgaagacagcga… Human Genome Project ...atgcaaggagtc_cgttgacagagc… begins (DOE) Basic local alignment search tool (BLAST)
28
7 The Origins of Computational Biology BLAST Congress establishes Genbank 1990 Human Genome Project Information begins (DOE) superhighway
Basic local alignment search World Wide Web, tool (BLAST) NCSA Mosaic
ome
The internet we know and love The Origins of Computational Biology
Al Gore Creates Bill to Fund "Information Congress establishes Superhighway" Genbank 1990 Human Genome Project Information begins (DOE) superhighway Tim Berners‐Lee proposes a design for information sharing that becomes the World Wide Web Basic local alignment search World Wide Web, tool (BLAST) GenBank goes online. NCSA Mosaic
The first web browser Pizza Hut goes on line
8 Genbank BLAST
National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/
BLAST The Origins of Computational Biology
Google H. Influenzae: 1st whole 1995 genome sequence Yeast –1st eukaryote 1998 C. elegans ‐ 1st animal IBM’s Blue Gene architecture for Fly genome 1999 protein modeling
A.thaliana – 1st plant 2000
9 Sequence Assembly Shotgun sequencing
DNA molecule Limits of gel electrophoresis: ~ 500bp in one “read” To sequence more than 500 bp: shotgun reads Sequence 500bp fragments separately Combine computationally using sequence comparison Sequence assembly
contigs
Sequence Assembly The Origins of Computational Biology
2001 Wikipedia, Captcha
Draft human genome 2002 Mozilla Foundation: open source, W3P ABI Capillary sequencer patent policy 2003 Chimp genome, aligned All-vs-allAlignmentDetermination sequence of sequences of consensus similarity with sequence comparisonhigh sequence with human genome identity
10 How to interpret whole genome How to interpret whole genome sequence sequence •Where are the genes? Where are the genes? •When and where are those genes expressed? •When and where are those genes expressed? •What proteins do they encode? •What proteins do they encode? •What do they do? •What do they do? – Molecular function? – Molecular function? – Biological pathway or process? – Biological pathway or process?
atatactcacagcataactgtatatacacccagggggcggaatgaaagcgttaacggcca ggcaacaagaggtgtttgatctcatccgtgatcacatcagccagacaggtatgccgccga cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc DNA PATTERNS IN THE E.coli lexA GENE tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac Promotor sequences cacttctggcgcaacagcatattgaaggtcattatcaggtcgatccttccttattcaagc Repressor binding site cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg 1 gaattcgataaatcctggtttattgtgcagtttatggttccaaaatcgccttttgctgt atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg TTCCAA -35 cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac 61 atatactcacagcataactgtatatacacccagggggcggaatgaaagcgttaacggcca tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca -10 TATACT mRNAstart+ +10GGGGG Ribosomal binding site cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc 121 ggcaacaagaggtgtttgatctcatccgtgatcacatcagccagacaggtatgccgccga tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc 181 cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac 241 tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg 301 gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac gaattcgataaatctctggtttattgtgcagtttatggttccaaaatcgccttttgctgt 361 cacttctggcgcaacagcatattgaaggtcattatcaggtcgatccttccttattcaagc agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct 421 cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg ATG…TAA gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt 481 atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg open reading frame cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg 541 cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg 601 tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac 661 ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca 721 agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg 781 gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct 841 tctccaatatcaccgttccgttgctgggactggtcgatacggcggtaattggtcatcttg gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt 901 atagcccggtttatttgggcggcgtggcggttggcgcaacggcggaccagct tctccaatatcaccgttccgttgctgggactggtcgatacggcggtaattggtcatcttg
11 DNA microarrays The Origins of Computational Biology
cgtaacgctat 5 more yeast genomes 2004
2005
2006
Align genomes to confirm gene predictions Global alignment of upstream sequences and identify regulatory regions to identify regulatory regions
RRPE PAC Salzberg, Nature, 2003
Kellis et al, Nature, 03
12 The Origins of Computational Biology The Origins of Computational Biology
5 more yeast genomes 2004 5 more yeast genomes 2004
454 pyrosequencer Rosetta@home 454 pyrosequencer Rosetta@home Hi‐thruput, short read Facebook, Twitter Hi‐thruput, short read Facebook, Twitter sequencing sequencing 2005 2005 12 Drosophila genomes US cyberworm attacks 12 Drosophila genomes US cyberworm attacks Iranian centrifuges Iranian centrifuges
2006 2006
Next‐generation, short read sequencing Next‐generation. short read sequencing
Advantages Sanger Sequencing Illumina sequending •read lengths up to 1,000 bp •read lengths up to 36 bp •High throughput • accuracy 99.999% •error rates 1‐1.5% •Does not require PCR amplification •costs $500 per megabase •cost $2 per megabase •Accurate measures of abundance •Cheaper Disadvantages 454 sequencing • Short reads are unlikely to be unique. •read lengths 200‐300 bp • accuracy problem with homopolymers •Difficult to identify the origin of a given read •costs $60 per megabase • Particular challenge for genome assembly
13 Some next generation sequencing The Origins of Computational Biology applications
• Bacterial genomes Apple iPhone 1000 Genomes project 2007 •Sample diversity in a bacterial population (e.g., Estonia: First national your throat when you have strep) Human microbiome project. elections via Internet • Transcription: more accurate and quantitative compared with microarrays First tumor/normal 2008 Foldit: Crowd‐sourced •Medical diagnostics: sequence short genomic genome published protein folding game regions to identify mutations associated with disease Draft Neanderthal genome 2009
Metagenomics
•Sample communities of microbial organisms A crowd‐sourcing directly from their natural environments, game developed at CMU. bypassing the need for isolation and lab cultivation of individual species.
•Result: a collection of DNA fragments that characterize the organismal and functional diversity of the envirment
14 Metagenomics • Production‐scale plant fermenter What makes us human? •Fungal communities from the Arctic •Singapore indoor air filters • Yellowstone Obsidian Hot Spring •Human metabolic features‐ combo of human and • Fossil microbiome microbial traits •Human microbiome •Microbiota‐ microrganisms that live inside and on humans •Microbiome‐ the genomes of the microbial symbionts
The Origins of Computational Biology What is bioinformatics? 2010 Development of algorithms and software to support Social networking the acquisition and interpretation of biomolecular Chocolate (Theobroma topples regime in data cacao) genome Egypt • Acquisition: Microarray design, Sequence assembly 2011 • Interpretation: Sequence comparison, clustering of Crystal structure …solved microarrays, gene finding, phylogenetic profiling, … 3rd Generation sequencing: by protein folding Pac Bio, Ion Torrent game players, Nature Structural Biology 2012
15 What is bioinformatics? What is bioinformatics?
Development of software that enables efficient access The analysis and interpretation of nucleotide and and management of biomolecular information protein sequences and structures • Genbank •BLAST Application of all of these methods to address specific questions about specific biological systems •Protein visualization
Pairwise sequence alignment (global and local)
Multiple sequence alignment global local Sequence evolution models
Database searching
BLAST Substitution Sequence matrices statistics
Evolutionary tree reconstruction
16