What is ? What is bioinformatics?

•the analysis and interpretation of nucleotide and an interdisciplinary field at the interface of the sequences and structures computational and life sciences

“The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. “

National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/

What is bioinformatics? What is bioinformatics?

•the analysis and interpretation of nucleotide and •the analysis and interpretation of nucleotide and protein sequences and structures protein sequences and structures • development of algorithms and software to support • development of algorithms and software to support the acquisition … of biomolecular data the acquisition … of biomolecular data •the development of software that enables efficient access and management of biomolecular information

1 The Origins of Bioinformatics stems from parallel revolutions in biology and computing Amino acid Turing designs a stored At the beginning of World War II (1939‐1944): Sanger sequences insulin. program computer

•The shared program computer had not yet been 1950 Edsac: 1st stored Stein, Moore, Spackman: invented, and there were no programming program computer automatic amino acid languages, databases, or computer networks. analyzer •The relationship between genes and , the molecular basis of genes, the structure of DNA and the genetic code were all unknown.

Trends in Biochem. Sci, 99

Automatic recording apparatus used in the chromatographic analysis of mixtures of amino acids EDSAC: The first stored‐program computer.

Stein, Moore, Spackman, 1958

2 The Origins of Computational Biology

Grace Murray Hopper Amino acid sequencing Turing designs a stored finds the first bug Sanger sequences insulin. program computer

1950 Edsac: 1st stored Stein, Moore, Spackman: program computer automatic amino acid analyzer Grace Murray Hopper co‐invents COBOL and finds the first bug Rear Admiral Grace Hopper over‐seeing her team of programmers, Philadelphia Inquirer, 1957

The Origins of Computational Biology

Amino acid sequencing Turing designs a stored Sanger sequences insulin. program computer

1950 Edsac: 1st stored Stein, Moore, Spackman: program computer automatic amino acid analyzer Grace Murray Hopper co‐invents COBOL and finds the first bug

Grace Murray Hopper Discovery of DNA structure, Watson, Crick, Franklin

3 The Origins of Computational Biology The helical structure of DNA

Determination of the Fortran, Basic, LISP James Watson and Francis Crick genetic code 1960

Rosalind Franklin

1970

The Origins of Computational Biology The genetic code The genetic code is a degenerate, non‐overlapping, triplet code. Crick, Barnett, Brenner, and Watts‐Tobin, 1961 Determination of the Fortran, Basic, LISP genetic code 1960 Determination of On going protein sequen‐ the genetic code cing, Dayhoff publishes the Protein Atlas

1970

4 Atlas of Protein Sequence & Structure The Origins of Computational Biology 1965 ‐ 1978

Determination of the Fortran, Basic, LISP genetic code 1960 On going protein sequen‐ cing, Dayhoff publishes the Protein Atlas Arpanet

Margaret Dayhoff PhD in Chemistry, 47 Watson Computing Lab Fellow 47 ‐ 48 1970

The ARPAnet is established, The Origins of Computational Biology precursor to the Internet.

Determination of the Fortran, Basic, LISP genetic code Packet switching is invented, which supports flexible, multi‐node network 1960 On going protein sequen‐ cing, Dayhoff publishes the Protein Atlas Arpanet Interface Message Processor (IMP) network is constructed, linking 4, and The is then 15 nodes established. 1970

5 The Origins of Computational Biology Protein Data Bank (PDB)

Growing collection of X‐ray diffraction protein 1970 structure data Sanger‐Coulson sequencing TCP/IP Maxam‐Gilbert sequencing Development of molecular graphics display for Internet 3D visualization

SEARCH program supports remote access First royal email Gilbert, Sanger win Nobel USENET newsgroups Prize 1980

Beginnings of molecular evolution The Origins of Computational Biology

early mammal 1970 …atgccaggtcccagtga… Sanger‐Coulson sequencing TCP/IP Maxam‐Gilbert sequencing Internet

…atgcaaggagtccgagc……atgcgaggtccagtagt… …atgcaaggagtccgagc… …atgcgaggtccagtagt… First royal email Gilbert, Sanger win Nobel human rat mouse USENET newsgroups …atgcaaggagtcggagc… …atgcgaggtccaacagt… …atgggaggtccagtagt… Prize 1980 …atgcaaggagtcgcagagc… …atgcgaggtctcacagtgt… …atgggaggtctcccagtgt…

6 Sequence similarity → structural similarity Sequence similarity → structural similarity

Structure? Structure?

…vkltpegtr_wgghpldekflske… …vkltpegtr_wgghpldekflske… …vhltpettrgwgghmldekeiske… Estimate protein structure from a related protein with known …vhltpettrgwgghmldekeiske… structure and similar sequence.

Estimate protein structure from a related protein with known structure and similar sequence.

Sequence similarity → functional similarity The Origins of Computational Biology

? Congress establishes Genbank 1990 ...atgcgaggattctcgaagacagcga… Human Project ...atgcaaggagtc_cgttgacagagc… begins (DOE) Basic local alignment search tool (BLAST)

28

7 The Origins of Computational Biology BLAST Congress establishes Genbank 1990 Information begins (DOE) superhighway

Basic local alignment search World Wide Web, tool (BLAST) NCSA Mosaic

ome

The internet we know and love The Origins of Computational Biology

Al Gore Creates Bill to Fund "Information Congress establishes Superhighway" Genbank 1990 Human Genome Project Information begins (DOE) superhighway Tim Berners‐Lee proposes a design for information sharing that becomes the World Wide Web Basic local alignment search World Wide Web, tool (BLAST) GenBank goes online. NCSA Mosaic

The first web browser Pizza Hut goes on line

8 Genbank BLAST

National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/

BLAST The Origins of Computational Biology

Google H. Influenzae: 1st whole 1995 genome sequence Yeast –1st eukaryote 1998 C. elegans ‐ 1st animal IBM’s Blue Gene architecture for Fly genome 1999 protein modeling

A.thaliana – 1st plant 2000

9 Sequence Assembly Shotgun sequencing

DNA molecule Limits of gel electrophoresis: ~ 500bp in one “read” To sequence more than 500 bp: shotgun reads Sequence 500bp fragments separately Combine computationally using sequence comparison Sequence assembly

contigs

Sequence Assembly The Origins of Computational Biology

2001 Wikipedia, Captcha

Draft human genome 2002 Mozilla Foundation: open source, W3P ABI Capillary sequencer patent policy 2003 Chimp genome, aligned All-vs-allAlignmentDetermination sequence of sequences of consensus similarity with sequence comparisonhigh sequence with human genome identity

10 How to interpret whole genome How to interpret whole genome sequence sequence •Where are the genes?  Where are the genes? •When and where are those genes expressed? •When and where are those genes expressed? •What proteins do they ? •What proteins do they encode? •What do they do? •What do they do? – Molecular function? – Molecular function? – Biological pathway or process? – Biological pathway or process?

atatactcacagcataactgtatatacacccagggggcggaatgaaagcgttaacggcca ggcaacaagaggtgtttgatctcatccgtgatcacatcagccagacaggtatgccgccga cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc DNA PATTERNS IN THE E.coli lexA GENE tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac Promotor sequences cacttctggcgcaacagcatattgaaggtcattatcaggtcgatccttccttattcaagc Repressor binding site cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg 1 gaattcgataaatcctggtttattgtgcagtttatggttccaaaatcgccttttgctgt atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg TTCCAA -35 cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac 61 atatactcacagcataactgtatatacacccagggggcggaatgaaagcgttaacggcca tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca -10 TATACT mRNAstart+ +10GGGGG Ribosomal binding site cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc 121 ggcaacaagaggtgtttgatctcatccgtgatcacatcagccagacaggtatgccgccga tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc 181 cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac 241 tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg 301 gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac gaattcgataaatctctggtttattgtgcagtttatggttccaaaatcgccttttgctgt 361 cacttctggcgcaacagcatattgaaggtcattatcaggtcgatccttccttattcaagc agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct 421 cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg ATG…TAA gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt 481 atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg open reading frame cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg 541 cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg 601 tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac 661 ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca 721 agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg 781 gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct 841 tctccaatatcaccgttccgttgctgggactggtcgatacggcggtaattggtcatcttg gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt 901 atagcccggtttatttgggcggcgtggcggttggcgcaacggcggaccagct tctccaatatcaccgttccgttgctgggactggtcgatacggcggtaattggtcatcttg

11 DNA microarrays The Origins of Computational Biology

cgtaacgctat 5 more yeast 2004

2005

2006

Align genomes to confirm gene predictions Global alignment of upstream sequences and identify regulatory regions to identify regulatory regions

RRPE PAC Salzberg, Nature, 2003

Kellis et al, Nature, 03

12 The Origins of Computational Biology The Origins of Computational Biology

5 more yeast genomes 2004 5 more yeast genomes 2004

454 pyrosequencer Rosetta@home 454 pyrosequencer Rosetta@home Hi‐thruput, short read Facebook, Twitter Hi‐thruput, short read Facebook, Twitter sequencing sequencing 2005 2005 12 Drosophila genomes US cyberworm attacks 12 Drosophila genomes US cyberworm attacks Iranian centrifuges Iranian centrifuges

2006 2006

Next‐generation, short read sequencing Next‐generation. short read sequencing

Advantages Sanger Sequencing Illumina sequending •read lengths up to 1,000 bp •read lengths up to 36 bp •High throughput • accuracy 99.999% •error rates 1‐1.5% •Does not require PCR amplification •costs $500 per megabase •cost $2 per megabase •Accurate measures of abundance •Cheaper Disadvantages 454 sequencing • Short reads are unlikely to be unique. •read lengths 200‐300 bp • accuracy problem with homopolymers •Difficult to identify the origin of a given read •costs $60 per megabase • Particular challenge for genome assembly

13 Some next generation sequencing The Origins of Computational Biology applications

• Bacterial genomes Apple iPhone 1000 Genomes project 2007 •Sample diversity in a bacterial population (e.g., Estonia: First national your throat when you have strep) Human microbiome project. elections via Internet • Transcription: more accurate and quantitative compared with microarrays First tumor/normal 2008 Foldit: Crowd‐sourced •Medical diagnostics: sequence short genomic genome published protein folding game regions to identify mutations associated with disease Draft Neanderthal genome 2009

Metagenomics

•Sample communities of microbial organisms A crowd‐sourcing directly from their natural environments, game developed at CMU. bypassing the need for isolation and lab cultivation of individual species.

•Result: a collection of DNA fragments that characterize the organismal and functional diversity of the envirment

14 Metagenomics • Production‐scale plant fermenter What makes us human? •Fungal communities from the Arctic •Singapore indoor air filters • Yellowstone Obsidian Hot Spring •Human metabolic features‐ combo of human and • Fossil microbiome microbial traits •Human microbiome •Microbiota‐ microrganisms that live inside and on humans •Microbiome‐ the genomes of the microbial symbionts

The Origins of Computational Biology What is bioinformatics? 2010 Development of algorithms and software to support Social networking the acquisition and interpretation of biomolecular Chocolate (Theobroma topples regime in data cacao) genome Egypt • Acquisition: Microarray design, Sequence assembly 2011 • Interpretation: Sequence comparison, clustering of Crystal structure …solved microarrays, gene finding, phylogenetic profiling, … 3rd Generation sequencing: by protein folding Pac Bio, Ion Torrent game players, Nature Structural Biology 2012

15 What is bioinformatics? What is bioinformatics?

Development of software that enables efficient access The analysis and interpretation of nucleotide and and management of biomolecular information protein sequences and structures • Genbank •BLAST Application of all of these methods to address specific questions about specific biological systems •Protein visualization

Pairwise (global and local)

Multiple sequence alignment global local Sequence evolution models

Database searching

BLAST Substitution Sequence matrices statistics

Evolutionary tree reconstruction

16