Course overview Computational and Molecular Biology Course web page: http://www.cs.cmu.edu/~durand/03-711 Reading lists: http://www.cs.cmu.edu/~durand/03-711/reading.html Instructor: Dannie Durand Materials available on Electronic Reserves: TAs: Philip Davidson, Han Lai http://www.library.cmu.edu/ Fall 2011

Recommended text books Course overview (No required textbook) Syllabus: http://www.cs.cmu.edu/~durand/03-711/syllabus.html – Reading assignments will be posted on this page. • Some material is password protected; • Id: compbio, passwd: genomics – You can download homework, solution sets, and class notes from this page.

1 Electronic reserves: Course overview http://www.library.cmu.edu/ Click on: Materials available on Electronic Reserves: Course Reserves http://www.library.cmu.edu/ Select: Look up items on reserve by in- structor Type: “Durand” Select: 03-711

Electronic reserves: Electronic reserves: http://www.library.cmu.edu/ http://www.library.cmu.edu/

Click on: Click on: Course Reserves Course Reserves Select: Look up items Select: Look up items on reserve by in- on reserve by in- structor structor Type: “Durand” Type: “Durand” Select: 03-711 Select: 03-711 Click on URL to download PDF

2 How to do well in this course Course overview • Come to class Coursework and policies: • Take notes http://www.cs.cmu.edu/~durand/03-711/policies.html • Come to office hours • Preparing for exams – Homework is more focused on working problems – Exams are more focused on concepts – Study from your notes as well as your homework

Outline

I speak quickly. • Origins of computational molecular biology My handwriting is terrible. Please interrupt and ask questions. • An overview of computational molecular biology

• Functional and computational Genomics.

3 The Origins of The Origins of Computational Biology

Turing, v. Neumann: specs for Turing, v. Neumann: specs for stored program computer stored program computer Sanger: peptide sequencing by First transistor Sanger: peptide sequencing by First transistor partition chromatography partition chromatography Edman: stepwise protein Edman: stepwise protein Edsac: 1st stored program Edsac: 1st stored program degradation 1950 degradation 1950 computer computer Sanger sequences insulin, Sanger sequences insulin, Grace Murray Hopper: First Grace Murray Hopper: First Discovery of DNA structure Discovery of DNA structure compiler compiler

Fortran Fortran Stein, Moore, Spackman: Stein, Moore, Spackman: automatic analyzer First integrated circuit automatic amino acid analyzer First integrated circuit Myoglobin 1960 Myoglobin 1960 Ribonuclease Ribonuclease Trends in Biochem. Sci, 99 Lysozyme Lysozyme Basic Basic

The Origins of Computational Biology

Turing, v. Neumann: specs for stored program computer Sanger: peptide sequencing by ENIAC partition chromatography First transistor Edman: stepwise protein EDSAC: 1st stored program Schematic of the automatic recording apparatus used in the degradation 1950 chromatographic analysis of mixtures of amino acids computer Sanger sequences insulin, Grace Murray Hopper: First Discovery of DNA structure compiler

Stein, Moore, Spackman: Fortran automatic amino acid analyzer Myoglobin First integrated circuit Ribonuclease 1960 Lysozyme

Basic Kresge, N. et al. J. Biol. Chem. 2005;280:e6

4 The Origins of Computational Biology Genbank doubles every 18 months

ARPANET 1970 1E11 Sanger-Coulson sequencing TCP/IP Maxam-Gilbert sequencing Internet

Gilbert, Sanger win Nobel Prize 1980 First royal email USENET newsgroups 1E6 Congress establishes Genbank Human Project begins 1990 World Wide Web, Gopher 1982 2009 GenBank goes online. NCSA Mosaic

Pizza Hut goes on line First whole genome First whole genome sequence 1995 sequence

Whole Genome Sequencing Highlights Whole Genome Sequencing

(A Eukarya-centric View) www.genomesonline.org

1995 H. influenzae – 1st whole genome sequence 1997 Yeast – 1st eukaryotic sequence 1998 Caenorhabditis elegans – 1st multicellular organism 2000 Fly, Arabidopsis thaliana –1st plant 2001 Human 2002 Mouse, Ciona intestinalis, 2003 Caenorhabditis briggsae, Neurospora Crassa 2004 Five more yeasts, silkworm, rat, C. merolae, tetraodon 2005 Dictyostelium, zebrafish, chimpanzee 2007 Twelve drosophila

5 Eukaryotes Next-generation short read sequencing

Sanger Sequencing Illumina sequending • read lengths up to 1,000 bp • read lengths up to 36 bp • accuracy 99.999% • error rates 1-1.5% • costs $0.50 per kilobase •cost $2 per megabase

454 sequencing Prokaryotes • read lengths 200-300 bp • accuracy problem with homopolymers • costs $60 per megabase

Next-generation short read sequencing Next-generation short read sequencing

Advantages Applications • High throughput • Sequencing bacterial genomes •Does not require PCR amplification •Accurate measures of abundance • Resequencing • identify mutations associated with drug resistance; • identify mutations in cancerous tissues. Disadvantages • genetic diagnostics • Short reads are unlikely to be unique. • •Difficult to identify the origin of a given read • Particular challenge for genome assembly

6 Why sequence data is so powerful: • Production-scale plant fermenter Sequences are related! • Fungal communities from the Arctic • Singapore indoor air filters early mammal …atgccaggtcccagtga… • Yellowstone Obsidian Hot Spring • Many fecal • Fossil

…atgcaaggagtccgagc……atgcgaggtccagtagt… …atgcaaggagtccgagc… …atgcgaggtccagtagt…

human rat mouse …atgcaaggagtcggagc… …atgcgaggtccaacagt… …atgggaggtccagtagt…

Sequence similarity → functional similarity Sequence similarity → structural similarity

? Structure?

...atgcgaggattctcgaagacagcga… ...atgcaaggagtc_cgttgacagagc… …vkltpegtr_wgghpldekflske…

…vhltpettrgwgghmldekeiske…

Estimate protein structure from a related protein with known structure and similar sequence.

28

7 Sequence similarity → structural similarity Sequence Comparison

…atgcaaggagtcccagagcctgagctgactacgt… …atgcaagtcgtcccagtgccagaactccctacgt… …atgcgaggtctcccagtgtctgaactgactaagt… …accagtggtctccgagtggctgaactgactaaca… Structure? global pairwise alignment local pairwise alignment

vtfisll v..frrda.h ksevahrfkd lgeenfkalv… …mkwvtfisll flfssaysrg v..frrda.hkseva vtfisll v..frrea.h kseiahrfnd vgeehfiglv… …mkwvtfisll flfssaysrg v..frrea.h kseia …vkltpegtr_wgghpldekflske… vtfisll v..frrdt.y kseiahrfkd lgeqyfkglv… …~~wvtfisll flfssaysrg v..frrdt.y kseia …vhltpettrgwgghmldekeiske… Estimate protein structure from a vtlisfi lqrfardaeh kseiahrynd lkeetfkava… …mkwvtlisfi flfssatsrn lqrfardaeh kseia related protein with known global multiple alignment local multiple alignment structure and similar sequence. Applications • Database searching • RNA structure prediction • Evolutionary tree reconstruction • Gene finding • Sequence assembly….

atatactcacagcataactgtatatacacccagggggcggaatgaaagcgttaacggcca ggcaacaagaggtgtttgatctcatccgtgatcacatcagccagacaggtatgccgccga Reconstructing Evolutionary History cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac cacttctggcgcaacagcatattgaaggtcattatcaggtcgatccttccttattcaagc early mammal cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg …atgccagactccagtga… atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac …atgcgaggtccagtgt… ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg gaattcgataaatctctggtttattgtgcagtttatggttccaaaatcgccttttgctgt agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg human rat mouse atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg …atgcaagagtcgagagc… …atgcgagtccgtagtgt… …atgggaggtccactgt… cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca …atgcaagagtcgagagc… ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg …atgcgagtccgtagtgt… agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct …atgggagtccccactgt… gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt tctccaatatcaccgttccgttgctgggactggtcgatacggcggtaattggtcatcttg

8 DNA PATTERNS IN THE E.coli lexA GENE Global alignment of upstream sequences

Promotor sequences Repressor binding site 1 gaattcgataaatcctggtttattgtgcagtttatggttccaaaatcgccttttgctgt TTCCAA -35 61 atatactcacagcataactgtatatacacccagggggcggaatgaaagcgttaacggcca -10 TATACT mRNAstart+ +10GGGGG Ribosomal binding site 121 ggcaacaagaggtgtttgatctcatccgtgatcacatcagccagacaggtatgccgccga 181 cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc 241 tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc 301 gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac 361 cacttctggcgcaacagcatattgaaggtcattatcaggtcgatccttccttattcaagc 421 cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg ATG…TAA 481 atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg open reading frame 541 cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac 601 tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca 661 ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg RRPE PAC 721 agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct 781 gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt 841 tctccaatatcaccgttccgttgctgggactggtcgatacggcggtaattggtcatcttg 901 atagcccggtttatttgggcggcgtggcggttggcgcaacggcggaccagct Kellis et al, Nature, 03

Sequence conservation RNA Secondary Structure

Salzberg, Nature, 2003

9 RNA Secondary Structure RNA Secondary Structure

CCGUGUACGUGUAACGGAUCUUUAUUCCC... CCGUGUACGUGUAACGGAUCUUUAUUCCC... CCGUGAACGUAUACUGCAGUUUUAGUGCG... CCGUGAACGUAUACUGCAGUUUUAGUGCG... GGCUCACGCUGUUCCGGAUUAUGAUUCCC GGCUCACGCUGUUCCGGAUUAUGAUUCCC CGCAGAAGCUCCACGCGUUUUCUGUACGA CGCAGAAGCUCCACGCGUUUUCUGUACGA

The Fantasy From genes to cells

Whole genome sequence TGAAATAAACAACCAGGCAGCAGTTATTAACACGGGAACATGGCGGCCGCAGCCTGGGCTCCCGCGGCGGCGGCGG… Cell Simulator Compiler

Cell Function Simulator

10 Functional and computational genomics Pairwise (global and local) • mRNA expression Multiple sequence • Splice variants alignment

• Protein structure Substitution matrices Database • Protein expression searching

• Sub-cellular localization global local BLAST • Protein-protein interactions Sequence statistics • Protein-DNA interactions Gene Finding – time permitting Evolutionary tree Protein structure prediction Design and interpretation of all of these reconstruction RNA structure prediction assays requires sequence analysis. Computational genomics…

11