Course overview Computational Genomics and Molecular Biology Course web page: http://www.cs.cmu.edu/~durand/03-711 Reading lists: http://www.cs.cmu.edu/~durand/03-711/reading.html Instructor: Dannie Durand Materials available on Electronic Reserves: TAs: Philip Davidson, Han Lai http://www.library.cmu.edu/ Fall 2011
Recommended text books Course overview (No required textbook) Syllabus: http://www.cs.cmu.edu/~durand/03-711/syllabus.html – Reading assignments will be posted on this page. • Some material is password protected; • Id: compbio, passwd: genomics – You can download homework, solution sets, and class notes from this page.
1 Electronic reserves: Course overview http://www.library.cmu.edu/ Click on: Materials available on Electronic Reserves: Course Reserves http://www.library.cmu.edu/ Select: Look up items on reserve by in- structor Type: “Durand” Select: 03-711
Electronic reserves: Electronic reserves: http://www.library.cmu.edu/ http://www.library.cmu.edu/
Click on: Click on: Course Reserves Course Reserves Select: Look up items Select: Look up items on reserve by in- on reserve by in- structor structor Type: “Durand” Type: “Durand” Select: 03-711 Select: 03-711 Click on URL to download PDF
2 How to do well in this course Course overview • Come to class Coursework and policies: • Take notes http://www.cs.cmu.edu/~durand/03-711/policies.html • Come to office hours • Preparing for exams – Homework is more focused on working problems – Exams are more focused on concepts – Study from your notes as well as your homework
Outline
I speak quickly. • Origins of computational molecular biology My handwriting is terrible. Please interrupt and ask questions. • An overview of computational molecular biology
• Functional and computational Genomics.
3 The Origins of Computational Biology The Origins of Computational Biology
Turing, v. Neumann: specs for Turing, v. Neumann: specs for stored program computer stored program computer Sanger: peptide sequencing by First transistor Sanger: peptide sequencing by First transistor partition chromatography partition chromatography Edman: stepwise protein Edman: stepwise protein Edsac: 1st stored program Edsac: 1st stored program degradation 1950 degradation 1950 computer computer Sanger sequences insulin, Sanger sequences insulin, Grace Murray Hopper: First Grace Murray Hopper: First Discovery of DNA structure Discovery of DNA structure compiler compiler
Fortran Fortran Stein, Moore, Spackman: Stein, Moore, Spackman: automatic amino acid analyzer First integrated circuit automatic amino acid analyzer First integrated circuit Myoglobin 1960 Myoglobin 1960 Ribonuclease Ribonuclease Trends in Biochem. Sci, 99 Lysozyme Lysozyme Basic Basic
The Origins of Computational Biology
Turing, v. Neumann: specs for stored program computer Sanger: peptide sequencing by ENIAC partition chromatography First transistor Edman: stepwise protein EDSAC: 1st stored program Schematic of the automatic recording apparatus used in the degradation 1950 chromatographic analysis of mixtures of amino acids computer Sanger sequences insulin, Grace Murray Hopper: First Discovery of DNA structure compiler
Stein, Moore, Spackman: Fortran automatic amino acid analyzer Myoglobin First integrated circuit Ribonuclease 1960 Lysozyme
Basic Kresge, N. et al. J. Biol. Chem. 2005;280:e6
4 The Origins of Computational Biology Genbank doubles every 18 months
ARPANET 1970 1E11 Sanger-Coulson sequencing TCP/IP Maxam-Gilbert sequencing Internet
Gilbert, Sanger win Nobel Prize 1980 First royal email USENET newsgroups 1E6 Congress establishes Genbank Human Genome Project begins 1990 World Wide Web, Gopher 1982 2009 GenBank goes online. NCSA Mosaic
Pizza Hut goes on line First whole genome First whole genome sequence 1995 sequence
Whole Genome Sequencing Highlights Whole Genome Sequencing
(A Eukarya-centric View) www.genomesonline.org
1995 H. influenzae – 1st whole genome sequence 1997 Yeast – 1st eukaryotic sequence 1998 Caenorhabditis elegans – 1st multicellular organism 2000 Fly, Arabidopsis thaliana –1st plant 2001 Human 2002 Mouse, Ciona intestinalis, 2003 Caenorhabditis briggsae, Neurospora Crassa 2004 Five more yeasts, silkworm, rat, C. merolae, tetraodon 2005 Dictyostelium, zebrafish, chimpanzee 2007 Twelve drosophila genomes …
5 Eukaryotes Next-generation short read sequencing
Sanger Sequencing Illumina sequending • read lengths up to 1,000 bp • read lengths up to 36 bp • accuracy 99.999% • error rates 1-1.5% • costs $0.50 per kilobase •cost $2 per megabase
454 sequencing Prokaryotes • read lengths 200-300 bp • accuracy problem with homopolymers • costs $60 per megabase
Next-generation short read sequencing Next-generation short read sequencing
Advantages Applications • High throughput • Sequencing bacterial genomes •Does not require PCR amplification •Accurate measures of abundance • Resequencing • identify mutations associated with drug resistance; • identify mutations in cancerous tissues. Disadvantages • genetic diagnostics • Short reads are unlikely to be unique. • Functional genomics •Difficult to identify the origin of a given read • Particular challenge for genome assembly
6 Metagenomics Why sequence data is so powerful: • Production-scale plant fermenter Sequences are related! • Fungal communities from the Arctic • Singapore indoor air filters early mammal …atgccaggtcccagtga… • Yellowstone Obsidian Hot Spring • Many fecal microbiomes • Fossil microbiome
…atgcaaggagtccgagc……atgcgaggtccagtagt… …atgcaaggagtccgagc… …atgcgaggtccagtagt…
human rat mouse …atgcaaggagtcggagc… …atgcgaggtccaacagt… …atgggaggtccagtagt…
Sequence similarity → functional similarity Sequence similarity → structural similarity
? Structure?
...atgcgaggattctcgaagacagcga… ...atgcaaggagtc_cgttgacagagc… …vkltpegtr_wgghpldekflske…
…vhltpettrgwgghmldekeiske…
Estimate protein structure from a related protein with known structure and similar sequence.
28
7 Sequence similarity → structural similarity Sequence Comparison
…atgcaaggagtcccagagcctgagctgactacgt… …atgcaagtcgtcccagtgccagaactccctacgt… …atgcgaggtctcccagtgtctgaactgactaagt… …accagtggtctccgagtggctgaactgactaaca… Structure? global pairwise alignment local pairwise alignment
vtfisll v..frrda.h ksevahrfkd lgeenfkalv… …mkwvtfisll flfssaysrg v..frrda.hkseva vtfisll v..frrea.h kseiahrfnd vgeehfiglv… …mkwvtfisll flfssaysrg v..frrea.h kseia …vkltpegtr_wgghpldekflske… vtfisll v..frrdt.y kseiahrfkd lgeqyfkglv… …~~wvtfisll flfssaysrg v..frrdt.y kseia …vhltpettrgwgghmldekeiske… Estimate protein structure from a vtlisfi lqrfardaeh kseiahrynd lkeetfkava… …mkwvtlisfi flfssatsrn lqrfardaeh kseia related protein with known global multiple alignment local multiple alignment structure and similar sequence. Applications • Database searching • RNA structure prediction • Evolutionary tree reconstruction • Gene finding • Sequence assembly….
atatactcacagcataactgtatatacacccagggggcggaatgaaagcgttaacggcca ggcaacaagaggtgtttgatctcatccgtgatcacatcagccagacaggtatgccgccga Reconstructing Evolutionary History cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac cacttctggcgcaacagcatattgaaggtcattatcaggtcgatccttccttattcaagc early mammal cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg …atgccagactccagtga… atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac …atgcgaggtccagtgt… ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg gaattcgataaatctctggtttattgtgcagtttatggttccaaaatcgccttttgctgt agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg human rat mouse atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg …atgcaagagtcgagagc… …atgcgagtccgtagtgt… …atgggaggtccactgt… cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca …atgcaagagtcgagagc… ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg …atgcgagtccgtagtgt… agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct …atgggagtccccactgt… gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt tctccaatatcaccgttccgttgctgggactggtcgatacggcggtaattggtcatcttg
8 DNA PATTERNS IN THE E.coli lexA GENE Global alignment of upstream sequences
Promotor sequences Repressor binding site 1 gaattcgataaatcctggtttattgtgcagtttatggttccaaaatcgccttttgctgt TTCCAA -35 61 atatactcacagcataactgtatatacacccagggggcggaatgaaagcgttaacggcca -10 TATACT mRNAstart+ +10GGGGG Ribosomal binding site 121 ggcaacaagaggtgtttgatctcatccgtgatcacatcagccagacaggtatgccgccga 181 cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc 241 tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc 301 gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac 361 cacttctggcgcaacagcatattgaaggtcattatcaggtcgatccttccttattcaagc 421 cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg ATG…TAA 481 atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg open reading frame 541 cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac 601 tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca 661 ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg RRPE PAC 721 agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct 781 gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt 841 tctccaatatcaccgttccgttgctgggactggtcgatacggcggtaattggtcatcttg 901 atagcccggtttatttgggcggcgtggcggttggcgcaacggcggaccagct Kellis et al, Nature, 03
Sequence conservation RNA Secondary Structure
Salzberg, Nature, 2003
9 RNA Secondary Structure RNA Secondary Structure
CCGUGUACGUGUAACGGAUCUUUAUUCCC... CCGUGUACGUGUAACGGAUCUUUAUUCCC... CCGUGAACGUAUACUGCAGUUUUAGUGCG... CCGUGAACGUAUACUGCAGUUUUAGUGCG... GGCUCACGCUGUUCCGGAUUAUGAUUCCC GGCUCACGCUGUUCCGGAUUAUGAUUCCC CGCAGAAGCUCCACGCGUUUUCUGUACGA CGCAGAAGCUCCACGCGUUUUCUGUACGA
The Fantasy From genes to cells
Whole genome sequence TGAAATAAACAACCAGGCAGCAGTTATTAACACGGGAACATGGCGGCCGCAGCCTGGGCTCCCGCGGCGGCGGCGG… Cell Simulator Compiler
Cell Function Simulator
10 Functional and computational genomics Pairwise sequence alignment (global and local) • mRNA expression Multiple sequence • Splice variants alignment
• Protein structure Substitution matrices Database • Protein expression searching
• Sub-cellular localization global local BLAST • Protein-protein interactions Sequence statistics • Protein-DNA interactions Gene Finding – time permitting Evolutionary tree Protein structure prediction Design and interpretation of all of these reconstruction RNA structure prediction assays requires sequence analysis. Computational genomics…
11