Outline

Computational and A whirlwind review of molecular biology Molecular Biology An overview of computational molecular biology New problems in genomics

Dannie Durand Fall 2004 Lecture 3

Genomes: Pairwise (global and local) The complete instruction set

Multiple sequence alignment

Substitution matrices Database searching

global local BLAST Sequence statistics

Gene Finding

Evolutionary tree Protein structure prediction Neisseria gonorrhoeae Homo sapiens reconstruction RNA structure prediction Computational genomics…

Computational Genomics Whole Sequencing

1995 H. influenzae – 1st whole genome sequence GTGCACCTGACTCCTGAG... 1997 Yeast – 1st eukaryotic sequence Gene sequences 1998 C. elegans – 1st multicellular organism 2000 Fly, mustard weed – 1st plant 2001 Human – 1st vertebrate 2002 Mouse, Ciona intestinalis Computational implications: 2003 Mosquito, C. briggsae – Need algorithms that scale up 2004 Five more yeasts, silkworm, rat, C. merolae, white rot fungus… – don’t look the way we thought www.genomesonline.org they did Genomic sequences ¾revise models 215 32 eukarya, 19 archaea 164 bacteria – New biological questions 145 whole genome sequences: 19 eukarya, 16 archaea, 110 bacteria ¾ new computational problems In progress: 522 prokaryotic genomes, 441 eukaryotic genomes

1 The Fantasy From Genes to Organisms

Whole genome sequence TGAAATAAACAACCAGGCAGCAGTTATTAACACGGGAACATGGCGGCCGCAGCCTGGGCTCCCGCGGCGGCGGCGG… Cell Simulator Compiler

Cell Function Simulator

Cellular pathways Rube Goldberg’s picture snapping machine Pheromone signaling receptor senses pathway pheromone outside cell

transcription of mating specific genes

Example: From Genes to Organisms Pheromone signaling pathway

receptor senses pheromone outside cell

transcription of mating specific genes

2 From Genes to Organisms From Genes to Organisms • Predict – all genes – all gene products (protein, RNA) New computational New high throughput – regulatory motifs approaches data sets • Predict structure and function of individual components – New, better algorithms – mRNA expression • Reconstruct the cellular networks – Use data in new ways – Splice variants – Regulatory pathways • – Protein expression – Genomic sequence – Sub-cellular localization – Metabolic pathways – Gene content – Protein-protein interactions – Signaling pathways … – Gene order • Combine different types of – Protein-DNA interactions • Model cellular behavior data

Computational From Genes to Organisms

New computational New high throughput High-thoughput sequencing approaches data sets Computational support for • data acquisition • data analysis – New, better algorithms – mRNA expression – Use data in new ways – Splice variants • Comparative genomics – Protein expression – Genomic sequence – Sub-cellular localization – Gene content High-thoughput functional assays – Protein-protein interactions – Gene order Computational support for • Combine different types of – Protein-DNA interactions • data acquisition data • data analysis

Determine the set of splice variants in a given cell type When are genes turned on? under particular conditions genes DNA:

exon1 exon2 exon3 exon4 exon5 exon6 mRNA:

exon1 exon2 exon3 exon4 exon5 exon6

mRNAs exon1 exon2 exon3 exon4 Alternate splice forms: Determine the set of all genes being transcribed in a given cell type under particular conditions exon1 exon2 exon3 exon5 exon6

3 Expressed Sequence Tags (ESTs) Expressed Sequence Tags mRNA CAUGACUCCUUGGCUAC...CCGAGUGCGGCAUUUUUU – Single-pass sequencing of “random” cDNAs – 5’ or 3’ end reverse transcriptase – Relatively low quality sequence – Tissue specific cDNA CAUGACUCCUUGGCUAC...CCGAGUGCGGCAUUUUUU GTACTGAGGAACCGATG...GGCTCACGCCGTAAAAAA – No guarantee • that all genes are represented degradation of mRNA, synthesis of second DNA strand • that all splice forms are represented

mRNA dsDNA CATGACTCCTTGGCTAC...CCGAGTGCGGCATTTTTT GTACTGAGGAACCGATG...GGCTCACGCCGTAAAAAA

forward primer reverse primer 5’ EST 3’ EST 5’ ESTs 3’ ESTs

When are genes turned on? ESTs: molecular tags for genes.

ESTs – fast way to capture the coding portion of the genome. (In eukaryotes, most of the genome does not contain protein coding genes. ) – provide a crude measure of transcript abundance. However, rare transcripts may be missed. – provide a crude measure of splice variants (if at the 3’ or 5’ end of the gene).

DNA arrays detect mRNA transcripts microarrays

DNA microarrays DNA microarrays

Targets: Each well contains a cDNA oligonucelotide corresponding to a unique subsequence of a gene cgtaacgctat

4 DNA microarrays DNA microarrays

Down regulated in Up regulated in tumor tumor

DNA microarrays Expression array data

Unchanged unsorted array data

clustered array data

O. Alter, P. O. Brown and D. Botstein, PNAS 97 (18), 2000

From Genes to Organisms Not all mRNA transcripts are translated into proteins

DNA: New computational New high throughput approaches data sets RNA polymerase promoter gene – New, better algorithms – mRNA expression – Use data in new ways – Splice variants transcription • Comparative genomics – Protein expression – Genomic sequence – Sub-cellular localization mRNA: – Gene content – Protein-protein interactions – Gene order translation • Combine different types of – Protein-DNA interactions data

Amino acid sequence:

5 Protein Expression From Genes to Organisms

Find the set of all proteins being expressed in a given cell type under particular conditions New computational New high throughput approaches data sets • Isolate the set of proteins present – 2D gel electrophoresis – New, better algorithms – mRNA expression • Identify the proteins in the set – Use data in new ways – Splice variants – Mass spectrometry • Comparative genomics – Protein expression – Genomic sequence – Sub-cellular localization – Protein chips based on antibody – Gene content – Protein-protein interactions recognition – Gene order • Combine different types of – Protein-DNA interactions data

Protein Sub cellular Localization •Recognize sequence-based localization signals •Microscopy: stain protein with a “guest exon”

Plant and animal cells have compartments or “organelles”

Endoplasmic Reticulum (ER)

R. Murphy, Biological Sciences Mitochondria

From Genes to Organisms Protein-protein interactions

New computational New high throughput data • Yeast Two-hybrid System a b approaches sets – Determine if protein a binds to protein b P. Uetz et al, Nature, 2000; Ito et al, PNAS, 2001 – New, better algorithms – mRNA expression – Use data in new ways – Splice variants • Comparative genomics – Protein expression – Genomic sequence – Sub-cellular localization • Tandem affinity purification (TAP) – Gene content – Protein-protein interactions – Gene order – Determine proteins a and b participate in the – Protein-DNA interactions • Combine different types of same complex Rigaut et al, Nature Biotechnology, 1999 data acb

6 From Genes to Organisms Protein-DNA interactions

New computational New high throughput • Chromatin ImmunoPrecipitation (ChIP) approaches data sets – Determine if transcription factor a binds upstream of gene b – New, better algorithms – mRNA expression – Use data in new ways – Splice variants • Comparative genomics – Protein expression – Genomic sequence – Sub-cellular localization – Gene content – Protein-protein interactions a – Gene order b • Combine different types of – Protein-DNA interactions data Lee et al, Science, 2002

From Genes to Organisms Computational Genomics

GTGCACCTGACTCCTGAG... New computational New high throughput Gene sequences approaches data sets

– New, better algorithms – mRNA expression – Use data in new ways – Splice variants • Comparative genomics – Protein expression Genomic sequences – Genomic sequence – Sub-cellular localization – Gene content – Protein-protein interactions – Gene order • Combine different types – Protein-DNA interactions of data Functional data

(Slick, high tech) vocabulary for the More ’ 21st century • Comparative genomics • Genome – The study of genome function and evolution by comparing the – All the genetic material in the chromosomes of a genomic sequence and gene content and organization in two or particular organism more genomes. •Proteome • Functional genomics – The study of genes, their resulting proteins, and the role played by – Proteins expressed by a cell or organ at a particular the proteins the body's biochemical processes. time and under specific conditions • – The study of the interaction of an individual's genetic makeup and – The full complement of activated genes, mRNAs, or response to a drug transcripts in a particular tissue at a particular time • • Interactome – The study of the full set of proteins encoded by a genome. – Full complement of all protein-protein and protein- • DNA interactions – The effort to determine the 3D structures of large numbers of proteins using both experimental techniques and computer Make up your own… simulation

7 From Genes to Organisms From Genes to Organisms

New computational New high throughput New computational New high throughput approaches data sets approaches data sets

– New, better algorithms – mRNA expression – New, better algorithms – mRNA expression – Use data in new ways – Splice variants – Use data in new ways – Splice variants • Comparative genomics – Protein expression • Comparative genomics – Protein expression – Genomic sequence – Sub-cellular localization – Genomic sequence – Sub-cellular localization – Gene content – Gene content – Protein-protein interactions – Protein-protein interactions – Gene order – Gene order • Combine different types of – Protein-DNA interactions • Combine different types of – Protein-DNA interactions data data

atatactcacagcataactgtatatacacccagggggcggaatgaaagcgttaacggcca ggcaacaagaggtgtttgatctcatccgtgatcacatcagccagacaggtatgccgccga A sampler of new computational cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc problems gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac cacttctggcgcaacagcatattgaaggtcattatcaggtcgatccttccttattcaagc cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg • cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca • Functional characterization cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc • Regulation gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg • Biological networks gaattcgataaatctctggtttattgtgcagtttatggttccaaaatcgccttttgctgt agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt tctccaatatcaccgttccgttgctgggactggtcgatacggcggtaattggtcatcttg

DNA PATTERNS IN THE E.coli lexA GENE Gene Recognition Problem

Promotor sequences PATTERN

Repressor binding site

1gaattcgataaatctctggtttattgtgcagtttatggttccaaaatcgccttttgctgt CTGNNNNNNNNNNCAG TTCCAA -35 TTGACA 61 atatactcacagcataactgtatatacacccagggggcggaatgaaagcgttaacggcca TATAAT, mRNA start GGAGG human -10 TATACT mRNAstart+ +10GGGGG Ribosomal binding site …aggaggactataacgcctctcccagcatgggctggggctcctgtcccccactgtggcctggtactgcgccaggactcgtagtga… 121 ggcaacaagaggtgtttgatctcatccgtgatcacatcagccagacaggtatgccgccga 181 cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc Exon 1 Exon 2 Exon 3 241 tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc 301 gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac …cgacgccatataaattagtaatgtactatgggctggggcgtgatacgtacactgtggcctggtagctatgcagcacgtgtctagtga… 361 cacttctggcgcaacagcatattgaaggtcattatcaggtcgatccttccttattcaagc ATG…TAA mouse 421 cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg open reading frame 481 atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg 541 cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac 601 tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca 661 ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg 721 agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct 781 gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt 841 tctccaatatcaccgttccgttgctgggactggtcgatacggcggtaattggtcatcttg 901 atagcccggtttatttgggcggcgtggcggttggcgcaacggcggaccagct

8 Comparative genomics for gene A sampler of new computational finding problems

• Gene prediction • Functional characterization • Regulation • Biological networks

Salzberg, Nature, 2003

How do we define protein function? Functional Characterization

InsR • Molecular function • Transmembrane protein cell • Biological process • Receptor membrane • Kinase (phosphorylates) •Pathway • Insulin signaling pathway insulin • Growth and metabolism • Protein-protein interaction kinase domain • Cellular location • Binds insulin receptor • Dimeric structure • Structure

How do we define protein function? Computational approaches to functional characterization InsR • Transmembrane protein • Sequence similarity • Receptor Hypothesis: two proteins with significant • Kinase (phosphorylates) sequence similarity are likely to share • Insulin signaling pathway functional characteristics. • Growth and metabolism • Binds insulin • Genomic approaches • Dimeric structure – Genomic context – Conserved expression pattern.

9 Inferring function from homology Some limitations

BLAST BLAST …LWDPTFQVEFNQLG… …LWDEFNQLGTE …TMFPTFEMIVTKAG… …LWDEFNQLGTE MIVTKAGRRMFP …RRMFPTFQVPTFQV… MIVTKAGRRMFP TFQVKLFGMDPM …KLFMFPTFGEMDPM… TFQVKLFGMDPM ADYMLLMDFVPV …ADYMMCFPTFLLMD… ADYMLLMDFVPV DDKRYRYAFHS… …FVPVDDKPTSFQVR… DDKRYRYAFHS… . . .

No matches!

O

Some limitations Some limitations

BLAST BLAST …LWDPTFQVEFNQLG… …LWDPTFQVEFNQLG… …LWDEFNQLGTE …TMFPTFEMIVTKAG… …LWDEFNQLGTE …TMFPTFEMIVTKAG… MIVTKAGRRMFP …RRMFPTFQVPTFQV… MIVTKAGRRMFP …RRMFPTFQVPTFQV… TFQVKLFGMDPM …KLFMFPTFGEMDPM… No functional TFQVKLFGMDPM …KLFMFPTFGEMDPM… Misleading ADYMLLMDFVPV …ADYMMCFPTFLLMD… ADYMLLMDFVPV …ADYMMCFPTFLLMD… DDKRYRYAFHS… …FVPVDDKPTSFQVR… DDKRYRYAFHS… …FVPVDDKPTSFQVR… . data! . matches . . . .

E. Coli: 4290 predicted proteins Almost ~50% could not be characterized by homology. Yeast: 6217 predicted proteins 2557 could not be characterized by homology.

Some limitations Some limitations

BLAST BLAST …LWDPTFQVEFNQLG… …LWDPTFQVEFNQLG… …LWDEFNQLGTE …TMFPTFEMIVTKAG… …LWDEFNQLGTE …TMFPTFEMIVTKAG… MIVTKAGRRMFP …RRMFPTFQVPTFQV… MIVTKAGRRMFP …RRMFPTFQVPTFQV… TFQVKLFGMDPM …KLFMFPTFGEMDPM… TFQVKLFGMDPM …KLFMFPTFGEMDPM… ADYMLLMDFVPV …ADYMMCFPTFLLMD… ADYMLLMDFVPV …ADYMMCFPTFLLMD… DDKRYRYAFHS… …FVPVDDKPTSFQVR… DDKRYRYAFHS… …FVPVDDKPTSFQVR… ......

Need to read a lot of articles… Does not infer function directly from sequence O O

10 Computational approaches to Genomic approaches functional characterization • Phylogenetic profile • Sequence similarity • Pellegrini et al., PNAS, 1999 Hypothesis: two proteins with significant sequence similarity are likely to share • Domain fusion functional characteristics. Enright et al., Nature, 1999 Marcotte et al., Science, 1999 • Expression pattern • Genomic approaches Eisen et al., PNAS, 1998 – Genomic context • Gene position – Conserved expression pattern. Overbeek et al., PNAS, 1999 Dandekar et al., Trends in Biochem. Sci, 1998

Phylogenetic profile Pellegrini et al., PNAS, 1999

Underlying hypothesis: Functionally linked proteins have homologs in Finding the same set of organisms phylogenetic profiles Example: Flagellar proteins are found in bacteria that have flagella and are absent in other species.

Genomic approaches Domain fusion Marcotte et al., Science, 1999

• Phylogenetic profile • Pellegrini et al., PNAS, 1999 • Domain fusion Enright et al., Nature, 1999 Marcotte et al., Science, 1999 • Expression pattern Eisen et al., PNAS, 1998 • Gene position Overbeek et al., PNAS, 1999 Dandekar et al., Trends in Biochem. Sci, 1998

11 Domain fusion Domain fusion Marcotte et al., Science, 1999 Marcotte et al., Science, 1999

Underlying hypothesis: Proteins in genome of interest (e.g. E. Coli): – Domains a and b bind in the fused and unfused a b A B proteins. – Domain fusion enhances the affinity of a and b. – If a and b are found in a fused form in some other organism, infer that A and B interact.

b a b a A B Rosetta Stone sequence in reference genome (e.g. yeast) a b

Problems Problems: false positives Domains a and b are homologous False negatives: Domains a and b bind to each other, but there is no known Rosetta Stone sequence. to a Rosetta Stone protein,

but do not bind: a b a b a b

Problems: false positives Genomic approaches “Promiscuous” domains: Domain fusion analysis cannot distinguish between • Phylogenetic profile homologs that bind and those that do not. • Pellegrini et al., PNAS, 1999 • Domain fusion Enright et al., Nature, 1999 a b Marcotte et al., Science, 1999 • Expression pattern Eisen et al., PNAS, 1998 • Gene position Overbeek et al., PNAS, 1999 a b Dandekar et al., Trends in Biochem. Sci, 1998

12 Co-regulated genes may be functionally Example: linked Relating Genes to the Cell Cycle

M/G1

G1

S

S/G2 M/G2

clustered array data M/G1 Co-regulated genes

O. Alter, P. O. Brown and D. Botstein, PNAS 97 (18), 2000

A combined algorithm for genome-wide Results (%) prediction of protein function Marcotte et al., Nature Genetics, 1999 Compared predictions made by phylogenetic profiles with SwissProt annotation keywords 6217 predicted yeast proteins False positives Predicts Random known fn links Experimental 6.5 33.2 4.0 Known metabolic pathway 2.5 20.3 4.5 Experimentally verified Phylogenetic Correlated interaction profile expression pattern Phylogenetic profile 29.5 33.1 7.4 Rosetta Stone Method 36.4 26.5 7.7 Domain Homology to Correlated expression 35.8 11.5 6.9 fusion known pathway “Highest Confidence” links 4.8 40.9 5.5

Prediction of protein-protein interaction

Genomic approaches “The Use of Gene Clusters to Infer Functional Coupling” Overbeek et al., PNAS 96: 2896-2901, 1999. • Phylogenetic profile • Pellegrini et al., PNAS, 1999 • Domain fusion Enright et al., Nature, 1999 Marcotte et al., Science, 1999 • Expression pattern Eisen et al., PNAS, 1998 • Gene position Overbeek et al., PNAS, 1999 Dandekar et al., Trends in Biochem. Sci, 1998

13 A sampler of new computational Regulatory regions problems

• Gene prediction Polymerase • Functional characterization • Regulation A B C • Biological networks D Polymerase

A B C

Examples of binding site motifs Identifying binding sites Comparative genomic approach

Conserved genes Organism 1

Organism 2

Global alignment of upstream sequences Identifying binding sites Comparative genomic approach

Conserved genes Organism 1

RRPE PAC Organism 2

Conserved regulatory elements Kellis et al, Nature, 03

14 Comparative genomics Finding binding sites with expression data • AKA, phylogenetic footprinting • Key: correct evolutionary distance between species 1 • Identify co-expressed genes and species 2 – Mouse and human • Compare the upstream regions – S. cerevisiae and S. bayanus • Identify common sequence motifs • Requires – Whole genome sequence from closely related organisms • Experimental validation – Identification of orthologs – Global (multiple) alignment

Hardison, Oeltjen, Miller, Genome Res, 97 Wasserman et al, Nat Genetics, 00 Tavazoie et al, Nat Gen, 99 Kellis et al, Nature, 03

Transcriptional co-regulation Common DNA sequences

Promoter

Transcription factor binding site

Expression cluster

Expression patterns DNA sequence patterns A sampler of new computational

1 2 -2 -1 0 2 problems ATG 1 ATG ATG ATG

0 ATG ATG . • Gene prediction . -1 .

2 ATG • Functional characterization

1

0 -2

-1

-2 • Regulation Gibbs Sampling (AlignACE) • Biological networks

ATG ATG ATG ATG ATG ATG

position weight matrix

15 Transcriptional co-regulation Regulation graph from protein-DNA binding data

Lee et al, Science, 2002

Regulatory network motifs

Lee et al, Science, 2002

Lee et al, Science, 2002

Reverse engineering regulatory Reverse engineering metabolic networks networks

Transcription graph Regulatory network Protein-protein interaction graph Metabolic network

16 From Genes to Organisms Pairwise sequence alignment (global and local)

New computational New high throughput Multiple sequence approaches data sets alignment Substitution matrices Database – New, better algorithms – mRNA expression searching – Use data in new ways – Splice variants global local BLAST • Comparative genomics – Protein expression Sequence – Genomic sequence – Sub-cellular localization statistics – Gene content – Protein-protein interactions – Gene order Gene Finding – Protein-DNA interactions • Combine different types of Protein structure prediction data Evolutionary tree reconstruction RNA structure prediction Computational genomics…

17