<<

Current Topics in Analysis March 1, 2005 Computational Techniques in Comparative

Elliott H. Margulies, Ph.D. Genome Technology Branch National Institute [email protected]

Outline

ƒ Fundamental concepts of ƒ Alignment and visualization tools – Pair-wise and multi- methods – Combining with factor binding site data ƒ Motif Identification ƒ Comparative genomics resources available at UC Santa Cruz -- http://genome.ucsc.edu – Genome-wide sequence availability – prediction and identification Finding orthologous sequences in other species – Identifying conserved sequences ƒ Insights from genome sequence comparisons ƒ Multi-species sequence analysis

1 Why Compare Genomic Sequences from Different Species? ƒ Explore evolutionary relationships

Coding

Non-Coding Other Functional

Repetitive ƒ Enhanced algorithms ƒ Identify functionally constrained sequence

2 Charles Darwin

ƒ Served as naturalist on a British science expedition around the world (1831 -- 1836)

ƒ On the Origin of Species by Means of , or the Preservation of Favoured Races in the Struggle for Life ƒ The Origin of Species (1859) – All species evolved from a single life form – “Variation” within a species occurs randomly – Natural selection – Evolutionary change is gradual

Other Intellectual Foundations

ƒ Darwin (1859) – Theories of ƒ Mendel (1866) (rediscovered in 1900) – are units of ƒ Avery, McCarty & MacLeod (1944) – DNA as the “transforming principle” ƒ Watson & Crick (1953) – Structure of DNA ƒ Sanger (1977) – Methods of sequencing DNA

3 Rationale

ƒ DNA represents a “blueprint” for structure and physiology of all living things ƒ All species use DNA ƒ Mutations in functional DNA are less likely to be tolerated

Comparative Genomics

ƒ Find sequences that have diverged less than we expect These sequences are likely to have a functional role ƒ Our expectation is related to the time since the last common ancestor Evolutionary Distance

Human Horse Rat

Platypus

4 What’s in a Name?

ƒ Highly conserved sequences ƒ Sequences under purifying selection ƒ Functionally constrained sequences ƒ ECOR – Evolutionary COnserved Region – Variant: ECR ƒ CNS – Conserved Non-coding Sequence ƒ CNGs – Conserved Non-Genic sequence ƒ MCS – Multi-species

Outline

ƒ Fundamental concepts of comparative genomics ƒ Alignment and visualization tools – Pair-wise and multi-species methods – Combining with binding site data ƒ Motif Identification ƒ Comparative genomics resources available at UC Santa Cruz -- http://genome.ucsc.edu – Genome-wide sequence availability – Gene prediction and identification – Finding orthologous sequences in other species – Identifying conserved sequences ƒ Insights from vertebrate genome sequence comparisons ƒ Multi-species sequence analysis

5 Sequence Alignments

100% Identical Species 1 CATGGGCAAATTGGCCCATTGGCCATGGGGGCCCACCGTA |||||||||||||||||||||||||||||||||||||||| Species 2 CATGGGCAAATTGGCCCATTGGCCATGGGGGCCCACCGTA

80% Identical Species 1 CATGGGCAAATTGGCCCATTGGCCATGGGGGCCCACCGTA || |||| ||| ||| |||||| |||||| |||| |||| Species 2 CACGGGCTAATCCGCCAATTGGCTATGGGG-CCCAGCGTA

30% Identical Species 1 CATGGGCAAATTGGCCCATTGGCCATGGGGGCCCACCGTA | | | | || | | | | | Species 2 CACGAACTAATCCGCCAATAGCCTATAGCG-CACAGCGAA

Tools for Aligning Genomic Sequences

Genome Research (2000) 10:577-586

6 PipMaker vs. VISTA

ƒ Visualization Seq1 ƒ Alignment Strategy Seq2 – VISTA: avid – PipMaker: blastz ƒ East Coast – West Coast

Lawrence Berkeley Penn State National Laboratory University

PipMaker http://bio.cse.psu.edu/pipmaker/ ƒ Percent Identity Plot

ƒ X-axis is the reference sequence ƒ Horizontal lines represent gap-free alignments

7 http://bio.cse.psu.edu/pipmaker/

. : . : . : . : . : 1 GACATCTAAATTGCCTATTTT ATGCCTTTATGTATTGTAGAAATCTGCC ||||:|||::| :|||||||-|||::|||:|||||| ||||||:||||| 575 GACACCTAGGTATTCTATTTTTATGTTTTTGTGTATTCTAGAAACCTGCC

50 . : . : . : . : . : 50 TTACTGTTTTGTGTAGCCACAGAACAGAAATAGACTAACTTTTTTTT |||| ::| |:||: :: | :||||||||::||| || | ||:||--- 625 TTACAACTGTATGCCATGAAGGAACAGAAGCAGAAACACATATTCTTTTA

100 . : . : . : . : . : 97 AGTAAACTCTCTGGAAACAAAAATCTTCCCAGATATTTATTGTTA -----|:|||||:||--||:| ||||| |||||||||||||| |||| | 675 AAAAGAATAAACCCT GGGACCAAAATTCTTCCCAGATATTGATTGATC

150 . : . : . : . : . : 142 GGAAAATATAATCTAAAAATTCTTCTGCCCAACCCCTTGGCTGCATCCCA ||||||| ||||||| |||:|||||||||| ::|||||:||::||:|||| 723 GGAAAATCTAATCTACAAACTCTTCTGCCCTGTCCCTTAGCCACACCCCA

200 . : 192 GTCTTCCATC ||||| |||| 773 GTCTTGCATC

MultiPipMaker

8 http://www-gsd.lbl.gov/vista/

ƒ Global Alignment (avid) – Bray et al. (2003) Genome Res 13:97-102 ƒ Sliding Window Approach to Visualization – Plot Percent Identity within a Fixed Window-Size, a t Regular Intervals GACATCTAAATTGCCTATTTT ATGCCTTTATGTATTGTAGAAATCTGCCTTACTGTTTTGTGTAGCCACAGAACAGAAATAGACTAACTTTTTTTT ||||:|||::| :|||||||-|||::|||:|||||| ||||||:||||||||| ::| |:||: :: | :||||||||::||| || | ||:|| GACACCTAGGTATTCTATTTTTATGTTTTTGTGTATTCTAGAAACCTGCCTTACAACTGTATGCCATGAAGGAACAGAAGCAGAAACACATATTCTT

80% 88% 72% 76% 68% 52% 56% 64%

VISTA

ƒ Percent Identity is plotted from: – 100 base windows – Moved every 15 bases ƒ Colored regions meet certain alignment criteria – >100 bp >75% Identity

9 What’s Your Preference?

PipMaker

VISTA

East & West Coast Unite

http://zpicture.dcode.org/

Genome Research, 2004, 14(3):472-7

10 zPicture Output Summary Page

zPicture: Dot Plot View

11 zPicture Output Summary Page

Dynamic Visualization Options PipMaker-style

12 Dynamic Visualization VISTA-style

The Multi-Species version of zPicture

13 Outline

ƒ Fundamental concepts of comparative genomics ƒ Alignment and visualization tools – Pair-wise and multi-species methods – Combining with transcription factor binding site data ƒ Motif Identification ƒ Comparative genomics resources available at UC Santa Cruz -- http://genome.ucsc.edu – Genome-wide sequence availability – Gene prediction and identification – Finding orthologous sequences in other species – Identifying conserved sequences ƒ Insights from vertebrate genome sequence comparisons ƒ Multi-species sequence analysis

zPicture Output Summary Page

14 Are there any transcription factor binding sites in my alignment?

Pre-Computed Sequence Alignments

TRANSFAC

http://www.gene-regulation.com/

TRANSFAC http://www.gene-regulation.com/ ƒ A database of: – Eukaryotic transcription factors – Their genomic binding sites – And DNA- binding profiles ƒ Data are collected from published studies – Non-curate d – Redundant data

15 JASPAR: An Alternative to TRANSFAC

ƒ Differences from TRANSFAC: – Manually curated for “high quality” experiments – Non- redundant collection http://jaspar.cgb.ki.se/

TRANSFAC Data are inherently “noisy”

ƒ Binding sites are very short 6-10 bases in length ƒ Low complexity Only 4 “letters” in the DNA alphabet ƒ Frequently observe binding site by chance ƒ Conservation can help reduce the noise ƒ rVISTA 2.0 for pair-wise alignments ƒ multiTF for multi-species alignments

16 Example of multiTF Output

Summary of Alignment Tools

ƒ PipMaker (blastz) ƒ VISTA (avid) ƒ zPicture and MULAN ƒ Lagan and mLagan (glocal alignments) – http://lagan.stanford.edu/ ƒ rVISTA 2.0 ƒ Box 1 from: Ureta-Vidal, Ettwiller, and Birney (2003) Comparative Genomics: Genome-Wide Analysis in Metazoan 4: 251-262 ƒ Table 1 from: Miller, Makova, Nekrutenko, and Hardison (2004) Comparative Genomics Annual Reviews in Human Genetics 5:15-56

17 Outline

ƒ Fundamental concepts of comparative genomics ƒ Alignment and visualization tools – Pair-wise and multi-species methods – Combining with transcription factor binding site data ƒ Motif Identification ƒ Comparative genomics resources available at UC Santa Cruz -- http://genome.ucsc.edu – Genome-wide sequence availability – Gene prediction and identification Finding orthologous sequences in other species – Identifying conserved sequences ƒ Insights from vertebrate genome sequence comparisons ƒ Multi-species sequence analysis

Motif Finding

ƒ Identify Transcription Factor Binding Sites ƒ What sequences should be searched? Coordinately Regulated Genes

Human

Identify Over-Represented Patterns

18 Phylogenetic Footprinting

ƒ FootPrinter – http://bio.cs.washington.edu/software.html ƒ Takes the phylogeny into account CoordinatelyOrthologous Regulated Genes Genes

Human

Additional Species

Identify Conserved Motifs

Summary of Phylogenetic Footprinting Tools

ƒ FootPrinter – http://bio.cs.washington.edu/software.html – Blanchette and Tompa (2003) Nucleic Acids Research 31:3840–3842 ƒ phyloCon – http://oldural.wustl.edu/~twang/PhyloCon/ – Wang and Stormo (2003) 19:2369-80 ƒ phyME – Sinha, Blanchette, and Tompa (2004) BMC Bioinformatics 28:170 ƒ List of motif- finding algorithms: – Box 1 of Ureta-Vidal et al. (2003) Nature Reviews Genetics 4:251-262 ƒ Bayesian Approaches (and home of the Gibbs sampler) – http://www.wadsworth.org/resnres/bioinfo/ ƒ Example of motif- finding limited by conservation: – Wasserman et al. (2000) Nature Genetics 26:225-228

19 Outline

ƒ Fundamental concepts of comparative genomics ƒ Alignment and visualization tools – Pair-wise and multi-species methods – Combining with transcription factor binding site data ƒ Motif Identification ƒ Comparative genomics resources available at UC Santa Cruz -- http://genome.ucsc.edu – Genome-wide sequence availability – Gene prediction and identification Finding orthologous sequences in other species – Identifying conserved sequences ƒ Insights from vertebrate genome sequence comparisons ƒ Multi-species sequence analysis

Genome-Wide Sequences

Fugu rubripes Fugu Tetraodon Zebrafish Tetraodon nigroviridis Danio rerio Amphibians Gallus gallus Xenopus Marsupials Bos taurus Sus scrofus Canis familiaris Felis catus Mus musculus Rattus norvegicus Mouse Papio cynocephalus anubis Pan troglodytes Rat Cow Homo sapiens

450 400 350 300 250 200 150 100 50 0 Millions of years ago Thomas JW & Touchman JW (2002) TIGS 18:104-108 Human Chimpanzee

20 Genome Browsers

http://genome.ucsc.edu

http://www.ensembl.org

http://www.ncbi.nlm.nih.gov/mapview/

21 Outline

ƒ Fundamental concepts of comparative genomics ƒ Alignment and visualization tools – Pair-wise and multi-species methods – Combining with transcription factor binding site data ƒ Motif Identification ƒ Comparative genomics resources available at UC Santa Cruz -- http://genome.ucsc.edu – Genome-wide sequence availability – Gene prediction and identification – Finding orthologous sequences in other species – Identifying conserved sequences ƒ Insights from vertebrate genome sequence comparisons ƒ Multi-species sequence analysis

Approaches to Gene Prediction

ƒ Evidence-Based ƒ Ab Initio ƒ Dual-Genome – MGC – Genscan – Twinscan – Acembly – Geneid – SGP – Ensembl

22 Additional Gene Prediction Resources

ƒ Fugu BLAT Track at UCSC

ƒ SLAM – http://baboon.math.berkeley.edu/~syntenic/slam.html – Cawley et al. (2003) Nucleic Acids Research 31:3507-3509 ƒ Exoniphy Siepel and Haussler. Computational identification of evolutionarily conserved . Proc. 8th Annual Int'l Conf. on Research in Computational , pp. 177-186, 2004. http://www.soe.ucsc.edu/~acs/recomb2004.pdf – Also see genome “test” browser for data ƒ Box 1 from: – Ureta-Vidal et al. (2003) Nature Reviews Genetics 4:251-262

Outline

ƒ Fundamental concepts of comparative genomics ƒ Alignment and visualization tools – Pair-wise and multi-species methods – Combining with transcription factor binding site data ƒ Motif Identification ƒ Comparative genomics resources available at UC Santa Cruz -- http://genome.ucsc.edu – Genome-wide sequence availability – Gene prediction and identification – Finding orthologous sequences in other species – Identifying conserved sequences ƒ Insights from vertebrate genome sequence comparisons ƒ Multi-species sequence analysis

23 Chaining Alignments

ƒ Chaining bridges the gulf between large syntenic blocks and base-by- base alignments. The Challenge: ƒ Local alignments tend to break at transposon insertions, inversions, duplications, etc. ƒ Global alignments tend to force non-homologous bases to align. The Solution: ƒ Chaining is a rigorous way of joining together local alignments into larger structures. Slide (though modified) Courtesy of Jim Kent

Chains join together related local alignments

Protease Regulatory Subunit 3

Slide Courtesy of Jim Kent

24 Net Alignments: Focus on Orthology

ƒ Frequently, there are numerous mouse alignments for any given human region, particularly for coding regions. ƒ Net finds best mouse match for each human region.

Slide (though modified) Courtesy of Jim Kent

Genome-wide Multiple Sequence Alignments

25 Conservation Score at UCSC

ƒ Displays evolutionary conservation based on a phylogenetic

Probability that the given alignment column was generated by the “conserved” state

ƒ “Most Conserved” track represents highly conserved regions – Tuned to cover ~4% of the genome

“Most Conserved” Track at UCSC

What it you want a different stringency than was used for the Most Conserved Track?

26 Using the Table Browser to get Highly Conserved Sequences

27 Using the Table Browser to get Highly Conserved Sequences

Outline

ƒ Fundamental concepts of comparative genomics ƒ Alignment and visualization tools – Pair-wise and multi-species methods – Combining with transcription factor binding site data ƒ Motif Identification ƒ Comparative genomics resources available at UC Santa Cruz -- http://genome.ucsc.edu – Genome-wide sequence availability – Gene prediction and identification Finding orthologous sequences in other species – Identifying conserved sequences ƒ Insights from vertebrate genome sequence comparisons ƒ Multi-species sequence analysis

28 Insights from Human-Rodent Sequence Comparisons

ƒ Similar gene content and linear organization – ~340 syntenic blocks ƒ Difference in – Mouse genome is 14% smaller ƒ Sequence Conservation Nature 420:520, 2002 – ~40% in Alignments – ~5% Under Selection • ~1.5% Coding • ~3.5% Non-Coding ƒ See Jan 2003 & April 2004 issues of Genome Research Nature 428:493, 2004

Neutral Evolution

ƒ No selective pressure/advantage to keep or change the DNA sequence ƒ Rate of variation should correlate with: – Mutation rate – Amount of time since the last common ancestor ƒ The neutral rate can vary across the genome

29 Types of Neutrally Evolving DNA

ƒ 4-Fold Degenerate Sites – Third position of codons which can be any base and code for the same Second First U C A G Last U Phe Ser Tyr Cys U Phe Ser Tyr Cys C Leu Ser Stop Stop A Leu Ser Stop Trp G C Leu Pro His Arg U Leu Pro His Arg C Leu Pro Gln Arg A Leu Pro Gln Arg G A Ile Thr Asn Ser U Ile Thr Asn Ser C Ile Thr Lys Arg A Met Thr Lys Arg G G Val Ala Asp Gly U Val Ala Asp Gly C Val Ala Glu Gly A Val Ala Glu Gly G

Types of Neutrally Evolving DNA

ƒ Ancestral Repeats – Ancient Relics of Transposons Inserted Prior to the Eutherian Radiation

Adapted from Hedges & Kumar, Science 297:1283-5

30 Determining the Fraction of Sequence Under Purifying Selection Adapted From Figure 28, Nature 420:553

Genome-Wide Distribution

Neutrally Evolving Functionally Constrained

Conservation Score

31 Outline

ƒ Fundamental concepts of comparative genomics ƒ Alignment and visualization tools – Pair-wise and multi-species methods – Combining with transcription factor binding site data ƒ Motif Identification ƒ Comparative genomics resources available at UC Santa Cruz -- http://genome.ucsc.edu – Genome-wide sequence availability – Gene prediction and identification Finding orthologous sequences in other species – Identifying conserved sequences ƒ Insights from vertebrate genome sequence comparisons ƒ Multi-species sequence analysis

Phylogenetic Shadowing Boffelli et al. (2003) Science 299:1391-1394. ƒ Identifying sequence differences between multiple primate species

32 Multi-Species Comparative Sequence Analysis

Nature 424:788, 2003

Multi-Species Comparative Sequence Analysis Human Mouse Rat Pufferfish Zebrafish Chicken Chimpanzee Dog Cow Xenopus Monodelphis Macaque

33 Multi-Species Comparative Sequence Analysis Human Mouse Rat Pufferfish Zebrafish Chicken Chimpanzee Dog Cow Xenopus Monodelphis Macaque

Multiple Additional Species

7

22.3 22.2 22.1 21.3 21.1 15.3 15.2 1.8 Mb 15.1 p 14.3 14.1 13 12.3 12.1 CAV2,1 CAPZA2GASZ CORTBP2 11.2 11.1 11.1 11.21 TES1 MET ST7WNT2 CFTR 11.22

11.23 21.11 21.12 21.13 21.3

22.1 22.2 22.3 q 31.1 31.2 31.31 31.33 32.2 32.3 33 34 35 36.1 36.2 36.3

34 Hedgehog Vervet Rabbit Armadillo Macaque

Cow Pig Monodelphis Gorilla Horse Mouse lemur Wallaby Baboon Fugu Opossum Dog Cat Mouse Tetraodon Lemur Chimpanzee Orangutan Rat Platypus Chicken Zebrafish

50 100 150 200 250 300 350400 450 500 Human Estimated Evolutionary Distance from (Myr) Adapted from Kumar & Hedges, Nature 1998

UCSC View of Multiple Sequence Alignments

20 Kb

35 Multi-Species Weighted Conservation Score

ƒ Takes into Account the Different Divergence Rates of Each Species – “A Chicken Alignment Will Contribute More Than a Baboon Alignment” ƒ Based On the Substitution Rates at Bases under Neutral Selection – Calculated from 4-Fold Degenerate Positions

Human GCGGGGGCCTTCGGACCGCGCGGCG i = identity Cat iiiiiiiiiiimimiiimiiiimii m = mismatch m+miiiiiimimiiim++iiiiiim Chicken insertion Chimpanzee iiiiiiiiiiiiiiiiiiiiiiiii + = Baboon iiiiiiiiiiimiiiiiiiiiiiii - = unalignable 83% Dog84%iiiiiiiii+++++++++++++iii Cow immiiiimimmmiiiiiiiiiimii Pig64%iiimiiiiiimmimiiiiimiimii Rat im++++++++mimiiimmiiiimmm Mouse immiiimii+++miiimmiiiimmm 48% Fugu ------Tetraodon ------Zebrafish ------

Weighted Conservation Score

skip

36 Multi-Species Conservation Score Distribution

Coding 0.30 Non-Coding

0.25 Multi-Speci e s Conserved 0.20 Sequences

0.15 Percent of Total 0.10 99.2% of All Coding Exons

0.05

Fraction of Total Coding or Non-Coding Bases 0.00 -5-4-3-2-10123456789 Conservation Score

Multi-Species Conservation Score

Genome Research (2003) 13:2507-2518 MCS Multispecies Conserved Sequence

Discrete Regions “Noisy” of Sequence Alignments “MCS Highly Conserved Generator” Sequence

37 MCSs

1.8 Mb Sequence Data

95% Unknown ~72% Ancient

M C Repeats (2.6%) Non- MCSs Ss UTRs (3.8%) 5% 22% Coding

38 -Specificity of MCSs in Mammals

CarnivoresArtiodactyls Rodents

Cat Dog Cow Pig Mouse Rat

Monotreme Marsupials

Platypus Wallaby Opossum

For each MCS: Catalog All the Species in Which it is Present

skip

MCSs Artiodactyls Cow Pig Placental 4.5% Carnivores Cat Dog Placental + 17% Marsupials Rodents Mouse Rat

Monotreme All Platypus 52% X Mammals Marsupials XWallaby XOpossum

39 MCS Overlap with Mouse Alignments

320,000 Missed by Mouse Detected by Mouse Unique to Mouse 280,000

240,000

200,000

160,000 ‘False Bases Bases Positives’ 120,000 Total MCS 80,000 Detected Bases 40,000 Missed

0 70 72 74 76 78 80 82 84 86 88 90 92 94 96 Percent Identity Threshold

Detection of MCSs with Different Species

ƒ Investigating the Relative Contribution of Different Species’ Sequences to MCS Detection using More Quantitative Approaches

ƒ Re-Compute Conservation Score for All* Possible Subsets of Species ƒ Compare to a ‘Reference Set’ of MCSs – Generated with All Species – Surrogates for Conserved Functional Elements

40 Single Species Performance

Coding UTR 80,000 AR Not Annotated

60,000 Non-Coding

40,000 Ancient Repeats Reference SetBases MCS 20,000 UTRs Coding

0

L s n t s t L a i se a w ig pu ke R u C P bbit A y c Fugu Co a t gehog i elph la d h Mo Sheep R Wallaby P C He Opossum Monod

Best Performing Subsets

Platypus Rodent 80,000 Cat/Cow

Platypus 60,000 Dog, Cow, Mouse, Rat, & Chicken Rodent

40,000 Mouse, Rat, & Chicken Wallaby Reference Set MCSs Set Reference

20,000

0 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

41 More Species is Better

MCS Detection and Sequence Quality

ƒ To date, MCS detection has been with reasonably high-quality sequence

ƒ What quality of sequence is desired for MCS detection — especially provided a set of high-quality reference sequences?

42 MCS Detection and Sequence Quality

ƒ What Tradeoffs are encountered between sequence coverage vs. number of species?

Our Data Set Provides a Unique Opportunity to Explore these Questions

1) Re-create 0.5X, 1X, 2X… Read-Coverage Datasets 2) Analyze for MCSs 3) Compare to “Finished” MCSs

Sequence Coverage vs. MCS Detection

1

0.8

0.6 Cat Lemur Horse Dog 0.4 Hedgehog Cow Armadillo Pig Rat Rabbit Wallaby Mouse 0.2 Chicken Platypus Pufferfish Opossum

Fraction Species) (16 MCS of Bases All 16 species Best 8 species 0 0X 1X 2X 3X 4X 5X 6X Sequence Coverage

43 Low-Redundancy Sequencing of Multiple Vertebrate

Margulies et al., (2005) PNAS, 102:3354-3359

44