Comparative Genomics: Computational Challenges

Home , Comparative genomics, Genome project

Bernard M.E. Moret Laboratory for Computational Biology and Bioinformatics EPFL

Nantes, 6/8/09 – p. Overview

¯ Comparative approaches

¯ The genome and its evolution

¯ High-throughput data and computation

¯ What do we want to know?

¯ Comparing two genomes

¯ Comparing multiple genomes

¯ Ancestral reconstruction

¯ Challenges

Nantes, 6/8/09 – p. Nantes, 6/8/09 – p. Comparative approaches The genome and its evolution High-throughput data and computation What do we want to know? Comparing two genomes Comparing multiple genomes Ancestral reconstruction Challenges ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ Comparative Approaches

Nothing makes sense in biology except in the light of evolution Th. Dobzhansky, The American Biology Teacher, 1973.

¯ an evolutionary perspective requires models of evolution

¯ a model of evolution requires data that reveals evolution

¯ data that reveals evolution must come from several organisms or tissues

¯ hence working in the light of evolution requires comparative approaches

From the point of view of experimentalists and medical researchers

¯ some organisms, incl. humans, are difﬁcult to study in a lab setting

¯ some experiments cannot be performed on some organisms, incl. humans, for practical or ethical reasons

¯ hence learning about these organisms is best done by studying others

Nantes, 6/8/09 – p. Characteristics of Comparative Approaches

Comparative approaches

¯ rely on identiﬁcation of conserved patterns

¯ develop evolutionary models for observed changes

¯ use conserved patterns for datamining and evolutionary models for analysis of mined data

Translated to comparative genomics: data: whole-genome sequences conserved patterns: subsequences, distribution statistics, or combinations models: duplication/loss/rearrangement at the genome level, mutation/indel for nucleotides datamining: important components, anchors, clusters, syntenic blocks

Nantes, 6/8/09 – p. Comparative Genomics Vocabulary syntenic block: conserved pattern (subject to microrearrangements) used to denote conserved block of genes (10Kbps to 1Mbps) (originally used in genetics to denote colocation on the same chromosome) genomic alignment: sequence-level alignment of complete genomes, with block-level rearrangements, duplications, and losses positive (Darwinian) selection: selects for favorable traits; accelerates observed change in affected regions negative (ﬁltering) selection: selects against changes of most kinds; slows down observed change in affected regions ancestral reconstruction: inference of the putative contents and arrangement of the genome of a common ancestor genomic signature: originally, distribution of dinucleotide frequencies, now characterization of common patterns in a group of genomes

What evolutionary events affect the genome? nucleotide-level: “classical” sequence evolution (mutations and indels) genomic rearrangements: inversions, transpositions, translocations, and chromosomal fusion and ﬁssion duplication: gene retrotransposition, tandem duplication, segmental duplication, whole-genome duplication loss: point mutation, segmental deletion, neofunctionalization recombination: meiotic recombination, hybridization, lateral gene transfer

Nantes, 6/8/09 – p. Evolutionary Models

And how well understood are they? nucleotide-level: well established models with good statistics genomic rearrangements: enormous work in the last 10 years, but still parameter-poor duplication/loss: established work in lineage sorting (divergent gene evolution due to paralogs), much attention to whole-genome duplication, just starting on segmental duplications recombination: established work in population genetics, much work on identifying lateral gene transfer, detailed work on recombination just starting

Nantes, 6/8/09 – p. Nantes, 6/8/09 – p. 10 Comparative approaches The genome and its evolution High-throughput data and computation What do we want to know? Comparing two genomes Comparing multiple genomes Ancestral reconstruction Challenges ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ High-throughput data

Slowly pervading laboratory-based biology: sequencing: the original high-throughput data source, now easily the most economical high-tech lab instrument gene expression: microarrays and their ilk, now inexpensive and in very widespread use transcription proﬁling: ChIP-chip, ChIP-seq, and future products, soon to replace microarrays mass spectrometry: for protein analysis and sequencing; now also for mixed samples (metaproteomics) SNP assays: for precise genotyping of humans other domains: cell signalling, metabolomics (e.g., ﬂuxes), 3D imaging, time series, etc.

Nantes, 6/8/09 – p. 11 High-throughput sequencing

developed for the human genome project, now also used for: de novo sequencing: still the most challenging resequencing: verify base calls, test assemblies, etc. deep sequencing: dense sampling or high coverage metagenomics: random sampling of microbial communities current technologies (454, Illumina) generate around 4Gbps per half-day run next-gen technologies may yield over 20Gbps per hour run, at better than 50x coverage, with very short (20bps) fragments

Nantes, 6/8/09 – p. 12 Can computation keep up?

Computer power still follows Moore’s law, doubling every 12–15 months. However: That power is getting harder to use (parallelism is hard to exploit). Data accumulates faster than Moore’s law (sequence data alone doubles every year). This comparison presupposes a linear relationship, but most genomic analysis algorithms are much slower. High-performance computing does not help much:

¿ 4

the fastest machines can provide only ½¼ –½¼ speedup.

Nantes, 6/8/09 – p. 13 Genome-scale computing

Running time is only one facet of the problem. Comparing several genomes of a few Gbps each requires a lot of memory—at least 128GB per node.

Available and not too expensive: a 16-core compute node with 128GB memory and 500GB disk can be had for $20K. Still rare: most compute clusters have “thin” nodes (2-8GB of memory), unsuited to whole-genome analysis. 2010 architectures will pack 64–128 cores per node, with 0.5–1TB memory—great for comparing a few genomes, but by then we will have 100s of vertebrate genomes. . .

Nantes, 6/8/09 – p. 14 Nantes, 6/8/09 – p. 15 Comparative approaches The genome and its evolution High-throughput data and computation What do we want to know? Comparing two genomes Comparing multiple genomes Ancestral reconstruction Challenges ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ Basic annotation per genome

We want to identify: all coding genes (or exons); all noncoding genes; gene families; SINEs, LINEs, and other repeat elements; regions under positive, neutral, or negative selection.

Beyond this basic level, we want to identify gene clusters, operons, alternative splicing scenarios, etc.; gene function.

Nantes, 6/8/09 – p. 16 Comparative annotation

Using pairwise comparative approaches, we can ask for all pairwise homologies; orthology and paralogy relationships within gene families (within limits due to lack of phylogenetic information); syntenic blocks; mapping of the syntenic blocks between the two genomes (simple translocations and transpositions, with inversions); translation of functional annotations from each genome into the other.

Nantes, 6/8/09 – p. 17 A simple bacterial example: Rickettsiae

R. conorii (Med. spotted fever) vs. R. prowazeckii (typhus), about 15% from Ogata et al., Science 293(5537):2093–2098 (2001)

600000 605000 610000 615000 620000 625000 630000 635000 640000 645000 650000 0617 0619 lgtD 0625 lon 0631 0633 0638 0641 0644 0646 0649 0652 0655 0659 kpsF rpsO tlc4 0616 trxB1 gabD 0622 uvrD 0627 tdcB 0630 tRNA−Asp 0635 trxB2 0640 0643 0648 nrdB 0654 exoC 0660 0662 truB 0620 0626 0632 yhbH folD 0639 0642 0645 nrdA 0650 0653 miaA abcT2 pnp sca4

444 uvrD lon clpP yhbH trxB2 511 abcT2 kpsF 502 499 443 lgtD tdcB rpsA tRNA−Asp folD nrdB exoC 506 rpsO tlc4 trxB1 448 sca3 tRNA−Ala 516 nrdA miaA 507 pnp truB sca4 545000 550000 555000 560000 640000 635000 630000 625000 620000 615000

650000 655000 660000 665000 670000 675000 680000 685000 690000 695000 700000 0669 0672 0675 addA 0679 glmU rpmJ 0684 0688 0691 0695 0700 panF mccF2 0708 trpS ampG1 0719 tlc3 tRNA−Gln potE nifU 0668 0671 def2 0677 0683 0687 0690 0694 0697 0699 0704 hemC 0707 tRNA−Ser 0712 0715 0717 0721 0723 tRNA−Arg hesB2 sca4 0670 0673 0676 tRNA−Ser 0681 0685 0686 0689 bioC pdhD rnd 0698 0701 0703 0709 plsC 0713 0714 0716 0720 ispB 0725 0727

499 496 addA tRNA−Ser rpmJ 459 rnd phoR trpS 471 rfaJ ispB 482 nifU 497 494 452 455 458 461 464 tRNA−Ser 470 473 ampG1 478 tRNA−Gln hesB2 sca4 495 ppdK glmU 457 pdhD 463 hemC plsC 472 474 tlc3 tRNA−Arg potE 615000 610000 565000 570000 575000 580000 585000 590000 595000 600000

700000 705000 710000 715000 720000 725000 730000 735000 740000 745000 750000 0727 spl1 0735 0738 0741 0745 clpP 0749 0751 0754 himD 0763 0766 0769 0772 0774 0777 0781 0782 0784 0787 0790 0793 nuoN1 nifU 0732 0734 0737 0740 0743 rpsA cmk 0750 0753 0756 rho recJ prfA infC 0768 0771 0776 bioY rnpB ppdK 0786 0789 0792 0795 hesB2 spl1 0733 0736 0739 0742 0744 tRNA−Ala tRNA−Phe 0752 0755 sppA 0760 pdhC 0767 0770 birA 0775 sodB folC 0785 0788 0791 0794 0797

nifU 488 clpP tRNA−Phe rho prfA 532 sodB 489 ppdK 495 sca4 spl1 490 tRNA−Ala cmk sppA recJ infC 534 488 494 497 spl1 489 ppdK tRNA−Asp rpsA 524 527 pdhC birA folC spl1 490 addA 496 600000 605000 645000 650000 655000 660000 605000 610000 615000

750000 755000 760000 765000 770000 775000 780000 785000 790000 795000 800000 0797 priA 0802 0805 rluB radA 0819 tlyA 0825 0828 0831 0834 0838 0841 0843 0846 0850 nuoN1 0799 dnaB 0809 0813 0815 0818 0821 0824 0827 0830 0833 0837 0840 0845 ntrX hemB ubiX 0804 0806 0808 0811 0812 0814 infB nusA 0820 tyrS 0826 tRNA−Arg 0829 0832 0835 0836 0839 0842 0844 0847 ubiH

nuoN1 priA 543 548 551 554 tRNA−Arg fadB 563 hemB dnaB 545 547 550 nusA tyrS 559 ntrX 538 ubiX 544 radA 549 infB tlyA 558 561 665000 670000 675000 680000 685000 690000 695000 700000 705000

Nantes, 6/8/09 – p. 18 Phylogenetic annotation

Once we introduce phylogenetic information, we can study mechanisms and reconstruct events.

We can ask for the history of: duplications and losses of genes; rearrangements; introns gains and losses; recombinations of various types; lateral gene transfer; and any other events of interest.

Nantes, 6/8/09 – p. 19 Example: pathogenic genes in fungi

From Soanes et al., The Plant Cell 19:3318–3326 (2007)

Nantes, 6/8/09 – p. 20 Nantes, 6/8/09 – p. 21 Comparative approaches The genome and its evolution High-throughput data and computation What do we want to know? Comparing two genomes Comparing multiple genomes Ancestral reconstruction Challenges ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ Genomic alignment

Combines sequence-level alignment with block-level rearrangements, duplications, and losses. Large syntenic blocks reduce computational work in eukaryotic genomes and may also serve as approximations for functional clusters. Sequence evolution is parameterized separately for each block. Tools include Mauve and Shufﬂe-Lagan, with MultiZ (align against the human genome) used for vertebrates. Much remains to be done: Handling of rearrangements/duplications/losses is very limited to date. Handling of rearrangements in local alignments is poor. Determining syntenic blocks is thus hard, hence the many tools: GRIMM-Synteny, AdHore, Cinteny, CS7, FISH, OrthoCluster, TEAM, etc.

Nantes, 6/8/09 – p. 22 Example: comparative mapping

Human chromosomal regions colored by mouse chromosomal code

Note: large color blocks may be composed of many syntenic blocks. Mapping is a ﬁrst alignment step: rearrangements among elements, mapping within each chromosome, local alignment of syntenic blocks, must all follow.

Nantes, 6/8/09 – p. 23 Example: synteny for cotton/Arabidopsis

From Rong et al., Genome Research 15:1198–1210 (2005).

Solid and dotted vertical colored lines denote syntenic blocks found by FISH and CS7.

Nantes, 6/8/09 – p. 24 Visualizing the output

Two obvious problems: How do we condense the data for presentation from genomic scales (billions) to human scales (tens)? What forms of data presentation are most useful to researchers?

For the ﬁrst: visual aids: zooming interfaces, perspective walls mathematical techniques: projections semantic techniques: high-level representations For the second, much depends on: the problem the researcher’s training the researcher’s perceptual abilities

Nantes, 6/8/09 – p. 25 Genomic rearrangements

Rearrangements alter the order and strandedness of genomic regions, from subsequences through genes to syntenic blocks. To study rearrangements, genomes are represented by ordered sequences of signed indices, each index representing a gene or syntenic block. Rearrangements can be characterized by their outcome: breakpoints their mechanism: inversions, transpositions, translocations a mathematical model: permutations, the Nadeau-Taylor model, double-cut-and-join (DCJ) In any framework, they present challenging algorithmic questions.

Nantes, 6/8/09 – p. 26 Rearrangements: viewpoints

1 1 2 8 5 8

3 7 Transposition 6 7 4 6 2 4 G1=(1 2 3 4 5 6 7 8) 5 3

G2=(1 2 −5 −4 −3 6 7 8) Inversion Inverted Transposition

1 1 breakpoints (arrows) are −4 8 5 8 missing adjacencies −3 7 6 7

−2 6 −4 −2 5 −3

Double-cut-and-join makes two cuts in the genome, then reglues the ends.

With one cut on each of two chromosomes, translocation and fusion can occur. With two cuts on the same chromosome, inversion and ﬁssion can occur. Two successive DCJs, one with ﬁssion, one with fusion, cause a block exchange.

Nantes, 6/8/09 – p. 27 The breakpoint graph

A graph representation of two (or more) orderings of genes. One ordering is used to represent the identity permutation. Each gene is represented by two vertices (+ and -), the current permutation is given by solid edges, the identity by dashed edges.

L 2−−22+ 4+ 44− 3+ 3 3− 1− −1 1+ R

Every vertex has degree 2, with one solid and one dashed edge. Thus the graph decomposes into alternating cycles (here two cycles).

Nantes, 6/8/09 – p. 28 Results on inversions

formulation and ﬁrst results (1986) breakthrough theorem (Hannenhalli and Pevzner 1997): edit distance = # genes - # cycles in BP graph + # hurdles + # fortresses

optimal Ç ´ Ò µ distance computation (2001) use in unichromosomal phylogenetic reconstruction (2001) evolutionary distance estimators (2002) use in multichromosomal phylogenetic reconstruction (2002) improved theoretical framework (2002) inversions combined with duplication/loss (2004) probabilistic framework (2007)

optimal Ç ´ Ò ÐÓg Ò µ sorting (2009)

Nantes, 6/8/09 – p. 29 Results on DCJ

DCJ uniﬁes various rearrangements: inversions, transpositions, block exchanges, translocations, ﬁssions, and fusions are all aspects of DCJ. formulation of the problem (2005) improved theoretical framework (2006) evolutionary distance estimators (2008) probabilistic framework (2008) fast median computation (2008) DCJ is particularly useful for foundational work in combinatorics, algorithmics, and statistics, but also appears to work well with biological data.

Nantes, 6/8/09 – p. 30 Open questions

Research on rearrangements has yielded a wealth of results, but much remains to be done to make it useful in practice. How do we: determine the relative importance of operations? parameterize operations as a function of the locations and lengths of affected segments? estimate breakpoint reuse (rearrangement hotspots)? characterize and optimize rearrangement scenarios with additional dependencies or constraints? specify sufﬁcient biological constraints to make ancestral reconstruction possible?

Nantes, 6/8/09 – p. 31 Nantes, 6/8/09 – p. 32 Comparative approaches The genome and its evolution High-throughput data and computation What do we want to know? Comparing two genomes Comparing multiple genomes Ancestral reconstruction Challenges ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ Multiple Mauve alignment of Yersinia spp.

Nantes, 6/8/09 – p. 33 Two is easy, more is hard

An optimization problem that can be solved efficiently for two objects often becomes intractable for three or more objects. Examples include Satisfiability of Boolean formulae in conjunctive form (easy for clauses of 2 variables, hard for clauses of 3 variables), matching (easy for matching pairs, hard for matching triples), problems on graphs of fixed degree (trivial on graphs of degree 2, often hard on graphs of degree 3), etc.

In comparative genomics, the two basic problems exhibit this same behavior. Sequence alignment, and hence also genomic alignment, is easy for two sequences, hard for more. Finding the rearrangement median between genomes is easy for two genomes, hard for more.

Nantes, 6/8/09 – p. 34 Multiple sequence alignment

It remains the “single point of failure" of comparative genomics. all methods attempt to reduce multiple alignment to a series of pairwise alignments every popular tool is based on progressive alignment, using some assumed (or heuristically built) phylogenetic tree even the best alignment packages (MAFFT, ProbCons, Muscle) handle only point mutations and indels results tend to be poor for sequences with signiﬁcant divergence In comparative genomics, the phylogeny is often known, yet even then progressive alignment may be poor.

Nantes, 6/8/09 – p. 35 Sankoff's problem

The best formulation of multiple sequence alignment is due to Sankoff (1975).

Given Ò sequences to align, ﬁnd a binary tree on Ò leaves, an assignment of the Ò

sequences to the Ò leaves, and Ò -½ sequences labelling the internal nodes of the tree (“ancestral” sequences), that together optimize the sum, taken over all edges of the tree, of the pairwise alignment scores of the sequences associated with the two endpoints of each edge. Sequences can be replaced by genomes, as gene maps, gene orders, full genome sequences, etc. Edge scores can be parsimony-based or reﬂect likelihood under some model.

This formulation requires reconstruction of the full history of the given sequences from their last common ancestor—a very hard task. No tool exists for this problem at present, except for small-scale work on gene-order data.

Nantes, 6/8/09 – p. 36 Example: Sankoff's problem

observed sequence

ancestral reconstruction

pairwise alignment

All pairwise alignments involve at least one ancestral sequence.

Nantes, 6/8/09 – p. 37 The median problem

The “other" big computational problem in comparative genomics is deceptively simple:

given k genomes (usually 3), ﬁnd a new genome that minimizes the sum

of the pairwise genomic distances from itself to the given Ò genomes

Finding a median is a key step in phylogenetic reconstruction and thus for Sankoff’s problem the most common approach to ancestral reconstruction

Median optimization is intractable under most measures of genomic distance.

Nantes, 6/8/09 – p. 38 Taming median computations

We cannot avoid the problem entirely (except with progressive alignment), but we can ﬁnd ways of estimating or approximating medians quickly:

progressive alignment: replaces median by a pairwise alignment proﬁle minimum spanning tree: easily computed, then altered heuristically to produce a phylogenetic tree tight bounding on edge scores: based on mixed integer-linear programming, one set for each tree greedy methods: in a median of three, repeatedly move from one end towards the other two simultaneously; fork when no such move exists compression methods: identify commuting or noninterfering operations (for which a single path sufﬁces) decomposition methods: successfully used for DCJ medians

Nantes, 6/8/09 – p. 39 The UPenn/UCSC aligner

The UC Santa Cruz Human Genome browser includes 27 additional tracks for 27 other vertebrate genomes, including reptiles, amphibians, and ﬁshes. The alignment is a combination of star alignment (everything is referenced to the human genome) and progressive alignment, produced by MultiZ through a complex data-handling pipeline.

Independent assessments of the alignments ﬁnd less than 10% of the alignments to be “suspicious.” Vertebrates are all very closely related, making alignment much easier, but their genomes are huge, so this pipeline is a major achievement.

Nantes, 6/8/09 – p. 40 Issue in alignment of zebrafish to vertebrates

Part of the 28-vertebrate genome alignment at UCSC. From Prakash and Tompa, Genome Biology 8:R124 (2007).

Nantes, 6/8/09 – p. 41 ECR browser tracks

(very similar to the UCSC browser tracks)

Nantes, 6/8/09 – p. 42 Assessing genomic alignments

Most genomic alignments published to date are of: strains of the same species of bacteria (e.g., E. coli) minimally divergent species in the same genus (e.g., Yersinia spp.) larger taxonomic groups with a short history (e.g., vertebrates) In all cases, the small divergence facilitates the work.

Genomic alignments are tested by the community mostly by examining local sequence alignments (using the tracks). not a high-throughput process! ignores rearrangement handling lacks supporting evolutionary scenarios

Nantes, 6/8/09 – p. 43 Nantes, 6/8/09 – p. 44 Comparative approaches The genome and its evolution High-throughput data and computation What do we want to know? Comparing two genomes Comparing multiple genomes Ancestral reconstruction Challenges ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ Ancestral genomes

Medians are used to reconstruct ancestral genomes, but the two are quite distinct. Medians: store intermediate algorithmic results answer a very simple optimization criterion

Ancestral genomes: biological constraints are presumably numerous, but we know very little about them, and so good probabilistic models are lacking reconstruction is severely underconstrained “optimal” solutions (i.e., medians) abound

Nantes, 6/8/09 – p. 45 Negative results

For moderately diverged genomes, current biological constraints do not sufﬁce.

250 Inversion Only Inversions and Insertions Inversions and Deletions All Ops - Low Insertions/Deletions All Ops - High Insertions/Deletions 200

150

100

50 Edit Distance from Reconstructed Label to True Label

0 0 2 4 6 8 10 Internal Node

12 -proteobacteria from Earnest-DeYoung et al., WABI 2004 Nodes 3, 4, 5, and 7 are two levels above the leaves and show large errors under all mixes of operations.

Nantes, 6/8/09 – p. 46 Positive results

Tracking gene clusters or operons across lineages. Ancestral genomes claimed for mammalian genomes at coarse resolution. Assembly approach instead of median: ancestral genome in a star tree built from many known syntenic blocks by selecting some and assembling them into a (possibly incomplete) genome. Signature approach by identifying rearrangements common to all shortest evolutionary paths.

Nantes, 6/8/09 – p. 47 Nantes, 6/8/09 – p. 48 Comparative approaches The genome and its evolution High-throughput data and computation What do we want to know? Comparing two genomes Comparing multiple genomes Ancestral reconstruction Challenges ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ Modelling challenges

Details of rearrangements: affected areas by position and length, forbidden areas (centromeres?), relative frequencies. Details of duplications: affected areas by position and length, relative frequencies. Interactions between duplications/losses and rearrangements. Constraints from gene clusters or operons. Combining gene-level and sequence-level models. Effects of the type of evolutionary selection.

Nantes, 6/8/09 – p. 49 Algorithmic challenges

Automatic syntenic block identiﬁcation. Scalable rearrangement handling. Sankoff’s problem. Orthology assignment using sequence- and gene-level models. Alignment using sequence- and gene-level models. Presentation of results.

Nantes, 6/8/09 – p. 50 Assessment challenges

Automated assessment tools using independent methods. Data collection for such assessments. Resampling (in the style of “bootstrapping”) methods. Interactive presentation of results. High-throughput lab methods for veriﬁcation.

Nantes, 6/8/09 – p. 51 Thank you

Nantes, 6/8/09 – p. 52