<<

genetic ancestry (When branches get short...)

Ingo Ebersberger, Greg Ewing, Sascha Strauss, Heiko Schmidt and Arndt von Haeseler

Center for Integrative Bioinformatics in Vienna (CIBIV) http://www.illgenauktionen.de/Katalog/RIMG1770.JPG

The Coalescent Model

 Wright Fisher population model with constant population size.  Each generation chooses its parent at random.  Pairs of lineages are traced back to a coalescent event.  Kingman (1982) developed a continuous model that allows to estimate times between the coalescent events.  The coalescent rate for any pair of genetic lineages is proportional to 1/Ne in generations or to 1/θ in substitutions. Reconstructing phylogenetic relationships of contemporary sequences Inferring the tree from the sequence tree When internal branches get short... When internal branches get short...

MRCA MRCA P2P3 1/3 P1P2

MRCAP1P3 A well known example Resolved ML trees distributed across the 15 topologies

All All

20 (0.17%) 4 (0.03%)

9,147 (76.47%) 1,369 (11.46%)

19 (0.16%) 13 (0.11%)

0 5 (0.04%)

0 1,361 (11.39%)

5 (0.04%) 0

0 0

0

((H,C),G) is the most frequently observed sequence tree

9,147 out of 11,877 sequence trees (77%) support ((H,C),G) The fossil record on evolution

Sivapithecus (~13 MYBP) Proconsul (~18 MYBP)

Sahelanthropus (~6.5 MYBP)

Aegyptopithecus (~33 MYBP) Morothopithecus (~21 MYBP) The fossil record on primate evolution

Sahelanthropus (~6.5 MYBP)

Sivapithecus (~13 MYBP)

Proconsul (~18 MYBP)

Aegyptopithecus (~33 MYBP)

Morothopithecus (~21 MYBP) The fossil record on primate evolution

Sahelanthropus (~6.5 MYBP)

Sivapithecus (~13 MYBP)

Proconsul (~18 MYBP)

Aegyptopithecus (~33 MYBP)

Morothopithecus (~21 MYBP) Dating of speciation events during primate evolution

Human/ Orang Rhesus Chimp

Calibration to human-chimp split (min: Sahelanthropus; max: arbitrary) 6.5 - 10 7.3 - 19 20 - 45 36 - 79

Calibration to human- split (min: Sivapithecus; max: Proconsul) 2.9 - 6.3 3.4 - 11.3 13 - 18 21 - 34

Calibration to human-macaque split (min: ; max: 2.7 - 6.3 3.0 - 12.0 10 - 20 21 - 33 Aegyptopithecus)

modified from Patterson et al. (2006) 441:1103-1108 The fossil record on primate evolution

? Sahelanthropus (~6.5 MYBP)

Sivapithecus (~13 MYBP)

Proconsul (~18 MYBP)

Aegyptopithecus (~33 MYBP)

Morothopithecus (~21 MYBP) A fraction of the is evolutionary old Human Chimp Gorilla Human Chimp Gorilla

MRCACG MRCAHC 1/3 1,369 (11.5%)

Human Chimp Gorilla

MRCAHG

1,361 (11.4%) ~35% of the human genome is evolutionary old Human Chimp Gorilla Human Chimp Gorilla

MRCACG MRCAHC 1/3 1,369 (11.5%) ~11.5%

Human Chimp Gorilla

MRCAHG

1,361 (11.4%) Genomic location of incongruent sequence trees The position of incongruent sequence trees with respect to genes

All (%) Gene (%) Exon (%) All (%) Gene (%) Exon (%)

20 (0.17) 8 (0.17) 2 (0.32) 4 (0.03) 1 (0.02) 0

9,148 (76.58) 3,814 (78.85) 487 (78.93) 1,369 (11.46) 504 (10.42) 63 (10.21)

19 (0.16) 10 (0.21) 2 (0.32) 13 (0.11) 6 (0.12) 1 (0.16)

0 0 0 5 (0.04) 0 0

1 (0.01) 0 0 1,361 (11.39) 492 (10.17) 62 (10.05)

5 (0.04) 2 (0.04) 0 0 0 0

0 0 0 0 0 0

0 0 0

The position of incongruent sequence trees with respect to genes: An example

1) KIR3DX1 2) LILRA2 3) LILRA1 4) LILRB1 5) LILRB4 The ML-tree of the LILRA Gene Family (CDS) Function of LILRA1

 Leukocyte Immunoglobuline-like receptor (Subfamily A)  expressed in monocytes and B-cells  plays a role in negative regulation of immune response  cell surface receptor linked signal transduction Human genes to go (> 300 bp exonic sequence covered)

# TREE_ID POSTPROB EXONIC DESCRIPTION 1 ((H,G)C) 1 869 transmembrane protease 2 ((C,G)H) 1 835 Hermansky-Pudlak syndrome 1 protein. 3 ((C,G)H) 1 767 4 ((C,G)H) 1 757 meningioma (disrupted in balanced translocation) 1 (MN1) 5 ((H,G)C) 1 756 Zinc finger protein 230 (Zinc finger protein FDZF2). 6 ((C,G)H) 1 740 solute carrier family 22 (organic cation transporter) 7 ((H,G)C) 1 736 Sel-1 homolog precursor (Suppressor of lin-12-like protein) (Sel-1L). 8 ((C,G)H) 1 696 Docking protein 5 (Downstream of tyrosine kinase 5) (IRS6) (Protein dok-5). 9 ((H,G)C) 1 693 Melatonin-related receptor (G protein-coupled receptor 50) (H9). 10 ((C,G)H) 1 688 FLJ35348 (FLJ35348) on chromosome 9 11 ((H,G)C) 1 687 Membrane progestin receptor beta (mPR beta) 12 ((H,G)C) 1 676 13 ((C,G)H) 1 647 Disks large-associated protein 1 (DAP-1) 14 ((H,G)C) 1 647 Olfactory receptor 6S1. 15 ((C,G)H) 1 634 Secretogranin-2 precursor (Secretogranin II) 16 ((C,G)H) 1 619 CUB and sushi domain-containing protein 2 17 ((C,G)H) 1 619 HMG2 like isoform 1 .. 58 ((H,G)C) 0,95 365 Peptidyl-prolyl cis-trans isomerase A Amino acid substitution patterns depend on the sequence phylogeny

AA H , C , G C , H , G positions X Y Y X Y Y

H C G 9,482 (17) (75%) (59%) 24 (41%) H G C 1,449 12 (11%) (41%)

C G H 1,787 34 (17) (14%) (59%) (59%) When did species-specific morphological characteristics evolve?

? Sahelanthropus (~6.5 MYBP)

Sivapithecus (~13 MYBP)

Proconsul (~18 MYBP)

Aegyptopithecus (~33 MYBP)

Morothopithecus (~21 MYBP) When did species-specific morphological characteristics evolve?

? Sahelanthropus (~6.5 MYBP)

Sivapithecus (~13 MYBP)

Proconsul (~18 MYBP)

Aegyptopithecus (~33 MYBP)

Morothopithecus (~21 MYBP) When subsequent branches get compressed... Radiation: Speciation events compressed in time Compression of subsequent branches can cause Anomalous Gene Trees (AGTs)

ANOMALOUS SEQUENCE TREES: incongruent sequence trees that are more likely to be observed than congruent sequence trees

Degnan and Rosenberg (2006) PLOS Genetics 2:762-769 The problem of anomalous gene trees

Not all unlabelled rooted tree topologies are equally probable

1/3

2/3 Symmetric labelled trees are more probable than asymmetric labelled trees

a b c d a c b d a d b c c d b a a b c d

c d a b a b d c 1/3 2/3

a c b d b d a c

a c d b b d c a a d b c b c d a a d c b b c a d

If we generate rooted 4-taxa trees with a coalescent model, a symmetric labelled topology has a higher probability (1/9) then an asymmetric labelled topology (1/18) Asymmetric 4-taxa species trees can produce AGTs

P1 P2 P3 P4

1/9

1/12

P1 P2 P3 P4 Asymmetric 4-taxa species trees can produce AGTs

P1 P2 P3 P4

1/9

1/12

P1 P2 P3 P4 Topology preference in real data

All (%) Gene (%) Exon (%) All (%) Gene (%) Exon (%)

20 (0.17) 8 (0.17) 2 (0.32) 4 (0.03) 1 (0.02) 0

9,148 (76.58) 3,814 (78.85) 487 (78.93) 1,369 (11.46) 504 (10.42) 63 (10.21)

19 (0.16) 10 (0.21) 2 (0.32) 13 (0.11) 6 (0.12) 1 (0.16)

0 0 0 5 (0.04) 0 0

1 (0.01) 0 0 1,361 (11.39) 492 (10.17) 62 (10.05)

5 (0.04) 2 (0.04) 0 0 0 0

0 0 0 0 0 0

0 0 0

29 38 AGTs in n-taxa species trees (n>5)

 The probability for each labelled tree is 1/90, 1/60, and 1/180, respectively (5 taxa case)  proof that every species tree topology with five or more species can give rise to AGTs AGTs in n-taxa species trees (n>5)

 The probability for each labelled tree is 1/90, 1/60, and 1/180, respectively (5 taxa case)  proof that every species tree topology with five or more species can give rise to AGTs

Species tree reconstruction by means of taking the consensus from many sequence trees can be positively missleading A fast method to reconstruct species trees in the presence of AGTs

A simple solution:  rooted triples do not display the AGT phenomenon

1/3 The idea of the Triple Consensus Method (TCM)

L3 L11 L12 L13A L15 L23 L27 L27a L32 L35 Anurida maritima X X X X X X X X X X Archispirostreptus gigas X X X X X X X X X X Astrosclera willeyana X X X X X X X X X X Barentsia elongata X X X X X X X X X X Flustra foliacea X X X X X X X X X X Littorina saxatilis X X X X X X X X X X Lubomirskia baicalensis X X X X X X X X X X Psoroptes ovis X X X X X X X X X X Sipunculus nudus X X X X X X X X X X Xenoturbella bocki X X X X X X X X X X

Generate Consensus Tree Simulation

Generate sequence tree

(b)

(a) Simulation Results (perfect sequence trees)

20 taxa tree, 1000 replications per parameter combination Simulation Results (reconstructed sequence trees (ML))

ML sequence tree

true sequence tree

20 taxa tree, 1000 replications per parameter combination Application to biological data Application to biological data

Anopheles g. C. briggsae Ciona s. C. elegans

Apis m.

Pan t. Maccaca m. Bos t. Canis f. Saccharomyces s.

Drosophila m.

Monodelphis d. Mus m. Rattus n. Xenopus t.

Drosophila p.

Tetraodon n. Takifugu r. Danio r. Gallus g. The resulting rooted species phylogeny based on 216 sequence trees The resulting species phylogeny based on 216 sequence trees

No difference to the MRe-tree, therefore no indication for AGTs A more relevant dataset to come??

Porifera

Ecdysozoa

Deuterostomes

Annelids/Molluscs Some conclusions and challenges

Genealogy of the phenotype depends on the genealogy of the underlying genotype. Consideration of population genetic effects is necessary when branches get short or population sizes get large. Map of human genetic ancestry is required to understand human phenotypic evolution and to correctly interpret the fossil record. Ancestral polymorphisms and incomplete lineage sorting serve as an alternative to homoplasy. short branches can lead to ‘un-intuitive’ behaviour of evolution and can in several ways interfere with accurate species tree reconstruction. Acknowledgements Positional conservation in the LILR-region y " a(x)

3 ! Fossil Record on

Dating of Hominid Fossils: kadabba: ~5.5 MYBP tugenensis: ~5.8 MYBP Sahelanthropus tchadensis: 6.5-7.5 MYBP