Determining the Probability of Hemiplasy in the Presence of Incomplete Lineage Sorting and Introgression Supplementary Materials and Methods
Total Page:16
File Type:pdf, Size:1020Kb
Determining the probability of hemiplasy in the presence of incomplete lineage sorting and introgression Supplementary Materials and Methods Mark S. Hibbins*, Matthew J.S. Gibson*, and Matthew W. Hahn*† Department of Biology* and Department of Computer Science† Indiana University, Bloomington 1 Mutation probabilities on genealogies Each of the twelve possible genealogies under our parent tree model has a set of five branch lengths along which mutations can occur. l1, l2, and l3 denote the tip branches leading to species A, B, and C respectively; l4 denotes the internal branch, and l5 denotes the ancestral branch. As described in the supplement of Guerrero & Hahn (2018), the mu- tation probability on each of these branches has the general form R 1 − e−mx f (x)dx, where m is the mutation probability per 2N generations, x is the random variable for the branch length, and f(x) is the probability density function for x. We begin with the mutation probabilities for parent tree 1, which are found in the supple- ment of Guerrero & Hahn, and will be re-written here to be consistent with notation. In the following notation, parent tree 1 will be denoted as ”pt1”. Since many of the genealo- gies are identical in length, the mutation probabilities on their branches can be written with general expressions. We first consider the genealogies AB21, BC1, and AC1, which are all produced via incomplete lineage sorting in parent tree 1, and share the following set of mutation probabilities: 1 Z t3−t2 3 −m(t1+(t2−t1)+x) −x −3x n1[ILS, pt1] = (1 − e ) (e − e )dx (1) L 0 2 1 Z t3−t2 −m(t1+(t2−t1)+x) −3x −((t3−t2)−x) n2[ILS, pt1] = (1 − e )3e (1 − e )dx (2) L 0 t −t Z 3 2 Z (t −t )−y 1 −3y 3 2 −x −mx n4[ILS, pt1] = 3e ( e (1 − e )dx)dy (3) L 0 0 Z t3−t2 1 −m((t3−t2)−x) −3x n5[ILS, pt1] = (1 − e )3e dx (4) L 0 1 −3(t3−t2) 3 −(t3−t2) In each of the above, L = 1 + 2 e − 2 e is the probability of coalescence of A, B, and C in their ancestral population. t3 denotes the total height of the tree, i.e. the time at the base of the tree. The difference between t3 and t2 determines the duration of the ancestral population of all three taxa, before speciation occurs. Equations 1 through 4 each represent the mutation probabilities for multiple branches, which are as follows: n1[ILS, pt1] = n(l3, AB21) = n(l1, BC1) = n(l2, AC1) (5) 1 n2[ILS, pt1] = n(l1, AB21) = n(l2, AB21) = n(l2, BC1) = n(l3, BC1) = n(l1, AC1) = n(l3, AC1) (6) n4[ILS, pt1] = n(l4, AB21) = n(l4, BC1) = n(l4, AC1) (7) n5[ILS, pt1] = n(l5, AB21) = n(l5, BC1) = n(l5, AC1) (8) The gene tree produced by lineage sorting in parent tree 1, AB11, has a different set of mutation probabilities, since the branches have different expected lengths. These are: 1 Z t2−t1 ( ) = ( ) = ( − −m(t1+x)) −x n l1, AB11 n l2, AB11 −( − ) 1 e e dx (9) 1 − e t2 t1 0 1 Z t3−t2 ( ) = ( − −m(t1+(t2−t1)+x)) −x n l3, AB11 −( − ) 1 e e dx (10) 1 − e t3 t2 0 Z t2−t1 − − e y Z t3−t2 e x −m((t2−t1)−y+x) n(l4, AB11) = ( (1 − e ) dx)dy (11) −(t2−t1) −(t3−t2) 0 1 − e 0 1 − e 1 Z t3−t2 ( ) = ( − −m((t3−t2)−x)) −x n l5, AB11 −( − ) 1 e e dx (12) 1 − e t3 t2 0 Now we consider introgression, starting with parent tree 2. Many of the mutation proba- bilities are symmetrical with parent tree 1 and therefore remain the same, and the remain- der have the same general form with different parameters. For the ILS genealogies BC22, AB2, and AC2, equations 1 and 2 have the time of A-B speciation (t1) replaced with the timing of B-C introgression (tm). This gives: Z t3−t2 1 −m(tm+(t2−tm)+x) 3 −x −3x n1[ILS, pt2] = (1 − e ) (e − e )dx (13) L 0 2 Z t3−t2 1 −m(tm+(t2−tm)+x) −3x −((t3−t2)−x) n2[ILS, pt2] = (1 − e )3e (1 − e )dx (14) L 0 n4[ILS, pt2] = n4[ILS, pt1] (15) n5[ILS, pt2] = n5[ILS, pt1] (16) These correspond to the following branch mutation probabilities: n1[ILS, pt2] = n(l1, BC22) = n(l3, AB2) = n(l2, AC2) (17) n2[ILS, pt2] = n(l2, BC22) = n(l3, BC22) = n(l1, AB2) = n(l2, AB2) = n(l1, AC2) = n(l3, AC2) (18) n4[ILS, pt2] = n(l4, BC22) = n(l4, AB2) = n(l4, AC2) (19) n5[ILS, pt2] = n(l5, BC22) = n(l5, AB2) = n(l5, AC2) (20) 2 For the genealogy produced by lineage sorting in parent tree 2, BC12, we have: 1 Z t2−tm ( ) = ( ) = ( − −m(tm+x)) −x n l2, BC12 n l3, BC12 −( − ) 1 e e dx (21) 1 − e t2 tm 0 1 Z t3−t2 ( ) = ( − −m(tm+(t2−tm)+x)) −x n l1, BC12 −( − ) 1 e e dx (22) 1 − e t3 t2 0 Z t2−tm − − e y Z t3−t2 e x −m((t2−tm)−y+x) n(l4, BC12) = ( (1 − e ) dx)dy (23) −(t2−tm) −(t3−t2) 0 1 − e 0 1 − e n(l5, BC12) = n(l5, AB11) (24) Finally, we consider parent tree 3. The mutation probabilities have the same formulation as parent tree 2, with two key changes: since parent tree 3 is shorter (Figure 2 of main text), t2 is replaced by t1. This also applies to the value of L, which we will denote for 1 −3(t3−t1) 3 −(t3−t1) parent tree 3 as L3 = 1 + 2 e − 2 e . For the ILS genealogies BC23, AB3, and AC3, this gives: 1 Z t3−t1 3 −m(tm+(t1−tm)+x) −x −3x n1[ILS, pt3] = (1 − e ) (e − e )dx (25) L3 0 2 1 Z t3−t1 −m(tm+(t1−tm)+x) −3x −((t3−t1)−x) n2[ILS, pt3] = (1 − e )3e (1 − e )dx (26) L3 0 t −t Z 3 1 Z (t −t )−y 1 −3y 3 1 −x −mx n4[ILS, pt3] = 3e ( e (1 − e )dx)dy (27) L3 0 0 1 Z t3−t1 −m((t3−t1)−x) −3x n5[ILS, pt3] = (1 − e )3e dx (28) L3 0 Where: n1[ILS, pt3] = n(l1, BC23) = n(l3, AB3) = n(l2, BC3) (29) n2[ILS, pt3] = n(l2, BC23) = n(l3, BC23) = n(l1, AB3) = n(l2, AB3) = n(l1, AC3) = n(l3, AC3) (30) n4[ILS, pt3] = n(l4, BC23) = n(l4, AB3) = n(l4, AC3) (31) n5[ILS, pt3] = n(l5, BC23) = n(l5, AB3) = n(l5, AC3) (32) Finally, for the genealogy BC13, the mutation probabilities are as follows: 1 Z t1−tm ( ) = ( ) = ( − −m(tm+x)) −x n l2, BC13 n l3, BC13 −( − ) 1 e e dx (33) 1 − e t1 tm 0 1 Z t3−t1 ( ) = ( − −m(tm+(t1−tm)+x)) −x n l1, BC13 −( − ) 1 e e dx (34) 1 − e t3 t1 0 Z t1−tm − − e y Z t3−t1 e x −m((t1−tm)−y+x) n(l4, BC13) = ( (1 − e ) dx)dy (35) −(t1−tm) −(t3−t1) 0 1 − e 0 1 − e 1 Z t3−t1 ( ) = ( − −m((t3−t1)−x)) −x n l5, BC13 −( − ) 1 e e dx (36) 1 − e t3 t1 0 3 2 When does introgression makes hemiplasy more likely than ILS alone? The probability of hemiplasy with C ! B introgression is Pe = (1 − d)Pe[BC1] + d(Pe[BC12] + Pe[BC22]) (37) From this, it can be seen that introgression makes hemiplasy more likely than ILS alone when: Pe[BC12] + Pe[BC22] > Pe[BC1] (38) When is this true? Substituting the relevant expressions from the main text gives: −(t2−tm) (1 − e )v(l4, BC12) ∏(1 − v(li, BC12))+ i6=4 1 −(t2−tm) ( e )v(l4, BC22) ∏(1 − v(li, BC22)) 3 i6=4 1 −(t2−t1) > ( e )v(l4, BC1) ∏(1 − v(li, BC1)) (39) 3 i6=4 −(t2−tm) (1 − e )v(l4, BC12) ∏(1 − v(li, BC12)) > i6=4 1 −(t2−t1) ( e )v(l4, BC1) ∏(1 − v(li, BC1))− 3 i6=4 1 −(t2−tm) ( e )v(l4, BC22) ∏(1 − v(li, BC22)) (40) 3 i6=4 The mutation probabilities on the right side of the inequality are equal since they are on the same topology with the same branch lengths. Therefore, equation 40 can simplified: −(t2−tm) (1 − e )v(l4, BC12) ∏(1 − v(li, BC12)) > i6=4 1 1 −(t2−t1) −(t2−tm) v(l4, BC1) ∏(1 − v(li, BC1))( e − e ) (41) i6=4 3 3 As the most conservative case, let us assume a hybrid speciation scenario where t1 = tm. This represents the most conservative introgression scenario, since Figure 4 in the main text shows that more recent introgression makes hemiplasy more likely. In this case, the right side of the inequality simplifies to 0, leaving −(t2−tm) (1 − e )v(l4, BC12) ∏(1 − v(li, BC12)) > 0 (42) i6=4 This is true whenever t2 > tm, which is true by definition in this model. Therefore, intro- gression always makes hemiplasy more likely than ILS alone. 4 3 Supplementary Figures and Tables A) B) C) D) AB1 AB2 BC AC t t t t 2 2 2 2 t t t t 1 1 1 1 A B C A B C A B C ABC E) F) G) H) BC1! BC2! AB! AC! t t t t 2 2 2 2 t t t t m m m m A B C A BC A B C A B C Supplementary Figure 1: Each parent tree in our model generates four gene trees; one generated from lineage sorting (Panels A and E), and three equally likely trees generated from incomplete lineage sorting (panels B-D, F-H).