Marker development for the study at micro- and macro-evolu onary me scales in neotropical Palms
Marylaure de la Harpe, Oriane Loiseau, Jaqueline Hess, Nicolas Salamin, Chris an Lexer, Margot Paris
(picture: Oriane Loiseau) POPCORN, a mul disciplinary project
Using Popula on Genomics, Phylogene cs and Community Ecology to understand Radia ons in Neotropical mountains
Popula on Phylogeny genomics
Community ecology Ideal markers
• Many markers widespread along the genome
• Low cost in order to genotypes thousands of samples
• Evolu on rate suitable for both macro and micro evolu on studies
Muta on rate divergence
Study level
kingdom Order Family genus species popula ons
• Long sequences (>600 bp) for phylogeny and selec on tests
• Include candidate genes for adapta on and “neutral” non-genic markers
• Include markers already used for phylogeny in palms • Can be applied to low quan ty and quality DNA from herbarium specimens Target capture sequencing
Very flexible as we can choose the targets: -number -nature (genes, non-genic regions) -candidate genes/regions -loca on in genome -length (in bp) -…
h p://www.arborbiosci.com/products/custom-target-capture/ Oil palm genome: very useful
• Oil palm genome is the closest reference genome
too divergent for proper capture design ?? especially because we are not interested in conserved regions
Building Geonoma reference sequences • Whole genome sequencing of the species G. undata (27x coverage, Illumina PE150bp)
• Reference assisted reconstruc on of the G. undata genome
94% of the genes recovered (UTRs + exons + introns)
Low recovery of the inter-genic regions (repeats, too divergent to the oil palm,…) Criteria for the selec on of 4’051 genes
• Broad range of rates of molecular evolu on Criteria for the selec on of 4’051 genes
• Divergence to oil palm used as proxy for rate of molecular evolu on
Highly conserved genes: suitable range of Highly variable genes: Histogram of dat1$Divergence Not informa ve at our scale evolu on rates Mostly paralogues, pseudo or par al genes,… 1500
All genes 1000
No. of genes Selected genes 500 0
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Divergence to oil palm Criteria for the selec on of 4’051 genes
• Broad range of rates of molecular evolu on • Mostly single copy genes (using coverage and He info)
• Average size of 1’300 bp
• Interes ng func ons: pathogenesis; flowering; response to UV, light; floral scents;…
• 8 genes previously used for phylogeny + 141 Heyduk et al. (2015) genes
• Even distribu on in the genome (around 160Kb on average between 2 target genes)
Addi onal 133 non-genic regions
• 5 to 15 per chromosomes • 800 bp length in average 4’770’883 bp in total • As far as possible from genes Sampling for kit evalua on
MK891_B
MK891_A MK891 - 5 “phylogene c” samples from 3 palm subfamilies MK891_D popula on up to 83 Miy divergence to G undata. MK891_C samples Geonoma undata MK891_E
- 5 “G. undata intraspecific” samples IO438_E from 5 different popula ons IO438_B IO438 IO438_A popula on - 2 x 5 “G. undata popula on” samples IO438_D samples
IO438_C
Quind_A
IO260_A
IO026_A Asterogyne guianensis Cocos nucifera Ceroxylon alpinum Licuala merguensis
0.0070 Sampling for kit evalua on : phylogeny samples
3 palm subfamilies represented Up to 80 Miy divergence to G. undata
Licuala merguensis
Ceroxylon alpinum
Cocos nucifera
Asterogyne guianensis Geonoma undata h p://www.palmweb.org Protocole
DNA extrac on 250-500ng of DNA used “Home-made + KAPA” library prepara on dual index sequencing
Quan fica on and pooling
Mybait target capture + PCR 11 cycles
Illumina sequencing PE 2x150bp (2 Million PE reads per sample)
Total cost per sample = 80 $ High reproducibility
The all procedure (library prepara on + target capture + sequencing) was done in duplicate for each sample to test for reproducibility
'$!!" #!!!" Ceroxylon alpinum Geonoma undata '&!!" '#!!" (Ceroxyloideae) (Arecoideae) '%!!" '!!!" '$!!"
'#!!" &!!"
'!!!"
%!!" &!!"
$!!" %!!"
$!!"
#!!" #!!"
!" !" !" #!!" $!!" %!!" &!!" '!!!" '#!!" '$!!" '%!!" !" #!!" $!!" %!!" &!!" '!!!" '#!!" '$!!" '%!!" '&!!"
Coverage per bait - Replicate 2 Coverage per bait - Replicate 1
High for all sample, for all 3 subfamilies (correla on coefficient range: 0.94 – 0.98) High efficiency of the method
!"#!$$#%&'()# *"#+,$-#,.,/01,'2#%&'()# 100 100 80 80 60 60 40 40
Global Efficiency Global !"#$#%&'($)&*&' Efficiency Global +,*"-#./$"'.(0&$"$,0,' 20 1#2#,'$(203"-&' 20 1"-#4/5#$'&560$(%' 702(&5&'%"-.("$,0,' 0 0
0 500000 1500000 2500000 0 500000 1500000 2500000 Sequencing effort Sequencing effort Factors influencing bait efficiency 200 200 150 150 100 100 50 50 !"#$%'>0#'(0*% 0 0
1 3 5 7 9 11 13 15 17 19 21 17.5 27.5 37.5 47.5 57.5 67.5 77.5 !"#$%&'()#$*% ;<%0,($'($%9=:% 200 200 150 150 100 100 50 50 !"#$%'>0#'(0*% 0 0
0 25.8 62.9 67.2 83.8 0.01 0.05 0.09 0.13 0.17 6#4'27'(0'%58'%$,%!"#$%&'('#9.*2:% +"$'%,-%.,/'01/"2%34,/15,(% SNP detec on
! Efficiency of the bait set for phylogeny
MK891_B
MK891_A MK891 RAxML tree with concatenated data MK891_D popula on
100 MK891_C samples 100 Geonoma undata High branch support, even within species MK891_E
IO438_E
100 IO438_B IO438 100 IO438_A popula on 100 IO438_D samples 96
IO438_C 100
100 Quind_A
IO260_A 100 IO026_A 100 Asterogyne guianensis 100 Cocos nucifera Ceroxylon alpinum Licuala merguensis
0.0070 Efficiency of the bait set for popula on genomics
A. B. MK891 IO438 1.2
0.00 Quind 1.0 0.8 -0.10 0.6 IO260 Cross-validation error IO026 -0.20 -0.10 0.00 0.10 1 2 3 4 5 6 7 C. K 1.0
0.8
0.6
0.4
0.2
Ancestry proportions (K=5) 0.0 Quind_A IO260_A IO026_A IO438_A IO438_B IO438_E IO438_C IO438_D MK891_A MK891_B MK891_E MK891_C MK891_D Ongoing work
Different sets of bait lists : - Full 60’000 baits = popcorn kit Mybait kit 3 - 57’000 baits = combine popcorn kit + Heyduck bait set (2015)
- 57’000 baits = popcorn kit Other companies - 54’000 baits = combine popcorn kit + Heyduck bait set (2015) size kit
- 20’000 baits = reduced phylogeny informa ve kit Mybait kit 1
Phylogene c RAxML trees, informa veness branch support
PhyDesign (Heyduck et al. 2015) Thanks for your a en on
Chris an Lexer Marylaure de la Harpe
POPCORN group