Assessing microsatellite linkage disequilibrium in wild, cultivated, and mapping populations of Theobroma cacao L and its impact on association mapping.

Tree Genetics and Genomes

J. Conrad Stack1, Stefan Royaert2, Osman Gutiérrez3, Chifumi Nagai4, Ioná Santos Araújo Holanda5, Raymond Schnell1, Juan-Carlos Motamayor1*

1Mars, Incorporated, McLean, VA, USA 2Mars Center for Cocoa Science, CP55 Itajuípe, Bahia, Brazil, 45625-000 3USDA-ARS, Subtropical Horticulture Research Station, Miami, FL, USA 4Hawaii Agriculture Research Center, Kunia, HI, USA 5Universidade Federal Rural do Semi-Arido, Departamento de Ciências Vegetais, BR 110 - Km 47, Bairro Pres. Costa e Silva, Mossoró-RN, Brazil CEP 59.625-900

*corresponding author ([email protected]) Beagle parameters, comparison to PHASE, and sample partitioning

Usage Beagle (v3.2.2) was run using the parameterization below: java -Xmx1024m -jar beagle.jar unphased=infile.bgl markers=infile.markers out=outfile.out missing=? nsamples=100 niterations=500 seed=123456789

Again, a different seed value (seed=…) was randomly chosen and used in each analysis. The marker files contained physical and not genetic distances. With our datasets, only the most likely haplotype phases for each individual were output. To try to control for possible uncertainty in the data set, the program was rerun 100 times each time it was used. Unlike PHASE, the runtime of beagle was on the order of minutes rather than days.

Comparisons to PHASE In addition to the PHASE program, a large number of additional analyses were carried out using Beagle (v3.3.2). The Beagle haplotyping algorithm does not make any explicit assumptions about the haploblock structure or linkage disequilibrium present in a particular data set. Instead, it allows the data to speak for themselves, expanding potential haploblocks according to an internal heuristic (cite Beagle). The authors of Beagle state that the program might not perform well on small datasets of SNPs (cite 2011 review paper), but its performance with highly polymorphic microsatellite markers was not discussed. Below we explore the performance of Beagle relative to PHASE (Figure 1).

Each data set mentioned in this section below was phased and then LD values were calculated from the haplotypes. The markers on each linkage group were processed independently and the physical locations of each marker were provided to each haplotyping program. For PHASE, the full data set comprising all structural (population) groups was broken into two subsets, assigning samples to either set at random. These results were compared with similar “unstructured” Beagle haplotype-based LD analyses. PHASE- and Beagle-originated LD values were also compared using the mapping population and Hawaii population data sets. For Beagle, all data sets were input into the program using the unphased argument, 100 haplotypes were sampled for each individual and 500 iterations were run. Linkage group 5 is not represented in the comparison plots below due to the fact that PHASE analyses of that linkage group consistently failed.

For all three datasets, the pairwise LD values calculated from PHASE and Beagle haplotype inferences were remarkably similar. The LD decay plots computed from Beagle haplotypes were qualitatively consistent with those produced by PHASE (Figure 2). Figure 1: PHASE and Beagle comparison plots for the full population data (A), the mapping population (B), and the Hawaii population (C). All LD values are from within-linkage group pairwise comparisons. All linkage groups are represented in each subplot. Grey lines represent rough confidence intervals generated by resampling haplotypes from PHASE output (N=1000) and from all separate Beagle runs (N=100). Amelonado (all LGs) Iquitos (all LGs) Curaray (all LGs) 1.0 m m m u u u i i i

r 0.8 r r b b b i i i l l l i i i u u u

q 0.6 q q e e e s s s i i i d d d

0.4 e e e g g g a a a

k 0.2 k k n n n i i i L L L 0.0

Criollo (all LGs) Contamana (all LGs) Purus (all LGs) 1.0 pairwise distance between loci (basepairs) pairwise distance between loci (basepairs) pairwise distance between loci (basepairs) m m m u u u i i i

r 0.8 r r b b b i i i l l l i i i u u u

q 0.6 q q e e e s s s i i i d d d

0.4 e e e g g g a a a

k 0.2 k k n n n i i i L L L 0.0

Nanay (all LGs) Maranon (all LGs) Nacional (all LGs) 1.0 pairwise distance between loci (basepairs) pairwise distance between loci (basepairs) pairwise distance between loci (basepairs) m m m u u u i i i

r 0.8 r r b b b i i i l l l i i i u u u

q 0.6 q q e e e s s s i i i d d d

0.4 e e e g g g a a a

k 0.2 k k n n n i i i L L L 0.0

Guiana (all LGs) Hybrid (all LGs) MP01 (all LGs) 1.0 pairwise distance between loci (basepairs) pairwise distance between loci (basepairs) pairwise distance between loci (basepairs) m m m u u u i i i

r 0.8 r r b b b i i i l l l i i i u u u

q 0.6 q q e e e s s s i i i d d d

0.4 e e e g g g a a a

k 0.2 k k n n n i i i L L L 0.0

0 10 20 30 40 0 10 20 30 40 0 10 20 30 40

pairwise distance between loci (basepairs) pairwise distance between loci (basepairs) pairwise distance between loci (basepairs) Figure 2. LD decay with pairwise physical distance between loci using Beagle haplotypes. All linkage groups are shown together in each subplot. This plot is directly comparable to a similar plot in the main text where PHASE haplotypes were used.

Sample partitioning Pairwise LD values are compared for the full structured population data, which was phased under two different partitioning schemes. Again, all linkage groups were analyzed separately. In the first partitioning scheme (“unstructured”), the 778 were analyzed as a whole. In the second partitioning scheme (“structured”), samples were divided into groups based on their structural group assignment.

There was a substantial amount of differences between the structured and unstructured Beagle LD values as well as a substantial degree of variation (Figure 3). Again, this is likely due to genotype imputation as well as Beagle performing less well on smaller data sets (cite Browning 2008 review). 0 . 1 8 . 0 ) d e r u t c u 6 r . t 0 s (

E L G A E B

, 4 s . e 0 u l a v

D L 2 . 0 0 . 0

0.0 0.2 0.4 0.6 0.8 1.0

LD values, BEAGLE (unstructured)

Figure 3. Comparison of LD values for unstructured and structured Beagle haplotypes.