Domestication and dispersal of African ( glaberrima Steud.): from West to the Americas

MSc Biology Thesis

Margret Veltman

Front cover: Jola women transplant the rice seedlings. Adapted from: Linares, O. (2002). African rice (): History and future potential. Proceedings of the National Academy of Sciences of the United States of America, 99,(25), 16360–16365.

Domestication and dispersal of African rice (Oryza glaberrima Steud.): from to the Americas

Margret Veltman 910917-865130 MSc Biology Thesis BIS-80436 October 2017

Supervisors: Examiners:

New York University Wageningen University & Research Center for Genomics and Systems Biology Biosystematics group Dr. Jae Young Choi Prof. dr. Eric Schranz Prof dr. Michael Purugganan Prof. dr. Tinde van Andel

“So this is the way rice came to [America], through the Africans, who smuggled the seeds in their hair.”

– ‘With Grains in Her Hair’ (Carney, 2004)

ACKNOWLEDGEMENTS

This research started out as a story of the transatlantic journey of rice and people. Several hundred years after the introduction of African crops to the American continent, I embarked on a transatlantic journey myself, to tell a part of that story. Now that part is written and it is my turn to acknowledge the people who have helped me complete it. First of all, my gratitude goes to the Purugganan lab in New York for hosting me as a visiting scholar and the Biosystematics group for allowing me to start and finish my research in Wageningen. A special word of thanks is due to Tinde and Rachel: thank you for taking me on board in the early stage of the project, encouraging me to develop my own thoughts and for introducing me to the fascinating world of rice. I would specifically like to thank Michael and Eric for facilitating this research and providing me with intellectual guidance along the way. I owe you a debt for solving many bureaucratic nightmares and academic struggles; thanks for being helpful in every way possible! Much gratitude goes to Jae and Jonathan, for supporting me and helping me to deal with daily bioinformatics questions. I would not have learned as much or come as far as I did without your supervision and stimulating discussions. Katie, Niels, Zoé, Zoë, Zoe and everyone else in the lab: thanks for being amazing colleagues. I enjoyed the many lunch breaks, coffees, dinners, drinks and more drinks. Without any of you, my time in New York would definitely have been much less memorable! I would also like to thank my friends from the International House and elsewhere in the city, for providing a home away from home and the necessary distractions after work – you made me feel happy to be there and sad to leave. Thank you Annette, Anja, Anneke, Jill, Andreas, Maria and Hamilton for coming to visit and sharing a bit of my life; thank you to many more for having the intention to visit; and thank you, Peter, for loving me so much that you came to visit twice. On the home front, I am grateful to my family and friends for keeping in touch and being there from a distance. You guys gave me comfort and company when I need it most. I would also like to thank my old and new housemates in California and the Netherlands: it is always great coming home, whether that is on the West Coast of America or in Europe. Lastly, a word of thanks goes to my good fellow biologists: you keep me sharp and motivated! Thanks specifically to the Genetics group in Wageningen, for adopting me and making the student island a great place to be. Thanks also to the Illuminacie, a.k.a. ‘Ioioijojiinluniuinuijijju’: I hope we can keep organising many more reunions to catch up and talk about biology. And finally, thanks to my favourite biologist of all: dysfunctional mail delivery, failing technology and an ocean could not keep us apart. Thank you for supporting me when I left, being patient when I was away and giving me a soft landing when I came back home. I love you.

3

ABSTRACT

Asian rice () and African rice (Oryza glaberrima) are the only two species of a large genus known to be cultivated as crops. Together, they serve as a staple food for the majority of our world’s growing population. Whereas Asian rice was domesticated from O. rufipogon around 9000 years ago, rice was domesticated independently in West Africa from a different wild progenitor and thus poses an interesting example of parallel evolution. Its exact origins, however, are still contested. Recent genome-wide studies have supported either a centric or a non-centric origin of O. glaberrima. Here we review the evidence for both scenarios through a critical reassessment of 206 publicly available whole genome sequences of domesticated and wild accessions of African rice. While genetic diversity analyses support a bottleneck caused by domestication, signatures of recent and strong positive selection do not unequivocally point to candidate domestication genes, suggesting that domestication of African rice may have proceeded differently than in Asian rice – either through selection on different alleles, or through different modes of selection. The possibility that population subdivision could account for this pattern was assessed by conducting a structure analysis, which revealed five genetic clusters localising to different geographic regions. Phylogenetic relationships of these clusters with four wild populations support a centre of origin along the , followed by diversification along the Atlantic coast. Analysis of multiple domestication genes furthermore demonstrates the presence of ancestral haplotypes confined to the southwest coastal population, suggesting that at least one of several key domestication genes might have originated there. These findings shed new light on an old controversy concerning the process of domestication in Africa, in which two models have traditionally been competing, namely the rapid transition model proposed by Portères (1962) and the protracted transition model proposed by Harlan et al. (1976). Our data provides evidence for both: supporting a centre of origin in the proposed primary domestication centre of the rapid transition model and highlighting the possibility of multiple origins consistent with the protracted transition model, including a separate centre of domestication activity in the Guinea Highlands. In addition, this study demonstrates the genetic similarity of a natural landrace from Suriname to a previously collected Surinamese sample and confirms their relatedness to accessions from the tropical forest belt along the coast of West Africa. More sampling of O. glaberrima is needed on both continents to elucidate the origins of rice in West Africa and in the Americas. We anticipate that genome-wide analyses of additional collections from across the species range will help to answer the questions where African rice came from, how it adapted to its different environments and how it can help to sustain a food secure future.

4

CONTENTS

Introduction ...... 6 Background ...... 6 Problem statement ...... 12 Objectives ...... 13 Approach ...... 14 Results ...... 16 Call sets ...... 16 Genetic diversity ...... 17 Evidence of positive selection ...... 19 Population structure ...... 22 Phylogenetic relationships ...... 29 African origins of rice in Suriname ...... 35 Discussion...... 38 Data preparation ...... 38 Demography and adaptation ...... 39 Population structure ...... 41 Taxonomic implications ...... 42 Domestication origin ...... 43 Transatlantic migration ...... 45 Conclusion ...... 46 Future directions ...... 47 Methods ...... 49 Data preparation ...... 49 Summary statistics ...... 50 Signatures of selection ...... 52 Population structure ...... 54 Phylogenetic analyses ...... 55 Biogeographic analyses ...... 56 References ...... 57 Supplementary material ...... 65

5

INTRODUCTION Background

History and relevance Rice is the world’s most important cereal crop. A staple food for more than half of the world’s seven billion people, it is of crucial importance in providing food security for an exponentially growing population. Like other cereals, rice has been domesticated by humans several independent times. Yet, unlike other species, these domestication events have different origins – one in Asia, and one in Africa. Although the exact history of rice in Asia is still disputed, it is clear that Asian rice (Oryza sativa) was domesticated from a single wild species () approximately 9000 years ago (Molina et al., 2011). In contrast, African farmers domesticated rice from another progenitor, , approximately 3000 years ago (see Figure 1). This event resulted in a species that is now recognised as Oryza glaberrima (Portères, 1962). A B

Figure 1: Domestication of African rice. A. Evolutionary history of cultivated rice species. Two separate domestication events resulted in the speciation of O. sativa from O. rufipogon in east Asia, and O. glaberrima from O. barthii in West Africa, respectively. Adapted from (Purugganan, 2014). B. Traditional cultivation range of African rice in West Africa. Adapted from (Vaughan, 1994). With colonial expansion in the age of exploration, rice and knowledge of its cultivation were transported by European tradesman to different continents. As a result, O. sativa was brought to the rice growing communities of West Africa probably as early as the 17th century (Linares, 2002). Both Asian and African rice were also introduced in the Americas, where the higher yielding Asian rice was grown on plantations and the more hardy African rice on smaller fields that were kept by their enslaved owners – otherwise known as ‘botanical gardens of the dispossessed’ (Carney, 2005).

6

After the onset of industrialisation, Asian rice steadily overtook African rice as a subsistence and commercial crop. Yet, both species are still grown today and offer a stunning variety of natural landraces that thrive under different environmental conditions. Asian and African rice have distinct phenotypic characteristics: their grains differ in colour, size, shape and taste. Whereas Asian rice can be milled mechanically, facilitating large scale production, African rice grains break easily and have to be milled manually with mortar and pestle. These characteristics have favoured the cultivation of Asian rice over African rice in large parts of the world. In Africa, O. glaberrima has largely been replaced by Asian rice, even though African rice is more resistant to abiotic stresses and is often preferred for its taste and its diversity in maturation time (Linares, 2002). In the Americas, African rice continues to survive in a ritual context and is still cultivated by Maroons (descendants from slaves), living in secluded communities in the interior forests of Suriname and French Guiana. Maintaining much of their traditional ways, they grow rice to honour their ancestors and use it for ritual offerings, rather than for their own consumption (Van Andel, 2010). Globalisation places these local cultural traditions and the neglected species associated with them under threat (Andel, Van der Velden, & Reijers, 2016). In addition, food demand is rising in many African countries as a result of the growing population, a trend which is reflected in annual rice consumption (see Figure 2). Yet, food security is increasingly under pressure from ongoing land use and climate change, limiting the availability of suitable crop land (Schmidhuber & Tubiello, 2007). Both processes have accelerated the shift in cultivation from local African varieties to the more productive Asian varieties. As a result, many traditional landraces of O. glaberrima are disappearing, or have already disappeared (Linares, 2002).

Figure 2: Sub-Saharan Africa per capita rice consumption. Data source: US Department of Agriculture, FAOSTAT population database (FAO). Adapted from (Mohanty, 2013).

7

Even though Asian rice has higher yields, the diminishing genetic diversity of African rice may lead to the loss of other important agronomic traits (such as salt tolerance or blight resistance) that are not represented in O. sativa. Loss of these traits from the gene pool is irreversible and limits the capacity of this species to resist a changing climate – and that of breeders to produce more resilient varieties. An understanding of the evolution of O. glaberrima and its adaptation to different natural environments is therefore an important step in characterising the agronomic potential of this species, the protection of which will be indispensable for sustaining genetic crop diversity and a food secure future.

Evolutionary origins Two main competing hypotheses have been proposed concerning the domestication of rice in Africa. One proposes that plant domestication in Africa occurred in a non-centric manner, over a protracted period of time (Harlan, De Wet, & Stemler, 1976), and has been called the ‘protracted transition model’. According to this hypothesis, rice was domesticated in multiple areas of domestication in West Africa, without a defined moment and centre of origin. The other proposes a single centre of domestication along the Niger River, followed by two secondary diversification events: one along the coast of what are now the countries of and Gambia, and one in the Highlands of Guinea (Portères, 1962). This has been called the ‘rapid transition model’. According to a particular theory supporting the latter hypothesis, domestication was triggered at an acute time point when climate change started transforming forests into savannah around 4000 years ago (Clark, 1967). The sudden drying meant that the increasing population could no longer rely on traditional forest products. However, the nature of hunter-gatherer and pastoral societies in West Africa calls into question whether a definite centre of origin is likely ever to be found (Shaw, 1976). Human migration may have assisted the exchange of particular rice varieties between ethnic groups, diminishing the differences between them. In addition, ongoing hybridisation with O. barthii and later cultivation alongside O. sativa may have caused interspecific gene flow, which further complicates inferences about domestication origin. In addition to the centric and non-centric model, an additional theory about crop domestication stipulates multiple (usually two) defined centres, as has been observed in both Asian rice (Gross & Zhao, 2014) and barley (Allaby, 2015). Such a polycentric origin can explain the existence of two distinct, geographically separated sub-populations or even sub-species, like O. sativa ssp. indica (which originated in India) and O. sativa ssp. japonica (which originated in China). These subspecies of rice have separate origins, although later domestication stages saw extensive gene flow between the two, which has been associated with the transfer of domestication alleles (Choi et al., 2017). An overview of the various domestication hypotheses can be seen in Figure 3.

8

Non-centric Polycentric Centric

Figure 3: Overview of crop domestication hypotheses. The debate about African plant domestication has mainly revolved around the non-centric (protracted transition) model and centric (rapid transition) model. An alternative model is derived from the idea of domestication centres, but holds that there can be more than one. Scholars have argued in favour of two separate domestications in both Asian rice and in barley.

Several studies have tried to illuminate the question of how and where African rice originated, either implicitly or explicitly addressing the domestication hypotheses described above. A study of fourteen unlinked nuclear genes compared the relative levels of genetic diversity and population subdivision among 20 O. glaberrima and 20 O. barthii individuals, all sampled from the proposed domestication centre (Li, Zheng, & Ge, 2011). This study found O. glaberrima to have 70% lower genetic diversity and hardly any population structure compared to its wild relative, supporting a single origin around the . In contrast, Semon et al. (2005) found five cryptic sub-populations based on a study of 93 microsatellite markers in 198 O. glaberrima and 9 O. sativa accessions, and proposed that this internal structure was partly driven by introgression from O. sativa and partly by local adaptation. Two of the sub-populations showed interspecific admixture with O. sativa. In a later study, these natural hybrids were shown to be true O. sativa varieties (Orjuela et al., 2014). The three other groups were specific to O. glaberrima and associated with different phenotypic traits (Semon, Nielsen, Jones, & McCouch, 2005). These ‘ecotypes’ correspond to the wide-spread ‘floating’ rice, which is hypothesised to be the ancestral variety, and the derived ‘non-floating’ and ‘upland’ varieties, which are more adapted to brackish and dry conditions, respectively (Semon et al., 2005). Despite this population structure, the authors found no isolation by distance and therefore argued that the maintenance of sub-populations was mainly caused by artificial selection and human-mediated gene flow. A study of genetic diversity based on 235 single nucleotide polymorphisms (SNPs) also found two distinct populations in 266 O. glaberrima accessions, that did not show any correlation to geography – in contrast to their wild relative, of which 101 accessions clearly clustered into three geographically distinct sub-populations (Orjuela et al., 2014). Contrary to Semon et al. (2005), however, this study failed to link population structure to phenotypic traits.

9

A potential shortcoming of these earlier studies is the type and quantity of genetic markers analysed, which leads to a low resolution of genetic diversity. This limitation is overcome with the introduction of genome-wide analyses and tools. Using RNA-seq technology, Nabholz et al. (2014) analysed more than 12000 RNA transcripts from 10 wild and 9 domesticated African rice individuals. They thus confirmed that O. glaberrima, with a synonymous nucleotide diversity of 0.0006 per site, underwent an extreme genetic bottleneck and is probably the least genetically diverse crop ever documented – suggesting that the strong reduction in effective population size carried a significant genetic load or ‘cost of domestication’, amplified by the high selfing rate. The relatively low level of diversity was confirmed by subsequent genomic studies. In 2014, Wang et al. published an assembly of the currently only available reference genome of O. glaberrima (AGI1.1). Whole genome resequencing of 94 O. barthii and 20 O. glaberrima accessions revealed that domesticated rice consistently clustered with one of five sub-populations of wild rice, suggesting a single origin in the area of Senegal, Gambia, Guinea and Sierra Leone. A genome-wide association study of 93 additional O. glaberrima accessions fine-tuned this finding by providing evidence for geographically localised diversification within this region (Meyer et al., 2016), specifically suggesting that reduced salt tolerance may have evolved in the tropical south of the western Atlantic coast in response to higher rainfall and reduced salinity. In addition, this study supports a period of low- intensity cultivation that may have started as early as 10,000 years before the effective population size reached a low point around 3000 years ago, when African rice was reputedly domesticated.

Evidence from across the Atlantic Whilst genetic and, more recently, genomic data from West Africa have been steadily accumulating over the past years, genetic evidence of O. glaberrima from other locations is still scarce. Apart from a single accession that was sampled from a breeding station in Guyana and was included in previous studies (Meyer et al., 2016; Wang et al., 2014), few, if any, field collections from Latin America have been sequenced to date (see Appendix A). However, research by Van Andel et al. (2010) has confirmed the ongoing cultivation and use of O. glaberrima by local Maroon communities in Suriname, proving that African rice migrated across the Atlantic Ocean and successfully survived on the American continent. This most likely happened as a consequence of colonial slave trade, which was responsible for the transportation of 12.5 million slaves over a period of 350 years (Van Andel et al., 2016). The migration of rice and its cultivation in the Americas – first on coastal plantations and later in the interior forests after slaves escaped and established their own settlements deep in the woods – probably resulted in a genetic bottleneck and the selection of landraces from West Africa that were well suited to their new environment. Indeed, resequencing data of an accession from Suriname confirmed its similarity to West African landraces from the Guinea Highlands and thereby its likely

10 identity as a rain-fed, upland variety (Van Andel et al., 2016). This is consistent with historical data (see Figure 4), which tells us rice was picked up by slave ships mostly along the ‘rice coast’ (Guinea, Sierra Leone, Liberia and Ivory Coast), and with the tropical climate observed in these countries as well as in Suriname.

Figure 4: Logbook entry of the Dutch slave ship D'Eenigheid. This ship travelled from Flushing (The Netherlands), past West Africa across the Atlantic Ocean and back again, between the years of 1761-1763. On the coast of Liberia, they bought four slaves (“een man, twee vrouwen en een jongen”) and twenty chests of rice. Adapted from (Van Andel et al., 2016).

Recently a new specimen was collected from Suriname, in addition to the one previously analysed by Van Andel et al. (2016). Comparing their genomes can demonstrate to which degree they are similar and share the same ancestry. Since the two Surinamese accessions were sampled from different locations in Suriname, they might stem from populations of different descent. TVA6749 was bought from a market in Paramaribo (Van Andel et al., 2016), which is close to Saramaccan (Saramaka) territory; TVA6745 was collected from a field along the Tapanahony river in the rainforest of Sipaliwini, which is home to the Aucan (Ndyuka) Maroons (see Figure 5). The Saramaka and the Ndyuka are the two largest Maroon communities in Suriname and each have their own unique language, history and cultural traditions. The cultural and geographic distance between these communities poses the question whether the landraces of O. glaberrima that they cultivate are also genetically distinct, or whether they grow similar Figure 5: Sampling locations of O. glaberrima in Suriname. Collections varieties. sites are indicated in red. Adapted from (Van Andel, 2010). 11

Problem statement Thus far, no conclusive evidence has been provided in favour of either the non-centric or the centric domestication hypothesis. Although Wang et al. (2014) support a single origin of domestication, this place of origin is not located along the Inner Niger Delta as suggested by Portères, but rather in what Portères proposes to be the secondary diversification centre(s). In contrast, Meyer et al. (2016) support a scenario of diversification in this region, although they make no claims as to the original centre of domestication, and even provide evidence for the protracted model. None of the other genetic studies mentioned in the previous section were able to pinpoint a clear centre of domestication. In addition, evidence regarding population structure is inconclusive and varies from no observed structure at all (Li et al., 2011; Nabholz et al., 2014), to clearly differentiated (Orjuela et al., 2014; Semon et al., 2005) and geographically localised (Meyer et al., 2016) sub-populations. Furthermore, the use of widely divergent types and quantities of data, including RNA transcripts, microsatellites, gene markers and genome-wide SNPs, precludes a systematic comparison of the results of these studies. Whereas evidence regarding the evolutionary history of O. glaberrima on the Atlantic coast of Africa is varied and contradictory, evidence regarding the evolutionary history of O. glaberrima on the Atlantic coast of the Americas is scant. Specifically, there is no knowledge of the number of independent migration occurrences across the Atlantic, and whether or not this resulted in the introduction of different varieties of African rice. In addition, there is no knowledge of subsequent adaptation of rice in this new environment, which may have accelerated differentiation from its African relatives. Although it is appealing to think that African rice in Suriname was pre-adapted to the local climate, it is currently unknown whether the similarity to the Guinean forest landraces is restricted to this accession or representative of the wider population of African rice on the coast of Latin America. The available genomic data enable a reinterpretation of previous results and might clarify some of the present ambiguities regarding the origin and diversification of African rice. In addition, the new specimen from Suriname offers the opportunity to test its relation to the other Surinamese and West African accessions. Thus, while many questions concerning the domestication and migration of African rice are still outstanding, the complete genome sequences of more than a hundred O. glaberrima accessions – and an almost equal number of O. barthii accessions – provide a wealth of genetic data that can be used to reconstruct the evolutionary history of African rice, and to compare the genetic diversity of O. glaberrima in its different localities.

12

Objectives The primary objective of the present study was to elucidate the origin and diversification of O. glaberrima in West Africa through a critical reassessment of publicly available whole genome resequencing data. This was done by compiling the resequencing data of accessions used in the studies by Wang et al. (2014) and Meyer et al. (2016). These data were analysed through a combination of population genetic and phylogenetic approaches. The secondary objective was to confirm the species identity and establish the origin of a newly collected specimen of O. glaberrima from the interior of Suriname. This was done by next generation sequencing and phylogeographic analysis with other O. glaberrima genomes. Since only one other accession from Suriname has been sequenced to date, detailed population genetic analyses for this region were not possible. However, the comparison of the two samples presented here can serve as a basis for further sampling and research into the evolution of African rice in the Americas.

Research questions In order to meet the objectives, the following research questions were formulated around the main question: “How have domestication and human-assisted migration shaped the geographic pattern of variation in O. glaberrima?”.

1. What is the relative genetic diversity between O. glaberrima and O. barthii? 2. Which (domestication) traits show evidence of positive selection? 3. What is the population structure in O. glaberrima and O. barthii? a. Which populations of O. glaberrima and O. barthii share the same ancestries? b. What is the degree of genetic differentiation among these populations? c. To what extent is population structure caused by isolation by distance? 4. What are the phylogenetic relationships among O. barthii and O. glaberrima? a. Do all O. glaberrima accessions cluster within the same O. barthii population? b. Do domestication genes share the same evolutionary histories? 5. What is the geographical origin of O. glaberrima in Suriname? a. What are closest relatives of the Surinamese samples, and where are they from?

Hypotheses We assumed the centric, rapid transition model as a null hypothesis, and the protracted transition model with multiple domestication centres as an alternative hypothesis. Based on the null hypothesis, we expected that the study would point out a single origin of African rice domestication in West Africa, unless results would clearly provide evidence to the contrary. Related to the research questions, we specifically expected that: 13

1. African rice underwent an extreme bottleneck due to domestication, resulting in levels of genetic diversity that are significantly lower than those of its wild relative. 2. Positive selection during domestication resulted in ‘hard’ selective sweeps that subsequently drove variants under selection (and linked variants) to fixation in the entire population, including known domestication genes. As a consequence, African rice shows an excess of derived alleles with high frequencies. 3. African rice is genetically largely homogeneous compared to its wild relative, although there is some genetic differentiation visible within West Africa between the coastal and inland populations and between the arid and tropical populations, respectively. There is no isolation by distance. Wild African rice, on the other hand, displays a clear population structure that is linked to geography. 4. All cultivated rice clusters together with the same population of wild rice, pointing to a single origin. Individual genes each deviate somewhat from the genome-wide phylogenetic pattern, but otherwise show similar evolutionary relationships. 5. The mass migration of and people to the Americas implies that the introduction of African rice was not a singular event. Given the climate in Suriname, both samples collected there may be genetically similar to the Guinean forest landraces, but they are likely to descend from different lineages and hence cluster with different relatives.

Approach To test the hypotheses, this study consists of five parts. The first four aim at a deeper understanding of the evolution of O. glaberrima in West Africa, using available resequencing data. The last aims to shed more light on the history of O. glaberrima in Suriname, including new resequencing data that has not been used before. Details of the approach are given in the Methods section. A summary is given below: Firstly, raw sequencing data of all available accessions from Wang et al. (2014) and Meyer et al. (2016) were aligned to the O. glaberrima reference genome and used as a basis for SNP calling. After quality filtering, the study proceeded with two call sets: a full set of SNPs (call set 1a) and a reduced set of SNPs (call set 1b). The relative genetic diversity of O. glaberrima and O. barthii were estimated by measuring SNP density, nucleotide diversity (π) and Tajima’s D and by comparing the Minor Allele Frequency (MAF) spectrum. This was done to confirm the genetic bottleneck in O. glaberrima as a result of domestication. Secondly, evidence of positive selection was assessed by looking at the derived allele frequency spectrum and comparing this to the frequency spectrum under neutral expectations. Deviations from neutral in the higher frequency classes were interpreted as evidence of selection. 14

Adaptation of O. glaberrima was further explored by scanning for regions of the genome that show signs of a selective sweep1. Selective sweeps can be caused by complete or incomplete selection on a single new mutation, leading to a ‘hard’ sweep (consisting of a single haplotype), or on multiple new mutations or pre-existing variants, leading to a ‘soft’ sweep (consisting of multiple haplotypes) (Messer & Petrov, 2013). In this study, we aimed to identify recent hard sweeps by localising deviations in the site frequency spectrum as compared to the genome-wide background. These results were used to identify candidate regions harbouring alleles that have been driven to (near) fixation as a result of artificial selection. Thirdly, population structure of O. glaberrima and O. barthii was determined by estimating the number of ancestral populations and fractions of these ancestries represented in each individual. This formed the basis for the identification of extant sub-populations that were used for subsequent analyses. Genetic differentiation between the sub-populations were quantified using the fixation index (FST). To determine whether geographic separation is able to account for (part of) the genetic differentiation between sub-populations, isolation by distance was evaluated as the correlation between individual relatedness and distance. These results demonstrate how much substructure the domesticated rice population possesses compared to its wild relative, and which part of this substructure can be explained by geography. Fourthly, the genetic relatedness of O. glaberrima to O. barthii was assessed by calculating pairwise genomic distances between individuals and using these to reconstruct evolutionary relationships. To determine whether there might have been multiple centres of domestication, gene haplotypes were used to uncover the evolutionary histories of candidate domestication genes. The resulting phylogenies were analysed in light of the whole genome tree. These results indicate whether any of the domestication genes possess unique relationships that deviate from the overall phylogeny, which would point to the possibility of multiple domestication events. Fifthly, the genetic similarity of the two Surinamese accessions were analysed by sequencing the nuclear DNA of the newly collected sample and including this sequence in SNP calling together with the other available O. glaberrima genomes. This led to a new set of variants (call set 2). Pairwise distances were recalculated for phylogenetic analysis and mapped geographically in order to identify the closest relatives of this sample and its likely point of origin. This was done to assess whether the two Surinamese accessions might share a single origin.

1 Persistent selection on polymorphisms can leave patterns in the genome that are characterised by a local reduction in variation: a so-called signature of selection (Nielsen, 2005). The extent of this pattern is determined by the intensity of selection, the time scale during which selection has operated and the recombination rate in this genomic area. The result of these processes combined is known as a ‘selective sweep’. 15

RESULTS Call sets This study used publicly available whole genome data of 206 accessions of O. glaberrima and O. barthii sampled from across both species ranges (see Appendix B). SNPs were called according to criteria summarised in Appendix C. An overview of the resulting variation found in both species can be found in Table 1. From these results, it is evident that the vast majority of polymorphic sites in O. glaberrima are shared with O. barthii. In contrast, almost half of the variation in O. barthii is unique to the wild species alone. The level of polymorphism in the domesticated accessions is also much lower, and consists of a higher number of singletons than in the wild species. The ratio of synonymous to non- synonymous substitutions appear to be roughly the same (between 0.80-0.85), as does the relative portion of protein coding variation (less than 10% of the total SNP count); although both are a bit lower in O. glaberrima. Both O. glaberrima and O. barthii are predominantly selfing plants. This is reflected in their low levels of heterozygosity, which ranges from 0-5% of all sites per individual. Although O. glaberrima is expected to be more inbred than O. barthii, an accurate comparison of both species was hampered by discrepancies in their sequencing coverage. After controlling for this by removing sites or individuals with excessively low coverage, however, we see a clear reduction in the fraction of heterozygous sites in O. glaberrima (see Appendix D), which is several times lower than in O. barthii.

Table 1: Distribution of SNPs after quality filtering. Columns show the number of variants over both species and each species separately, under different filtering conditions. Varying the filtering conditions resulted in two versions of the same call set: one full set of SNPs (call set 1a) and once reduced set of SNPs (call set 1b), with fewer variants of higher quality.

Call set 1a 1b Population Total O. barthii O. glaberrima Total O. barthii O. glaberrima Accessions 206 94 112 206 94 112 SNP count 3,923,601 3,797,182 2,322,659 2,644,126 2,580,362 1,419,601 Unique 1,727,361 1,600,942 126,419 1,288,289 1224525 63764 Shared 2,196,240 58% 95% 1,355,837 53% 96% Singletons 239,322 345,025 597,486 116,390 151,804 456,772 Transitions 2,823,188 2,733,322 1,686,367 1,919,367 1,872,667 1,043,405 Transversions 1,100,413 1,063,860 636,292 724,759 707,695 376,196 Ratio (Ts:Tv) 2.57 2.57 2.65 2.65 2.65 2.77

Protein coding 288,061 164,001 188,196 97,532 Coding fraction 0.076 0.071 0.073 0.069

Synonymous 130,054 73,962 84,662 42,890

Non-synonymous 153,612 87,468 100,894 53,136

Ratio (dS/dN) 0.847 0.846 0.839 0.807

16

Genetic diversity A better estimate of the relative genetic diversity in O. barthii and O. glaberrima was obtained by comparing the SNP density, π and Tajima’s D across the genome. Genome-wide sliding windows of these statistics can be found in Appendix E. Average SNP density was consistent with total SNP counts and found to be almost twice as high in O. barthii as in O. glaberrima, amounting to 9.04 SNPs per kb in O. barthii and 5.00 SNPs per kb in O. glaberrima. This is similarly reflected in the relative nucleotide diversity between the two species, which was significantly lower in the cultivated species (πcultivated =

πc = 0.0007) than in the wild species (πwild = πw = 0.0013) at p < 1.0E-05. The relative nucleotide diversity

(πw/ πc) was found to be 1.87 across the genome, but was markedly higher in some regions (see Figure 6A). Tajima’s D was found to be significantly different between the two species (p < 1.0E-05), being predominantly negative in O. glaberrima (-0.6761) and positive in O. barthii (0.5172). A negative value of Tajima’s D indicates that the average number of pairwise differences between two sequences falls below what is expected based on the total number of segregating sites in a population, meaning that a relatively large number of segregating sites occur in only few samples. Hence, the relative levels of Tajima’s D suggest that large parts of the O. glaberrima genome exhibit an excess of rare variants as compared to O. barthii. This is compatible with the general trend in π ratio (πw/ πc), which is usually well above one (see Figure 6A) and takes on particularly high values when Tajima’s D is extremely negative. The excess of rare variant in O. glaberrima is confirmed by the Minor Allele Frequency (MAF) spectrum (see Figure 6B), where O. glaberrima has a larger spike in low frequency alleles (MAF < 0.01) than the majority of O. barthii alleles, which are of intermediate frequency (MAF 0.01-0.05). Intermediate allele frequencies are generally observed during population contraction, causing a genetic bottleneck that filters rare variants out of the population. In contrast, low allele frequencies usually occur following a genetic bottleneck (such as those caused by domestication), when population size increases again. The exact impact size is dependent on the severity of the bottleneck, the time of recovery and the recombination rate in the genome (Thornton, 2005). The low levels nucleotide diversity and the large number of rare variants found in O. glaberrima are thus consistent with a scenario of population expansion following a sudden drop in effective population size. These results are in congruence with previous findings (Nabholz et al., 2014; Wang et al., 2014; Meyer et al., 2016) and indicate that a strong reduction in diversity in O. glaberrima occurred as a result of domestication.

17

A B

Figure 6: Relative genetic diversity and allele frequencies in domesticated and wild rice. A. Nucleotide diversity (π) ratio and Tajima’s D in O. glaberrima and O. barthii along chromosome 1, calculated in window sizes of 100 kb. Strong differences in Tajima’s D correspond to outliers in π. B. Minor allele frequency spectra of O. glaberrima and O. barthii. O. glaberrima has an excess of low frequency variants (MAF < 0.01), whereas O. barthii has more intermediate frequency variants (MAF 0.01-0.05).

18

Evidence of positive selection While the previous results show that the assumptions of neutrality do not hold, this deviation from neutrality could be caused both by changes in the effective population size as well as artificial and natural selection. It is therefore not excluded that the strongly negative value of Tajima’s D and the excess of rare minor alleles can be explained by to an extent demographic history, rather than by selection pressure alone. To test whether positive selection has impacted the site frequency spectrum, we looked at the relative distribution of derived alleles between O. glaberrima and O. barthii. Previous studies on the domestication of Asian rice have found an excess of high frequency derived alleles among non- coding and synonymous substitutions in two subspecies of O. sativa (Caicedo et al., 2007). In extreme cases, the right-hand side of the derived frequency spectrum will mirror the left-hand side, and create a U-shaped distribution. Whereas demographic factors can usually account for the occurrence of derived alleles at extremely low frequencies, it is extremely unlikely that population size and genetic drift alone can cause the shift of derived alleles to extremely high frequencies Such a U-shaped derived allele frequency spectrum is therefore used as evidence of positive selection, but has not been demonstrated in African rice to date. To determine which alleles are derived and ancestral, SNPs were polarised with respect to an outgroup. For this purpose, we chose the Oryza meridionalis (v1.3) genome sequence (Jacquemin, Bhatia, Singh, & Wing, 2013). A justification for the outgroup is given in Appendix F. A total of 3,923,601 variants were screened, of which 2,332,467 either did not align to the outgroup species, or did not pass the alignment quality filter. The derived allele frequency spectrum was calculated for all synonymous and noncoding variants of the remaining 1,591,134 SNPs. In contrast to the expected site frequency spectrum under neutral conditions, a large number of high frequency derived alleles is observed in both populations (see Figure 7A and Figure 7B). Although both species show an excess of high frequency derived alleles, O. glaberrima shows a greater excess (35% of total SNPs above expectation) than O. barthii (27% of total SNPs above expectation) in the far-end of the spectrum (0.7 – 1.0). This excess is also caused by higher frequency classes in O. glaberrima (21% of total SNPs > 0.99) than in O. barthii (18% of total SNPs > 0.95) (see Figure 7C). Over the whole spectrum, the difference between observed and expected frequencies was shown to be greater in O. glaberrima than in O. barthii in a two-sample Kolmogorov-Smirnov test (p < 1.5E-04) (see Figure 7D). This disproportional skew in favour of high frequency derived alleles in O. glaberrima suggests that large parts of the genome bear signs of recent positive selection.

19

A B

C D

Figure 7: Derived allele frequency spectrum of non-coding and synonymous substitutions in O. barthii and O. glaberrima. A. Observed and expected marginal derived allele frequency spectrum of O. glaberrima. B. Observed and expected marginal derived allele frequency spectra of O. barthii. C. Derived allele frequency spectra of O. glaberrima and O. barthii together. D. Empirical cumulative distribution function of the difference between the expected and observed frequency spectra.

To further investigate the regions of the genome that might have been under recent and strong selection, a composite likelihood ratio (CLR) test was conducted. Overall, the genome-wide CLR was higher in O. glaberrima (1.12 on average) than in O. barthii (0.93 on average), consistent with the larger excess of high frequency derived alleles demonstrated in the previous section. A comparison of the outliers of both species reveals a single shared outlier on chromosome 4 (see Figure 8); this position falls inside a gene (ORGLA04G0148000) that is predicted to encode a magnesium ion binding protein with serine/threonine phosphatase activity (The UniProt Consortium, 2017). The 278 genes in the vicinity of (i.e., less than 25 kb away from) or overlapping with other outlier positions in O. glaberrima are mostly uncharacterised, but include several genes with high impact mutations that are involved in diverse biological processes such as auxin signalling, ADP binding and cell growth regulation (see Appendix G). The lack of overlap of these outliers with O. barthii suggests that the sweeps found in these regions are unique to the domesticated species.

20

To investigate whether genes that have been associated with domestication in O. sativa also show evidence of selection in O. glaberrima, we compared the location of these genes to the location of candidate sweeps. None of the twenty genes under inspection directly mapped to outlier regions (see Figure 8). When the threshold was lowered to include the top 5% highest values, only one gene mapped to an outlier region: GN1a, a gene on chromosome 1 associated with grain productivity. To increase confidence in the top 5% outliers, π-ratio and Tajima’s D were computed as independent test for selection, in corresponding windows of 25 kb. Only one gene (IPA1) fell within a window with outlier values for both π-ratio and Tajima’s D (see Figure 9A); however, this window did not exhibit an elevated CLR.

Figure 8: Log-transformed CLR test statistic (ω), used to scan for genomic regions under selection in O. glaberrima. Positions less than 25kb away from a known domestication gene are highlighted in green. Positions with the top 0.5% highest values for ω are considered outlier positions and are separated by a red line. Shared outliers are indicated with a black arrow. A B

Figure 9: Correspondence between ω and other neutrality tests. A. The ω-statistic (log(CLR)) versus relative nucleotide diversity (π ratio) in windows of 25 kb. B. The ω-statistic (log(CLR)) versus Tajima’s D in windows of 25 kb. Regions harbouring domestication genes are highlighted in green. Dashed grey lines depict 5% cut-off points.

21

Overall, the selection scan results show a low correspondence to other neutrality tests (see Figure 9). This is not unexpected, since these neutrality tests reflect different underlying metrics. Both π and Tajima’s D possess serious shortcomings when it comes to detecting selective sweeps, due to the confounding effect of demography on the site frequency spectrum. It is therefore not surprising that the outliers for one statistic do not match the outliers for another statistic – although a high value for π ratio, or a low value for Tajima’s D, would provide additional support. More surprising is the fact that not a single domestication gene shows clear-cut evidence of a recent, strong selective sweep. Either this is because the chosen selection scan still has problems separating the effects of demography from those of selection, or because the model of a single, hard sweep fails to explain the history of these genes. If the latter is the case, this calls into question the ‘single origin’ hypothesis, and the domestication of O. glaberrima might have resulted from more complex processes than simple selection scans are able to detect. Since these scans presume that a variant under selection swept through an entire population, the possibility that part of the population escaped the sweep, either due to different selection pressures or due to population substructure, remains unexplored.

Population structure ADMIXTURE To investigate the population structure between O. glaberrima and its wild relative, we re-examined the variation in both species in a joint ADMIXTURE analysis. Previous studies using whole genome data have only examined the population structure in the wild and domesticated species separately. Pooling the data allows us to infer which ancestor fractions are shared between the species and which are unique. Since genetic diversity is much higher in O. barthii than in O. glaberrima, it is expected that the genetic structure in the wild species is stronger and might crowd out some of the substructure within the domesticated species, making it appear genetically homogenous compared to its wild relatives. To observe the effect of including or excluding O. barthii in the structure analysis, ADMIXTURE was therefore repeated with only the O. glaberrima accessions. The cross-validation standard error estimates showed that model fit was optimised at K = 8 when both wild and domesticated accessions were used (see Appendix H). Thus, a model was chosen where eight ancestral populations explain the observed genetic structure in the entire data set. Six of these are represented in both species; two are only represented in O. barthii and two others only in O. glaberrima (see Figure 10A). In contrast, model fit was optimised at K = 5 when only O. glaberrima accessions were used (see Appendix H), meaning that model fit improved for this species when the number of underlying ancestral populations was lowered. A model was therefore chosen where only five ancestral populations are used to explain the genetic structure in O. glaberrima alone.

22

A O. barthii O. glaberrima

. OB-A | OB-B | OB-C | OB-D | OG-I | OG-II | OG-III | OG-IV | OG-V . B C

Figure 10: Population structure in O. barthii and O. glaberrima. A. Admixture analysis with K = 8 ancestral populations of wild and domesticated rice accessions combined. B. Geographic clustering of genetic O. glaberrima populations. Accessions were labelled according to their genetic background, with the colour of each dot representing the ancestral population (K = 5) that accounted for the majority (>50%) of pruned SNPs in that accession C. Admixture analysis with K = 5 ancestral populations of domesticated rice accessions only.

23

This resulted in a population structure with roughly similar proportions of estimated ancestry, with the exception that two ancestral populations are now collapsed into one (see Figure 10C). Based on these results, the two species were subdivided in roughly evenly sized sub- populations: O. barthii in sub-populations OB-A through OB-D, and O. glaberrima in OG-I through OG- V, respectively. Previous genome-wide studies have also categorised O. glaberrima and O. barthii in distinct sub-populations, based on population structure analyses. In the case of O. glaberrima, this population structure has been linked to geographic origin (Meyer et al., 2016), with an 11°N and a 6°W cline dividing north from south and west from east, respectively. For the purpose of comparison, all West African accessions were therefore assigned to one out of four geographic populations (see Figure 11b): northeast (NW), northwest (NW), southeast (SE), and southwest (SW). Similarly, all O. barthii accessions were assigned to one out of five genetic clusters identified by Wang et al. (2014): OB-I, OB-II, OB-III, OB-IV and OB-V. It has to be noted that, whereas Meyer et al. (2016) identified geographic populations of domesticated accessions based on Principal Component Analysis (PCA), Wang et al. (2014) identified genetic populations of wild accessions. Hence, a one to one correspondence between the geographic and genetic populations of O. glaberrima is not expected; neither do we expect a one to one correspondence between the genetic populations of O. barthii, due to addition of the large number of O. glaberrima individuals presented here. Nonetheless, we find that all but three OB-II accessions collectively form the OB-A subpopulation and all OB-I and OB-III accessions together are part of the OB-B subpopulation. OB-C is composed of equal numbers of OB-IV and OB-V accessions, and OB-D consists almost exclusively of OB-V accessions. Interestingly, OB-C and OB-D represent the individuals that were previously found to form a clade with O. glaberrima (Wang et al., 2014). This is consistent with the observation that these individuals contain ancestor fractions that are also found in O. glaberrima, in contrast to the individuals from populations OB-A and OB-B. The only domesticated populations that do not appear to share ancestry with any of the wild populations, are OG-II and OG-III. A large number of individuals within these populations contain substantial fractions of both ancestries, making these populations less readily distinguishable. These individuals were therefore categorised as either OG-II or OG-III depending on which out of K = 5 ancestor populations constituted the majority fraction. As is visible from the collection sites of the O. glaberrima accessions (see Figure 10B), the majority of accessions was sampled west of the 6°W cline and belong to one of three coastal populations: OG-I, OG-II and OG-III. While most of the OG-I accessions were collected in Senegal, Gambia (what was previously called Senegambia) and Guinea-Bissau, the OG-II and OG-III accessions occupy a wider range and are found predominantly in the Guinea Highlands, ranging from Guinea,

24

Sierra Leone and Liberia to Ivory Coast. East of the 6°W cline, most accessions belong to populations OG-IV and OG-V. Whereas the OG-V population is concentrated around the Upper Niger River, covering both and Burkina Faso, the OG-IV population extends further inland from the Gulf of Guinea, including Ghana, Nigeria and Cameroon, all the way to Niger and Chad. These results suggest that population structure a strong geographic component, which partially overlaps with the pattern observed by Meyer et al. (2016). The distribution of the five genetic populations over the NW, SW, NE and SE regions of West Africa is shown in Figure 11.

Figure 11: Distribution of O. glaberrima populations over the NW, SW, NE and SE regions of West Africa. Adapted from (Meyer et al., 2016). While populations OG-I through OG-III are localised along the Atlantic coast and therefore mostly correspond to the NW and SW clusters, populations OG-IV and OG-V are geographically more spread out and found primarily inland, representing the bulk of the NE and SE clusters.

Allelic differentiation To determine whether there is genetic differentiation between the geographic sub-populations, the fixation index was calculated between individuals separated by the 11°N and the 6°W cline, respectively. The fixation index between O. glaberrima and O. barthii was calculated as a reference point for within O. glaberrima population comparisons (see Table 2). While O. glaberrima and O. barthii are differentiated greatly (FST > 0.15), the western and eastern populations were differentiated moderately (FST = 0.10) and the northern and southern populations were differentiated only a by a small degree (FST < 0.05). This is in line with previous results, proposing a deep split between the eastern and western cultivation range followed by a later split between the northern and southern diversification centres (Meyer et al., 2016).

25

Table 2: Fixation index (FST) between species and between geographic clusters. The degree of differentiation is determined based on the mean weighted FST (Hartl & Clark, 1997).

Population 1 Population 2 Weighted FST Degree of differentiation O. barthii (all) O. glaberrima (all) 0.181 Great (0.15 – 0.25) east O. glaberrima (NE+SE) west O. glaberrima (NW+SW) 0.100 Moderate (0.05 – 0.15) north O. glaberrima (NE+NW) south O. glaberrima (SE+SW) 0.042 Little (< 0.05)

Because of the uneven sampling on both sides of the 6°W cline and because the geographic clusters are composed of unequal numbers of accessions with different ancestries, these numbers do not accurately reflect the allelic differences between the genetic sub-populations. In addition, they give no insight as to the relative degree to which these sub-populations are differentiated from their wild relatives. The fixation index was therefore recalculated for equal numbers (n = 15) of genetically homogenous individuals from all five genetic populations and an equal number of O. barthii accessions. These 15 O. barthii individuals were selected from the 20 accessions that were sequenced at higher depth, to ensure sufficient coverage of data (see Appendix B). Even with this restricted sample size, it is clear that genetic differentiation from O. barthii is large (> 0.15) for all five populations, but relatively larger for the three coastal populations (> 0.25) than for the two inland populations (0.15 – 0.25). Genetic differentiation is the smallest for OG-IV and the largest for OG-II. This pattern is mirrored by the number of segregating sites remaining in each population after removing monomorphic SNPs, which is again the largest for OG-II and the smallest for OG-IV. The opposite trend can be seen for average nucleotide diversity per kb, which is smaller in the coastal populations (π < 1.0) than in the inland populations (π > 1.0). We thus observe that, even though the total number of polymorphic sites in larger in OG-I through OG-III, the average number of pairwise differences between these individuals is lower. This suggests that a smaller number of individuals carries a larger fraction of the polymorphic sites, which is consistent with a population expansion scenario, as explained before.

Table 3: Genetic attributes of five genetic O. glaberrima populations. Columns show relative sample size, degree of polymorphism and genetic differentiation from O. barthii. Fixation index (FST) reflects the differentiation between a subset of individuals (n = 15) from each population and an equal number of O. barthii. The degree of differentiation is determined by the mean weighted FST, with interpretations based on (Hartl & Clark, 1997).

Population Accessions Segregating sites π/kb Weighted FST Degree of differentiation OG-I 22 1,443,290 0.9905 0.25641 Very great (> 0.25) OG-II 30 1,550,660 0.7504 0.30894 Very great (> 0.25) OG-III 19 1,646,372 0.9166 0.27588 Very great (> 0.25) OG-IV 21 1,236,010 1.179 0.20514 Great (0.15 – 0.25) OG-V 20 1,324,300 1.076 0.24287 Great (0.15 – 0.25)

26

Isolation by distance The combined evidence of the previous sections suggests that the increase in genetic differentiation from the inland to the coastal populations may be linked to geographic range expansion. To test whether the observed population structure could be (partly) explained by geography, isolation by distance (IBD) was assessed among all West African accessions. There are two theories of IBD: ecological IBD, which is caused by the reduced probability of two random individuals mating with increasing distance, and genetic IBD, which is explained by the accumulation of genetic differences by dispersal (Ishida, 2009). We evaluated genetic IBD as the correlation between pairwise relatedness, measured by the kinship coefficient (φ), and pairwise geographic distance, measured by the shortest distance between the collection sites of two accessions in kilometres. As expected, large discrepancies were observed in the geographic distances separating individuals between populations. The average pairwise distance was the smallest among OG-I, in which individuals cluster relatively close together, and by far the largest among OG-IV accessions (see Figure 12A). Kinship was on average higher among the coastal accessions than among the inland accessions, but still relatively constant (see Figure 12B). The kinship coefficient was almost exclusively negative, although by varying degrees, indicating that all individuals were essentially unrelated. Considering the uneven geographic range of the populations, we divided the O. glaberrima accessions into two segments: the three coastal populations – OG-I, OG-II and OG-III – pooled together, with a relatively small geographic range (less than 500 km on average), and the two inland populations – OG-IV and OG-V – pooled together, with a much larger geographic range (averaging more than 1000 km). Pairs of individuals that were separated by more than 1.5 times the interquartile range of the pairwise geographic distance within each segment were omitted from the analysis, in order to reflect the most common geographic range of the group. Genetic IBD was discernible among the first group (the coastal populations), with a correlation coefficient (r) of -0.35 (see Figure 12C). This correlation was stronger than the observed correlation between distance and relatedness within any single population, or in all populations pooled together (see Appendix H). In contrast, there was hardly any IBD among the second group (the inland populations), with a correlation coefficient (r) of 0.02 (see Figure 12D). This suggests that, whereas in the inland regions geographic distance seems to be a very poor indicator of relatedness (perhaps partly because these individuals are less related, and partly because the geographic distances between them are so large), some of the population structure observed along the coast can indeed be explained by geographic distance. This would correspond to the accrual of mutations as O. glaberrima dispersed throughout the coastal range.

27

A B

Pairwise geographic distance (km) Pairwise relatedness (φ)

C D

IBD among OG-I, OG-II, OG-III IBD among OG-IV, OG-V

Figure 12: Isolation by distance in O. glaberrima. A. Distribution of the pairwise geographic distances between individuals in kilometres, grouped per population. B. Distribution of the kinship coefficients between individuals, grouped per population. C. Isolation by distance among the coastal populations (OG-I, OG-II and OG-III). Outliers (separated by more than 1500 km) were omitted. D. Isolation by distance among the inland populations (OG-IV and OG-V). Outliers (separated by more than 3500 km) were omitted. Each dot symbolises a unique pair of individuals within in each group. Whereas N denotes the number of accessions included in each analysis, the number of pairwise comparison equals N! and is therefore markedly higher.

28

Phylogenetic relationships

Whole genome Previous research has found that most O. glaberrima accessions form a clade together with only a subset of O. barthii (Wang et al., 2014). In the present analysis, these O. barthii individuals represent multiple ancestral populations and are classified as OB-C and OB-D. With this new population structure information and with 92 additional O. glaberrima sequences, it is possible to test whether this monophyletic relationship still holds. To this end, a Neighbour Joining (NJ) tree was constructed based on the pairwise genomic distance between all 206 available accessions and annotated with the K = 8 ancestral population structure information from the previous section. The resulting tree demonstrates the phylogenetic clustering of O. glaberrima (OG) with OB-C and OB-D individuals. This means that a large part of the OB-C and OB-D individuals are closely related to O. glaberrima, as previously identified by Wang et al. (2014). Some OB-C accessions branch off towards the base of the tree; these individuals, although they share ancestry with O. glaberrima, are further removed from the domesticated accessions and have been previously all identified as admixed individuals (Wang et al., 2014). OB-A and OB-B form a distinct clade with longer branches, confirming the genetic distance of these populations to O. glaberrima. These leaves were removed from the pruned tree, to zoom in on the relatedness among the accessions with shorter branch lengths (see Figure 13B). The identity of individuals leaves can be found on the tree in Appendix I, which has been annotated with accession numbers. To investigate the further separation of O. glaberrima into its genetic sub-populations, the tree was pruned to include only domesticated accessions constituting a monophyletic clade with their closest wild relatives, omitting populations OB-A and OB-B (see Figure 13B). From this tree, it is evident that almost all the coastal accessions (yellow, dark-blue and red) appear to form a clade that do not include any O. barthii or any of the inland accessions (green, light-blue and pink). Although the tree is unrooted, the clustering of inland samples with O. barthii implies that they share a common pool of genetic variation from which O. glaberrima wad domesticated, and that the coastal lineages branched off at a later point in time. The smaller degree of polymorphism in the coastal populations, despite their larger sample size, seems to support this scenario. Assuming that the sampling locations of the present accessions reflect their historical origins, this suggests that the origin of domestication lies east of the 6°W cline, and that O. glaberrima subsequently migrated westward. This is consistent with the TreeMix analysis performed by Meyer et al. (2016), and with the domestication hypothesis proposed by Portères (1962).

29

A B

Figure 13: Neighbour Joining (NJ) tree of O. barthii and O. glaberrima whole genome sequences, based on 3,923,601 genome-wide SNPs. A. NJ tree with all O. barthii (OB) and O. glaberrima (OG) accessions. Accessions are labelled by species and coloured according to their genetic cluster (K = 8). The grey bar labelled ‘OB-V and OG’ indicates the smallest monophyletic clade containing all O. glaberrima and its nearest wild relatives. B. Pruned NJ tree with only OB-V and OG accessions, representing the smallest monophyletic clade that contains all O. glaberrima and its nearest wild relatives. Branches are coloured according to genetic cluster (K = 8) and are labelled by country. OB-V accessions labels have a grey background. The dashed blue line surrounds the largest clade that contains only OG and no wild relatives.

30

Domestication genes To assess whether domestication of O glaberrima could have had multiple origins, NJ trees were constructed for several domestication genes known from recent rice genetics literature (Wang et al., 2017; Li et al., 2017), a list of which can be found in Appendix J. Of the 25 genes, 5 (Prog1, An1, GS3, OsSh1 and Hd1) were not found in a BLAST+ search. Of these genes, it has been noted that OsSh1 and Hd1 are deleted in O. glaberrima (Wang et al., 2014). Only in one case (Badh2) did the BLAST+ search yield a different genomic feature than previously identified by Wang et al. (2014). In this case, the analysis was continued with the BLAST+ result. Genotypes were phased to be able to resolve individual haplotypes. Each gene tree therefore consists of 206*2 = 412 terminal nodes. Gene tree statistics and haplotypes are summarised in Appendix K. Because genes may have been subject to different selection pressures during various phases of domestication, they are not all expected to reflect the same evolutionary history. However, a gene may also have experienced different selection pressures in different sub-populations, causing multiple haplotypes to be visible within one gene tree. Individuals with the same genetic history for that gene will share the same haplotype. We labelled the five most common haplotypes per gene to assess the haplotype count and prevalence among O. glaberrima individuals. We then annotated the tree based on population structure, to see which of these O. glaberrima haplotypes predominantly belong to a single subpopulation and whether they cluster with the expected OB-C and OB-D accessions. In most genes, the majority of O. glaberrima accessions are represented by one or two large haplotypes. These haplotypes are usually shared with a subset of OB-C and OB-D accessions. Barring some small deviations, there are generally no major inconsistencies with the genome-wide phylogenetic signal. However, there are four genes that clearly stand out from this pattern: Sd1, qSh1, OsLG1 and Sh4 (see Figure 14). In these genes, a subset of O. glaberrima individuals from a single subpopulation cluster together in smaller clades that are far removed from the larger O. glaberrima clade. Invariably, one of the segregating haplotypes in these genes is composed O. glaberrima individuals that are exclusively from the OG-II population (see Table 4). A closer inspection of the neighbouring O. barthii accessions reveals that their closest relatives all belong to the OB-B subpopulation, rather than the expected OB-C and OB-D populations. Two of the genes with this deviating pattern, Sh4 and qSh1, are involved in loss of seed shattering (Konishi et al., 2006; C. Li, Zhou, & Sang, 2006). Sd1 causes semi-dwarfing (Asano et al., 2011) and OsLG1 regulates a closed panicle (Ishii et al., 2013). A re-examination of the other trees subsequently shows that some of these OG-II individuals also cluster with OB-B accessions in other genes. These genes are involved in diverse functions such as grain size (GW2), cold tolerance (COLD1) and grain and pericarp colouring (Phr and Rc). The smaller number of accessions in these clades, however, did not warrant their identification as one of the five largest haplotypes in these genes.

31

A qSh1 1: 26,833,660 - 26,843,661 B Sd1 1: 28,538,241 - 28,548,242

C OsLG1 4: 24,451,797 - 24,461,798 D Sh4 4: 25,146,703 - 25,156,704

Figure 14: Separation of an OG-II haplotype from the most prevalent O. glaberrima haplotype in multiple domestication genes. The five (or in case of equal haplotype counts, six) largest haplotypes are labelled. The most common haplotypes, containing accessions from multiple sub- populations O. glaberrima, are collapsed into orange nodes. Haplotypes that consist exclusively of O. barthii are collapsed into blue nodes. Remaining haplotypes, consisting of a mix O. barthii and O. glaberrima accessions from a single subpopulation, are expanded with branch colours reflecting the population of origin of the O. glaberrima accessions.

32

The presence of a separate haplotype consisting of OG-II accessions, which are found exclusively in Guinea, Liberia, Sierra Leone and Ivory Coast, suggests that these haplotypes may be restricted to the proposed secondary domestication centre in the Guinean forest region. To determine which polymorphisms might set these OG-II individuals apart from the majority of O. glaberrima accessions, we compared the OG-II haplotypes with the most prevalent haplotypes observed in the four genes. We then selected the segregating sites and used variant effect prediction software to evaluate the impact of these substitutions. In qSh1, Sd1 and Sh4, there were three segregating sites between the largest haplotype and the OG-II haplotype; in OsLG1 there were only two segregating sites. Screening the effects of these substitutions revealed that all but two substitutions affected upstream and downstream gene variants. Of the remaining coding variants, one resulted in a synonymous substitution in qSh1; the other resulted in a stop codon gained in Sh4. A previous study has demonstrated the presence of a C→T substitution that leads to a premature stop codon in the Sh4 gene (Wu et al., 2017). Analysis of the same resequencing data as used in the current study revealed that the derived allele, which causes both smaller grain size and loss of shattering, is present in most domesticated accessions, but absent in O. barthii. Notably, the few O. glaberrima carrying the ancestral allele were all localised to a small geographic area covering the countries Guinea, Liberia and Sierra Leone. This is consistent with our observation that the smaller OG- II haplotype is composed of 11 individuals from precisely the same region (see Table 4). A closer examination of the segregating sites in the two haplotypes of Sh4 reveals a G→A substitution at the reported site of the C→T substitution. A cross-check of the genetic variants of Sh4 on Ensembl Plants confirms that that both substitutions are equivalent, with the first reflecting a substitution in the reference genome (4:g.25152034G>A) and the second reflecting a substitution in the coding sequence (ORGLA04G0254300.1:c.589C>T), which is on the other strand. Polarisation of this SNP using O. meridionalis as an outgroup confirms that the Guanine encoding the intact protein is ancestral, and the Adenine responsible for truncation is derived. The recurrent pattern of gene haplotypes that are restricted to OG-II accessions from the Guinea Highlands suggests that domestication may have followed a different path in this area, leading to the observed genetic differences in various parts of the genome. Although the functional relevance of these genetic differences has been unequivocally demonstrated in one gene (Sh4), both in silico and in vitro, the phenotypic consequences of differentiation in the other genes remain to be determined experimentally. Considering the extensive LD in O. glaberrima and the fact that strong candidates for high impact substitutions could not be identified, it cannot be excluded that associated functional mutations lie outside the intervals included in our phylogenetic analyses and that the gene haplotypes identified here may have hitchhiked on the selection of a different genomic feature altogether.

33

Interestingly, screening of homologs of additional agronomically important O. sativa genes, reveals that the NAC transcription factor OsNAC6 is located less than 15 kb from Sd1. OsNAC6 has been has been identified as a key regulator of rice stress responses, and has been shown to enhance drought and salinity tolerance (Ohnishi et al., 2005). Phylogenetic analysis shows that the exact same OG-II accessions form a separate haplotype in this gene, as in Sd1 (see Table 4). A recent genome-wide association study of O. glaberrima has provided evidence for geographical divergence of salt tolerance traits in the SW coastal population and suggested two other candidate genes OsHAK5 and OsHAK6 that were linked to a significant SNP on the far-end (position 30698514) of chromosome 1 (Meyer et al. 2016). This gives further credibility to the idea that the segregating haplotypes in these genomic regions may underlie functional differentiation of the coastal OG-II accessions. An overview of the segregating OG-II accessions and in which genes they cluster together with OB-B, is given in Table 4.

Table 4: OG-II accessions possessing a different haplotype than the main O. glaberrima clade in multiple domestication genes. All these accessions have in common that they cluster with OB-B rather than OB-C and OB-D and share a single haplotype with other O. glaberrima individuals for the genes that are marked with an ‘x’.

Accession qSh1 Sd1 OsNAC6 OsLG1 Sh4 GW2 COLD1 Phr1 Rc Country

IRGC103937 x x x x x Liberia

IRGC103946 x x Liberia

IRGC103949 x x x x Liberia

IRGC103953 x x x x x Sierra Leone

IRGC103988 x x x x Sierra Leone

IRGC104035 x x x x x Cote d'Ivoire

IRGC104036 x x x x x Cote d'Ivoire

IRGC104165 x x x x Guinea

IRGC105048 x x x x x Liberia

IRGC105049 x x x x Liberia

TOG6203 x x x x x Guinea

It can be seen from Figure 14 that some individuals from other sub-populations, namely OG- III, OG-IV and OG-V, are also represented by different haplotypes in some genes. Most notably, accessions from the OG-IV subpopulation do not only cluster separately in qSh1 and OsLG1, but are also part of smaller haplotypes in Phr1, MOC1, Rc and Ipa1 (see Table 5). Annotated trees of these four additional genes can be found in Appendix K. Although these accessions are exclusively surrounded by O. barthii accessions sharing large fractions of the same ancestral populations, the fact that they frequently form a clade together provides further support for the separate roots of domestication in the coastal and inland regions of West Africa.

34

Table 5: OG-IV accessions possessing a different haplotype than the main O. glaberrima clade in multiple domestication genes. All these accessions have in common that they cluster with O. barthii accessions of similar ancestry and share a single haplotype with other O. glaberrima individuals for the genes that are marked with an ‘x’

Accession qSh1 OsLG1 Phr1 MOC1 Rc Ipa1 Country

IRGC103592 x x Cameroon

IRGC103599 x x Cameroon IRGC103922 x x x x x x Nigeria IRGC103963 x x x x x x Senegal

IRGC103982 x x Nigeria IRGC104024 x x x x x x Guinea-Bissau

IRGC104044 x x Chad

IRGC104047 x x Cameroon

IRGC104533 x Nigeria

IRGC104904 x x Nigeria

TOG5467 x Nigeria

African origins of rice in Suriname Knowing the genetic structure and phylogenetic relationships among domesticated and wild rice in Africa, it is possible to draw inferences about the origin of O. glaberrima accessions sampled overseas. A previous report recently demonstrated the relatedness of TVA6749 to the Guinean forest landraces (Van Andel et al., 2016). The nearest neighbour of this sample was an accession from Ivory Coast. To test the similarity of other African rice grown in Suriname, another accession was sequenced and SNP calling was repeated with the same data set. An NJ tree was constructed based on the pairwise differences between all 112 O. glaberrima accessions of the previous analyses, plus the new sample, resulting in a tree with 113 terminal nodes (see Figure 15). The tree shows that the Surinamese samples (TVA6745 and TVA6749) are sisters to each other, and cluster most closely with two samples from Ivory Coast (IRGC104573 and IRGC104034), one of which was the nearest neighbour of TVA6749 in the study by Van Andel et al. (2016). The Ivory coast and Suriname samples are surrounded primarily by accessions belonging to the OG-II subpopulation. As previously discussed, these accessions were collected predominantly from the SW corner of West Africa, from the countries of Guinea, Sierra Leone and Liberia. These results demonstrate the likeness of the Surinamese samples to each other and to other landraces from the hypothesised secondary domestication centre of O. glaberrima in West Africa.

35

Figure 15: NJ tree of 113 accessions of O. glaberrima. Accessions are coloured according to their genetic cluster as identified by the ADMIXTURE with K=8 ancestral populations. The Surinamese accessions are indicated with a golden star.

The African roots of both accessions were subsequently compared by conducting a Thin Plate Spline regression analysis on the genomic divergence between both Surinamese sequences and the West African sequences. The Surinamese samples are genetically so close, that the heat maps of their interpolated genetic distances in West Africa were almost identical. A combined map was therefore made reflecting the average pairwise distance of the West African accessions to both Surinamese accessions (see Figure 16A). The average pairwise distances were grouped and averaged per country. In order of increasing distance, the genetic similarity was the closest to samples from Sierra Leone, Guinea and Liberia, respectively (see Figure 16). Combined, these findings confirm the genetic relatedness of the Surinamese O. glaberrima samples to the Guinean forest landraces, and suggests that these geographically widely separated individuals may share a single origin.

36

A

B

C

Code Country Figure 16: Genomic distance of 103 West African accessions to the Surinamese sister BFA Burkina Faso samples. A. Thin plate spline heat map of pairwise genomic distances to the Surinamese CMR Cameroon samples. The map shows the average pairwise genomic distance of the two Surinamese TCD Chad CIV Cote d'Ivoire samples to 106 West African O. glaberrima samples, interpolated across the West African GMB Gambia map. Distances were normalised on a scale of 0 to 100 and inverted, with minimum GHA Ghana distance (0.1424) reflecting the highest similarity (100) in red and maximum distance GIN Guinea GNB Guinea-Bissau (0.2604) reflecting the lowest similarity (0) in blue. B. Box and whisker diagrams of LIB Liberia pairwise genomic distances to the Surinamese samples. Box and whisker diagrams are MLI Mali NGA Nigeria grouped according to country and ordered from lowest to highest median genomic SEN Senegal distance. Box widths are proportional to the square root of the number of accessions in SLE Sierra Leone each country. C. Legend of West African countries and their corresponding codes.

37

DISCUSSION

Data preparation This is the first time that the data sets of two large scale genomic studies on African rice (Meyer et al., 2016; Wang et al., 2014) have been combined, allowing for several important new insights into the domestication history of O. glaberrima. The sample sizes of both O. glaberrima and O. barthii populations, in combination with the genomic nature of the data, confer an advantage to the analyses presented here that previous studies did not have. A disadvantage, however, is the lower sequencing coverage of the O. barthii data. The disparity of the O. glaberrima and O. barthii data sets could have called for two parallel variant discovery and filtering processes, with the goal to fine tune the quality of both data sets independently. However, a joint variant discovery protocol was favoured, considering the close similarity of both species and the likelihood that they share variation. The high coverage data of O. glaberrima thus confer more confidence to SNPs found in both species, which might go undetected if the O. barthii data were to be analysed alone and later merged with O. glaberrima. Nonetheless, the low coverage of the O. barthii reads has implications for downstream analyses. Firstly, it is more difficult to detect SNPs that have been sequenced at a lower depth. Hence, variant discovery is less sensitive to variation that is unique to O. barthii, either because it was not sequenced at all, or because it was overlooked by the SNP calling algorithm, or because it was rejected by our hard filters. All this implies that the genetic diversity in O. barthii is most probably underestimated by our results and that the relative diversity as compared to O. glaberrima could be even higher than reported here. Potential artefacts caused by missing data were circumvented as much as possible by adding an additional quality check based on no-call rate, and by applying stringent SNP and individual filters where necessary. It was expected that stringent filtering of the whole data set would disproportionately filter out SNPs that are unique to O. barthii, due to lower variant quality. However, the inverse can be seen: applying stringent criteria leads to a stronger reduction in SNPs that are shared with O. glaberrima, than those that are unique to O. barthii, while the proportion of unique and shared SNPs in O. glaberrima stays approximately the same (see Table 1). This could be explained by the fact that the high read depth in O. glaberrima allows for effective detection of extremely rare alleles, which are likely to be caused by sequencing errors and subsequently filtered out. The trade-off between quality and quantity of data warrants careful consideration of the kind of input data required for each analysis. A complete and a reduced call set were therefore consciously applied to overcome some of the shortcomings discussed here, and to provide a set of SNPs that is both of sufficient quality and number to provide a complete, yet accurate view of the relationships between both species.

38

Demography and adaptation The diversity measures reported in this study are consistent with previous reports (Meyer et al., 2016; Wang et al., 2014) and provide strong evidence for a large reduction in diversity in the genome of O. glaberrima as a result of domestication. This reduction in diversity is most likely caused by a combination of selection, favouring a small number of preferred alleles, and demographic history, causing a large drop in effective population size (Ne). Both the negative values of Tajima’s D and the skewed MAF distribution in favour of rare alleles support the notion that O. glaberrima underwent population expansion following a severe bottleneck. While this demographic scenario can account for the large number of rare derived alleles, it is unlikely to account for the large number rare ancestral alleles. When derived alleles jump to extremely high frequencies, this commonly assumed to be the result of genetic hitchhiking linked to an adaptive mutation, rather than genetic drift. U-shaped derived allele frequency spectra therefore provide strong evidence for the effects positive selection in the genome. An absence of this U-shape in the wild progenitor of a crop, would suggest that these genomic regions evolved more or less neutrally in the wild species, but not in the domesticated species. In contrast to the U-shaped spectra published for O. sativa and its wild progenitor, O. rufipogon (Caicedo et al., 2007), the frequency spectra presented here do not only show an excess of high frequency derived alleles in the domesticated species, but also in the wild species. One explanation is that this is a result of misassignment of the ancestral state. Even though our choice of outgroup was designed to minimise polarisation errors due to incomplete lineage sorting and multiple substitutions (see Appendix F), this does not guarantee that will SNPs were polarised correctly. However, the fact that O. glaberrima shows a larger excess of high frequency derived alleles than O. barthii, as evidenced by their empirical cumulative distribution functions, is an indication that it at least underwent stronger positive selection than its wild relative. Despite this evidence, the identification of exact regions in the genome that have been under positive selection is notoriously difficult due to the confounding effects of demographic history, which are known to produce local reductions in genetic diversity that can look remarkably like selective sweeps. Artificial selection and a reduction in Ne go hand in hand, making it difficult to determine whether deviations in the site frequency spectrum have been caused by selective pressure, driving a favoured genotype to fixation, or by genetic drift, removing alleles from the population by chance. However, whereas selection is assumed to produce very localised patterns, demography is assumed to exert effects roughly evenly across the whole genome. Knowledge of the ‘background’ site frequency spectrum therefore enables the detection of local deviation from the genome-wide average. This would allegedly render the selection scan robust to different demographic scenarios.

39

This robustness of the selection scan employed here has been demonstrated in simulations that modelled respective increases and decreases in effective population size (Kim & Nielsen, 2004). None of these scenarios, however, simulated a domestication event, which causes a very extreme and rapid decline in Ne. Without additional modelling, therefore, we cannot be sure that demographic history of African rice did not interfere with the CLR test presented here. The difficulty of pinpointing a causative mutation on a fine scale is further exacerbated by the relatively short time span and recent onset of domestication, which reduces the possibility of recombination after selection, leading to large blocks of LD. The fact that none of the domestication genes map to an outlier in the SFS, should therefore be interpreted with caution. Previous studies have reported genes falling within the window of LD decay (300 kb) of an outlier SNP; however, within this window, many other genomic features can be found as well, whose functions are still unknown. Indeed, because we know that O. glaberrima was domesticated independently from O. sativa there might very well have been other genes at play in the domestication of rice in Africa that we do not yet know of. If we are to take the results of the selection scan seriously, this would imply there is not much parallelism in the genes that are associated with domestication in African and Asian rice. The lack of functional annotation in the O. glaberrima genome means that the homologues of O. sativa domestication genes may provide a good starting point, but until they (or any other gene that shows evidence of selection) have been experimentally associated with a functional trait – like Sh4 – we cannot be sure that they were indeed selected for in O. glaberrima as well. Another shortcoming of the selection scan presented here, is that it does not consider alternative forms of selection, such balancing selection or diversifying selection. Although we explicitly focused on hard selective sweeps, caused by persistent positive selection on a single de novo mutation, these assumptions are frequently not met in nature. In recent years, more studies have focused on the prevalence and detection of ‘soft’ selective sweeps, which are caused by positive selection on either pre-existing variants or multiple de novo mutations. Both cases lead to a sweep that does not appear to be complete, in the first case because the standing variation under selection is already linked with multiple alleles due to recombination, in the second case because both mutations have the same effect and therefore neither of them becomes fixed (for a visualisation, see Appendix L). Despite these realistic possibilities, we did not explore alternative models of selection in this study for two reasons. Firstly, the existence of several haplotypes due to multiple novel mutations or standing variation renders the detection of soft sweeps exceedingly complex. Consequently, there are fewer tools available that can reliably pick up on these subtle signatures of selection with the same sensitivity as hard sweeps. Secondly, the performance of such tools and their respective Type I and Type II errors have not been quantified to the extent of methods for the detection of hard sweeps (Jensen, 2014).

40

Population structure Perhaps the most compelling argument that can be made against the hard sweep model in O. glaberrima, is the population subdivision. In contrast to previous studies, we found the O. glaberrima populations to be clearly divisible into five sub-populations based on their ancestry fractions. Population structure appears to be particularly strong in the coastal regions of West Africa, and shows a weak correspondence with two geographic clines previously identified by Meyer et al. (2016). It is noteworthy that a recent study of the AfricaRice gene bank collection also revealed exactly five genetic clusters based on a study of 27,560 SNPs across 2179 accessions. These clusters are linked strongly to country of origin, but less so to ecotype (Ndjiondjop et al., 2017). The authors differentiated three groups that were primarily from Liberia, Nigeria and Mali – consistent with the locations from which the OG-II, OG-IV and OG-V populations presented here were predominantly sampled. The fourth group was from Nigeria and Togo and the fifth group (the smallest) was geographically more wide-spread. The geographic component of population structure can prevent instances of position selection from sweeping through the entire population in multiple ways. Firstly, isolation by distance delays the migration of a beneficial allele, thus diminishing the effect of genetic hitchhiking that is observed in a hard sweep (Pfaffelhuber, Lehnert, Stephan, & Parsch, 2008). In addition, population structure can cause parallel adaptation to a global selection pressure in geographically separated demes (Ralph & Coop, 2010), resulting in a soft sweep rather than a hard sweep (see Appendix L). Geographically separated populations may also undergo local adaptation due to geographically localised selection. This has been observed in the case of drought tolerance in the coastal populations of O. glaberrima (Meyer et al., 2016). Hence, it is not unlikely that population substructure further complicates the detection of sites that are universally under positive selection throughout the entire species. Interestingly, all the accessions that originate from the proposed primary domestication centre of African rice, around the Inner Niger Delta, belong to a single genetic cluster (OG-V). This cluster is also found in Guinea, where it coincides with one of the coastal populations (OG-III) and where the species splits into two other populations: OG-I to the north and OG-II to the south. The geographic ranges of these populations roughly correspond with the proposed secondary domestication centres in what used to be Senegambia and in the Guinea Highlands, respectively. In contrast, OG-IV is restricted to the inland areas and located primarily south and east of OG-V. The fact that OG-IV and OG-V appears to be the most genetically diverse and least genetically differentiated from O. barthii, while the reverse is seen in the coastal populations, seems to suggest that the population bottleneck occurred in an east to west direction, followed by differentiation between the north and south in the coastal region. We therefore see strong evidence of Portères’ domestication theory in our population structure analysis.

41

Taxonomic implications The genome-wide phylogeny is consistent with the observed populations structure results and with the tree previously published by Wang et al. (2014). In both cases, O. glaberrima clusters predominantly within a single O. barthii, albeit in a paraphyletic manner. Yet, a single origin of O. glaberrima would presuppose that all domesticated individuals would share a common ancestor without any O. barthii descendants. The fact that O. glaberrima does not form a monophyletic clade therefore calls into question the assumption that it speciated through a discrete domestication event. A possible explanation could be the rewilding of ancient O. glaberrima landraces, which have ‘gone feral’ and are now classified as the wild species. Alternatively, one might be tempted to conclude that the paraphyletic nature of this group disproves its taxonomic status as a separate species, and that O. glaberrima and O. barthii are genetically indistinguishable. Indeed, O. glaberrima and O. barthii diverged so recently that hybridisation is still considered possible – if difficult – and that admixture between O. barthii and O. glaberrima even now should not be excluded. In fact, ‘weedy’ rice, which is a genetic mix between the wild and cultivated species, can result from interspecific crosses and has been observed in the case of O. barthii along the Niger River Delta (Orjuela et al., 2014). However, since both O. glaberrima and O. barthii are primarily inbreeding, ongoing genetic exchanges between the two genomes seem unlikely. A more plausible explanation for the observed phylogenetic patterns might be that the time that has passed since domestication (roughly 3000 years ago) has not been sufficient to establish complete lineage sorting. This could mean that O. glaberrima still contains a part of the ancestral variation that is observed in O. barthii. For this reason, gene trees may not correspond to the overall genomic tree. Indeed, it is a widely observed phenomenon that incomplete lineage causes mixed phylogenetic signals (Degnan & Rosenberg, 2009). The presence of a distinct haplotype that is restricted to OG-II and OB-B individuals in several hypothesised domestication genes, therefore raises the questions whether this clade is a remnant of neutral standing variation that has been lost from other O. glaberrima accessions, or whether it arose through novel substitutions. The ancestral character of a functionally important SNP in one of these haplotypes has recently been confirmed for Sh4 and is proposed to “support the deep and separate roots of domestication practices in the west versus the eastern cultivation range” (Wu et al., 2017). This may point either to early domestication in the southwest and persistence of ancestral haplotypes, or to the existence of other loci that may be involved in loss of shattering in these accessions. The first explanation would cast doubt on the assumption that African rice was domesticated around the Inner Niger Delta and suggest the Guinean forest as the potential primary domestication centre. The second explanation is more consistent with parallel adaptation, pointing to the independent selection of multiple variants affecting the same trait in different sub-populations of O. glaberrima.

42

Considering the population structure and phylogenetic relationships obtained from genome- wide data, the second scenario seems more plausible. Further molecular and phenotypic analyses are needed to elucidate the functional implications of the segregating haplotypes observed in other affected genes and give more insight into the patterns of local adaptation that are restricted to the Guinean forest and other regions.

Domestication origin The results discussed so far thus shed new light on an old controversy concerning the process of plant domestication in Africa, in which two models have traditionally been competing: the rapid transition model proposed by (Portères, 1962) and the non-centric protracted transition model proposed by (Harlan et al., 1976). Whereas the work by Portères (1962) promotes the idea of a primary domestication centre followed by secondary centres where later improvement occurred, Harlan’s model implies diffuse domestication over a long period of time, with multiple centres or no centres at all (1976). The studies that have employed genomic data, have supported both sides of this controversy: Wang et al. (2014) found evidence for centric domestication, thereby supporting Portères’ hypothesis, whereas Meyer et al. (2016) found evidence for the protracted transition model advocated by Harlan (1976). The evidence by Wang et al. (2014) is primarily based on the observation that all O. glaberrima seems to cluster with a single O. barthii population. However, rather than claiming the Inner Niger Delta to be the primary site of domestication, they found that the centre of origin was most likely located in Senegal, Gambia, Guinea and Sierra Leone – countries that Portères has designated as secondary domestication centres. In contrast, Meyer et al. (2016) seem to support the Inner Niger Delta as a plausible site of origin, but maintain that migration to the west and a subsequent split between the arid (northern) and tropical (southern) regions separated the Senegambian and the Guinean domestication centres. While they claim to support the protracted transition model, based on proof that the effective population size of O. glaberrima was steadily declining for a period of about 10,000 years before it is commonly assumed to have been domesticated, they thus clearly provide evidence for the primary and secondary domestication centres proposed by Portères. This study has aimed to resolve some of the confusion introduced in this debate. First of all, Wang et al. (2014) and Meyer et al. (2016) address different aspects of the controversy: Wang et al. (2014) merely propose that domestication was centric, but make no claims regarding the rapidity, and Meyer et al. (2016) do the opposite; they propose that domestication was protracted, but not necessarily non-centric. Although the present study does not deal with the time scale of domestication, a few remarks about the potential centre(s) of origin can be made.

43

Based on the sampling origins of O. glaberrima, the results are strongly suggestive of an early domestication event in the eastern cultivation range, with subsequent genetic differentiation towards the west. The geographic origin of the closest wild relatives of O. glaberrima, however, correspond mostly with the Senegambian and Guinean forest regions. If we assume that these relatives are genuinely O. barthii and not ‘rewilded’ or intermediate forms, there are two plausible scenarios. Either African rice was first domesticated along the coast and subsequently migrated east. In that case, the coastal populations must have undergone substantial differentiation in order to explain their larger genetic distance from O. barthii. Alternatively, the geographic separation of the inland populations and their wild relatives can be explained by domestication in the eastern cultivation range and a subsequent range shift of the wild progenitor from the east to the west. To ascertain which of the two scenarios (eastern centre of origin and westward migration, or western centre of origin and eastward migration) eventually holds out, more knowledge is needed about the extent to which O. glaberrima and O. barthii migrated in the past and whether the present sampling locations truly reflect historical populations. In addition, while the sampling locations of O. glaberrima accessions were quite precise, for most O. barthii only the country was given and detailed coordinates were not available. This not only prevents accurate knowledge of the collection sites of these accessions, but also precludes the detection of isolation by distance. More importantly, however, even sampling across both species is needed to enable an accurate assessment of spatial genetic structure. Currently, most of the O. glaberrima accessions were sampled along the coast. Samples from the eastern cultivation range are therefore relatively scarce. In contrast, most O. barthii accessions were sampled from Mali, Chad, Cameroon and Nigeria (see Appendix B) – countries where only few O. glaberrima were sampled. Thus, in countries where many O. glaberrima were sampled (including the countries where Wang et al. (2014) located the domestication centre), only a minority of O. barthii were collected. This calls into question the reliability of their results. Since drawing inferences about origin and migration in any given location (east or west, north or south) based on only a handful of either species is likely to result in suboptimal estimates, additional sampling with accurate geo- referencing is expected to drastically improve future estimates of domestication origin. Despite these limitations, the marked population structure observed in O. glaberrima points to the possibility of parallel domestication or local adaptation following domestication. The presence of strongly differentiated OG-II haplotypes in multiple domestication genes might be an artefact caused by incomplete lineage sorting, but it might also mean that the segregating landraces acquired domestication traits independently of the majority of O. glaberrima. Regardless of whether this happened at the onset of domestication or during a secondary wave, the phylogenetic clustering of these haplotypes provides evidence for a separate genetic origin that may have been caused by introgression or direct domestication from this otherwise seemingly unrelated O. barthii population.

44

Transatlantic migration In addition to West Africa, further sampling is needed in Latin America in order to resolve questions about the origin of O. glaberrima in the New World. Aside from a handful of germplasm collections (see Appendix A), no systematic sampling efforts in the countries of Central and south America have been reported. The most extensive evidence of African rice across the Atlantic comes from Portères himself. He described the morphologies of O. glaberrima in several Latin American countries, and observed a self-sowing variety in El Salvador, in addition to cultivated varieties in French Guiana, El Salvador and Panama (Agnoun, Biaou, Sié, Vodouhè, & Ahanchédé, 2012). Despite these reported observations, few botanical collections survive and only three accessions from the Latin American continent have been sequenced to date. Two of these are the Surinamese samples described in this study; the other was collected from Guyana. The Guyanese sample is one of two accessions, however, that were collected from breeding stations (in Guyana and Zimbabwe, respectively). Hence, their true geographic origin cannot be stated with certainty. Although it seems plausible that the transatlantic migration during the slave trade resulted in an additional genetic bottleneck in Latin American O. glaberrima, this is difficult to verify without additional sequencing data. However, the close genetic similarity between the two Surinamese accessions and the Guinean forest population, is consistent with the notion of a small founder population that was pre-adapted to the tropical climate of the Surinamese rainforests (Brown, 2016). This is further supported by morphological comparison of the only other known collection of cultivated O. glaberrima from a Maroon field in the past: a botanical specimen which was collected from French Guiana and identified by Vaillant in 1938 (Vaillant, 1948). Portères examined this specimen in 1955 and found it to be most similar to varieties found in the Guinea Highlands (Portères, 1955) – precisely within the geographical range presently demonstrated to be genetically most similar to the Surinamese collections. Despite these similarities, the selection pressures in Suriname probably also differed to some extent from those in West Africa, both due to local differences in climate and human uses. Escaped slaves were constantly on the move, so it is likely that they selected landraces with a propensity for early flowering. In addition, the present-day uses of African rice by the Maroon communities are mostly ritual, suggesting that nutritional value and taste may have become relatively less important traits. Analysis of these differential selective pressures will require more sampling in Suriname, as the current collection is insufficient to draw inferences about the population at large. Resequencing data of additional specimens collected in Suriname, French Guiana and other Latin American countries will provide an opportunity to explore how African rice has evolved after its introduction in the New World and diverged from its African relatives.

45

CONCLUSION

In light of the findings presented, we can now with some confidence assert that the centric hypothesis of African rice domestication is incorrect – or at least has some serious shortcomings. The diversity analyses unequivocally demonstrate that O. glaberrima underwent an extreme bottleneck. To the best of our knowledge, this bottleneck was most probably associated with domestication, thus supporting the rapid transition model. Although there are some indications of positive selection in terms of an excess of high frequency derived alleles, conclusive evidence of hard selective sweeps – especially in relation to known domestication traits – has been elusive. Whether this stems from methodological issues or from the population structure observed in O. glaberrima can only be demonstrated with improved knowledge of the demographic history of O. glaberrima and additional simulations. Although it has been shown that cultivated rice is much less genetically diverse than wild rice, the ADMIXTURE analysis of O. glaberrima shows that it is all but homogeneous, even compared to O. barthii. The subdivision of the species in coastal and inland populations is suggestive of geographic structure, as is the differentiation along a north-south gradient on the coast. Contrary to expectation, isolation by distance was observed in three out of five genetic sub-populations. Due to data limitations, isolation by distance could not be assessed in O. barthii. Phylogenetic results confirm the clustering observed in previous studies, but shed new light on the relationships between sub-populations of the wild and domesticated species. O. glaberrima indeed forms a cluster with a subset of O. barthii individuals; however, the majority of the coastal accessions form a monophyletic clade that does not contain any wild relatives. This pattern breaks down when considering phylogenies at the level of individual genes; there we see that some landraces are far removed from the majority of O. glaberrima and cluster with a different O. barthii sub- population instead. These separate haplotypes demonstrate the divergent evolutionary trajectories among some sub-populations, most notably the accessions from the Guinean forest region. The special status of the Guinean forest landraces is reflected in their migration and survival across the Atlantic Ocean, which was first documented by Vaillant (1948) and later by Portères (1962). Considering the constant supply of people and food crops from Africa to the Americas during the slave trade, the two Surinamese accessions were not necessarily expected to be closely related. Yet, phylogeographic analyses demonstrated the genetic proximity of the two Surinamese samples to each other and to their closest relatives from Ivory Coast, Sierra Leone, Guinea and Liberia. Although not enough accessions from Suriname have been collected to unequivocally demonstrate a genetic bottleneck as a result of migration, the fact that the two accessions are sister taxa, but have been collected from different Maroon communities, points to the possibility that O. glaberrima migrated from the ‘rice coast’ in West Africa to the coastal plantations of Latin America through a single route.

46

Whereas this study provides compelling evidence for the origin of African rice in the eastern cultivation range and its diversification along the Atlantic coast of West Africa, the overarching hypothesis that O. glaberrima was domesticated in a single and discrete event has to be rejected. The observed population structure is partially consistent with Portères proposed primary and secondary domestication centres. However, evidence of persisting ancestral variation and multiple gene haplotypes among different sub-populations of O. glaberrima suggests that important functional traits may have arisen out of parallel evolution or local adaption, rather than a single selective sweep. This is partly corroborated by the effect of geographic distance on genetic relatedness and partly by experimental evidence confirming the phenotypic consequences of spatially restricted genetic variation (Meyer et al., 2016; Wu et al., 2017). Hence, it can be concluded that the centric, rapid transition model of domestication does not tell the whole story of the evolution of O. glaberrima. In contrast, the protracted transition model with multiple domestication centres or a polycentric view might offer valuable alternative perspectives on the observed geographic distribution of genetic variation found in African rice.

FUTURE DIRECTIONS

There are several ways to improve and add to our understanding of the results presented here. Firstly, a systematic comparison of the genetic variation in O. barthii and O. glaberrima will be aided when sequencing data are of equal quality. This will minimise the species bias in artefacts caused by missing data. In addition, future selection scans, whether looking for hard or soft sweeps, will have to explicitly account for the extreme genetic bottleneck observed in O. glaberrima by modelling various demographic scenarios; this will increase our confidence in candidate regions. Furthermore, an assessment of alternative modes of selection – including differential selection in subpopulations and selection on multiple different alleles, balancing selection and diversifying selection – may lead to different insights that are currently overlooked by hard sweep methods and is necessary to investigate the possibility of parallel or local adaptation. Future research into the origins of African rice should also investigate the possibility that the closest wild relatives of O. glaberrima are in fact hybrids or rewilded ancient landraces. The presence of such weedy varieties might be able to account for the shorter branch lengths of these wild relatives, their paraphyletic relationship with O. glaberrima and the discrepancy between their collection sites and those of their immediate cultivated siblings. A closer examination of the genetic diversity and signatures of selection of the hypothetical ancestor population in comparison to other O. barthii and O. glaberrima accessions might elucidate whether they are more similar to the domesticated or to the wild species. These genetic analyses will have to be balanced with suitable morphological evidence.

47

Uncertainty in species delimitation could be further examined through phylogenetic networks and introgression analyses. This will aid our understanding of the precise evolutionary relationships between O. glaberrima and O. barthii. In addition to reassessing the taxonomic status of some accessions, improved sampling in combination with accurate geo-referencing of all collection sites will benefit the fine-scale mapping of spatial genetic structure, particularly of O. barthii. Future analyses will primarily require more sampling of O. glaberrima east of the Inner Niger Delta and more O. barthii west of the Inner Niger Delta, where collections of these species are presently scarce. More even sampling will allow a more thorough analysis of the population structure observed across the geographical ranges of O. glaberrima and O. barthii; this will aid the identification of plausible centre(s) of origin and flows of migration. Larger sample sizes will also increase the sensitivity of genome-wide association studies, which enable the identification of SNPs that are associated with traits of interest. While genome-wide data have already been used to explore the mutations associated with drought tolerance (Meyer et al., 2016), these data have not yet been mined for other signs of ecological adaptation. Genome-wide association studies thus provide an opportunity identify additional causative mutations underlying adaptive evolution. Loci that are suspected to be under selection will subsequently have to be validated experimentally through functional assays in vitro and in vivo. Functional analyses will help to improve the annotation of the O. glaberrima genome, which is still lacking in many ways in comparison to the Asian rice genome. Improved functional annotation will assist in predicting the phenotypic consequences of gene haplotypes and in linking phylogenetic patterns to the evolution of functionally significant traits. The complementation of genetic and genome-wide studies with experimental data will be indispensable in the future – not just for our understanding of the broad patterns of evolution and domestication of African rice, but also to create insight into the emergence of local adaptive traits connected with its diversification in different geographic contexts.

48

METHODS Data preparation Collection and next-generation sequencing This study used publicly available whole genome data of 111 O. glaberrima and 94 O. barthii accessions (Meyer et al., 2016; Wang et al., 2014). All the O. glaberrima accessions were sequenced at an average read depth of ~15X. Most of the O. barthii accessions were sequenced at an average read depth of ~5X; a subset of 20 O. barthii accessions, sampled from various populations, was sequenced at a higher coverage of ~20X. Publicly available whole genome resequencing reads were downloaded from the Sequence Read Archive (SRA). Two additional O. glaberrima accessions were sampled in Suriname. From these, the seeds were collected at the Vreedzaam market at Waterside street, Paramaribo in 2009 (Van Andel et al., 2016) and from a gardener’s field in Tjon Tjon, Sipaliwini in 2016 (unpublished), respectively. DNA was extracted from the leaf material of successfully germinated seeds and amplified following the protocol in Appendix M. DNA libraries were prepared using the Nextera DNA Library Preparation kit. The resulting inserts measured 300-600 bp and 700 bp, and were sequenced on the Illumina HiSeq 2500 platform at a median depth of 8.3X and ~15X, respectively. The average read length of the accessions was ~100 bp with a per base quality ranging from 28 to 40. A total of 206 accessions were used for population genetic analyses. A list of all used accessions and the metadata provided by Wang et al. (2014) and Meyer et al. (2016) can be found in Appendix B.

Whole genome alignment and variant discovery All variants were called relative to the O. glaberrima (AGI1.1) reference genome and subsequently filtered to remove false positives. The O. glaberrima reference genome (Wang et al., 2014) was retrieved from Ensembl Genomes (release 33). Variant discovery was performed following the Genome Analysis Tool Kit (GATK) Best Practices (DePristo et al., 2011). Untrimmed reads were mapped to the reference genome using the BWA-MEM algorithm (Li & Durbin, 2010) of the Burrows-Wheeler Aligner (v0.7.13). Duplicate reads were flagged with Picard (v1.129) MarkDuplicates (Broad Institute, n.d.). Local realignment was performed around indels with GATK (v3.6.0) RealignerTargetCreator and IndelRealigner (McKenna et al., 2010). The resulting BAM files were indexed and validated with Picard (v1.129). Individual genotypes were called using GATK (v3.6.0) HaplotypeCaller on reads with a minimum mapping quality score of 30. GVCFs were combined into a single VCF with GATK (v3.6.0) GenotypeGVCFs. Only biallelic SNPs were retained for analysis.

Quality control Since filtering at the level of individuals would cause a significant reduction in the sample size of O. barthii, we only filtered at the level of variants. The recommended approach to variant filtering is

49 quality score recalibration, which uses a machine learning approach to identify true positives by comparing their combined annotation profiles to a reference set of known variants. Because O. glaberrima is a non-model organism, such a reference set is not (yet) available. Hard filters were therefore applied to the raw SNPs by removing SNPs falling outside the quality thresholds of several common annotations: DP, QD, MQ, MQRankSum, ReadPosRankSum and FS. In addition, we used the no call rate divided by the number of samples as a measure of missing data. We did not filter for heterozygous sites, because both species are primarily inbreeding and therefore exhibit low levels of heterozygosity. Filter thresholds were determined based on their effect on the Transition:Transversion ratio (Ts:Tv). Although the true Ts:Tv ratio is unknown and varies along the genome, it is known that functional constraints generally favour transitions over transversions (Li, Wu, & Luo, 1985), leading to ‘transition bias’. SNPs with a higher Ts:Tv are thus likely to be enriched for true SNPs, while SNPs with a lower Ts:Tv will contain more false positives. SNPs were binned along the range of a given annotation. For each interval, SNP count and Ts:Tv were calculated and plotted in R (v3.3.2) (R Core Team, 2013). Intervals were removed in order to maximise Ts:Tv while retaining a reasonable number of SNPs. Based on previous studies, call sets between 2 and 4 million SNPs were deemed reasonable. Filtering criteria with different levels of strictness were applied (see Appendix C). This resulted in two versions of the same call set. Call set 1a contains a full set of SNPs considered to adhere to a minimum standard of quality. Call set 1b contains fewer SNPs that adhere to a higher standard of quality. Both call sets were used; a choice between the two was made depending on the amount and quality of SNPs needed for each analysis. Relative diversity estimates, selection scans and detection of population structure require fewer SNPs; for those analysis, the reduced call set was used. In order to obtain pairwise genomic distances and differentiate between gene haplotypes, a higher density of SNPs was desired; for these analyses, the complete call set was used, so as to maximise the number of segregating sites. The effect of filtering on Ts:Tv ratio was quantified with VCFTools (v0.1.14) (Danecek et al., 2011). Filter classes and their thresholds can be found in Appendix C.

Summary statistics Call set comparison Mean depth of coverage, fraction of missing data and mean variant quality per SNP were calculated in 100 kb sliding windows along the entire genome in order to assess the distribution and quality of SNPs. These statistics were computed using VCFtools (v0.1.14) for both call sets. Large problematic regions were not detected. In order to make an informed decision as to which version of call set to use and whether or not to adjust the filtering parameters, additional statistics were calculated. Following the method of (Li, Li, Jia, Caicedo, & Olsen, 2017), call set 1a and 1b were compared both with respect to

50 their patterns of nucleotide diversity (π) and genetic differentiation (FST). The fixation index (FST) between O. glaberrima and O. barthii was calculated as per Equation (1), based on the implementation of (Weir & Cockerham, 1984). Relative nucleotide diversity was calculated according to Equation (2), as the ratio of π in O. glaberrima to π in O. barthii.

2 2 휎푆 휎푆 Equation (1): 퐹푆푇 = 2 = , where p is the allele frequency in the total population, 휎푇 푝(1−푝) 2 2 휎푇 is the variance in allele frequency in the total population, and 휎푆 is the variance in allele frequency between the two sub-populations. 푛 푖−1 Equation (2): 휋 = ∑푖푗 푥푖푥푗휋푖푗 = 2 ∗ ∑푖=2 ∑푗=1 푥푖 푥푗휋푖푗 , where 휋푖푗 is the number of differences per site between sequences 푖 and 푗, 푥푖 is the frequency of sequence 푖, 푥푗 is the frequency of sequence 푗 and 푛 is the total number of sequences in the data set. The results were deemed sufficiently comparable to proceed with both call sets (see Appendix D). In addition, site depth, call rate and mean heterozygosity per individual were calculated for all accessions using VCFtools (v0.1.14). The results were visualised in bar plots with R (v3.3.2) and can also be found in Appendix D.

Genetic diversity In order to estimate the genetic diversity in O. barthii and O. glaberrima separately, variants were split into two populations based on species identification. Monomorphic sites were removed. A total of 2,580,362 and 1,419,601 SNPs were used to calculate SNP density, π, and Tajima’s D in O. barthii and O. glaberrima, respectively, where SNP density and Tajima’s D are defined according to Equation (3) and Equation (4).

푆 Equation (3): SNP density = , where 푆 is the number of segregating sites. 푤푖푛푑표푤 푠푖푧푒 푑 Equation (4): Tajima’s 퐷 = , where 푑 is the difference between two estimators of 휃 √푉(푑) (the scaled mutation rate), namely the average number of differences between two sequences (휋) as per Equation (2) and the expected number of segregating sites between two sequences under neutral 휃 푀 = 푆⁄ 푆 theory according to Watterson’s estimator 휔 ( 푎1). Here, is the total number of segregating

푛−1 1 푡ℎ sites in the population, 푎1 = ∑푖=1 푖, and 푖 is the 푖 sequence in a total of 푛 sequences (Tajima, 1989). When population size is constant and there is no selection on the genome (so-called neutral conditions), the two estimators should equal each other and Tajima’s 퐷 equals 0. These statistics were computed in 100 kb regions with VCFtools (v0.1.14). Genome-wide sliding windows were plotted in R (v3.3.2) and can be found in Appendix E. Genome-wide average statistics were compared using the Kruskal-Wallis test. Minor allele frequencies were calculated using VCFtools (v0.1.14) in combination with a custom R script and plotted in R (v.3.3.2).

51

Signatures of selection Derived allele frequency spectrum Ancestral and derived alleles were distinguished using a suitable Oryza species from the AA clade as an outgroup (see Appendix F). SNPs were polarised with reference to the O. meridionalis (v1.3) genome (Jacquemin et al., 2013) with a custom R script. The O. meridionalis x O. glaberrima multiple alignment was retrieved from Ensembl Genomes (release 33) and parsed with mafTools (Earl, Paten, & Diekhans, 2014). For each biallelic SNP, the corresponding position and five flanking bases were extracted from the alignment using a custom perl script. Positions that did not map to the outgroup, positions with gaps within 5 bp of the SNP, and SNPs that mapped to multiple regions of the O. meridionalis genome were discarded. 1,591,134 out of 3,923,601 SNPs were retained, constituting a data loss of about 59%. Synonymous and non-coding SNPs were extracted using SnpSift (v4.0) (Cingolani et al., 2012). For these SNPs, the derived allele frequency spectra and cumulative densities were plotted using the R (v3.3.2). Because of the low genomic divergence (<5%), homoplasy was considered unlikely and correction for multiple substitutions was not applied. The expected site frequency spectrum under a neutral model of evolution was calculated using the estimation of the population scaled mutation rate in Equation (4) (Tajima, 1989). Deviation from neutrality of the observed site frequency spectra of the two populations was compared using a two-sample Kolmogorov-Smirnov test.

Selection scans Several CLR tests that are widely used for detecting ‘hard’ sweeps are available as open source software, including OmegaPlus (Kim & Nielsen, 2004), SweeD (Pavlidis, Zivkovic, Stamatakis, & Alachiotis, 2013) and SweepFinder (Nielsen et al., 2005). These CLR methods are superior to more common neutrality tests such as Tajima’s D, because they measure deviations of the site frequency spectrum (SFS) against the genomic ‘background’ SFS. Since the background has been partly shaped by demographic history, they thereby each to some extent control for the confounding effect of past fluctuations in population size. In comparative analyses of a number of these CLR methods using simulated data, SweeD and OmegaPlus were shown to outperform other tests (Crisci et al., 2013; Pavlos Pavlidis & Alachiotis, 2017). While SweeD is capable of taking into account the polarisation of alleles in the so-called ‘unfolded’ SFS, this has the disadvantage that limiting the analyses to only unfolded SNPs causes a significant loss of data. OmegaPlus has no feature to distinguish between ancestral and derived alleles, but has the added advantage of explicitly taking into account patterns of Linkage Disequilibrium (LD). Extensive LD is observed in the O. glaberrima genome, with r2 reaching half its maximum value at a distance of 175 kb and approaching baseline at 300 kb (Meyer et al., 2016). For these reasons, OmegaPlus was chosen as the preferred method.

52

Selective sweeps were thus detected with the composite likelihood ratio (CLR) method developed by (Kim & Nielsen, 2004), using the ω-statistic as implemented in OmegaPlus (v2.0.0). The pooled VCF was split into chromosomes using GATK (v3.7.0) SelectVariants and converted to FASTA with AlternateReferenceMaker. FASTA sequence headers were modified using a custom perl script and combined to create the multiple alignments required as input for the sweep analysis. The OmegaPlus CLR test statistic (ω) measures the likelihood that a site is under selection by detecting whether there is an excess of LD within specified distances from that position. Due to the strong linkage among variants, the minimum and maximum windows for LD detection were set to 25 kb and 175 kb, respectively. The CLR was subsequently calculated for 11,382 evenly spaced genomic positions in O. glaberrima and O. barthii, amounting to roughly one position every 25 kb, according to Equation (5).

−1 푙 푊−푙 (( ) + ( )) (∑ 푟2 + ∑ 푟2 ) 2 2 푖,푗∈퐿 푖푗 푖,푗∈푅 푖푗 Equation (5): 휔 = −1 2 , where 푊 is the number of segregating (푙(푊−푙)) ∑푖∈퐿,푗∈푅 푟푖푗 sites, divided into two groups: one from the first to the 푙th polymorphic site on the left and the other from the (푙 + 1)th to the last polymorphic site on the right. 퐿 and 푅 represent the left and right set

2 2 2 of polymorphic sites, respectively, and 푟푖푗 is 푟 , a common measure of LD (Hill & Robertson, 1968), between the 푖th and the 푗th sites. The value of 푙 that maximises 휔, defines the test statistic. Values of ω were log-transformed, prior to creating Manhattan plots in R (v3.3.2) using the ‘qqman’ package (Turner, n.d.). Windows containing domestication genes with previous evidence for positive selection were highlighted. Positions with the top 0.5% values were considered candidates regions. To verify whether these candidate regions show other characteristic signatures of selection, the ω-statistic was plotted against overlapping 25 kb windows of π and Tajima’s D, respectively. Since common outliers were rare and would require lowering threshold for OmegaPlus, we did not report any common outliers as candidate regions, but rather chose OmegaPlus as the leading test.

Variant effect prediction Candidate regions were screened for potential causative mutations by examining related SNP content and genomic features. Variants were annotated using SnpEff (v4.0) (Cingolani et al., 2012). Genomic features were retrieved from the general feature format (GFF) file of the O. glaberrima reference genome on Ensembl (release 33). Genomic features containing putative moderate to high impact mutations within close proximity (< 25 kb) of candidate regions were extracted for closer inspection and can be found in Appendix G.

2 2 LD is defined by Hill and Robertson (1968) as 푟 = 퐷⁄푝1푝2푞1푞2, where 푝1 and 푝2 are the allele frequencies of SNP1, 푞1 and 푞2 are the allele frequencies of SNP2, and 퐷 measures the absolute difference between the observed and the expected haplotype frequencies (푝1푞1, 푝1푞2, 푝2푞1, and 푝2푞2, respectively). 53

Population structure ADMIXTURE Population structure was determined with ADMIXTURE (v1.3.0) (Alexander, Novembre, & Lange, 2009). In order to minimise the confounding effect of linkage disequilibrium, SNPs with a correlation coefficient of r2 > 0.25 were pruned with PLINK (v2.0) in sliding windows of 500 SNPs, with a step size of 50 SNPs (Purcell et al., 2007). A total of 70,873 SNPs were retained for analysis. ADMIXTURE was subsequently run with varying levels of K and cross-validation to improve model fit. An optimal number of ancestral populations was selected by choosing the level of K with the lowest cross-validation error. This analysis was repeated for O. glaberrima and O. barthii separately. Cross-validation error estimates of all three analyses can be found in Appendix H. The resulting ancestry fractions were plotted as stacked bar charts in R (v3.3.2). The O. glaberrima population was subsequently divided into five populations based on the ancestral population contributing the highest fraction of genetic variation. The geographic distribution of these populations in West Africa was visualised by plotting the coordinates of all West African accessions within each populations, using the R packages ‘rworldmap’ and ‘raster’ (Hijmans & Van Etten, 2012; South, 2011).

Allelic differentiation Based on the study by Meyer et al. (2016), four geographic sub-populations were identified, separating the arid and tropical populations by the 11° N cline and coastal and inland populations by the 6° W cline, respectively. Allelic differentiation between these populations was measured using Weir and

Cockerham’s (1984) definition of FST, as implemented in VCFtools (v0.1.14). FST between O. glaberrima and O. barthii was included as a baseline. To account for uneven sample sizes of these populations, an equal number of individuals (n=15) was selected for each of the five sub-populations as identified with ADMIXTURE. To minimise the effect of missing data, an equal number of individuals (n=15) that were sequenced at high coverage were selected from the O. barthii population. Pairwise FST and π were calculated between all six populations to obtain a more balanced estimate of allelic differentiation, taking uneven sampling and sequencing depth into account.

Isolation by distance Isolation by distance (IBD) was assessed by comparing the relatedness of individuals with the geographic distance separating their sites of collection. Kinship coefficients were estimated based on 1,419,601 SNPs, using the KING robust relationship inference method (Manichaikul et al., 2010) as implemented in PLINK (v2.0) (Purcell et al., 2007). Pairwise geographic distances were calculated in R (v3.3.2) using the package ‘geosphere’ (Hijmans, 2016), as the shortest distance between two points according to the Haversine function, assuming a spherical Earth with a radius of 6,378 km.

54

Box and whisker plots of the resulting distances and kinships were grouped per genetic cluster and visualised in R (v3.3.2). The relation between kinship and geographic distance was quantified by fitting a linear model to the data points. To reflect the geographic range of the majority of each genetic population, outliers were omitted. A pair was considered an outlier when the distance separating them fell outside the interquartile range (IQR) by more than 1.5*IQR. On average, outliers comprised less than 4% of the data. Linear models and correlation coefficients were estimated in R (v3.3.2). Kinship by distance graphs for each population separately and all populations combined can be found in Appendix H.

Phylogenetic analyses Whole genome clustering To confirm the clustering of O. glaberrima within O. barthii, a whole genome phylogenetic tree was constructed based on 3,923,601 genome-wide SNPs. To avoid distortion of branch lengths, no outgroup was used. Pairwise genomic distances between all accessions were calculated using a custom perl script, implementing the method described in S3.2 of (Gronau, Hubisz, Gulko, Danko, & Siepel, 2011). The divergence between two genomes, 푋 and 푌, was accordingly calculated as follows: 1 1 Equation (6): 푑(푋, 푌) = ∑퐿 [1 − max(훿 + 훿 , 훿 + 훿 )] , where 퐿 푖=1 2 푎푖,푐푖 푏푖푑푖 푎푖,푑푖 푏푖푐푖

푎푖푏푖 is the genotype at position 푖 in 푋, and 푐푖푑푖 is the genotype at position 푖 in 푌. The resulting distance matrix was used to construct an unrooted neighbour joining (NJ) tree, using the BioNJ algorithm of FastME (v2.0) with subtree pruning and regrafting (SPR) (Lefort et al., 2015). Trees were pruned and annotated in Interactive Tree Of Life (iTOL v3) (Letunic & Bork, 2016).

Identification of domestication genes Gene trees were constructed of salient domestication genes that have been proposed by Li et al. (2017) and Wang et al. (2014). These genes were identified according to criteria published in Meyer & Purugganan (Meyer & Purugganan, 2013). The selected domestication genes and their proposed functions are listed in Appendix J. Because the African rice genome is poorly annotated, their putative location in the O. glaberrima genome was determined using the O. sativa reference genome as a guide. O. sativa protein sequences were retrieved from the Rice Annotation Project Database (RAP-DB) and compared against the Ensembl Genomes O. glaberrima reference protein FASTA using BLAST+ (v2.6) (Camacho et al., 2009; Sakai et al., 2013). Genes were considered homologous when protein sequence similarity was higher than or equal to 95%. Based on this criterion, 19 out of 25 genes could be used for analysis. Gene coordinates were retrieved from the GFF file of the O. glaberrima reference genome on Ensembl (release 33). Gene structures of these genes were retrieved from Ensembl Plants (release 37).

55

Haplotype clustering To ensure sufficient phylogenetic signal, gene intervals were extended with 5 kb flanking regions on either side using bedtools slop (v2.26.0) (Quinlan & Hall, 2010). Larger intervals were not used, to minimise the influence of LD decay. The SNPs within the selected genomic coordinates were phased with PHASE (v2.1.1) (Stephens & Donnelly, 2003; Stephens, Smith, & Donnelly, 2001), using the general model for recombination rate variation as found in (Li & Stephens, 2003). Output was converted to FASTA format with a custom perl script. The resulting multiple alignments each contained 412 nucleotide sequences. Evolutionary histories were inferred using the NJ algorithm (Saitou & Nei, 1987) as implemented in MEGA7 (Kumar, Stecher, & Tamura, 2016). Both coding and noncoding positions were used. Evolutionary distances were calculated using the p-distance method of (Nei & Kumar, 2000). Ambiguous positions were removed for each sequence pair. All trees were annotated in Interactive Tree Of Life (iTOL v3) (Letunic & Bork, 2016). The five most common haplotypes for each gene were identified using the PHASE output. Putative effects of segregating SNPs were predicted using SnpEff (v4.0). Tree and haplotype statistics of all genes are summarised in Appendix K.

Biogeographic analyses The original data set included only one accession from Suriname (Van Andel, 2016). Variant discovery was therefore repeated with resequencing data of the second unpublished sample from Suriname (TVA6745), resulting in an additional call set (call set 2). Quality thresholds for filtering this new call set can be found in Appendix C. Pairwise genomic distances were recalculated following the method described in S3.2 of Gronau et al. (2011), and were used to construct a whole genome NJ tree in MEGA (v7) using default settings. Pairwise genomic distances of the two Surinamese accessions to other O. glaberrima accessions were mapped onto geographical coordinates as previously done by Van Andel et al. (2016), by conducting a Thin Plate Spline (TPS) regression analysis on all West African accessions in R (v3.2.3) using the packages ‘fields’, ‘raster’ and ‘rworldmap’ (Hijmans & Van Etten, 2012; Nychka, Furrer, Paige, & Sain, 2015; south, 2011). Distances were interpolated across a geographical region from 0° to 20° latitude and −20° to 20° longitude with the smoothing parameter (λ) set to 0.001. Box and whisker diagrams of genomic distances were grouped per country and created using R (v3.2.3). Collection sites of the two Surinamese samples were retrieved from the Global Biodiversity Information Facility (Creuwels, 2017) and the U.S. Board on Geographic Names’ database (National Geospatial-Intelligence Agency, 2017). Additional evidence of O. glaberrima in Latin America was assessed by screening existing germplasm collections through the Genesys portal (Genesys, 2017) and reviewing published botanical records online. An overview of the found collections and their countries of origin is given in Appendix A.

56

REFERENCES

Agnoun, Y., Biaou, S. S. H., Sié, M., Vodouhè, R. S., & Ahanchédé, A. (2012). The African Rice Oryza glaberrima Steud: Knowledge Distribution and Prospects. International Journal of Biology, 4(3). https://doi.org/10.5539/ijb.v4n3p158

Alexander, D. H., Novembre, J., & Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19, 1655–1664.

Allaby, R. G. (2015). Barley domestication: the end of a central dogma? Genome Biology, 16(1), 176. https://doi.org/10.1186/s13059-015-0743-9

Asano, K., Yamasaki, M., Takuno, S., Miura, K., Katagiri, S., Ito, T., … Matsuoka, M. (2011). Artificial selection for a green revolution gene during domestication. Proceedings of the National Academy of Sciences of the United States of America, 108(27), 11034–9. https://doi.org/10.1073/pnas.1019490108

Broad Institute. (n.d.). Picard Tools. Retrieved August 7, 2017, from http://broadinstitute.github.io/picard/

Brown, T. A. (2016). Plant genomics: African origins of “.” https://doi.org/10.1038/NPLANTS.2016.148

Caicedo, A. L., Williamson, S. H., Hernandez, R. D., Boyko, A., Fledel-Alon, A., York, T. L., … Purugganan, M. D. (2007). Genome-Wide Patterns of Nucleotide Polymorphism in Domesticated Rice. PLoS Genetics, 3(9), e163. https://doi.org/10.1371/journal.pgen.0030163

Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: architecture and applications. BMC Bioinformatics, (10), 421. https://doi.org/10.1186/1471-2105-10-421

Carney, J. (2005). Rice and memory in the age of enslavement: Atlantic passages to Suriname. Slavery and Abolition, 26(3), 325–348. Retrieved from http://www.tandfonline.com.ezproxy.library.wur.nl/doi/abs/10.1080/01440390500319562

Choi, J. Y., Platts, A. E., Fuller, D. Q., Hsing, Y.-I., Wing, R. A., & Purugganan, M. D. (2017). The rice paradox: Multiple origins but single domestication in Asian rice. Molecular Biology and Evolution, 34(4), msx049. https://doi.org/10.1093/molbev/msx049

57

Cingolani, P., Patel, V. M., Coon, M., Nguyen, T., Land, S. J., Ruden, D. M., … Sturzenbaum, S. (2012). Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. https://doi.org/10.3389/fgene.2012.00035

Cingolani, P., Platts, A., Wang, l. L., Coon, M., Nguyen, T., Wang, L., … Ruden, D. M. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin), 6(2), 80–92.

Clark, J. D. (1967). The Problem of Neolithic culture in sub-Saharan Africa. In W. W. Bishop & J. D. Clark (Eds.), Background to Evolution in Africa (pp. 601–627). Chicago, IL: Chicago University Press.

Creuwels, J. (2017). Naturalis Biodiversity Center (NL) - Botany. Naturalis Biodiversity Center. Occurrence Dataset https://doi.org/10.15468/ib5ypt accessed via GBIF.org on 2017-09-24.

Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., … Durbin, R. (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156–2158. https://doi.org/10.1093/bioinformatics/btr330

Degnan, J. H., & Rosenberg, N. A. (2009). Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends in Ecology & Evolution, 24(6), 332–340. https://doi.org/10.1016/j.tree.2009.01.009

DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V, Maguire, J. R., Hartl, C., … Daly, M. J. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5), 491–8. https://doi.org/10.1038/ng.806

Earl, D., Paten, B., & Diekhans, M. (2014). Alignathon: a competitive assessment of whole-genome alignment methods. Genome Research, 24(12), 2077–2089. https://doi.org/10.1101/gr.174920.114

Genesys. (2017). About Genesys. Retrieved August 10, 2017, from https://www.genesys- pgr.org/content/about/about

Gronau, I., Hubisz, M. J., Gulko, B., Danko, C. G., & Siepel, A. (2011). Bayesian inference of ancient human demography from individual genome sequences. Nature Genetics, 43(10), 1031–1034. https://doi.org/10.1038/ng.937

Gross, B. L., & Zhao, Z. (2014). Archaeological and genetic insights into the origins of domesticated rice. Proceedings of the National Academy of Sciences of the United States of America, 111(17), 6190– 7. https://doi.org/10.1073/pnas.1308942110

58

Harlan, J. R., De Wet, J. M. J., & Stemler, A. (1976). Plant Domestication and Indigenous African Agriculture. In Origins of African Plant Domestication (pp. 3–19). De Gruyter Mouton.

Hijmans, R. J. (2016). geosphere: Spherical Trigonometry. R package version 1.5-5. Retrieved from https://cran.r-project.org/package=geosphere

Hijmans, R. J., & Van Etten, J. (2012). raster: Geographic analysis and modeling with raster data. Retrieved from http://cran.r-project.org/package=raster

Hill, W. G., & Robertson, A. (1968). Linkage Disequilibrium in Finite Populations. Theoretical and Applied Genetics, 38, 226–23. Retrieved from http://svn.donarmstrong.com/don/trunk/projects/research/linkage/papers/ld_in_finite_popula tions_hill_robertson_theor_appl_gen_38_6_226.pdf

Ishida, Y. (2009). Sewall Wright and Gustave Malécot on Isolation by Distance. Philosophy of Science, 76(5), 784–796. https://doi.org/10.1086/605802

Ishii, T., Numaguchi, K., Miura, K., Yoshida, K., Thien Thanh, P., Myint Htun, T., … Ashikari, M. (2013). OsLG1 regulates a closed panicle trait in domesticated rice. Nature GeNetics, 4(4). https://doi.org/10.1038/ng.2567

Jacquemin, J., Bhatia, D., Singh, K., & Wing, R. A. (2013). The International Oryza Map Alignment Project: development of a genus-wide comparative genomics platform to help solve the 9 billion- people question. Current Opinion in Plant Biology, 16(2), 147–156. https://doi.org/10.1016/J.PBI.2013.02.014

Jensen, J. D. (2014). On the unfounded enthusiasm for soft selective sweeps. Nature Communications, 5, 5281. https://doi.org/10.1038/ncomms6281

Kim, Y., & Nielsen, R. (2004). Linkage Disequilibrium as a Signature of Selective Sweeps. Genetics, 167(3).

Konishi, S., Izawa, T., Lin, S. Y., Ebana, K., Fukuta, Y., Sasaki, T., & Yano, M. (2006). An SNP Caused Loss of Seed Shattering During Rice Domestication. Science, 312(5778). Retrieved from http://science.sciencemag.org/content/312/5778/1392.full

Kumar, S., Stecher, G., & Tamura, K. (2016). MEGA7: Molecular Evolutionary Genetics Analysis version 7.0 for bigger datasets. Molecular Biology and Evolution, 33, 1870–1874.

59

Lefort, V., Desper, R., Gascuel, O., M, A., W, H., & O, G. (2015). FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program: Table 1. Molecular Biology and Evolution, 32(10), 2798–2800. https://doi.org/10.1093/molbev/msv150

Letunic, I., & Bork, P. (2016). Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Research. https://doi.org/10.1093/nar/gkw290

Li, C., Zhou, A., & Sang, T. (2006). Rice Domestication by Reducing Shattering. Science, 311(5769). Retrieved from http://science.sciencemag.org/content/311/5769/1936.full

Li, H., & Durbin, R. (2010). Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England), 26(5), 589–95. https://doi.org/10.1093/bioinformatics/btp698

Li, L.-F., Li, Y.-L., Jia, Y., Caicedo, A. L., & Olsen, K. M. (2017). Signatures of adaptation in the genome. Nature Genetics, 49(5), 811–814. https://doi.org/10.1038/ng.3825

Li, N., & Stephens, M. (2003). Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using Single-Nucleotide Polymorphism Data. Genetics, 165, 2213–2233.

Li, W. H., Wu, C. I., & Luo, C. C. (1985). A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Molecular Biology and Evolution, 2(2), 150–74. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/3916709

Li, Z.-M., Zheng, X.-M., & Ge, S. (2011). Genetic diversity and domestication history of African rice (Oryza glaberrima) as inferred from multiple gene sequences. Theoretical and Applied Genetics, 123(1), 21–31. https://doi.org/10.1007/s00122-011-1563-2

Linares, O. F. (2002). African rice (Oryza glaberrima): history and future potential. Proceedings of the National Academy of Sciences of the United States of America, 99(25), 16360–5. https://doi.org/10.1073/pnas.252604599

Manichaikul, A., Mychaleckyj, J. C., Rich, S. S., Daly, K., Sale, M., & Chen, W.-M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26(22), 2867–2873. https://doi.org/10.1093/bioinformatics/btq559

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., … DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–303. https://doi.org/10.1101/gr.107524.110

60

Meyer, R. S., Choi, J. Y., Sanches, M., Plessis, A., Flowers, J. M., Amas, J., … Purugganan, M. D. (2016). Domestication history and geographical adaptation inferred from a SNP map of African rice. Nature Genetics, 48(9), 1083–1088. https://doi.org/10.1038/ng.3633

Meyer, R. S., & Purugganan, M. D. (2013). Evolution of crop species: genetics of domestication and diversification. Nature Reviews Genetics, 14(12), 840–852. https://doi.org/10.1038/nrg3605

Mohanty, S. (2013). IRRI - Trends in global rice consumption. Retrieved May 30, 2017, from http://irri.org/rice-today/trends-in-global-rice-consumption

Molina, J., Sikora, M., Garud, N., Flowers, J. M., Rubinstein, S., Reynolds, A., … Purugganan, M. D. (2011). Molecular evidence for a single evolutionary origin of domesticated rice. Proceedings of the National Academy of Sciences of the United States of America, 108(20), 8351–6. https://doi.org/10.1073/pnas.1104686108

Nabholz, B., Sarah, G., Sabot, F., Ruiz, M., Adam, H., Nidelet, S., … Glémin, S. (2014). Transcriptome population genomics reveals severe bottleneck and domestication cost in the African rice (Oryza glaberrima). Molecular Ecology, 23(9), 2210–2227. https://doi.org/10.1111/mec.12738

National Geospatial-Intelligence Agency. (2017). Geographical Names. Retrieved from http://www.geographic.org/geographic_names/name.php?uni=- 1355040&fid=4448&c=suriname

Ndjiondjop, M.-N., Semagn, K., Gouda, A. C., Kpeki, S. B., Dro Tia, D., Sow, M., … Warburton, M. L. (2017). Genetic Variation and Population Structure of Oryza glaberrima and Development of a Mini-Core Collection Using DArTseq. Frontiers in Plant Science, 8, 1748. https://doi.org/10.3389/fpls.2017.01748

Nei, M., & Kumar, S. (2000). Molecular Evolution and Phylogenetics. New York: Oxford University Press.

Nielsen, R. (2005). Molecular signatures of natural selection. Annual Review of Genetics, 39, 197–218. https://doi.org/10.1146/annurev.genet.39.073003.112420

Nychka, D., Furrer, R., Paige, J., & Sain, S. (2015). fields: Tools for spatial data. https://doi.org/10.5065/D6W957CT

Ohnishi, T., Sugahara, S., Yamada, T., Kikuchi, K., Yoshiba, Y., Hirano, H.-Y., & Tsutsumi, N. (2005). OsNAC6, a member of the NAC gene family, is induced by various stresses in rice. Genes & Genetic Systems, 80(2), 135–9. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/16172526

61

Orjuela, J., Sabot, F., Chéron, S., Vigouroux, Y., Adam, H., Chrestin, H., … Ghesquière, A. (2014). An extensive analysis of the African rice genetic diversity through a global genotyping. TAG. Theoretical and Applied Genetics. Theoretische Und Angewandte Genetik, 127(10), 2211–23. https://doi.org/10.1007/s00122-014-2374-z

Pfaffelhuber, P., Lehnert, A., Stephan, W., & Parsch, J. (2008). Linkage Disequilibrium Under Genetic Hitchhiking in Finite Populations. Genetics, 179(1), 527–537. https://doi.org/10.1534/genetics.107.081497

Porteres, R. (1962). Berceaux Agricoles Primaires Sur le Continent Africain. The Journal of African History, 3(2), 195–2010. https://doi.org/10.1017/S0021853700003030

Portères, R. (1955). Présence ancienne d’une variété cultivée d’Oryza glaberrima St. en Guyane Française. Journal D’agriculture Tropicale et de Botanique Appliquée, 2(12), 680. https://doi.org/10.3406/jatba.1955.2270

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D., … Sham, P. C. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics, 81(3), 559–75. https://doi.org/10.1086/519795

Purugganan, M. D. (2014). An evolutionary genomic tale of two rice species. Nature Genetics, 46(9), 931–2. https://doi.org/10.1038/ng.3071

Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841–84210. https://doi.org/10.1093/bioinformatics/btq033

R Core Team. (2013). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.r-project.org/

Saitou, N., & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4), 406–25. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/3447015

Sakai, H., Lee, S. S., Tanaka, T., Numa, H., Kim, J., Kawahara, Y., … Itoh, T. (2013). Rice Annotation Project Database (RAP-DB): An Integrative and Interactive Database for Rice Genomics. Plant and Cell Physiology, 54(2), e6–e6. https://doi.org/10.1093/pcp/pcs183

Schmidhuber, J., & Tubiello, F. N. (2007). Global food security under climate change. Proceedings of the National Academy of Sciences of the United States of America, 104(50), 19703–8. https://doi.org/10.1073/pnas.0701976104

62

Semon, M., Nielsen, R., Jones, M. P., & McCouch, S. R. (2005). The population structure of African cultivated rice oryza glaberrima (Steud.): evidence for elevated levels of linkage disequilibrium caused by admixture with O. sativa and ecological adaptation. Genetics, 169(3), 1639–47. https://doi.org/10.1534/genetics.104.033175

Shaw, T. (1976). Early crops in Africa: a review of the evidence. In J. R. Harlan, J. M. J. De Wet, & A. B. L. Stemler (Eds.), Origins of African Plant Domestication (pp. 107–153). The Hague & Paris: Mouton.

South, A. (2011). rworldmap: A New R package for Mapping Global Data. The R Journal, 3(1), 35–43.

Stephens, M., & Donnelly, P. (2003). A Comparison of Bayesian Methods for Haplotype Reconstruction from Population Genotype Data. American Journal of Human Genetics, 73, 1162–1169. Retrieved from http://stephenslab.uchicago.edu/assets/papers/Stephens2003a.pdf

Stephens, M., Smith, N. J., & Donnelly, P. (2001). A New Statistical Method for Haplotype Reconstruction from Population Data. American Journal of Human Genetics, 68, 978–989. Retrieved from http://stephenslab.uchicago.edu/assets/papers/Stephens2001.pdf

Tajima, F. (1989). Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics, 123(3), 585–95. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/2513255

The UniProt Consortium. (2017). UniProt: the Universal Protein knowledgebase. Nucleic Acids Research, 45(D1), D158–D169. https://doi.org/10.1093/nar/gkh131

Thornton, K. (2005). Recombination and the properties of Tajima’s D in the context of approximate- likelihood calculation. Genetics, 171(4), 2143–8. https://doi.org/10.1534/genetics.105.043786

Turner, S. D. (n.d.). qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. biorXiv. https://doi.org/10.1101/005165.

Vaillant, M. (1948). Milieu cultural et classification des variétés de Riz des Guyanes française et hollandaise. Revue Internationale de Botanique Appliquée et D’agriculture Tropicale, 28(313), 520–529. https://doi.org/10.3406/jatba.1948.6700

Van Andel, T. (2010). African Rice (Oryza glaberrima Steud.): Lost Crop of the Enslaved Africans Discovered in Suriname. Economic Botany, 64(1), 1–10. https://doi.org/10.1007/s12231-010- 9111-6

63

Van Andel, T. R., Meyer, R. S., Aflitos, S. A., Carney, J. A., Veltman, M. A., Copetti, D., … Freedman, A. H. (2016). Tracing ancestor rice of Suriname Maroons back to its African origin. Nature Plants, 2(10), 16149. https://doi.org/10.1038/nplants.2016.149 van Andel, T. R., van der Velden, A., & Reijers, M. (2016). The “Botanical Gardens of the Dispossessed” revisited: richness and significance of Old World crops grown by Suriname Maroons. Genetic Resources and Crop Evolution, 63(4), 695–710. https://doi.org/10.1007/s10722-015-0277-8

Vaughan, D. A. (1994). The Wild Relatives of Rice: A Genetic Resources Handbook. IRRI.

Wang, M., Yu, Y., Haberer, G., Marri, P. R., Fan, C., Goicoechea, J. L., … Wing, R. A. (2014). The genome sequence of African rice (Oryza glaberrima) and evidence for independent domestication. Nature Genetics, 46(9), 982–8. https://doi.org/10.1038/ng.3044

Weir, B. S., & Cockerham, C. C. (1984). Estimating F-Statistics for the Analysis of Population Structure. Source: Evolution, 38(6), 1358–1370. Retrieved from http://www.jstor.org

Wu, W., Liu, X., Wang, M., Meyer, R. S., Luo, X., Ndjiondjop, M.-N., … Zhu, Z. (2017). A single-nucleotide polymorphism causes smaller grain size and loss of seed shattering during African rice domestication. Nature Plants, 3(6), 17064. https://doi.org/10.1038/nplants.2017.64

64

SUPPLEMENTARY MATERIAL

Appendix A: Evidence of African rice in Latin America ...... 66 Field collections in Suriname ...... 67 Appendix B: List of used accessions ...... 68 Oryza glaberrima ...... 68 Oryza barthii ...... 72 Geographic origin of used accessions ...... 75 Appendix C: Filtering parameters ...... 76 Call set 1: O. glaberrima and O. barthii ...... 77 Call set 2: O. glaberrima ...... 79 Appendix D: Call set comparison ...... 81 Fixation index and nucleotide diversity ...... 81 Individual statistics (call set 1a) ...... 82 Individual statistics (call set 1b) ...... 83 Appendix E: Diversity statistics ...... 84 Appendix F: Choice of outgroup ...... 90 Appendix G: Candidate regions under selection ...... 91 OmegaPlus outliers ...... 91 Associated genes ...... 91 Appendix H: Population structure analysis ...... 94 Appendix I: Whole genome species tree ...... 95 Appendix J: Genes selected for phylogenetic analysis ...... 96 Appendix K: Haplotypes of domestication genes ...... 97 Appendix L: Selective sweep models ...... 99 Appendix M: DNA extraction protocol ...... 100 References ...... 101

65

APPENDIX A: EVIDENCE OF AFRICAN RICE IN LATIN AMERICA

Only three O. glaberrima accessions samples from the Americas are known to have been sequenced to date: IRGC68976, TVA6745 and TVA6749. Two of those were collected from the field (TVA6745 and TVA6749); the other (IRGC68976) was collected from a breeding station. Historical records of O. glaberrima in the Americas were documented by Portères (1955, 1960), but the present location of these vouchers is unknown. Seed collections from gene banks have several O. glaberrima accessions in store (Genesys, 2017). Whether these came from donor institutions in Africa or from farmers in the country where they are currently held, however, is not clear. The Surinamese accessions held by Naturalis remain the most well documented and preserved botanical collections that we know of at this time. An additional sample was recently collected from French Guiana, and will likely be included in future studies.

Table S1: Botanical records and germplasm of O. glaberrima from Latin America. This list was compiled from online sources.

Country Accession Biological status Collection Year Genetic data1 Source2 Holding institute Brazil WAB0011146 Traditional cultivar/Landrace Germplasm NA Genesys Africa Rice Center Guyana WAB0029115 Traditional cultivar/Landrace Germplasm NA Genesys Africa Rice Center Suriname WAB0034173 Traditional cultivar/Landrace Germplasm NA Genesys Africa Rice Center Colombia WAB0036913 Unknown Germplasm NA Genesys Africa Rice Center Colombia WAB0036914 Unknown Germplasm NA Genesys Africa Rice Center Guyana IRGC68976 Unknown Germplasm 1984 WGS Genesys International Rice Research Institute French Guiana TVA???? Traditional cultivar/landrace Botanical record 2017 NA Unpublished Naturalis Biodiversity Center Suriname TVA6745 Traditional cultivar/landrace Botanical record 2016 WGS Unpublished Naturalis Biodiversity Center Suriname TVA6749 Traditional cultivar/landrace Botanical record 2009 WGS (Van Andel et al., 2016) Naturalis Biodiversity Center French Guiana Unknown Traditional cultivar/landrace Botanical record 1938 NA (Portères, 1955) Unknown El Salvador Unknown Self-seeding Botanical record 1960 NA (Portères, 1960) Unknown El Salvador GSOR311688 Traditional cultivar/Landrace Germplasm 1960 NA Genesys United States Department of Agriculture El Salvador PI269630 Other3 Germplasm 1960 NA Genesys United States Department of Agriculture

1 NA = Not Applicable. WGS = Whole Genome Sequence. 2 “GENESYS is a global portal to information about Plant Genetic Resources for Food and Agriculture (PGRFA). It is a gateway from which germplasm accessions from genebanks around the world can be easily found and ordered. GENESYS is the result of collaboration between Bioversity International on behalf of System-wide Genetic Resources Programme of the CGIAR (Consultative Group on International Agricultural Research), the Global Crop Diversity Trust and the Secretariat of the International Treaty on the Plant Genetic Resources for Food and Agriculture. By facilitating access to and use of PGRFA, GENESYS helps to secure its long-term conservation.” (Genesys, 2017). 3 The classification of biological status by Genesys as ‘other’, may point to the fact that this is the unwanted self-seeding variety of “Arroz rojo” as observed in El Salvador by Portères (1960).

66

Field collections in Suriname

A B C

Figure S1: Map of Suriname, showing the sampling locations of accessions TVA6745 and TVA6749 used in this study. A. The distance between the two sampling locations, as measured by Google Maps B. Location where TVA6749 where was collected in 2009, from Vreedzaam market at Waterside street, Paramaribo. Coordinates were obtained from the Global Biodiversity Information Facility (Creuwels, 2017)4. C. Location where TVA6745 was collected in 2016, from a rice field in Tjon Tjon along the Tapanahony River, Sipaliwini. Coordinates were obtained from the U.S. Board on Geographical Names’ database (National Geospatial-Intelligence Agency, 2017).

4 http://www.gbif.org/occurrence/1139387448, http://www.gbif.org/occurrence/1140304295 or http://www.gbif.org/occurrence/1140360483

67

APPENDIX B: LIST OF USED ACCESSIONS Oryza glaberrima

Species Accession number Collection Country Latitude Longitude Source publication SRA study SRA run Oryza glaberrima IRGC101049 International Rice Research Institute South Africa -33.6106 26.83333 Wang et al. (2014) SRP038750 SRR1206507 Oryza glaberrima IRGC103442 International Rice Research Institute Senegal 12.65 -15.4667 Meyer et al. (2016) SRP071857 SRR3231659 Oryza glaberrima IRGC103450 International Rice Research Institute Gambia 13.25 -15.8333 Meyer et al. (2016) SRP071857 SRR3231660 Oryza glaberrima IRGC103452 International Rice Research Institute Senegal 13.76639 -16.4833 Meyer et al. (2016) SRP071857 SRR3231661 Oryza glaberrima IRGC103456 International Rice Research Institute Senegal 13.01667 -15.7167 Meyer et al. (2016) SRP071857 SRR3231662 Oryza glaberrima IRGC103461 International Rice Research Institute Senegal 13.53333 -16.4167 Meyer et al. (2016) SRP071857 SRR3231663 Oryza glaberrima IRGC103463 International Rice Research Institute Senegal 12.51667 -16.5833 Meyer et al. (2016) SRP071857 SRR3231664 Oryza glaberrima IRGC103469 International Rice Research Institute Burkina Faso 11.33333 -4.41667 Wang et al. (2014) SRP038750 SRR1206500 Oryza glaberrima IRGC103472 International Rice Research Institute Burkina Faso 10.53333 -4.76667 Wang et al. (2014) SRP038750 SRR1206508 Oryza glaberrima IRGC103517 International Rice Research Institute Mali 14.33333 -4.9 Meyer et al. (2016) SRP071857 SRR3231665 Oryza glaberrima IRGC103520 International Rice Research Institute Mali 14.33333 -4.9 Wang et al. (2014) SRP038750 SRR1206509 Oryza glaberrima IRGC103530 International Rice Research Institute Mali 14.23333 -3.61667 Meyer et al. (2016) SRP071857 SRR3231666 Oryza glaberrima IRGC103592 International Rice Research Institute Cameroon 10.16667 14.33333 Meyer et al. (2016) SRP071857 SRR3231667 Oryza glaberrima IRGC103599 International Rice Research Institute Cameroon 8.7 14.18333 Meyer et al. (2016) SRP071857 SRR3231668 Oryza glaberrima IRGC103632 International Rice Research Institute Mali 14.88333 -3.99139 Wang et al. (2014) SRP038750 SRR1206510 Oryza glaberrima IRGC103922 International Rice Research Institute Nigeria 11.16667 4.66667 Meyer et al. (2016) SRP071857 SRR3231669 Oryza glaberrima IRGC103937 International Rice Research Institute Liberia 6.18333 -9.76667 Meyer et al. (2016) SRP071857 SRR3231670 Oryza glaberrima IRGC103946 International Rice Research Institute Liberia 8.4 -10.1833 Meyer et al. (2016) SRP071857 SRR3231671 Oryza glaberrima IRGC103948 International Rice Research Institute Liberia 8.18333 -10.2333 Meyer et al. (2016) SRP071857 SRR3231672 Oryza glaberrima IRGC103949 International Rice Research Institute Liberia 8.28333 -10.1 Meyer et al. (2016) SRP071857 SRR3231673 Oryza glaberrima IRGC103953 International Rice Research Institute Sierra Leone 9 -13 Meyer et al. (2016) SRP071857 SRR3231674 Oryza glaberrima IRGC103955 International Rice Research Institute Senegal 12.62972 -16.0167 Meyer et al. (2016) SRP071857 SRR3231675 Oryza glaberrima IRGC103956 International Rice Research Institute Senegal 12.58583 -15.5433 Meyer et al. (2016) SRP071857 SRR3231676 Oryza glaberrima IRGC103957 International Rice Research Institute Senegal 12.60083 -16.0694 Meyer et al. (2016) SRP071857 SRR3231677 Oryza glaberrima IRGC103958 International Rice Research Institute Senegal 12.63639 -15.4725 Meyer et al. (2016) SRP071857 SRR3231678

68

Species Accession number Collection Country Latitude Longitude Source publication SRA study SRA run Oryza glaberrima IRGC103959 International Rice Research Institute Senegal 12.55556 -15.8275 Meyer et al. (2016) SRP071857 SRR3231679 Oryza glaberrima IRGC103960 International Rice Research Institute Senegal 12.57028 -15.9042 Meyer et al. (2016) SRP071857 SRR3231680 Oryza glaberrima IRGC103963 International Rice Research Institute Senegal 12.52944 -15.7081 Meyer et al. (2016) SRP071857 SRR3231681 Oryza glaberrima IRGC103967 International Rice Research Institute Senegal 12.50083 -15.5217 Meyer et al. (2016) SRP071857 SRR3231682 Oryza glaberrima IRGC103981 International Rice Research Institute Nigeria 8.91667 4.63333 Meyer et al. (2016) SRP071857 SRR3231683 Oryza glaberrima IRGC103982 International Rice Research Institute Nigeria 12.75 8.83333 Meyer et al. (2016) SRP071857 SRR3231684 Oryza glaberrima IRGC103988 International Rice Research Institute Sierra Leone 8.28333 -10.5833 Meyer et al. (2016) SRP071857 SRR3231685 Oryza glaberrima IRGC103989 International Rice Research Institute Sierra Leone 8.23333 -10.3667 Meyer et al. (2016) SRP071857 SRR3231686 Oryza glaberrima IRGC103991 International Rice Research Institute Sierra Leone 8.13333 -10.8833 Meyer et al. (2016) SRP071857 SRR3231687 Oryza glaberrima IRGC103992 International Rice Research Institute Sierra Leone 7.28333 -11.3333 Meyer et al. (2016) SRP071857 SRR3231688 Oryza glaberrima IRGC103993 International Rice Research Institute Sierra Leone 9.75 -12.5 Meyer et al. (2016) SRP071857 SRR3231689 Oryza glaberrima IRGC103994 International Rice Research Institute Sierra Leone 9.05 -11.7333 Meyer et al. (2016) SRP071857 SRR3231690 Oryza glaberrima IRGC103995 International Rice Research Institute Sierra Leone 9.11667 -11.9167 Meyer et al. (2016) SRP071857 SRR3231691 Oryza glaberrima IRGC104011 International Rice Research Institute Nigeria 8.88333 11.38333 Meyer et al. (2016) SRP071857 SRR3231692 Oryza glaberrima IRGC104022 International Rice Research Institute Guinea-Bissau 11.785 -15.9175 Meyer et al. (2016) SRP071857 SRR3231693 Oryza glaberrima IRGC104023 International Rice Research Institute Guinea-Bissau 11.88333 -15.8497 Meyer et al. (2016) SRP071857 SRR3231694 Oryza glaberrima IRGC104024 International Rice Research Institute Guinea-Bissau 11.86917 -15.6086 Meyer et al. (2016) SRP071857 SRR3231695 Oryza glaberrima IRGC104025 International Rice Research Institute Guinea-Bissau 12.07917 -15.3214 Meyer et al. (2016) SRP071857 SRR3231696 Oryza glaberrima IRGC104028 International Rice Research Institute Guinea-Bissau 12.34861 -14.3528 Meyer et al. (2016) SRP071857 SRR3231697 Oryza glaberrima IRGC104029 International Rice Research Institute Guinea-Bissau 12.16667 -14.6664 Meyer et al. (2016) SRP071857 SRR3231698 Oryza glaberrima IRGC104030 International Rice Research Institute Guinea-Bissau 11.68333 -14.7658 Meyer et al. (2016) SRP071857 SRR3231699 Oryza glaberrima IRGC104032 International Rice Research Institute Guinea-Bissau 11.35583 -15.1167 Meyer et al. (2016) SRP071857 SRR3231700 Oryza glaberrima IRGC104034 International Rice Research Institute Cote d'Ivoire 7.316944 -7.5125 Meyer et al. (2016) SRP071857 SRR3231701 Oryza glaberrima IRGC104035 International Rice Research Institute Cote d'Ivoire 6.685556 -8.31139 Meyer et al. (2016) SRP071857 SRR3231702 Oryza glaberrima IRGC104036 International Rice Research Institute Cote d'Ivoire 6.589722 -8.24583 Meyer et al. (2016) SRP071857 SRR3231703 Oryza glaberrima IRGC104044 International Rice Research Institute Chad 9.41667 16.33333 Meyer et al. (2016) SRP071857 SRR3231704 Oryza glaberrima IRGC104047 International Rice Research Institute Cameroon 10.75 13.83333 Meyer et al. (2016) SRP071857 SRR3231705 Oryza glaberrima IRGC104165 International Rice Research Institute Guinea 8.381944 -9.29944 Meyer et al. (2016) SRP071857 SRR3231706 Oryza glaberrima IRGC104173 International Rice Research Institute Guinea 10.01639 -10.8333 Meyer et al. (2016) SRP071857 SRR3231707

69

Species Accession number Collection Country Latitude Longitude Source publication SRA study SRA run Oryza glaberrima IRGC104177 International Rice Research Institute Guinea 11.07139 -14.4219 Meyer et al. (2016) SRP071857 SRR3231708 Oryza glaberrima IRGC104178 International Rice Research Institute Guinea 12.35472 -13.14 Meyer et al. (2016) SRP071857 SRR3231709 Oryza glaberrima IRGC104180 International Rice Research Institute Guinea 11.18306 -12.2042 Meyer et al. (2016) SRP071857 SRR3231710 Oryza glaberrima IRGC104181 International Rice Research Institute Guinea 10.89861 -12.5586 Meyer et al. (2016) SRP071857 SRR3231711 Oryza glaberrima IRGC104182 International Rice Research Institute Guinea 10.81667 -12.6997 Meyer et al. (2016) SRP071857 SRR3231712 Oryza glaberrima IRGC104187 International Rice Research Institute Guinea 10.91639 -12.1833 Meyer et al. (2016) SRP071857 SRR3231713 Oryza glaberrima IRGC104190 International Rice Research Institute Guinea 10.58417 -12.5608 Meyer et al. (2016) SRP071857 SRR3231714 Oryza glaberrima IRGC104194 International Rice Research Institute Guinea 9.273056 -13.0242 Meyer et al. (2016) SRP071857 SRR3231715 Oryza glaberrima IRGC104195 International Rice Research Institute Ghana 6.15 0.26667 Meyer et al. (2016) SRP071857 SRR3231716 Oryza glaberrima IRGC104206 International Rice Research Institute Ghana 7.63333 0.83333 Wang et al. (2014) SRP038750 SRR1206512 Oryza glaberrima IRGC104231 International Rice Research Institute Sierra Leone 7.83333 -10.7667 Meyer et al. (2016) SRP071857 SRR3231717 Oryza glaberrima IRGC104260 International Rice Research Institute Ghana 7.23333 0.56667 Meyer et al. (2016) SRP071857 SRR3231718 Oryza glaberrima IRGC104294 International Rice Research Institute Chad 9.66667 15 Meyer et al. (2016) SRP071857 SRR3231719 Oryza glaberrima IRGC104533 International Rice Research Institute Nigeria 12.83333 4.7 Meyer et al. (2016) SRP071857 SRR3231720 Oryza glaberrima IRGC104545 International Rice Research Institute Nigeria 15.83333 5.83333 Meyer et al. (2016) SRP071857 SRR3231721 Oryza glaberrima IRGC104561 International Rice Research Institute Sierra Leone 8.26667 -10.4833 Meyer et al. (2016) SRP071857 SRR3231722 Oryza glaberrima IRGC104562 International Rice Research Institute Sierra Leone 9.66667 -11.5833 Meyer et al. (2016) SRP071857 SRR3231723 Oryza glaberrima IRGC104566 International Rice Research Institute Senegal 12.71667 -15.6 Meyer et al. (2016) SRP071857 SRR3231724 Oryza glaberrima IRGC104571 International Rice Research Institute Senegal 14.49944 -14.4456 Meyer et al. (2016) SRP071857 SRR3231725 Oryza glaberrima IRGC104573 International Rice Research Institute Cote d'Ivoire 9.2 -3.03333 Meyer et al. (2016) SRP071857 SRR3231726 Oryza glaberrima IRGC104574 International Rice Research Institute Mali 12.4 -5.4 Wang et al. (2014) SRP038750 SRR1206513 Oryza glaberrima IRGC104595 International Rice Research Institute Mali 12.3 -7.93333 Meyer et al. (2016) SRP071857 SRR3231727 Oryza glaberrima IRGC104904 International Rice Research Institute Nigeria 10.5 4.66667 Meyer et al. (2016) SRP071857 SRR3231728 Oryza glaberrima IRGC104934 International Rice Research Institute Burkina Faso 10.83333 -4.58333 Meyer et al. (2016) SRP071857 SRR3231729 Oryza glaberrima IRGC104955 International Rice Research Institute Sierra Leone 9.5 -12.2333 Wang et al. (2014) SRP038750 SRR1206514 Oryza glaberrima IRGC105005 International Rice Research Institute Guinea 11.67667 -9.42556 Meyer et al. (2016) SRP071857 SRR3231730 Oryza glaberrima IRGC105011 International Rice Research Institute Guinea 11.43333 -9.03306 Meyer et al. (2016) SRP071857 SRR3231731 Oryza glaberrima IRGC105021 International Rice Research Institute Guinea 10.84972 -10.9333 Meyer et al. (2016) SRP071857 SRR3231732 Oryza glaberrima IRGC105026 International Rice Research Institute Guinea 9.430556 -13.0878 Meyer et al. (2016) SRP071857 SRR3231733

70

Species Accession number Collection Country Latitude Longitude Source publication SRA study SRA run Oryza glaberrima IRGC105034 International Rice Research Institute Guinea 10.34972 -14.3664 Meyer et al. (2016) SRP071857 SRR3231734 Oryza glaberrima IRGC105036 International Rice Research Institute Guinea 10.66611 -14.5997 Meyer et al. (2016) SRP071857 SRR3231735 Oryza glaberrima IRGC105038 International Rice Research Institute Guinea 9.949722 -12.9333 Meyer et al. (2016) SRP071857 SRR3231736 Oryza glaberrima IRGC105043 International Rice Research Institute Guinea 11.21667 -11.8164 Meyer et al. (2016) SRP071857 SRR3231737 Oryza glaberrima IRGC105044 International Rice Research Institute Guinea 11.31667 -12.2831 Meyer et al. (2016) SRP071857 SRR3231738 Oryza glaberrima IRGC105048 International Rice Research Institute Liberia 6.98333 -9.6 Meyer et al. (2016) SRP071857 SRR3231739 Oryza glaberrima IRGC105049 International Rice Research Institute Liberia 6.98333 -9.58333 Meyer et al. (2016) SRP071857 SRR3231740 Oryza glaberrima IRGC105050 International Rice Research Institute Liberia 6.23333 -9.93333 Meyer et al. (2016) SRP071857 SRR3231741 Oryza glaberrima IRGC105052 International Rice Research Institute Guinea 10.18333 -14.0664 Meyer et al. (2016) SRP071857 SRR3231742 Oryza glaberrima IRGC58622 International Rice Research Institute Sierra Leone 8.464444 -11.7958 Meyer et al. (2016) SRP071857 SRR3231743 Oryza glaberrima IRGC61457 International Rice Research Institute Liberia 6.452222 -9.42833 Meyer et al. (2016) SRP071857 SRR3231744 Oryza glaberrima IRGC67563 International Rice Research Institute Ghana 6.719167 0.526111 Meyer et al. (2016) SRP071857 SRR3231745 Oryza glaberrima IRGC68939 International Rice Research Institute Madagascar -18.7772 46.83111 Wang et al. (2014) SRP038750 SRR1206516 Oryza glaberrima IRGC68976 International Rice Research Institute Guyana 4.866111 -58.9381 Wang et al. (2014) SRP038750 SRR1206517 Oryza glaberrima IRGC75500 International Rice Research Institute Burkina Faso 12.9 -2.44972 Wang et al. (2014) SRP038750 SRR1206518 Oryza glaberrima IRGC75546 International Rice Research Institute Burkina Faso 13.65278 -0.55083 Meyer et al. (2016) SRP071857 SRR3231746 Oryza glaberrima IRGC75618 International Rice Research Institute Burkina Faso 12.01639 -2.31639 Meyer et al. (2016) SRP071857 SRR3231747 Oryza glaberrima IRGC75729 International Rice Research Institute Burkina Faso 12.74528 -3.76306 Meyer et al. (2016) SRP071857 SRR3231748 Oryza glaberrima IRGC96841 International Rice Research Institute Zimbabwe -19.0131 29.14639 Wang et al. (2014) SRP038750 SRR1206519 Oryza glaberrima TOG5457 Africa Rice Center Nigeria 11.93333 4.18333 Wang et al. (2014) SRP038750 SRR1206501 Oryza glaberrima TOG5467 Africa Rice Center Nigeria N.A. N.A. Wang et al. (2014) SRP038750 SRR1206502 Oryza glaberrima TOG5923 Africa Rice Center Liberia 6.43333 -10.7833 Wang et al. (2014) SRP038750 SRR1206503 Oryza glaberrima TOG5949 Africa Rice Center Nigeria 7.5 9.06667 Wang et al. (2014) SRP038750 SRR1206504 Oryza glaberrima TOG7025 Africa Rice Center Sierra Leone 9.85 -11.3167 Wang et al. (2014) SRP038750 SRR1206505 Oryza glaberrima TOG7102 Africa Rice Center Mali 13.98333 -5.61639 Wang et al. (2014) SRP038750 SRR1206506 Oryza glaberrima TOG6203 Africa Rice Center Guinea 10.38333 -9.3 Meyer et al. (2016) SRP071857 SRR3231749 Oryza glaberrima TOG7135 Africa Rice Center Senegal 12.65 -15.4667 Meyer et al. (2016) SRP071857 SRR3231750 Oryza glaberrima TOG7197 Africa Rice Center Cote d'Ivoire 7.4 -7.55 Meyer et al. (2016) SRP071857 SRR3231751 Oryza glaberrima TVA6749 Naturalis Biodiversity Center Suriname 5.923333 -55.54 Van Andel et al. (2016) NA NA Oryza glaberrima TVA6745 Naturalis Biodiversity Center Suriname 4.421389 -55.3703 Unpublished NA NA

71

Oryza barthii

Species Accession number Collection Country Cluster Coverage Source publication SRA study SRA run Oryza barthii IRGC100122 International Rice Research Institute Gambia OB-V high Wang et al. (2014) SRP037996 SRR1206362 Oryza barthii IRGC100921 International Rice Research Institute Unknown OB-V low Wang et al. (2014) SRP037996 SRR1206381 Oryza barthii IRGC100922 International Rice Research Institute Unknown OB-IV low Wang et al. (2014) SRP037996 SRR1206382 Oryza barthii IRGC100927 International Rice Research Institute Sierra Leone OB-IV low Wang et al. (2014) SRP037996 SRR1206383 Oryza barthii IRGC100931 International Rice Research Institute Mali OB-I high Wang et al. (2014) SRP037996 SRR1206363 Oryza barthii IRGC100934 International Rice Research Institute Mali OB-V high Wang et al. (2014) SRP037996 SRR1206364 Oryza barthii IRGC100939 International Rice Research Institute Unknown OB-V low Wang et al. (2014) SRP037996 SRR1206384 Oryza barthii IRGC101240 International Rice Research Institute Mali OB-I low Wang et al. (2014) SRP037996 SRR1206385 Oryza barthii IRGC101252 International Rice Research Institute Burkina Faso OB-V low Wang et al. (2014) SRP037996 SRR1206386 Oryza barthii IRGC101381 International Rice Research Institute Niger OB-V low Wang et al. (2014) SRP037996 SRR1206387 Oryza barthii IRGC101959 International Rice Research Institute Senegal OB-V low Wang et al. (2014) SRP037996 SRR1206388 Oryza barthii IRGC103534 International Rice Research Institute Mali OB-I low Wang et al. (2014) SRP037996 SRR1206389 Oryza barthii IRGC103895 International Rice Research Institute Senegal OB-V high Wang et al. (2014) SRP037996 SRR1206365 Oryza barthii IRGC103912 International Rice Research Institute Tanzania OB-II high Wang et al. (2014) SRP037996 SRR1206370 Oryza barthii IRGC104084 International Rice Research Institute Nigeria OB-V high Wang et al. (2014) SRP037996 SRR1206366 Oryza barthii IRGC104119 International Rice Research Institute Chad OB-II high Wang et al. (2014) SRP037996 SRR1206367 Oryza barthii IRGC105608 International Rice Research Institute Cameroon OB-II high Wang et al. (2014) SRP037996 SRR1206368 Oryza barthii IRGC106234 International Rice Research Institute Sierra Leone OB-IV high Wang et al. (2014) SRP037996 SRR1206369 Oryza barthii WAB0009239 Africa Rice Center Nigeria OB-III low Wang et al. (2014) SRP037996 SRR1206390 Oryza barthii WAB0009240 Africa Rice Center Cameroon OB-V low Wang et al. (2014) SRP037996 SRR1206391 Oryza barthii WAB0012712 Africa Rice Center Mali OB-V low Wang et al. (2014) SRP037996 SRR1206392 Oryza barthii WAB0024904 Africa Rice Center Nigeria OB-V low Wang et al. (2014) SRP037996 SRR1206393 Oryza barthii WAB0026768 Africa Rice Center Mali OB-V low Wang et al. (2014) SRP037996 SRR1206394 Oryza barthii WAB0026769 Africa Rice Center Chad OB-V low Wang et al. (2014) SRP037996 SRR1206395 Oryza barthii WAB0026770 Africa Rice Center Nigeria OB-V low Wang et al. (2014) SRP037996 SRR1206396 Oryza barthii WAB0028874 Africa Rice Center Gambia OB-V low Wang et al. (2014) SRP037996 SRR1206397 Oryza barthii WAB0028875 Africa Rice Center Mali OB-I low Wang et al. (2014) SRP037996 SRR1206398

72

Species Accession number Collection Country Cluster Coverage Source publication SRA study SRA run Oryza barthii WAB0028876 Africa Rice Center Guinea OB-V low Wang et al. (2014) SRP037996 SRR1206409 Oryza barthii WAB0028877 Africa Rice Center Niger OB-II low Wang et al. (2014) SRP037996 SRR1206400 Oryza barthii WAB0028882 Africa Rice Center Cameroon OB-II low Wang et al. (2014) SRP037996 SRR1206401 Oryza barthii WAB0028884 Africa Rice Center Cameroon OB-II low Wang et al. (2014) SRP037996 SRR1206402 Oryza barthii WAB0028885 Africa Rice Center Mali OB-V low Wang et al. (2014) SRP037996 SRR1206403 Oryza barthii WAB0028887 Africa Rice Center Tanzania OB-II low Wang et al. (2014) SRP037996 SRR1206404 Oryza barthii WAB0028889 Africa Rice Center Guinea OB-V low Wang et al. (2014) SRP037996 SRR1206405 Oryza barthii WAB0028893 Africa Rice Center Mali OB-IV low Wang et al. (2014) SRP037996 SRR1206406 Oryza barthii WAB0028894 Africa Rice Center Mali OB-IV low Wang et al. (2014) SRP037996 SRR1206407 Oryza barthii WAB0028896 Africa Rice Center Mali OB-V low Wang et al. (2014) SRP037996 SRR1206408 Oryza barthii WAB0028897 Africa Rice Center Mali OB-I low Wang et al. (2014) SRP037996 SRR1206409 Oryza barthii WAB0028900 Africa Rice Center Mali OB-IV low Wang et al. (2014) SRP037996 SRR1206410 Oryza barthii WAB0028903 Africa Rice Center Zambia OB-II high Wang et al. (2014) SRP037996 SRR1206371 Oryza barthii WAB0028905 Africa Rice Center Senegal OB-V low Wang et al. (2014) SRP037996 SRR1206411 Oryza barthii WAB0028907 Africa Rice Center Senegal OB-V low Wang et al. (2014) SRP037996 SRR1206412 Oryza barthii WAB0028910 Africa Rice Center Mali OB-IV low Wang et al. (2014) SRP037996 SRR1206413 Oryza barthii WAB0028911 Africa Rice Center Mali OB-IV low Wang et al. (2014) SRP037996 SRR1206414 Oryza barthii WAB0028912 Africa Rice Center Mali OB-IV low Wang et al. (2014) SRP037996 SRR1206415 Oryza barthii WAB0028913 Africa Rice Center Mali OB-IV low Wang et al. (2014) SRP037996 SRR1206416 Oryza barthii WAB0028915 Africa Rice Center Mali OB-IV low Wang et al. (2014) SRP037996 SRR1206417 Oryza barthii WAB0028916 Africa Rice Center Mali OB-I low Wang et al. (2014) SRP037996 SRR1206418 Oryza barthii WAB0028917 Africa Rice Center Chad OB-II low Wang et al. (2014) SRP037996 SRR1206419 Oryza barthii WAB0028919 Africa Rice Center Chad OB-V low Wang et al. (2014) SRP037996 SRR1206420 Oryza barthii WAB0028925 Africa Rice Center Chad OB-II low Wang et al. (2014) SRP037996 SRR1206421 Oryza barthii WAB0028926 Africa Rice Center Chad OB-II low Wang et al. (2014) SRP037996 SRR1206422 Oryza barthii WAB0028927 Africa Rice Center Chad OB-III low Wang et al. (2014) SRP037996 SRR1206423 Oryza barthii WAB0028929 Africa Rice Center Chad OB-II low Wang et al. (2014) SRP037996 SRR1206424 Oryza barthii WAB0028930 Africa Rice Center Chad OB-II low Wang et al. (2014) SRP037996 SRR1206425 Oryza barthii WAB0028931 Africa Rice Center Chad OB-I low Wang et al. (2014) SRP037996 SRR1206426

73

Species Accession number Collection Country Cluster Coverage Source publication SRA study SRA run Oryza barthii WAB0028934 Africa Rice Center Chad OB-II low Wang et al. (2014) SRP037996 SRR1206427 Oryza barthii WAB0028937 Africa Rice Center Nigeria OB-III low Wang et al. (2014) SRP037996 SRR1206428 Oryza barthii WAB0028938 Africa Rice Center Nigeria OB-III high Wang et al. (2014) SRP037996 SRR1206429 Oryza barthii WAB0028940 Africa Rice Center Nigeria OB-III low Wang et al. (2014) SRP037996 SRR1206429 Oryza barthii WAB0028942 Africa Rice Center Cameroon OB-II low Wang et al. (2014) SRP037996 SRR1206430 Oryza barthii WAB0028944 Africa Rice Center Cameroon OB-II low Wang et al. (2014) SRP037996 SRR1206431 Oryza barthii WAB0028946 Africa Rice Center Cameroon OB-I low Wang et al. (2014) SRP037996 SRR1206432 Oryza barthii WAB0028947 Africa Rice Center Cameroon OB-V low Wang et al. (2014) SRP037996 SRR1206433 Oryza barthii WAB0028948 Africa Rice Center Cameroon OB-V low Wang et al. (2014) SRP037996 SRR1206434 Oryza barthii WAB0028952 Africa Rice Center Zambia OB-III high Wang et al. (2014) SRP037996 SRR1206373 Oryza barthii WAB0028956 Africa Rice Center Guinea OB-V low Wang et al. (2014) SRP037996 SRR1206435 Oryza barthii WAB0028957 Africa Rice Center Guinea OB-V low Wang et al. (2014) SRP037996 SRR1206436 Oryza barthii WAB0028958 Africa Rice Center Mali OB-II high Wang et al. (2014) SRP037996 SRR1206374 Oryza barthii WAB0028959 Africa Rice Center Mali OB-V low Wang et al. (2014) SRP037996 SRR1206437 Oryza barthii WAB0028961 Africa Rice Center Mali OB-V low Wang et al. (2014) SRP037996 SRR1206438 Oryza barthii WAB0028967 Africa Rice Center Gambia OB-V low Wang et al. (2014) SRP037996 SRR1206439 Oryza barthii WAB0028972 Africa Rice Center Gambia OB-V low Wang et al. (2014) SRP037996 SRR1206440 Oryza barthii WAB0028975 Africa Rice Center Mali OB-I low Wang et al. (2014) SRP037996 SRR1206441 Oryza barthii WAB0028976 Africa Rice Center Mali OB-I high Wang et al. (2014) SRP037996 SRR1206375 Oryza barthii WAB0028977 Africa Rice Center Mali OB-I low Wang et al. (2014) SRP037996 SRR1206442 Oryza barthii WAB0028979 Africa Rice Center Mali OB-I high Wang et al. (2014) SRP037996 SRR1206376 Oryza barthii WAB0028980 Africa Rice Center Mali OB-I high Wang et al. (2014) SRP037996 SRR1206377 Oryza barthii WAB0028981 Africa Rice Center Mali OB-I low Wang et al. (2014) SRP037996 SRR1206443 Oryza barthii WAB0028983 Africa Rice Center Cameroon OB-II low Wang et al. (2014) SRP037996 SRR1206444 Oryza barthii WAB0028985 Africa Rice Center Chad OB-II low Wang et al. (2014) SRP037996 SRR1206445 Oryza barthii WAB0028987 Africa Rice Center Nigeria OB-IV high Wang et al. (2014) SRP037996 SRR1206378 Oryza barthii WAB0028989 Africa Rice Center Chad OB-II low Wang et al. (2014) SRP037996 SRR1206446 Oryza barthii WAB0028991 Africa Rice Center Chad OB-II low Wang et al. (2014) SRP037996 SRR1206447 Oryza barthii WAB0028992 Africa Rice Center Chad OB-III high Wang et al. (2014) SRP037996 SRR1206379

74

Species Accession number Collection Country Cluster Coverage Source publication SRA study SRA run Oryza barthii WAB0028993 Africa Rice Center Cameroon OB-II low Wang et al. (2014) SRP037996 SRR1206448 Oryza barthii WAB0028994 Africa Rice Center Nigeria OB-II low Wang et al. (2014) SRP037996 SRR1206449 Oryza barthii WAB0028996 Africa Rice Center Nigeria OB-V low Wang et al. (2014) SRP037996 SRR1206450 Oryza barthii WAB0028997 Africa Rice Center Nigeria OB-V low Wang et al. (2014) SRP037996 SRR1206451 Oryza barthii WAB0028998 Africa Rice Center Chad OB-V low Wang et al. (2014) SRP037996 SRR1206452 Oryza barthii WAB0029000 Africa Rice Center Botswana OB-II low Wang et al. (2014) SRP037996 SRR1206453 Oryza barthii WAB0030151 Africa Rice Center Chad OB-IV high Wang et al. (2014) SRP037996 SRR1206380 Oryza barthii WAB0030173 Africa Rice Center Unknown OB-V low Wang et al. (2014) SRP037996 SRR1206454 Oryza barthii WAB0030186 Africa Rice Center Mali OB-IV low Wang et al. (2014) SRP037996 SRR1206455

Geographic origin of used accessions

Figure S2: Geographic origin of used accessions. A. Number of O. glaberrima accessions collected per country. B. Number of O. barthii accessions collected per country.

75

APPENDIX C: FILTERING PARAMETERS

Table S2: Description of filters used for quality control. All filter recommendations were taken from GATK Best Practices (Broad Institute, n.d.), with the exception of ‘Missing data’.

Filter Meaning Rationale Remark DP Depth of Coverage Removes sites with excessive coverage caused by Sequencing depth of more than 5 or 6 sigma from the mean depth is alignment artefacts. extremely unlikely given a Gaussian distribution of read depth. QD Quality by Depth Removes sites with low confidence (variant The Phred quality score is normalised to avoid inflation caused by deep quality divided by depth). sequencing. MQ RMS Mapping Quality Removes sites with a low Root Mean Square The root of the mean square is taken to take account for variability in the (RMS) mapping quality. dataset. MQRankSum Mapping Quality Rank Sum Test Removes sites where the mapping qualities of the The u-based z-approximation compares the mapping qualities of the reads reference and alternate alleles are not supporting the reference allele and the alternate allele. comparable. ReadPosRankSum Read Position Rank Sum Test Removes sites where one allele occurs more The u-based z-approximation compares whether the positions of the frequently at the end of reads than the other, reference and alternate alleles are different within the reads. because this is where sequencing errors occur. FS Strand Bias (Fisher's exact test) Removes sites where the alternate allele was seen The probability of strand bias is Phred-scaled. more or less often on the forward or reverse strand than the reference allele. Missing data Percentage of missing genotypes Removes sites where the call rate is low. The number of genotypes that are not called is divided by the total number of genotypes.

Table S3: Thresholds of filters used for quality control. Parameters were set to improve the Ts:Tv ratio while retaining a sufficient number of SNPs. Both lenient and stringent thresholds were applied when multiple cut-offs seemed justified, resulting in different versions of the same call set. Stringent thresholds used for call set 1b are highlighted in blue.

Accessions Filter thresholds Before filtering After filtering Call set 1 O. barthii O. glaberrima DP QD MQ MQRankSum ReadPosRankSum FS Missing data SNP count Ts:Tv ratio SNP count Ts:Tv ratio 1a: Lenient 94 112 > 4800 <2 <38 <-5 <-2.5, >2.5 >60 >25% 10759580 2.44 3923601 2.57 1b: Stringent 94 112 >4800 <21 <38 <-5 <-2.5, >2.5 >30 >25% 10759580 2.44 2644126 2.65

Accessions Filter thresholds Before filtering After filtering Call set 2 O. barthii O. glaberrima DP QD MQ MQRankSum ReadPosRankSum FS Missing data SNP count Ts:Tv ratio SNP count Ts:Tv ratio 2a: Lenient NA 113 >4000 <2 <40 <-5 <-2.5, >2.5 >60 >15% 4163814 2.43 1485239 2.53 2b: Stringent NA 113 >3000 <21 <40 <-5 <-2.5, >2.5 >60 >10% 4163814 2.43 791674 2.72

76

Call set 1: O. glaberrima and O. barthii

Figure S3: SNP count and Ts:Tv ratio grouped by filter class. Applied filter thresholds are indicated in red. Mean Ts:Tv ratio before filtering is indicated in green. Graphs corresponding to stringent filter thresholds used for call set 1b are highlighted in blue.

77

78

Call set 2: O. glaberrima

Figure S4: SNP count and Ts:Tv ratio grouped by filter class. Applied filter thresholds are indicated in red. Average Ts:Tv ratio before filtering is indicated in green. Graphs corresponding to stringent filter thresholds used for call set 1b are highlighted in blue.

79

80

APPENDIX D: CALL SET COMPARISON Fixation index and nucleotide diversity A

B

Figure S5: Effect of filtering thresholds on nucleotide diversity and fixation index. For illustration purposes, only chromosome 1 is shown. A. Nucleotide diversity in both call sets. B. Fixation index in both call sets. As expected, nucleotide diversity is

reduced in the stringent call set. FST is not markedly different. Considering that both statistics follow the same trend in both call sets, it was concluded that the effect of filtering on estimations of genetic diversity was merely one concerning absolute magnitude, and not concerning relative diversity.

Table S4: Mean depth of coverage, fraction of missing data and heterozygosity per individual of call set 1a and 1b. The relatively equal heterozygosity levels in call set 1a are predominantly caused by the comparatively low coverage O. barthii accessions When removing excessively low coverage accessions (<4X), or removing lower quality SNPs (QD<21), O. barthii is shown to be more heterozygous than O. glaberrima.

Call set 1a 1b Population Total O. barthii O. glaberrima Total O. barthii O. glaberrima Depth of coverage 11.26 5.58 16.03 10.5 5.28 14.87 Fraction missing data 0.08 0.15 0.02 0.08 0.16 0.01 Mean heterozygosity (all accessions) 4.95% 4.67% 5.19% 1.59% 2.49% 0.84% Mean heterozygosity (accessions >4X) 6.15% 9.29% 5.33% 1.91% 6.08% 0.87%

81

Individual statistics (call set 1a)

O. barthii O. glaberrima

Figure S6: Individual site statistics based on a complete SNP set (call set 1a). Mean sequencing depth (upper panel), call rate (middle panel) and heterozygosity (lower panel) per individual. 82

Individual statistics (call set 1b)

O. barthii O. glaberrima

Figure S7: Individual site statistics based on a reduced SNP set (call set 1b). Mean sequencing depth (upper panel), call rate (middle panel) and heterozygosity (lower panel) per individual. 83

APPENDIX E: DIVERSITY STATISTICS

Figure S8: SNP density in O. barthii and O. glaberrima (variants per kb). 100kb sliding windows along chromosomes 1-6. 84

Figure S9: SNP density in O. barthii and O. glaberrima (variants per kb). 100kb sliding windows along chromosomes 7-12

85

Figure S10: Nucleotide diversity (π) in O. barthii and O. glaberrima. 100kb sliding windows along chromosomes 1-6.

86

Figure S11: Nucleotide diversity (π) in O. barthii and O. glaberrima. 100kb sliding windows along chromosomes 7-12.

87

Figure S12: Tajima’s D in O. barthii and O. glaberrima. 100kb sliding windows along chromosomes 1-6.

88

Figure S13: Tajima’s D in O. barthii and O. glaberrima. 100kb sliding windows along chromosomes 7-12.

89

APPENDIX F: CHOICE OF OUTGROUP Oryza is a large genus and many reference genomes of related species are available. Three potential outgroups were identified to polarise SNPs, in order of decreasing genetic distance to O. glaberrima: Oryza punctata, Oryza meridionalis, and Oryza longistaminata. These outgroups were determined based on known divergence times and genomic distances to O. japonica, which, as a sister taxon, is supposed to be equidistant to the outgroup as compared to O. glaberrima (see Figure S14). A closely related outgroup offers a better alignment, but carries the risk of incomplete lineage sorting and therefore incorrectly assigned ancestral alleles. A distantly related outgroup circumvents this problem but is more difficult to align, and hence will cause larger loss of data. O. longistaminata was rejected because of its low genomic divergence from O. glaberrima (~2%). O. punctata was rejected because of its high genomic divergence (>5%) and associated data loss (see Table S5). For the genetic distances reported to the candidate species, saturation analyses show a good fit between the corrected pairwise divergence and the uncorrected P-distance (Zhu et al., 2014). For this reason, homoplasy was not considered and correction for multiple substitutions was not applied.

Table S5: Three potential outgroups under consideration. All are wild species belonging to the Oryza genus (Zhu et al., 2014). Genetic distance is based on 53 nuclear genes and 16 intergenic regions. Data loss was estimated from ~1000 random SNPs. Species Origin Clade Genetic distance Divergence time Data loss O. longistaminata Africa AA 0.0216 2.42 mya 38% O. meridionalis Australia AA 0.0301 2.93 mya 66% O. punctata Africa BB 0.0637 9.11 mya 79%

Figure S14: Divergence times between Oryza species of the AA clade. O. punctata is the only member of the BB clade and is included as an outgroup (Zhu et al., 2014).

90

APPENDIX G: CANDIDATE REGIONS UNDER SELECTION

OmegaPlus outliers

Table S6: Condidate selective sweeps unique to O. glaberrima. Candidate selective sweeps are defined as outlier positions with the top 0.5% highest CLR scores across the genome, based on the likelihood method developed by (Kim & Nielsen, 2004).

Chromosome Position Log(CLR) Chromosome Position Log(CLR) 1 1426959 3.237867771 6 22426628 3.116000663 1 6808194 3.136167033 6 22451658 3.814361598 1 6833223 3.620796766 6 24053578 3.122649421 1 11889081 3.689188881 8 1078187 3.717595158 1 12239487 4.560417696 8 1103237 4.717364514 1 13340763 3.408787603 9 7189035 3.276112502 1 15443199 3.331975127 9 7639764 3.184903195 2 6658489 3.415667253 10 1351828 3.263180095 3 20764430 3.299420917 10 1376848 3.508927986 3 24278996 3.778492756 10 1451911 3.248242104 3 24303922 3.130393009 10 9808814 5.146302872 4 3179772 3.216416884 10 9833835 5.01910125 4 9013383 3.253914691 10 9883877 4.164606657 4 12067897 3.802813514 10 13111586 3.699459339 4 12318267 3.255005304 10 13161628 3.333576247 4 17350682 3.149078873 10 16814692 4.179178298 5 6211752 3.409361167 10 16839712 3.812071059 5 22015778 3.410409784 10 16864732 3.630679045 6 801346 5.080582611 11 12547320 8.685175038 6 826375 4.94629044 11 16779756 3.3205646 6 951520 3.580609711 11 16804800 3.238301087 6 9311207 3.423120602 11 16829844 3.473514074 6 9361265 3.500493033 11 18257352 3.661359608 6 11814107 4.601525322 12 12223437 3.485106635 6 11839136 4.601525322 12 12248446 3.357935155 6 16669733 3.377263266 12 12273455 3.317696418 6 16694762 3.516827286 12 12298464 3.789117065 6 21250218 3.254181152 12 12648590 3.268978613

Associated genes We identified 1120 putative moderate to high impact mutations in 278 genes located less than 25 kb away from candidate sweeps. Of these mutations, 34 had a severe impact and were located in 17 different genes. These genes, their predicted function and the nature of their associated high impact mutations can be found in Table S7 and Table S8, respectively.

91

Table S7: Functions of genes associated with candidate sweeps. Gene functions are categorised based on Ontology (GO) terms and available through UniProtKB (The UniProt Consortium, 2017).

ID Chromosome Start End Strand Description GO category Function Activity ORGLA01G0020300 1 1421693 1424058 - Uncharacterised Molecular function ATP binding protein serine/threonine kinase activity ORGLA01G0020500 1 1430691 1431574 + Uncharacterised NA NA NA ORGLA01G0020500 1 1430691 1431574 + Uncharacterised NA NA NA ORGLA01G0020500 1 1430691 1431574 + Uncharacterised NA NA NA ORGLA02G0085500 2 6652855 6655870 - Auxin responsive protein Biological process auxin-activated signaling pathway regulation of transcription, DNA-templated ORGLA02G0085500 2 6652855 6655870 - Auxin responsive protein Biological process auxin-activated signaling pathway regulation of transcription, DNA-templated ORGLA05G0069500 5 6221541 6224335 - Uncharacterised NA NA NA ORGLA05G0069600 5 6230822 6232728 - Uncharacterised NA NA NA ORGLA05G0069600 5 6230822 6232728 - Uncharacterised NA NA NA ORGLA05G0069600 5 6230822 6232728 - Uncharacterised NA NA NA ORGLA05G0069600 5 6230822 6232728 - Uncharacterised NA NA NA ORGLA06G0208300 6 21265153 21266448 + Uncharacterised NA NA NA ORGLA06G0225700 6 22433573 22434991 + Uncharacterised NA NA NA ORGLA06G0225500 6 22409271 22413960 - Uncharacterised Molecular function ADP binding NA ORGLA06G0225500 6 22409271 22413960 - Uncharacterised Molecular function ADP binding NA ORGLA06G0225500 6 22409271 22413960 - Uncharacterised Molecular function ADP binding NA ORGLA06G0225500 6 22409271 22413960 - Uncharacterised Molecular function ADP binding NA ORGLA06G0225500 6 22409271 22413960 - Uncharacterised Molecular function ADP binding NA ORGLA06G0225700 6 22433573 22434991 + Uncharacterised NA NA NA ORGLA09G0045900 9 7196691 7197119 + Uncharacterised NA NA NA ORGLA09G0045900 9 7196691 7197119 + Uncharacterised NA NA NA ORGLA09G0045900 9 7196691 7197119 + Uncharacterised NA NA NA ORGLA09G0046000 9 7204473 7205627 - Uncharacterised NA NA NA ORGLA09G0046000 9 7204473 7205627 - Uncharacterised NA NA NA ORGLA10G0070100 10 9785288 9786138 + Uncharacterised NA NA NA ORGLA10G0070100 10 9785288 9786138 + Uncharacterised NA NA NA ORGLA10G0150600 10 16798552 16799153 - Uncharacterised NA NA NA ORGLA11G0153900 11 16778633 16780936 - Uncharacterised Biological process regulation of monopolar cell growth NA ORGLA11G0154200 11 16829453 16829740 + Uncharacterised Molecular function ADP binding NA ORGLA11G0154200 11 16829453 16829740 + Uncharacterised Molecular function ADP binding NA ORGLA11G0154300 11 16834322 16835425 + Uncharacterised Molecular function ADP binding NA ORGLA11G0154400 11 16835730 16836384 + Uncharacterised NA NA NA ORGLA11G0154700 11 16847944 16853200 + Uncharacterised Molecular function ADP binding NA ORGLA11G0154700 11 16847944 16853200 + Uncharacterised Molecular function ADP binding NA 92

Table S8: High impact mutations found in genes associated with candidate sweeps. Variant effects were predicted with SnpEff and filtered using SnpSift (Cingolani, Patel, et al., 2012; Cingolani, Platts, et al., 2012).

ID Chromosome Start End Position Reference Alternate Variant Impact Effect Codon change Amino acid change ORGLA01G0020300 1 1421693 1424058 1421693 T G G HIGH stop_lost&splice_region_variant c.1272A>C p.Ter424Cysext*? ORGLA01G0020500 1 1430691 1431574 1430757 C T T HIGH stop_gained c.67C>T p.Arg23* ORGLA01G0020500 1 1430691 1431574 1430760 G T T HIGH stop_gained c.70G>T p.Gly24* ORGLA01G0020500 1 1430691 1431574 1430889 T C C HIGH stop_lost c.199T>C p.Ter67Glnext*? ORGLA02G0085500 2 6652855 6655870 6655658 G T T HIGH stop_gained c.213C>A p.Cys71* ORGLA02G0085500 2 6652855 6655870 6655661 G T T HIGH stop_gained c.210C>A p.Cys70* ORGLA05G0069500 5 6221541 6224335 6222036 C G G HIGH stop_lost c.669G>C p.Ter223Tyrext*? ORGLA05G0069600 5 6230822 6232728 6230839 G A A HIGH stop_gained c.544C>T p.Arg182* ORGLA05G0069600 5 6230822 6232728 6230854 G A A HIGH stop_gained c.529C>T p.Gln177*

ORGLA05G0069600 5 6230822 6232728 6231255 C T T HIGH splice_acceptor_variant&intron_variant c.130-2G>A ORGLA05G0069600 5 6230822 6232728 6231366 G A A HIGH stop_gained c.124C>T p.Arg42* ORGLA06G0208300 6 21265153 21266448 21265336 G T T HIGH stop_gained c.184G>T p.Glu62* ORGLA06G0225700 6 22433573 22434991 22434989 T C C HIGH stop_lost&splice_region_variant c.1282T>C p.Ter428Argext*? ORGLA06G0225500 6 22409271 22413960 22409393 G A A HIGH stop_gained c.4315C>T p.Gln1439* ORGLA06G0225500 6 22409271 22413960 22410158 G A A HIGH stop_gained c.3550C>T p.Gln1184* ORGLA06G0225500 6 22409271 22413960 22410310 T C C HIGH stop_lost c.3398A>G p.Ter1133Trpext*? ORGLA06G0225500 6 22409271 22413960 22410991 T C C HIGH stop_lost c.2717A>G p.Ter906Trpext*? ORGLA06G0225500 6 22409271 22413960 22411742 G A A HIGH stop_gained c.1966C>T p.Gln656* ORGLA06G0225700 6 22433573 22434991 22434989 T C C HIGH stop_lost&splice_region_variant c.1282T>C p.Ter428Argext*? ORGLA09G0045900 9 7196691 7197119 7196692 T C C HIGH start_lost c.2T>C p.Met1? ORGLA09G0045900 9 7196691 7197119 7197051 C T T HIGH stop_gained c.361C>T p.Gln121* ORGLA09G0045900 9 7196691 7197119 7197119 A G G HIGH stop_lost&splice_region_variant c.429A>G p.Ter143Trpext*? ORGLA09G0046000 9 7204473 7205627 7204742 C A A HIGH stop_gained c.886G>T p.Glu296* ORGLA09G0046000 9 7204473 7205627 7205263 C T T HIGH stop_gained c.365G>A p.Trp122* ORGLA10G0070100 10 9785288 9786138 9785597 T C C HIGH stop_lost c.310T>C p.Ter104Argext*?

ORGLA10G0070100 10 9785288 9786138 9785822 A G G HIGH splice_acceptor_variant&intron_variant c.354-1A>G

ORGLA10G0150600 10 16798552 16799153 16798654 A C C HIGH splice_acceptor_variant&intron_variant c.253-1T>G ORGLA11G0153900 11 16778633 16780936 16779436 G A A HIGH stop_gained c.1411C>T p.Gln471* ORGLA11G0154200 11 16829453 16829740 16829528 C T T HIGH stop_gained c.76C>T p.Gln26* ORGLA11G0154200 11 16829453 16829740 16829528 C T T HIGH stop_gained c.76C>T p.Gln26* ORGLA11G0154300 11 16834322 16835425 16835368 T A A HIGH stop_gained c.1047T>A p.Tyr349* ORGLA11G0154400 11 16835730 16836384 16836131 G A A HIGH stop_gained c.227G>A p.Trp76* ORGLA11G0154700 11 16847944 16853200 16848137 G A A HIGH stop_gained c.194G>A p.Trp65* ORGLA11G0154700 11 16847944 16853200 16851754 T G G HIGH stop_gained c.1305T>G p.Tyr435* 93

APPENDIX H: POPULATION STRUCTURE ANALYSIS

A B C

Figure S15: Cross-validation (CV) error estimates of ADMIXTURE, with varying levels of K. A. CV error of the entire population (206 accessions), reaching a minimum at K=8. B. CV error of O. glaberrima (112 accessions), reaching a minimum at K=5. C. CV error of O. barthii (94 accessions), reaching a minimum at K=5. Bar plots of the corresponding structure analyses of A and B can be found in the main text.

A B C

D E F

Figure S16: Isolation by distance in the five genetic clusters of O. glaberrima. N = number of accessions in each population. r = Pearson correlation coefficient between the pairwise distances and kinship coefficients. Each dot represents the distance and relatedness between a unique pair of accessions within the population. Outliers are included. Samples falling outside the geographic of West Africa and samples without known coordinates are excluded. A, B and C represent the three coastal populations. D and E represent the inland populations. F represents the entire West African population (N = 106) combined.

94

APPENDIX I: WHOLE GENOME SPECIES TREE

Figure S17: Neighbour Joining (NJ) tree of 206 O. barthii and O. glaberrima sequences, based on 3,923,601 genome wide SNPs. Leaves are labelled by accession number and are coloured by species determination. The grey bar labelled ‘OB and OG’ indicates the smallest monophyletic clade containing all O. glaberrima and its nearest wild relatives.

95

APPENDIX J: GENES SELECTED FOR PHYLOGENETIC ANALYSIS

Table S9: Overview of genes selected for phylogenetic analysis. Columns show proposed domestication stage, function, source (bibliographic reference), the respective locations in the O. sativa and O. glaberrima genome and the identity between the O. glaberrima and O. sativa gene sequences. Domestication stage is only given for the genes taken from Li et al. (2017), because this was not reported by Wang et al. (2014). The number of segregating sites is based on phased haplotypes of 206 accessions (n=412 sequences), produced by concatenating the SNPs found in an interval spanning the gene plus 5 kb flanking regions. Homologues of known domestication genes in Oryza sativa 5 BLAST+ query results (Oryza glaberrima) Sequence identity RAP-DB query results (Oryza sativa) Gene Domestication stage Function Source ID Genomic coordinates ID Genomic coordinates OsLG1 Domestication Closed panicle Li et al. (2017) & Wang et al. (2014) Os04g0656500 chr04:33488722:33492700 ORGLA04G0242500 chr04:24455085:24458509 99% Sh4 Domestication Seed shattering Li et al. (2017) & Wang et al. (2014) Os04g0670900 chr04:34231186:34233221 ORGLA04G0254300 chr04:25150788:25152622 98% qSh1 Early improvement Seed shattering Li et al. (2017) & Wang et al. (2014) Os01g0848400 chr01:36445456:36449951 ORGLA01G0311100 chr01:26836716:26840603 99% Bh4 Early improvement Hull colour Li et al. (2017) Os04g0460200 chr04:22969845:22971859 ORGLA04G0119300 chr04:15433128:15433925 99% LABA1 Early improvement Awn barb Li et al. (2017) Os04g0518800 chr04:25959399:25963504 ORGLA04G0160000 chr04:18237532:18241208 100% OsC1 Early improvement Hull colour Li et al. (2017) Os06g0205100 chr06:5315163:5316640 ORGLA06G0059500 chr06:4466255:4467544 98% Phr1 Later improvement Grain discoloration Li et al. (2017) & Wang et al. (2014) Os04g0624500 chr04:31749141:31751604 ORGLA04G0229000 chr04:23491785:23494110 93% COLD1 Later improvement Chilling tolerance Li et al. (2017) Os04g0600800 chr04:30311574:30316221 ORGLA04G0206300 chr04:21690765:21694785 99% Waxy Later improvement Amylose content Li et al. (2017) Os06g0133000 chr06:1765622:1770653 ORGLA06G0020500 chr06:1582446:1585771 99% Rc Later improvement Red pericarp Li et al. (2017) & Wang et al. (2014) Os07g0211500 chr07:6062889:6069304 ORGLA07G0060000 chr07:5268358:5270384 98%

Sd1 Semi-dwarfing Wang et al. (2014) Os01g0883800 chr01:38382385:38385469 ORGLA01G0334700 chr01:28541899:28544581 98%

Gn1a Grain productivity Wang et al. (2014) Os01g0197700 chr01:5270449:5275585 ORGLA01G0059300 chr01:4236446:4240917 98%

GW2 Grain width Wang et al. (2014) Os02g0244100 chr02:8115223:8121651 ORGLA02G0093000 chr02:7561612:7567493 99%

GIF1 Grain incomplete filling Wang et al. (2014) Os04g0413500 chr04:20422171:20426921 ORGLA04G0089800 chr04:13126506:13130922 99%

MOC1 Tillering Wang et al. (2014) Os06g0610350 chr06:24313523:24316697 ORGLA06G0172100 chr06:17930149:17931657 99%

Sdr4 Seed dormancy Wang et al. (2014) Os07g0585700 chr07:23796611:23797642 ORGLA07G0155600 chr07:16905004:16906038 97%

Ep2 Erect panicle Wang et al. (2014) Os07g0616000 chr07:25381698:25389532 ORGLA07G0170500 chr07:18073478:18080578 99%

Badh2 Fragrance Wang et al. (2014) Os08g0424500 chr08:20379823:20385975 ORGLA08G0132000 chr08:15046659:15052508 100%

IPA1 Ideal plant architure Wang et al. (2014) Os08g0509600 chr08:25274541:25278696 ORGLA08G0176500 chr08:19078252:19082033 99%

Dep1 Dense and erect panicle Wang et al. (2014) Os09g0441900 chr09:16411151:16415851 ORGLA09G0089300 chr09:12041569:12045716 99%

Prog1 Domestication Plant architecture Li et al. (2017) & Wang et al. (2014) Os07g0153600 chr07:2839194:2840089 Not found ?

An-1 Early improvement Awn length Li et al. (2017) Os04g0350700 chr04:16731738:16735336 Not found ?

GS3 Grain size and weight Wang et al. (2014) Os03g0407400 chr03:16729501:16735109 Not found ?

OsSh1 Seed shattering Wang et al. (2014) Os03g0650000 chr03:25197057:25206948 Not found Deleted

Hd1 Heading date Wang et al. (2014) Os06g0275000 chr06:9336376:9338569 Not found Deleted

5 The Rice Annotation Project Database (RAP-DB) utilises the Oryza sativa ssp. japonica cv. Nipponbare genome and is accessible on http://rapdb.dna.affrc.go.jp/ (Sakai et al., 2013). 96

APPENDIX K: HAPLOTYPES OF DOMESTICATION GENES

Table S10: Phasing and phylogenetic output of genes selected for phylogenetic analysis. Data are based on genomic intervals spanning the gene, plus 5 kb on either side. Columns show the number of loci and the number of haplotypes per gene; the frequency of the most prevalent haplotype; and the number of leaves and sum of branch lengths (SBL) of the corresponding NJ tree. Bordered rows indicate genes in which one or more of the five largest haplotypes are composed of O. glaberrima individuals belonging to a single population, namely OG-II (red), or OG-IV (pink). NJ trees of these genes can be found in Figure 14 of the main text and in Figure S17 of this Appendix.

Gene Chromosome Start End Loci Haplotypes Highest frequency Leaves SBL

qSh1 1 26831714 26845603 91 63 0.55 412 2.63252201

Sd1 1 28536897 28549581 136 101 0.47 412 3.46575573

Gn1a 1 4231445 4245917 2 4 0.82 412 1.25

Gw2 2 7556611 7572493 207 92 0.4 412 2.59930774

Gif1 4 13121505 13135922 190 264 0.12 412 7.3951465

Bh4 4 15428126 15438925 119 72 0.65 412 2.18847926

LABA1 4 18232530 18246208 245 144 0.2 412 2.49139267

COLD1 4 21685763 21699785 63 37 0.42 412 1.84506179

Phr1 4 23486784 23499110 113 53 0.5 412 1.57358009

OsLG1 4 24450083 24463509 119 75 0.51 412 2.41602114

Sh4 4 25145786 25157622 131 102 0.29 412 2.6869362

Waxy 6 1577444 1590771 266 110 0.3 412 2.74459838

MOC1 6 17925148 17936657 96 67 0.58 412 2.64005463

OsC1 6 4461253 4472544 200 97 0.21 412 2.22526556

Sdr4 7 16900003 16911038 153 86 0.57 412 3.5857042

Rc 7 5263356 5275384 84 65 0.52 412 2.11471613

EP2 7 18068476 18085578 151 200 0.27 412 5.17523044

Badh2 8 15041658 15057508 114 96 0.5 412 2.4141847

IPA1 8 19073251 19087033 119 47 0.63 412 1.59925443

Dep1 9 12036567 12050716 241 119 0.42 412 2.73050577

Average 142 95 0.45 412 2.79

97

A Phr1 4: 23,491,785 – 23,494,110 B MOC1 6: 17,930,149 – 17,931,657

C Rc 7: 5,268,358 – 5,270,384 D Ipa1 8: 19,078,252 – 19,082,033

Figure S18: Separation of an OG-IV haplotype from the most prevalent O. glaberrima haplotype in multiple domestication genes. The five (or in case of equal haplotype counts, six) largest haplotypes are labelled. The most common haplotypes, containing accessions from multiple subpopulations O. glaberrima, are collapsed into orange nodes. Haplotypes that consist exclusively of O. barthii are collapsed into blue nodes. Remaining haplotypes, consisting of a mix O. barthii and O. glaberrima accessions from a single subpopulation, are expanded with branch colours reflecting the population of origin of the O. glaberrima accessions.

98

APPENDIX L: SELECTIVE SWEEP MODELS

Figure 19: Selective sweep can produce different patterns of variation in the genome. A. The type of selection scan employed in this study was designed to detect hard sweeps, produced by persistent positive selection on a single de novo mutation. B. A soft sweep is produced when selection occurs on a pre-existing variant that this already linked to multiple alleles. C. Alternatively, two new mutations with the same fitness effect arise at the same; both are selected, but neither of them is swept to fixation. Adapted from (Jensen, 2014).

Figure 20: Parallel adaptation in geographically separated demes. Even under a global selection pressure, isolation by distance will delay the spread of a beneficial mutation, creating a time window for multiple beneficial mutations to arise. All three mutations in this figure respond to selection by increasing in frequency, although none of them is quick enough to sweep through the population to outcompete the others. This kind of parallel adaptation produces a pattern of variation as in Figure 19C, and can therefore be seen as a soft sweep. Adapted from (Ralph & Coop, 2010).

99

APPENDIX M: DNA EXTRACTION PROTOCOL

DNA extraction with CTAB

1. Place leaf tissue into 15 ml conical tubes; close carefully. 2. Put samples into liquid nitrogen. 3. Grind frozen tissue with mortar and pestle to a fine dust. 4. Scoop into 1.5 ml micro tubes, close lid and vortex on shaker to remove any clumps. 5. Centrifuge briefly to bring down tissue dust. 6. Add 300 μl CTAB and place in 65°C water bath for 30 minutes. 7. Allow to cool to room temperature. 8. Centrifuge briefly 9. Add 300 μl chloroform under fume hood (use tip box for chloroform). 10. Seal tightly and vortex rigorously for 10-20 seconds. 11. Centrifuge at 3250 rpm for 15 minutes. 12. For each sample, add 200 μl of very cold (-20°C) isopropyl alcohol to a new micro tube. 13. Transfer 200 μl of the chloroform extracted supernatant to the new micro tube and centrifuge at 3250 rpm for 15 minutes. 14. Discard the supernatant; the pellet of DNA should stay behind. 15. Wash with 200 μl of 70% ethanol and centrifuge at 3250 rpm for 7-10 minutes. Discard the supernatant. 16. Repeat the previous step. 17. Air dry the samples for 3 hours, or until the ethanol (and smell) have disappeared. 18. Resuspend the pellets in 100 μl of TE or SDW and store at 4°C overnight. 19. Transfer DNA to skirted 96-well plates for storage. 20. Use 0.5 μl of DNA in a 10 μl PCR reaction.

Preparation of 1L 2X CTAB

For 2% CTAB, 1.4M NaCl, 100 mM Tris, 20mM EDTA, use:

 20 g CTAB (cetyltrimethyl or hexyltrimethyl ammonium bromide)  81.82 g NaCl  100 ml 1M Tris, pH 8  40 ml 0.5M EDTA

Note: CTAB goes into solution slowly and can release toxic fumes if heated.

100

REFERENCES

Broad Institute. (n.d.). GATK | Best Practices. Retrieved August 7, 2017, from https://software.broadinstitute.org/gatk/best-practices/

Cingolani, P., Patel, V. M., Coon, M., Nguyen, T., Land, S. J., Ruden, D. M., … Sturzenbaum, S. (2012). Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. https://doi.org/10.3389/fgene.2012.00035

Cingolani, P., Platts, A., Wang, l. L., Coon, M., Nguyen, T., Wang, L., … Ruden, D. M. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin), 6(2), 80–92.

Creuwels, J. (2017). Naturalis Biodiversity Center (NL) - Botany. Naturalis Biodiversity Center. Occurrence Dataset https://doi.org/10.15468/ib5ypt accessed via GBIF.org on 2017-09-24.

Genesys. (2017). About Genesys. Retrieved August 10, 2017, from https://www.genesys- pgr.org/content/about/about

Jensen, J. D. (2014). On the unfounded enthusiasm for soft selective sweeps. Nature Communications, 5, 5281. https://doi.org/10.1038/ncomms6281

Kim, Y., & Nielsen, R. (2004). Linkage Disequilibrium as a Signature of Selective Sweeps. Genetics, 167(3).

Li, L.-F., Li, Y.-L., Jia, Y., Caicedo, A. L., & Olsen, K. M. (2017). Signatures of adaptation in the weedy rice genome. Nature Genetics, 49(5), 811–814. https://doi.org/10.1038/ng.3825

National Geospatial-Intelligence Agency. (2017). Geographical Names. Retrieved from http://www.geographic.org/geographic_names/name.php?uni=- 1355040&fid=4448&c=suriname

Portères, R. (1955). Présence ancienne d’une variété cultivée d’Oryza glaberrima St. en Guyane Française. Journal D’agriculture Tropicale et de Botanique Appliquée, 2(12), 680. https://doi.org/10.3406/jatba.1955.2270

Portères, R. (1960). Riz subspontanés et Riz sauvages en El Salvador (Amérique Centrale). Journal D’agriculture Tropicale et de Botanique Appliquée, 7(9), 441–446. https://doi.org/10.3406/jatba.1960.2626

101

Ralph, P., & Coop, G. (2010). Parallel adaptation: one or many waves of advance of an advantageous allele? Genetics, 186(2), 647–68. https://doi.org/10.1534/genetics.110.119594

Sakai, H., Lee, S. S., Tanaka, T., Numa, H., Kim, J., Kawahara, Y., … Itoh, T. (2013). Rice Annotation Project Database (RAP-DB): An Integrative and Interactive Database for Rice Genomics. Plant and Cell Physiology, 54(2), e6–e6. https://doi.org/10.1093/pcp/pcs183

The UniProt Consortium. (2017). UniProt: the Universal Protein knowledgebase. Nucleic Acids Research, 45(D1), D158–D169. https://doi.org/10.1093/nar/gkh131

Van Andel, T. R., Meyer, R. S., Aflitos, S. A., Carney, J. A., Veltman, M. A., Copetti, D., … Freedman, A. H. (2016). Tracing ancestor rice of Suriname Maroons back to its African origin. Nature Plants, 2(10), 16149. https://doi.org/10.1038/nplants.2016.149

Wang, M., Yu, Y., Haberer, G., Marri, P. R., Fan, C., Goicoechea, J. L., … Wing, R. A. (2014). The genome sequence of African rice (Oryza glaberrima) and evidence for independent domestication. Nature Genetics, 46(9), 982–8. https://doi.org/10.1038/ng.3044

Zhu, T., Xu, P.-Z., Liu, J.-P., Peng, S., Mo, X.-C., & Gao, L.-Z. (2014). Phylogenetic relationships and genome divergence among the AA- genome species of the genus Oryza as revealed by 53 nuclear genes and 16 intergenic regions. Molecular Phylogenetics and Evolution, 70, 348–361. https://doi.org/10.1016/j.ympev.2013.10.008

102

Back cover: Many of the Surinamese varieties of rice that are cultivated by Maroons have genetic roots in West Africa. Adapted from Gewin, V. (2017, January). Rice Reveals African Slaves’ Agricultural Heritage. SAPIENS. Retrieved from https://www.sapiens.org. Photo credit: Tinde van Andel.