Genetic and Historical Migration Relationship of Three Northern 51

Genetic and Historical Migration Relationships of Three Northern Sabah Native Ethnic Groups with Their Southern China and Southeast Asia Neighbours

C. W. Yew1, M.Z.Hoque2, J. Pugh-Kitingan3, C.L.Y. Voo1, J. Rangsangan4, S.T.Y. Lau1, X. Wang5, W. Y. Saw5, T. H. Ong5, Y. Y. Teo5, S.H. Xu6, B.P. Hoh7, M.E. Phipps8 and S.V. Kumar1*

1Biotechnology Research Institute,2 School of Medicine, 3School of Social Science, 4Borneo Marine Research Institute, Universiti Sabah, Jalan UMS, 88400 Kota Kinabalu, Sabah, Malaysia. 5Department of Statistics and Applied Probability, National University of Singapore, Singapore. 6Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, China. 7Institute for Molecular Medical Biotechnology, Universiti Teknologi MARA, Malaysia. 8Jeffrey Cheah School of Medicine and Health Sciences, Monash University, Malaysia *Email: [email protected]

Abstract

The native ethnic groups of Sabah are divided into Dusunic, Paitanic and Murutic-speaking groups under the North stock of the Austronesian linguistic family. As this region is a putative entry and transition point of the ‘Out-of-Taiwan’ human migration history, the founder effect may have created multiple new ethnic groups. Nevertheless, there is no evidence to support this hypothesis as the population structure and genetic relationships of the indigenous ethnic groups in Sabah and with those in Southeast Asia and Southern China regions are unknown. As such, this study aims to unravel and compare the population structure and genetic relationships of the Northern Borneo populations against the regional populations, for subsequent inference of migration history. Ethical clearance was obtained and blood samples were collected from healthy individuals of aboriginal ethnic groups. A total of 63 individuals of three ethnic groups namely Rungus, Sonsogon and Sungai- Lingkabau were genotyped with ~2.4 million genome-wide SNP markers. The genotyping data were then merged with the Pan-Asian SNP Consortium (PASNP) data set to form a comprehensive data set of SC-SEA-NB which composes of 58 neighbouring populations. The genetic relationships were then inferred via a complementary analysis of principal component clustering and admixture of 52 Short Communications in Biotechnology Vol. 4 / 2017 ancestry proportion. The Principal Coordinate Analysis (PCA) revealed that the three Sabah ethnic groups form a distinct cluster that is close to the Filipino Austronesians and Taiwanese native groups. Interestingly, the Sabah ethnic groups form a distinct genetic ancestry, denoted as ‘North-Borneo’ and showed a decreasing cline of admixture predominantly towards populations in Taiwan, Philippines, and . The Sonsogon group, which had an average of 95% ‘North-Borneo’ ancestry, suggested that the Northern Sabah population had undergone a population isolation event. As such, this work suggests that the Northern Borneo population follows the ‘Out-of-Taiwan’ migration, undergone population isolation, and resulted in admixture with the regional populations. The findings indicate that Sabah’s indigenous population, as a whole, consist of a distinct pool of genetic variants, which are important for anthropology and medical genetic studies in the future.

Keywords: Genetic diversity, Ethnic populations, Sabah, Admixture

INTRODUCTION The Borneo Island flourishes with multiple ethnic groups of diverse languages, cultures and plausibly the genetic background with others. Previous genome- wide SNP genotyping analysis by the PASNP consortium showed a scarce of sampling coverage in the Borneo Island. Apart from the Bidayuh from and the Dayak from Kalimantan, there was no ethnic groups from Sabah (Northern Borneo) in the study. As Northern Borneo is geographically the nearest to the Southern Philippines Islands, the lack of data from Sabah may present a void of better picture on the migration history which is inferred by the genetic study (The HUGO Pan-Asian SNP Consortium, 2009, Xing et al., 2009).

Generally, the Austronesians in the Island Southeast Asia (ISEA) were believed to be originated from the ‘Out-of-Taiwan’ exodus which happened at approximately 5000 years ago (Jinam et al., 2012). It is believed that once the ancestors of the Borneo Island entered into the island, this North-East region of Sabah became the transition point to other part of ISEA, through the land route. All previous migration history analysis have only drawn a simplistic model of migration, particularly there is no clear route of migration once the putative ancestors from Taiwan entered the Borneo Island (Macauley et al., 2005; Jinam et al., 2012). In addition, there was no quantitative analysis to measure the proportion of admixture among the SEA populations. As such, the current inhabitants of Sabah are plausibly the most recent ancestors to other ISEA Austronesians. Comparative analysis of population structure with the Southern China and Southeast Asia populations are thus predicted to provide insight into Genetic and Historical Migration Relationship of Three Northern Sabah 53 the migration history of the region and the overview of population substructure of the ISEA populations.

Sabah herself has >30 officially recognized aboriginal ethnic groups. Linguistic study showed that the North Borneo language stock extends from Southern Philippines into the vast majority of Sabah and extends to most of the interior lands of Sarawak and Kalimantan (Lewis, 2009). This strongly infer the genetic relatedness of these aboriginal ethnic groups in the Borneo, as linguistic affiliations always reflect the genetic relatedness (Gray et al., 2009). The North Borneo language stock in Sabah can be divided into three, namely Dusunic, Paitanic and Murutic family. The distribution of these language-speaking ethnic groups is based on regions. The Dusunic family, which is the major population in the state, spans range from Northeast, Central to the West Coast; the Paitanic family concentrates in the interior land of the East; whereas the Murutic family range from interior lands in the South-West and expands to the Heartland of the Borneo (Lewis, 2009).

As such, this study aims at discovering the population structure and proportion of genetic ancestry of the ethnic groups at the North-East region of Sabah, which is bordering to the putative migration entrance from the Southern Philippines route of ‘Out-of-Taiwan’. The findings will be used to infer a better description of the migration history in this region.

METHODS Ethical clearance and sample collection Ethical clearance was obtained from the Committee of Research Ethics of the university (code: JKEtika 4/10(3)). Subsequently, approvals were also granted from the District Officers of Pitas and Kota Marudu to collect blood samples from the Rungus, Sonsogon and Sungai-Lingkabau ethnic groups in the districts. Prior to sample collection, volunteers were briefed on the project objectives and future applications. Their protected rights such as confidential identity and handling of samples were also explained.

A brief interview was also conducted to obtain background of the volunteers, particularly the ethnicity, origin, and health history of the donor and family. After that, a consent form was signed by the volunteers. All data were kept private and confidential to avoid exposure of volunteers’ data. Next, 10 mL of peripheral blood was collected from each healthy individual. 54 Short Communications in Biotechnology Vol. 4 / 2017

SNP genotyping A total of 63 samples composing of 21 samples from each ethnic group were selected. Genomic DNA was isolated from whole blood or buffy coat with DNeasy Blood and Tissue kit (Qiagen) in accordance with the manufacturer’s protocol. Next, 200 ng of DNA was prepared for SNP genotyping with Illumina’s Omni2.5 bead array that contains ~2,379,855 SNPs, as described by the manufacturer’s protocol.

Quality assessment and merging of SC-SEA-NB merged data set The SNP genotyping data were visualised with Genome Studio (Illumina) and converted to PLINK format. Quality assessment of the samples were then performed to remove samples which are i) <99% call rate, ii) deviated from Hardy-Weinberg equilibrium, iii) discrepant in reported gender and iii) first degree relatives, for subsequent analysis. Next, all monomorphic SNPs among the three ethnics were removed. Principal component analysis (PCA) was conducted to identify the putative admixed individuals. These individuals will be excluded for further analysis.

For comparative analysis, a total of 50 populations composing of 978 unrelated individuals who originate from the Southern China and Southeast Asia were extracted from the Pan-Asian SNP Consortium (PASNP) data set. Meanwhile, 30 random unrelated individuals each from the Yoruba in Nigeria (YRI), Caucasians from North-West Europe (CEU), Han Chinese from Beijing (CHB) and Japanese from Tokyo (JPT), were also extracted from HapMap3 data set and served as reference populations representing continental Africa, Europe and East Asia, respectively.

These data sets were then merged with the North Borneo samples by common SNP markers among the data sets to form the SC-SEA-NB merged data set. The web-tool, LiftOver was then used to update the chromosome position of each SNP marker. Markers which are obsolete were removed. Moreover, PASNP samples which are in first-degree relationships with others and discrepant with the reported gender were removed. This curated merged data set was used for analysis of population structure.

Comparative analysis of population structure Population structure was inferred with smartPCA (EIGEN package version 4.2) and ADMIXTURE (version Linux-1.23) programmes. The smartPCA programme was used for principal component analysis (PCA). A dot-plot was drawn with the first and second principal component to infer the genetic affiliation of Genetic and Historical Migration Relationship of Three Northern Sabah 55 the North Borneo ethnics to others. For ADMIXTURE analysis, the optimum number of ancestral genetic cluster (K value) among the SEA-SC populations was determined with the cross-validation test (--cv). The average proportion of admixture (%) of each population from the surrounding populations was then calculated based on the portion of admixed ancestries.

RESULTS AND DISCUSSION Quality assessment Quality assessment showed that three individuals who are not clustered with their reported ethnicity were considered admixed and thus removed. In addition, 11 individuals who were in first degree relationships with others were removed. The final curated NB data set consists of 49 individuals (19 Rungus, 17 Sonsogon and 13 Sungai-Lingkabau). For comparative analysis, the NB samples were merged with the PASNP and HapMap by a set of 10,386 SNP common markers. This formed the SC-SEA-NB merged data set that composed of 58 populations and 1177 individuals.

North Borneo populations form a unique genetic cluster PCA of this merged data set showed that the NB samples are clustered closely with the East Asians. Next, the YRI and CEU populations were removed and to perform the analysis again. The second PCA unravelled that the SC-SEA were resolved into two clusters which explicitly represent two general population structures in the SC-SEA region, i.e. the Negritos and non-Negritos (results not shown).

To further resolve the genetic affiliation of the NB samples with the surrounding non-Negritos, PCA was conducted again without all Negritos from Malaysia, Philippines and Indonesia. The non-Negrito PCA clearly unravelled that there is a finer population substructure among these non-Negritos, generally in accordance with geographical location (Figure 1). Intuitively, there are two distinct substructures by the populations in the Northern Thailand and Southern China, whereas the , Javanese and form a wide cluster.

Interestingly, the NB populations formed their own genetic cluster, bordering with the neighbouring populations, such as Tagalog and Iraya from Philippines and the Toraja from , Indonesia (Figure 1). The Taiwanese aboriginal groups (Ami and Atayal) were clustered with NB populations at PC1, but were resolved apart at PC2. The relatively close neighbour, the Bidayuh from Sarawak, is genetically distinct as compared to the NB populations. More importantly, there is a cline among the three ethnic groups (Figure 1). Lingkabau (Paitainic- 56 Short Communications in Biotechnology Vol. 4 / 2017 speaking group) was inferred to be genetically closer to Sonsogon than the Rungus, despite the latter is a Dusunic-speaking group as the Sonsogon. This infers that the three groups are genetically distinct in general.

Figure 1 Principal component analysis of the non-Negrito populations in Southern China and Southeast Asia (SC-SEA).

Admixture of ‘North-Borneo’ ancestry in the region ADMIXTURE analysis was conducted at the optimal K=12 (Figure 2). Excluding the reference population from Africa and Europe, the Negritos and the known population isolate of Mlabri (Xu et al., 2010), there are six distinctive genetic ancestry, namely ‘Ami-Atayal’, ‘Temuan’, ‘North-Borneo’, ‘Bidayuh’, ‘Jinuo- Paluang’ and ‘Hmong’. The ‘Ami-Atayal’ is the most widely distributed, as it was found not only in all Negritos, but also in the Tai-Kadai speaking groups and the Han and Japanese. Despite of that, it is of minute proportion in the Austro-asiatic speaking groups in Thailand and Southern China, which are predominantly an equal admixture of ‘Temuan’ and ‘Jinuo-Paluang’ proportions. Besides, the ‘Hmong’ was predominant in the Hmongs and having a cline of admixture with the Tai-Kadai speaking groups, Han Chinese and Japanese at the ancestry proportion ranged at approximately 20-40%. Genetic and Historical Migration Relationship of Three Northern Sabah 57

The analysis also strengthens the findings that NB populations formed their own unique genetic cluster. The proportion of ‘North-Borneo’ admixed ancestry of the surrounding populations was calculated as given in Table 1. Sonsogon has the highest ancestry of up to 95%, followed by Sungai-Lingkabau (74.8%) and Rungus (55.1%). The drastic reduction of ‘North-Borneo’ ancestry in the Rungus is compensated with up to 40% of Ami-Atayal ancestry. However, the highest proportion of ‘North-Borneo’ in the surrounding populations is less than 13% (Ilocano 12.8%, Ami 12.4%, Toraja 12.4%, Dayak 10.4% and Bidayuh 3.1%). This suggests a local genetic cluster within the region of Northern Borneo.

Table 1 The average proportion of ‘North-Borneo’ ancestry among some selected populations in Southern China and Southeast Asia (SC-SEA).

As shown in Table 1, there is a diffusion of ‘North-Borneo’ ancestry, pre- dominantly towards the Philippines and Indonesia, although the highest proportion is less than 13%. The general pattern of diffusion is thatthe proportion is highest in Taiwan-Philippines (~9-13%), followed by Indonesia- Singapore (~5-9%) and Thailand-Southern China (~2-4%). Despite of that, only the Filipino Austronesian (Ilocano, Tagalog and Visaya) and the admixed Negritos (Iraya and Minanubu) are genetically affiliated to the NB populations, as supported by the PCA. This is attributed to minute amount of admixture from ‘Temuan’, which has approximately 20-50% proportion in the Malays and the . Surprisingly, there is no ‘North-Borneo’ admixture with the Negritos in Peninsular Malaysia (Jehai and Kensiu). 58 Short Communications in Biotechnology Vol. 4 / 2017

In contrast, the close neighbours NB populations and Bidayuh formed their own genetic ancestry (namely ‘North-Borneo’ and ‘Bidayuh’ respectively) at this optimum K. Generally, it is observed that majority of the SC-SEA populations has low proportion (<10%) of admixture of both ancestries in their genetic proportion. However, the pattern of admixture unravels that populations in Philippines, Sulawesi and Alor Islands contains higher proportion of ‘North- Borneo’, whereas populations from Northern China and Southern China contains higher ancestry proportion of ‘Bidayuh’. However, the Malays, Javanese and Dayak are having approximately equal admixture of both. These populations are also those having multiple admixtures not only from non-Negritos but Negritos as well. This reflects the general population structure, in term of ancestry admixture.

This works used the combinatorial analysis of clustering, ancestry proportion, population differentiation and phylogenetic analysis for inferring the genetic relationships within the NB populations and in comparison with populations in the Southern China and Southeast Asia region. The findings explicitly inferred that the three ethnics form a unique genetic ancestry, denoted as ‘North Borneo’. This study has filled in the gap on the migration history of the Austronesians based on the ‘Out-of-Taiwan’ theory, as there is no aboriginal representatives in previous studies (The HUGO Pan-Asian SNP Consortium, 2009; Xing et al., 2009; Jinam et al., 2012). Moreover, there was a diffusion of genetic ancestry of ‘North-Borneo’ towards all SC-SEA population, although it is predominantly found in the range from Taiwan to Philippines, Alor Islands and Sumatera. This further strengthens the important contribution of the NB populations to the formation of ethnic groups in Southeast Asia.

The drastic reduction of ‘North-Borneo’ ancestry in the Rungus, while having high proportion of Ami-Atayal (~40%) supports that the founders of ‘North- Borneo’ originated from the ‘Out-of-Taiwan’ expansion. This is affirmed by high admixed proportion of ‘Ami-Atayal’ across this region. The almost full domination of the ‘North-Borneo’ ancestry in the Sonsogon in comparison with the Rungus and Sungai neighbours in the short geographical distance has prompted the possibility of population isolation of the founders in the past.

As such, this work proposed a migration history, based on the genetic data, that the Sabah three ethnics are the putative founder of the North Borneo populations, as a whole. The relatively high proportion of ‘North-Borneo’ in Ilocano, Dayak, Toraja, Manggarai-Rampasasa and Malay-Palembang suggest that the founder of ‘North-Borneo’ migrated towards Southern Philipinnes, Genetic and Historical Migration Relationship of Three Northern Sabah 59

Kalimantan (Borneo), Sulawesi (Indonesia), Alor (Indonesia) and Sumatera (Indonesia), correspondingly to the current domicile of the ethnic groups mentioned above. The emigrated ‘North-Borneo’ founders might then have mixed marriage with other ethnic groups in the new domicile, and thus giving the admixture proportion of different ancestries.

CONCLUSION This is a first genome-wide SNP genotyping analysis of the Northern Sabah ethnic groups, in compared to the Southern China and Southeast Asia populations. It is concluded that the ancestors of the ethnics originated from Taiwan and undergone population isolation before migrated out towards Philippines and Indonesia. The findings filled in the gap in previous work and contributed to deciphering a higher resolution pattern of migration history in the region.

ACKNOWLEDGEMENTS This project is funded by the National Biotechnology Division, Ministry of Science, Technology and Innovation Malaysia (project code: 100-RMI/BIOTEK 16/6/2 B (1/2011)). The authors would like to express our gratitude and appreciation to all volunteers from the native ethnic groups of Sabah, Malaysia who contributed to this study.

REFERENCES Gray, R. D., Drummond, A. J. & Greenhill, S. J. 2009. Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science, 323: 479–483. Jinam, T. A., Hong, L., Phipps, M. E., Stoneking, M., Ameen, M., Edo, J., HUGO Pan-Asian SNP Consortium & Saitou, N. 2012. Evolutionary history of continental Southeast Asians: “Early Train” hypothesis based on genetic analysis of mitochondrial and autosomal DNA data. Molecular Biology and Evolution, 29: 3513–3527. Lewis, P. M. (ed.). 2009. Ethnologue: languages of the world, 16th edition. SIL International, Dallas. Macauley, V., Hill, C., Achilli, A., Rengo, C., Clarke, D., Meehan, W., blackburn, J., Semino, O., Scozzari, R., Cruciani, F., Taha, A., Shaari, N. K., Raja, J. M., Ismail., P., Zainuddin, Z., Goodwin, W., Bulbeck, D., Bandelt, H., Oppenheimer, S., Torroni, A. & Richards, M. 2005. Single, rapid coastal settlement of Asia revealed by analisis of complete mitochondrial genomes. Science, 308: 1034–1036. Nei, M. & Feldman, M. W. 1972. Identity of genes by descent within and between populations under mutation and migration pressures.Theoretical Population Biology, 3: 460–465. 60 Short Communications in Biotechnology Vol. 4 / 2017

Rosenberg, N. A., Li, L. M., Ward, R. & Pritchard, J. K. 2003. Informativeness of genetic markers for inference of ancestry. American Journal of Human Genetics, 73: 1402–1422. The HUGO Pan-Asian SNP Consortium. 2009. Mapping human genetic diversity in Asia. Science, 326: 1541–1545. Xing, J., Watkins, W. S., Witherspoon, D. J., Zhuang, Y., Guthery, S. L., Thara, R., Mowry, B. J., Bulayeva, K., Weiss, R. B. & Jorde, L. B. 2009. Fine-scaled human genetic structure revealed by SNP microarrays. Genome Research, 19: 815–825. Xu, S., Kangwanpong, D., Seielstad, M., Srikummool, M., Kampuansai, J., Jin, L. & The HUGO Pan-Asian SNP Consortium. 2010. Genetic evidence supports linguistic affinity of Mlabri – a hunter-gatherer group in Thailand. BMC Genetics, 11: 18–31 Yang, N., Xu, S. & The HUGO Pan-Asian SNP Consortium. 2011. Identification of close relatives in the HUGO Pan-Asian SNP database. PLoS One, 6: e29502 Genetic and Historical Migration Relationship of Three Northern Sabah 61 Admixture analysis of all SC-SEA populations, with the HapMap populations as reference populations reference as with the HapMap populations populations, of all SC-SEA analysis Admixture Figure 2 Figure