<<

Anthropological Review • Vol. 81(3), 252–268 (2018)

Anthropological Review Available online at: https://content.sciendo.com/anre Journal homepage: www.ptantropologiczne.pl

A glance of genetic relations in the Balkan populations utilizing network analysis based on in silico assigned Y-DNA Emir Šehović1*, Martin Zieger3, Lemana Spahić1, Damir Marjanović1,2, Serkan Dogan1

1International Burch University, Department of and Bioengineering, Sarajevo, 2Institute for Anthropological Research, , Croatia 3Institute of Forensic Medicine, Forensic Molecular Biology Dpt., University of Bern, Sulgenauweg 40, 3007 Bern, Switzerland

Abstract: The aim of this study is to provide an insight into Balkan populations’ genetic relations utilizing in silico analysis of Y-STR and performing predictions together with network analysis of the same haplotypes for visualization of the relations between chosen haplotypes and Balkan populations in general. The population dataset used in this study was obtained using 23, 17, 12, 9 and 7 Y-STR loci for 13 populations. The 13 populations include: Bosnia and Herzegovina (B&H), Croatia, , Slovenia, , Romany (), Hungary, , , , Kosovo, and . The overall dataset contains a total of 2179 samples with 1878 different haplotypes. I2a was detected as the major haplogroup in four out of thirteen analysed Balkan populations. The four populations (B&H, Croatia, Montenegro and Serbia) which had I2a as the most prevalent haplogroup were all from the former Yugoslavian republic. The remaining two major populations from former Yugoslavia, Macedonia and Slovenia, had E1b1b and R1a haplogroups as the most prevalent, respectively. The populations with E1b1b haplogroup as the most prevalent one are Macedonian, Romanian, as well as Albanian populations from Kosovo and Albania. The I2a haplogroup cluster is more compact when compared to E1b1b and R1b haplogroup clusters, indicating a larger degree of homogeneity within the haplotypes that belong to the I2a haplogroup. Our study demonstrates that a combination of haplogroup prediction and network analysis represents an effective approach to utilize publicly available Y-STR datasets for .

Key words: Balkan populations, Y-STR analysis, Haplogroup prediction, Median-joining tree, Y chromosomal haplogroups.

Introduction 2005 and Shi et al. 2005). The Y chro- mosome, despite being the smallest The paternally inherited Y chromo- chromosome in the organism, still pos- some is used in forensics for identifi- sesses two highly useful types of ge- cation, anthropology and population netic markers, single nucleotide poly- genetics for understanding origin and morphisms (SNPs) and short tandem migrations of humans (Kayser et al. repeats (STRs) (Gusmao et al. 2005;

Original Research Article Received April 10, 2018 Revised August 16, 2018 Accepted August 17, 2018 DOI: 10.2478/anre-2018-0021 © 2018 Polish Anthropological Society 253 Emir Šehović, et al.

Ballantyne et al. 2010; Wang et al. and South , the latter including 2014). World population linkages, and most of the Balkan populations (Kushn- phylogenetic trees are created based iarevich et al. 2015). upon SNP markers, mainly because of Earlier research done by Šarac et al. their low mutation rate (Wang et al. (2018) shows that the have been 2010). The lineages determined by SNP a turbulent for hundreds of years, patterns are referred to as haplogroups. and this region consequently contains Haplogroups can also be inferred from several major haplogroups. The four ma- readily available Y-STR genotyping data jor Y chromosomal haplogroups in the (Athey 2006). In the forensic context Balkans are I2a, E1b1b, R1a and R1b. there is plenty of Y-STR data available The I2a and E1b1b haplogroups are the (Willuweit and Roewer 2015) that can most abundant but the R1a and R1b still also be explored for population genet- have a considerable percentage (Semino ics. et al. 2000). Haplogroup I is thought to Haplogroup I referred to as “Palaeo- have appeared on the Balkan Peninsula lithic” European-specific is a biological about 45,000 years ago. However, this proof that Balkan population has had an haplogroup most probably originated additional expansion after the last Ice from the Middle coming to Age (Marjanović et al. 2005). Anatolian through (Battaglia et al. 2009; agriculture began to spread 8,000 to Primorac et al. 2011). Haplogroup E1b1b 9,500 years ago and went across the Bal- is believed to have originated from the kan (Lemmen et al. 2011). Roos- African continent and provides evidence talu et al. (2006) and Pala et al. (2012) of the last direct migration from discussed that ancient DNA samples, to Europe. (Battaglia et al. 2009). Fur- especially those of mtDNA link Europe- thermore, the inclusion of a significant an population to their neighbours from percentage of R1a and R1b haplogroups the Near East. Scientific efforts focused within the Balkan populations confirm on Genetics confirmed the hypothesis its historical intertwining with the oth- that there were several waves in which er populations of Europe (Semino et al. farmers from Near East came to Europe 2000). due to climate changes and new farming The aim of this study is to provide inventions. Therefore, the overall gene an insight into Balkan populations’ ge- pool of Europe was turbulently mixed netic relations utilizing in silico analy- many times. (Özdoğan, 2011; Davidović sis of Y-STR haplotypes by performing et al. 2015; Šarac et al. 2016; Veldhuis haplogroup predictions and network and Underdown 2017). analysis of the same haplotypes for vi- The Balto-Slavic speakers make ap- sualizing the relations between the cho- proximately one third of the total Eu- sen haplotypes and Balkan populations ropeans and the analysis of their mito- in general. Hence, the in silico analysis, chondrial DNA and non-recombining while also being affordable, is very use- region of the suggests ful in analysing the haplogroup distri- that their genetic structure does not dif- butions within the Balkan countries. In fer significantly from the neighbouring addition, the analysed haplotypes were populations. Balto-Slavic population can also studied using median-joining trees be divided into East Slavs, West Slavs, according to various sets of loci. Network analysis of Balkan Y haplogroups. 254

Material and methods This dataset contains a total of 2179 samples with 1878 different haplotypes. The population dataset used in this study The numbers of samples per popula- was obtained using 23, 17, 12, 9 and 7 tion are not consistent, varying from 53 Y-STR loci for 13 populations. A 23 Y-STR samples from Romany (Hungary) to 404 set was analysed in seven populations in- Montenegrin samples. The list of all an- cluding Bosnia and Herzegovina (Purps alysed populations, as well as the num- et al. 2014), Croatia (Purps et al. 2014), ber of samples and different haplotypes Macedonia (Purps et al. 2014), Slovenia is shown in Table 1. Full data used in this (Purps et al. 2014), Greece (Purps et al. article are available from the correspond- 2014), Romany (Hungary) (Purps et al. ing author on request. 2014) and Hungary (Purps et al. 2014). Y chromosomal haplogroups can be A set of 17 Y-STRs was analysed in two assigned from Y-STR haplotypes using populations, namely Serbia and Mon- haplogroup predictors, which are very tenegro (Mirabal et al. 2010). 12 Y-STR useful for analysing previous published analysis includes only the Albanian pop- Y-STR datasets (Dogan et al. 2016; Do- ulation (Ferriet al. 2010). Kosovo (Peričić gan et al. 2017; Heraclides et al. 2017; et al. 2004) and Romania (Barbarii et al. Gurkan et al. 2017) as well as validation 2003), population data were obtained us- purposes of haplogroup assignments ing 9 Y-STR loci analysis. Finally, 7 Y-STR based on Y SNP data (Petrejčíková et al. analysis includes only the Bulgarian pop- 2014 and Emmerova et al. 2017). The ulation (Zaharova et al. 2001). The rea- four haplogroup predictors that are in son for using some population datasets focus in this study are: Whit Athey hap- with less Y-STR loci is due to there being logroup predictor (Athey 2006), Nevgen no publicly available data with 23 Y STR haplogroup predictor (Ćetković - Gen- for those populations. tula and Nevski 2015), Vadim Urasin’s

Table 1. Population datasets used in the study. Population Number of samples Number of different haplotypes Albania 339 233 Bosnia and Herzegovinian 100 100 Bulgarian 126 88 Croatian 239 239 Greek 214 214 Hungarian 100 100 Albanian (Kosovo) 117 60 Macedonian 101 101 Montenegrin 404 318 Romanian 104 97 Romany (Hungary) 53 53 Serbian 179 171 Slovenian 104 104 Total 2179 1878 255 Emir Šehović, et al. haplogroup predictor (Urasin 2013) and of genetic data. It is based upon creat- Jim Cullen haplogroup predictor (Cullen ing minimum spanning trees which are 2008). further combined into one reticulate Overall concordance between the four network. In order to achieve parsimony, haplogroup predictors as well as the con- consensus sequences in form of medi- cordance between each of the haplogroup an vectors, or Steiner points are added predictors was analysed in order to ob- (Bendelt et al. 1999). Posada and Cran- tain the most reliable results and to un- dall (2001) introduced network analysis derstand how the haplogroup predictors and haplogroup predictors that together compare to each other. Relative concor- make a great tool for genetic analysis of dance between each of the haplogroup populations. Furthermore, as they com- predictors was calculated by giving a val- plement each other’s weaknesses, the ue of either 1 or 0 depending on whether results obtained from their combined the output of the predicted haplogroup is overall picture can be considered more the same. Finally, dividing the obtained accurate than when using each method sum of the values by the number of hap- individually (Posada and Crandall 2001). lotypes analysed will provide a relative For the purpose of network analy- concordance between the respective hap- sis, a set of populations of interest were logroup predictors. made. Each population set consisted of The haplogroup distribution percent- 15 haplotypes. The set of 15 haplotypes ages of all populations were obtained represented the haplogroup prediction by calculating the average of the hap- values for each population. Hence, the logroup percentages provided by each percentage of haplogroups in the set of of the four haplogroup predictors. The 15 haplotypes and the overall population haplotypes which were problematic for should be nearly the same. Within specif- the haplogroup predictor to resolve were ic haplogroups the haplotypes were tak- removed from the analysis. en arbitrarily. However, haplotypes with

RST pairwise matrix, as described by microvariants were not selected as they Slatkin (1995), was calculated on the 13 cannot be resolved by the network analy- populations analysed in this study. Y-STR sis. Furthermore, the overall concordance haplotypes are assumed to mainly follow percentage value between all the predic- a stepwise mutation model. Therefore, tors must be 100% for the haplotypes

RST analysis is the most suitable in this chosen; meaning that all the haplogroup case (Slatkin 1995). The YHRD software predictors were concordant and predicted (yhrd.org/amova) was used in order to the same haplogroup. calculate the RST pairwise matrix as well Due to a small number of loci on which as to create the Multidimensional Scaling the Bulgarian population was analysed,

(MDS) plot based on the obtained RST it was not included within the network values (Willuweit and Roewer 2015). A analysis datasets. The sets of populations minimal set of loci (DYS389I, DYS389II, included a set of all populations given DYS19, DYS391, DYS390, DYS392, in Table 1 except Bulgaria, former Yugo- DYS385, DYS393) was used in order to slavian populations and haplotype sets create the RST matrix. for individual haplogroup analyses. The The phylogenetic median-joining net- network analyses including haplotypes work algorithm is used with large sets predicted from all major haplogroups Network analysis of Balkan Y haplogroups. 256 were done by utilizing 15 haplotypes per tions of the vectors that can be generated population, while 10 haplotypes per pop- in the tree and select the most optimal ulation were used when the individual one. In addition, all of the other versions haplogroups were analysed. For the indi- the program has calculated can be seen. vidual haplogroup network analysis new That makes it possible to analyse the me- sets of haplotypes were made. dian-joining trees that the program has For construction of the dataset which not classified as optimal. included all populations except the Bul- A modal haplotype (http://www. garian, nine loci were used: DYS393, mymcgee.com/tools/yutility.html) was DYS390, DYS394, DYS385a/b, DYS439, inserted in the present haplotype sets. DYS392, DYS389II, and DYS438. The A modal haplotype consists of the allele locus DYS385a/b was separated into values with the largest number of occur- DYS385a and DYS385b. Finally, the for- rences in the haplotypes analysed. In case mer Yugoslavian set of populations was of a tie, the larger allele value is used. created using 12 loci: DYS393, DYS390, A heatmap for each major haplogroup DYS394, DYS385a/b, DYS439, DYS392, was created based on its respective fre- DYS389II, DYS458, DYS448, DYS456, quency in each country (Babicki et al. and DYS635. The reason for using more 2016). Moreover, for the purpose of loci in the former Yugoslavian population comparison to the median-joining trees is due to the smaller population size and a a Principal component analysis (PCA) bigger likelihood of having similar or same was performed, using the PAST program haplotypes within the haplotype set which developed by Hammer (2001), on all the would yield relatively poor results. How- populations, excluding the Bulgarian ever, by using 12 loci, the haplotypes can population, based on the same 9 Y-STR be differentiated by a larger margin while loci used for median-joining trees. retaining the ability of visualizing some of the similarities between them which will Results in the end yield proper clustering. The 12 chosen loci were selected based on the de- On average, the four haplogroup predic- gree of variance of allele lengths they dis- tors have a 91% relative concordance be- play between all the analysed populations. tween each other (haplogroup predictor

Loci variance ( 2 ) was calculated using concordance data not shown) as there the VAR.S function∑(𝑋𝑋−µ) in Excel. were no Y-SNPs for referent comparison. 𝑁𝑁−1 Network analysis of individual hap- Haplogroup assignment for each haplo- logroups (I2a, E1b1b, R1a and R1b) of type from all 4 haplogroup predictors the Balkan region were analysed using were compared against each other and the same 9 loci mentioned above. The together made up the relative concordan- median-joining trees were generated us- ce value. The relative concordance value ing the Fluxus Network 4.6 program with was calculated based only on algorithm an equal weight on all loci and an ϵ = 0 predictions. Hence, the haplogroup per- (http://www.fluxus-engineering.com/ centages and results in general are consi- sharenet.htm) (Bandelt, Forster and Röhl dered reliable. 1999). For the analysed populations, loci Post-processing option was included DYS385a/b, DYS481, DYS389II have a in order to calculate all the possible varia- very high variance in the Balkan popula-

1

257 Emir Šehović, et al. tions. The values correspond roughly to their respective variance is shown in Ta- what would have been expected (Purps ble 2. et al. 2014). It is important to emphasize Among all the analysed populations, that some of the loci are not covered by significant differences among the vari- all populations, meaning that their sta- ance values between populations is tistical distribution may be skewed to a found within the DYS393 locus (data for certain degree. However, the abovemen- individual population variance scores for tioned loci cover a substantial number each locus not shown). The former Yugo- of haplotypes, the only exception being slavian populations have a relatively low DYS481 with 911 haplotypes. The loci variance score with an average of 0.2602. with low level of variance, indicating On the other hand, other populations a similarity between the populations, average around 0.5 variance score with are DYS391, DYS389I and GATA-H4. Romany (Hungary) having the highest The loci variance analysis was useful in variance score of 1.17. choosing which loci to use in the net- Within the DYS 438 locus, B&H, Ser- work analysis. A complete list of loci and bian and Montenegrin population show

Table 2. Variance values of all analysed loci among the Balkan populations. Locus Number of populations Variance Number of haplotypes DYS393 13 0.610145 1878 DYS390 13 0.862400 1878 DYS19 13 1.530143 1878 DYS391 13 0.328473 1878 DYS385a 12 3.707194 1790 DYS385b 12 3.497353 1790 DYS439 10 1.034329 1633 DYS389I 13 0.378007 1878 DYS392 13 0.842962 1878 DYS389II 13 9.029229 1878 DYS458 9 1.662084 1400 DYS437 10 0.516272 1633 DYS448 9 1.12932 1400 GATH4 9 0.550598 1400 DYS456 9 1.138877 1400 DYS576 7 1.529612 911 DYS570 7 1.971725 911 DYS438 10 0.750886 1633 DYS635 9 1.209496 1400 DYS481 7 11.02054 911 DYS533 7 0.656163 911 DYS549 7 0.727928 911 DYS643 7 1.419869 911 Network analysis of Balkan Y haplogroups. 258 the least variance when compared to the and Macedonian populations despite the other populations. Among the analysed geographical proximity. populations the Greek and Albanian pop- ulation have the highest variance score within this locus. Within the populations which were an- alysed on the DYS 481 locus (seven out of the thirteen), all of them show a high degree of variance with B&H and Croatian population having the highest variance score of 14,5 and 14,3 respectively. When analysing the loci among the former Yugoslavian populations (B&H, Croatian, Macedonian, Montenegrin, Ser- bian and Slovenian), DYS393, DYS391, DYS389I, and DYS437 have a relative- ly low variance. On the other hand, Figure 1: Two-dimensional plot of MDS analysis of

DYS385a/b, DYS19 and DYS570 have a RST values for Y-STR haplotypes within the 13 Bal- relatively high variance, with DYS385a/b kan populations. being significantly more polymorphic Haplogroup distribution of Balkan populations when compared to the other loci.

The RST pairwise matrix calculated I2a is detected as the major haplogroup on 13 populations, and visualized by the in four out of thirteen analysed popula- MDS plot, has shown a population group- tions. Out of the six former Yugoslavian ing pattern which can roughly represent populations, four of them (B&H, Croatia, the geographical position of the analysed Montenegro and Serbia) have I2a as the populations. As was expected, Albanian most prevalent haplogroup, as shown and Kosovan population positioned close in Figure 1. The other two major popu- to each other within the MDS plot. Fur- lations from the former Yugoslavia, Ma- thermore, B&H and Serbian population cedonia and Slovenia, have E1b1b and have shown a very high degree of simi- R1a haplogroups as the most prevalent, larity within the MDS plot. The Romany respectively. The second most common population has appeared on the MDS plot haplogroup for both Macedonia and Slo- far away from all the populations indicat- venia is the I2a haplogroup, confirming a ing a large degree of dissimilarity which degree of similarity to the other former is to be expected. Yugoslavian major populations. Another point which shows the rough The populations with E1b1b hap- geographical representation of the popu- logroup as the most prevalent one are lations within the MDS plot is the group- Macedonian, Romanian, as well as Al- ing of Croatian, Slovenian and Hungarian banian populations from Kosovo and Al- populations in a very close fashion. The bania. Kosovo and Albania populations main unexpected result within the MDS have a high degree of similarity in the plot is the Greek population grouping haplogroup distribution. together with the Romanian population I2a and E1b1b haplogroups each have and being far away from the Bulgarian a hotspot presented on the geographi- 259 Emir Šehović, et al.

Figure 2: Distribution of haplogroups within the analysed populations. cal heatmap shown in Figure 2. Within a low level of intrahaplogroup similarity the R1a haplogroup a descending north- between the haplotypes. west to southeast gradient is visualized, while among the R1b haplogroup, rela- Analysis of individual major haplogroups of tive to the countries in the middle Bal- the Balkan region kans, increased percentages of the R1b haplogroup can be seen in the north and A large cluster of B&H, Slovenia, Mon- south of the Balkans. tenegro and Albania can be seen within the I2a haplogroup network analysis tree. Network analysis of Balkan populations Overall, the Balkan haplotypes predicted to be I2a haplogroup are the most simi- Four major and three minor clusters are lar to each other when compared to the formed with the Romany (Hungarian) other individual haplogroup trees what is haplotypes forming minor independent the main reason for obtaining a very com- clusters. The four major haplogroups pact median-joining tree. correspond to I2a, E1b1b, R1a and R1b, The overall E1b1b haplogroup clustering while the three minor clusters are each is not as compact as the I2a median-joining formed by haplotypes assigned to be in tree and the Balkan population haplotypes the haplogroups H, J2b and I1. make several minor clusters (analysed mi- As can be seen in Figure 3, the I2a hap- nor clusters shown within the red circles) logroup cluster shows the highest degree separated from the major cluster. The Ro- of relative compactness out of the ana- manian haplotypes are clustered with for- lyzed haplogroups indicating a high degree mer Yugoslavian haplotypes, with E1b1b of intrahaplogroup similarity between the predicted haplotypes from the Romanian haplotypes. On the other hand, the E1b1b population tending to be the most similar to haplogroup cluster has the least degree of the former Yugoslavian haplotypes compactness as it forms two minor clus- The R1a haplogroup median joining ters within it. This indicates that there is tree has several major clusters found main- Network analysis of Balkan Y haplogroups. 260

Figure 3: Geographical heatmap of the four major haplogroups within the Balkan region. ly around the large circles of the network kan haplogroup median joining tree is and is not as compact as the I2a haplogroup the least compact one. It contains two median joining tree as several minor clus- major clusters, with the smaller major ters can be observed. Among those, two of cluster consisting of mainly Kosovan, them are population specific: one for the Montenegrin and Serbian along with few Romani and the other for the Romanian Greek, B&H and Macedonian haplotypes population. All of the major clusters are indicating a large degree of similarity be- compactly clustered, while the minor clus- tween these haplotypes, and larger major ters are not clustered together and mainly cluster not showing clustering pattern of involve single population clusters. the Balkan populations but rather gradual Out of the four major individual hap- clustering indicating that the similarities logroups analysed in this study, R1b Bal- and differences between R1b predicted 261 Emir Šehović, et al.

Figure 4: Median-joining network of all study populations except for Bulgaria. (A) The median-joining network showing population differentiation. (B) The median-joining network based on the assigned ha- plogroups. haplotypes from the Balkan populations can be seen on the left side of Figure 6. are of the similar extent. The same cluster was also visible in the median-joining tree albeit with less clari- PCA ty. Other minor population clusters that are better visually represented in the PCA Similarly to the Network analysis, the are the Macedonian cluster in the bottom PCA largely shows haplogroup based right (E1b1b cluster), Greece cluster on clustering. The median-joining tree cre- the bottom left (R1b cluster) as well as ated a clearer overall picture, whereas the Serbian cluster within the I2a cluster the PCA is very useful in identifying and on the top of Figure 6. analysing the population distribution, The loci which contributed the most minor intra-population similarities and to differentiation among populations and outliers. A cluster of the Romany haplo- haplogroup based clustering within the types, within the R1a haplogroup cluster, PCA are: DYS 385a/b and DYS19. The Network analysis of Balkan Y haplogroups. 262

Figure 5: Median-joining network of individual major haplogroups within the Balkan region. Minor clusters marked within red circles. respective values of loci loadings can be The order of major haplogroups is the seen in Figure 7. Furthermore, the allele same in all populations with small dif- which contributed the most to the first ferences in the exact percentage values. component of the PCA was allele variant The exact haplogroup distribution values 11 on locus DYS392 while allele variant can be seen in Table 3. This shows that 14 on locus DYS385a contributed the the approach of utilizing four haplogroup most to the second component. predictors to yield accurate and reliable prediction results will enable a proper Discussion comparison and analysis of the dataset of interest. The haplogroup prediction percentages Emmerova et al. (2017) used 5 differ- of the population dataset from the cur- ent predictors and based on 12 STRs 3 of rent study are in expected concordance the predictors assigned the haplogroups with the work of Battaglia et al. (2009). with 98% accuracy compared to SNP 263 Emir Šehović, et al.

Figure 6: Principal component analysis of all the populations excluding Bulgaria based on the same 9 loci used in the network analysis.

Figure 7: (A) Loci loadings of the first principal component. (B) Loci loadings of the second principal com- ponent. Network analysis of Balkan Y haplogroups. 264

typing. Petrejčíková et al. (2014) used data used in the present study compared 3 different predictors for the research to Battaglia et al. (2009) for the Bosnian based on 12 STRs and obtained 98.80%, and Croatian populations and Regueiro 97.59% and 98.19% accuracy for Whit et al. (2012) for the Serbian population Athey´s haplogroup predictor, Jim Cul- can mainly be attributed to inaccuracies len´s haplogroup predictor and Vadim of haplogroup predictors in differentiat- Urasin´s haplogroup predictor, respec- ing between the I haplogroup subclades tively. Furthermore, Dogan et al. (2016) (I1, I2a and I2b in this case). used 4 haplogroup predictors for analys- The only notable difference to the ing the B&H population and obtained a published haplogroup distributions 99% average accuracy while the current from Battaglia et al. (2009) was in Hun- study obtained a 91% average accuracy garian R1a haplogroup percentage, since for all populations. This discrepancy can the prediction algorithms approximate be attributed to the fact that many of the this haplogroup frequency to be 21% populations were analysed on fewer than of all Hungarian Y chromosomes for 12 loci. the data from Purps et al. (2014), while An interesting detail of our study is Battaglia et al. (2009) state that it is that compared to the 22% of haplogroup 56.6%. A large discrepancy, that could H we found in the Romany living in Hun- potentially be attributed to a sampling gary, Romany living in Macedonia have error where one or the other population been shown to have 60% of the hap- sample might not be geographically rep- logroup H (Peričić et al. 2005) and only resentative or might include a signifi- 2% of R1a and R1b haplogroups respec- cant number of immigrants. The other tively. populations that were also covered in Discrepancies observed within the I the same article demonstrate only minor haplogroup distributions between the differences in few percentages which

Table 3. Haplogroup distribution within the analyzed populations. Values shown in percentages. Population I2a E1b1b R1a R1b I1 J1 J2a J2b H Alb (Alb) 5.7 30.3 4.6 11.7 6 1.6 4.2 13.6 11.7 Alb (Kos) 1.9 45.2 3.2 24.1 2.9 0.2 2.9 13 B&H 43.5 17 16 5 3.7 1 2.2 1.5 Bulgarian 16.8 18.4 5.9 9.1 2.9 1.3 6.9 2.7 1.5 Croatian 36.1 6.4 23.8 13.9 3.9 0.9 1.9 2.1 Greeek 10.8 16.8 7.3 22.9 2.5 10.1 6.5 5.9 Hungarian 9.5 7.2 21.2 21.5 13 3.7 4.2 4.5 Macedonian 18.3 37.6 13.8 6.6 3.2 3.9 2.9 4.7 Montenegrin 31.1 26.7 7.2 9.5 4.6 0.9 3.5 4.4 Romanian 14.4 18.7 12.7 12.2 5 2.6 8.4 6 Romany (Hu) 3.3 8.9 31.6 22.1 7.5 1.8 5.6 7.5 3.7 Serbian 39.8 17.3 14.6 4.8 5.8 0.5 2.6 1.6 Slovenian 21.6 9.8 24.5 15.8 5.5 3.3 4 3.1 265 Emir Šehović, et al. are justifiable and can be attributed to from the genetical perspective. Our study the small sample size of this population confirms that relative geographical haplo- possibly including a number of rare hap- group distribution is the most informa- lotypes which are not easily resolved by tive way of interpreting Y-chromosomal the predictors. population data and can be reliably done The results generated from the current in silico by utilizing large publicly availa- study show that the similarities between ble STR datasets stored in repositories the former Yugoslavian countries are, as such as YHRD (Willuweit and Roewer expected, present in a larger extent than 2015). Here we demonstrate, once again, between the other countries. The only that geographical distance between the population which showed higher level of populations, as would be expected, is one dissimilarity to the other former Yugosla- of the key factors in shaping the genetic vian populations is Slovenia. The major similarities or differences between them. Slovenian haplogroup is, unlike the other The data we obtain with this in silico Balkan countries, R1a. approach is in concordance with the exi- Kosovan and Albanian populations sting literature. have shown a high degree of similarity The usage of four haplogroup predic- which was expected considering their tors, providing very reliable and accurate common history, language and shared de- major haplogroup predictions, enabled us mographics (Belledi et al. 2000; Peričić et to create highly accurate median-joining al. 2004). Furthermore, when the Koso- trees. In silico haplogroup assignments van population was removed, Albanian and network analysis can be combined haplotypes were found to cluster with to detect emerging subclusters for selec- the Greek population (data not shown) tive SNP analysis for the investigation which does have certain logical reasoning of possible subclades and to specifically behind it since Albania and Greece share target uncertain haplotypes that should borders. be ultimately confirmed by SNP analy- Ballantyne et al. (2014) performed a sis. This combined approach provides a similar network analysis using 10 and very cost-effective Y haplogroup analysis, 15 loci with 1000 haplotypes randomly since SNP typing of the whole dataset can selected from the total dataset and com- be avoided without a substantial loss of pared to the median joining tree used in information. However, Y-SNP genotyping this study where only 10-15 haplotypes should still remain the main and stan- per population were used. This shows dard method of choice for haplogroup that the mentioned network analysis is determination. very detailed. However, it is challenging to visualize such a tree. Hence, we have Authors’ contributions optimized our method and used less hap- lotypes per median-joining tree. EŠ was involved in conceptualization, data analysis and curation, methodology, Conclusion visualization, writing original draft; MZ was involved in methodology, writing Various historical and immigration events original draft; LS was involved in writing have shaped the demographic picture of original draft; DM was involved in su- the Balkan region, making it very diverse pervision, reviewing and editing the ma- Network analysis of Balkan Y haplogroups. 266

vieri A, Pala M, Myres NM, et al. 2009. nuscript; SD was involved in conceptu- Y-chromosomal evidence of the cultural alization, methodology, data curation, diffusion of agriculture in Southeast Euro- visualization, and writing original draft. pe. European Journal of Human Genetics 17(6):820-830. Conflict of interest Belledi M, Poloni ES, Casalotti R, Conterio F, Mikerezi I, Tagliavini J, et al. 2000.Mater- The authors declare no conflict of inte- nal and paternal lineages in Albania and rest. the genetic structure of Indo-European Corresponding author populations. European journal of human genetics: EJHG 8(7):480. Emir Šehović. Address: Izeta Karšića 54, Bouckaert R, Lemey P, Dunn M, Greenhill Sarajevo, Bosnia and Herzegovina SJ, Alekseyenko AV, Drummond AJ, et al. Email address: [email protected]. 2012. Mapping the origins and expansion of the Indo-European language family. References Science 337(6097):957-960. Athey TW. 2006. Haplogroup prediction from Cullen J, 2008. World Haplogroup and Haplo- Y-STR values using a Bayesian-allele-frequ- -I Subclade Predictor. www.members.bex. ency approach. J Genet Geneal 2:34-39. net/jtcullen515/haplotest.htm. Accessed Babicki S, Arndt D, Marcu A, Liang Y, Grant 27 May 2016. JR, Maciejewski A, Wishart DS. 2016. Ćetković, Gentula M, Nevski A. 2015.Y-DNA Heatmapper: web-enabled heat mapping haplogroup predictor – NevGen. www.nev- for all. Nucleic Acids Res. (epub ahead of gen.org/. Accessed 27 May 2016. print). doi:10.1093/nar/gkw419 Davidović S, Malyarchuk B, Aleksic J. M, Ballantyne K. N, Goedbloed M, Fang R, Scha- Derenko M, Topalovic V, et al. 2015. Mi- ap O, Lao O, Wollstein A, et al. 2010. Mu- tochondrial DNA perspective of Serbian tability of Y-chromosomal microsatellites: genetic diversity. American journal of phy- sical anthropology 156(3), 449-465. rates, characteristics, molecular bases, and Doğan S, Ašić A, Doğan G, Besic L, Marjano- forensic implications. The American Jour- vić D. 2016. Y-Chromosome Haplogroups nal of Human Genetics 87(3): 341-353. in the Bosnian-Herzegovinian Population Ballantyne KN, Ralf A, Aboukhalid R, Acha- Based on 23 Y-STR Loci. Human biology kzai NM, Anjos MJ, Ayub Q, et al.2014. To- 88(3):201-9. ward Male Individualization with Rapidly Doğan S, Babic N, Gurkan C, Goksu A, Marja- Mutating Y -Chromosomal Short Tandem nović D, Hadziavdic V. 2016. Y-chromoso- Repeats. Human mutation 35(8):1021- mal haplogroup distribution in the Tuzla 1032. Canton of Bosnia and Herzegovina: A con- Bandelt HJ, Forster P, Röhl A. 1999. Median- cordance study using four different in sili- -joining networks for inferring intraspecific co assignment algorithms based on Y-STR phylogenies. Molecular biology and evolu- data. HOMO-Journal of Comparative Hu- tion 16(1):37-48. man Biology 1;67(6):471-83. Barbarii LE, Burkhard R, Dan Dermengiu D. Doğan S, Gurkan C, Dogan M, Balkaya HE, Y-chromosomal STR haplotypes in a Roma- Tunc R, Demirdov DK, Ameen NA, Marja- nian population sample. 2003. Internatio- nović D. 2017. A glimpse at the intricate nal journal of legal medicine 117(5):312- mosaic of ethnicities from : 315. Paternal lineages of the Northern Iraqi Bar-Yosef Ofer, 2002. The Upper Palaeolithic , Kurds, Syriacs, and Yazi- Revolution. Annual Reviews Anthropology dis. PloS one 3;12(11):e0187408. 31:1, 363-393 Emmerova B, Ehlera E, Comasd D, Votru- Battaglia V, Fornarino S, Al-Zahery N, Oli- bovaa J, Vanek D. 2017. Comparison of 267 Emir Šehović, et al.

Y-chromosomal haplogroup predictors. Fo- somal data. PLoS One 10(9):e0135820. rensic Science International 6:145-147. Lemmen C, Gronenborn, D, & Wirtz, K. W. Ferri G, Tofanelli S, Alu M, Taglioli L, Radhe- 2011. A simulation of the tran- shi E, Corradini B, et al. 2010. Y-STR varia- sition in Western Eurasia. Journal of Ar- tion in Albanian populations: implications chaeological Science 38(12), 3459-3470. on the match probabilities and the genetic Marjanović D, Fornarino S, Montagna S, legacy of the minority claiming an Egyptian Primorac D, Hadziselimovic R, Vidovic descent. International journal of legal me- S, et al. 2005. The peopling of modern dicine 124(5):363-370. Bosnia -Herzegovina: Y -chromoso- Grugni V, Battaglia V, Kashani BH, Parolo S, me haplogroups in the three main eth- Al-Zahery N, Achilli et al. 2012. Ancient nic groups. Annals of Human Genetics migratory events in the : new 69(6): 757-763. clues from the Y-chromosome variation of Mirabal S, Varljen T, Gayden T, Regueiro M, modern Iranians. PloS one 7(7): e41252. Vujović S, Popović D, et al. 2010. Human Y Gurkan C, Sevay H, Demirdov DK, Hossoz -chromosome short tandem repeats: A tale S, Ceker D, Teralı K, Erol AS. 2017. Tur- of acculturation and migrations as mecha- kish Cypriot paternal lineages bear an au- nisms for the diffusion of agriculture in the tochthonous character and closest resem- Balkan Peninsula. American journal of phy- blance to those from neighbouring Near sical anthropology 142(3):380-390. Eastern populations. Annals of human bio- Özdoğan M. 2011. Archaeological eviden- logy 17;44(2):164-74. ce on the westward expansion of farming Gusmao L, Sanchez-Diz P, Calafell F, Martin communities from eastern Anatolia to the P, Alonso CA, Alvarez-Fernandez F, et al. Aegean and the Balkans. Current Anthro- 2005. Mutation rates at Y chromosome pology 52(S4), S415-S430. specific microsatellites. Human Mutation Pala M, Olivieri A, Achilli A, Accetturo M, 26(6):520-528. Metspalu E, et al. 2012. Mitochondrial Hammer, Ø, Harper D.A.T, Ryan P.D. 2001. DNA signals of late glacial recolonization PAST: Paleontological statistics software of Europe from near eastern refugia. The package for education and data analysis. American journal of human genetics 90(5), Palaeontologia Electronica 4(1):9. 915-924. Heraclides A, Bashiardes E, Fernández- Peričić M, Barać Lauc L, Martinović Klarić I, -Domínguez E, Bertoncini S, Chimonas Janićijević B, Behluli I, Rudan P. 2004. Y M, Christofi V, King J, Budowle B, Ma- chromosome haplotypes in Albanian popu- noli P, Cariolou MA. 2017. Y-chromoso- lation from Kosovo. Forensic science inter- mal analysis of Greek Cypriots reveals a national 146(1):61-64. primarily common pre-Ottoman paternal Peričić M, Barać Lauc L, Martinović Klarić I, ancestry with Turkish Cypriots. PloS one Rootsi S, Janićević B, et al. 2005. High-Re- 16;12(6):e0179474. solution Phylogenetic Analysis of Southe- Kayser M, Lao O, Anslinger K, Augustin C, astern Europe Traces Major Episodes of Bargel G. Edelmann J, et al. 2005. Signi- Paternal Gene Flow Among Slavic Popu- ficant genetic differentiation between Po- lations. Molecular Biology and Evolution land and Germany follows present-day 22(10):1964-1975 political borders, as revealed by Y-chromo- Petrejčíková E, Čarnogurská J, Hronská D, some analysis. Human Genetics 117(5): Bernasovská J, Boroňová I, Gabriková D, et 428-443. al. 2014. Y-SNP analysis versus Y-haplogro- Kushniarevich A, Utevska O, Chuhryaeva M, up predictor in the Slovak population. An- Agdzhoyan A, Dibirova K, Uktveryte I, et thropologischer Anzeiger 71(3):275-285. al. 2015. Genetic heritage of the Balto-Sla- Posada D, Crandall KA. 2001. Intraspecific vic speaking populations: a synthesis of gene genealogies: trees grafting into ne- autosomal, mitochondrial and Y-chromo- tworks. Trends in ecology & evolution Network analysis of Balkan Y haplogroups. 268

16(1):37-45. frequencies. Genetics 139(1), 457-462. Primorac D, Marjanović D, Rudan P, Villems Stevanović M, Dobricić V, Keckarević D, Pe- R, Underhill P. A. 2011. Croatian genetic rović A, Savić-Pavićević D, Keckarević- heritage: Y-chromosome story. Croatian -Marković M, et al. Human Y-specific STR medical journal 52(3): 225-234. haplotypes in population of Serbia and Purps J, Siegert S, Willuweit S, Nagy M, Alves Montenegro. 2007. Forensic science inter- C, Salazar R, et al. 2014. A global analysis national 171(2):216-221. of Y-chromosomal haplotype diversity for Šarac J, Auguštin D. H, Metspalu E, Novok- 23 STR loci. Forensic Science Internatio- met N, Missoni S, Rudan P. 2018. Maternal nal. Genetics 12: 12-23. Genetic Profile of Serbian And Montene- Regueiro M, Rivera L, Damnjanovic T, Lukovic grin Populations from Southeastern Euro- L, Milasin J, Herrera RJ. 2012. High levels pe. Genetics & Applications 1(2), 14-22. of Paleolithic Y-chromosome lineages cha- Šarac J, Šarić T, Auguštin D. H, Novokmet N, racterize Serbia. Gene 498(1):59-67. Vekarić N, Mustać M, et al. 2016. Genetic Roostalu U, Kutuev I, Loogväli E. L, Metspa- heritage of Croatians in the Southeastern lu E, Tambets K, et al. 2006. Origin and European gene pool—Y chromosome expansion of haplogroup H, the dominant analysis of the Croatian continental and Is- human mitochondrial DNA lineage in land population. American Journal of Hu- West Eurasia: the Near Eastern and Ca- man Biology 28(6): 837-845. ucasian perspective. Molecular biology and Urasin V. 2013.Y Predictor by VadimUrasin evolution 24(2), 436-448. v1.5.0. http://predictor.ydna.ru/. [Acces- Rootsi S, Kivisild T, Benuzzi G, Help H, Ber- sed 27 May 2016]. misheva M, Kutuev I, et al. 2004. Phylogeo- Veldhuis D, Underdown S.J. (2017) Human graphy of Y-chromosome haplogroup I re- biology of migration, Annals of Human veals distinct domains of prehistoric gene Biology 44:5, 393-396 flow in Europe. The American Journal of Wang CC, Jin L, Li H. 2014. Natural selection Human Genetics 75(1):128-137 on human Y chromosomes. J. Genet. Geno- Semino O, Passarino G, Oefner P J., Lin Alice mics 41:47-52. A., Arbuzova S, et al. 2000. The Genetic Le- Wang CC, Yan S, Li H. 2010. Surnames and gacy of Paleolithic Homo sapiens sapiens in the Y chromosomes. Commun. Contemp. Extant Europeans: A Y Chromosome Per- Anthropol 4:26-33. spective. Science 290 (5494):1155-1159. Willuweit S, Roewer L. 2015. The new Y Semino O, Passarino G, Brega A, Fellous M, chromosome haplotype reference database. Santachiara-Benerecetti AS. 1996. A view Forensic Science International: Genetics of the neolithicdemic diffusion in Europe 15:43-48. through two Y chromosome-specific mar- Zaharova B, Andonova S, Gilissen A, Cassi- kers. American journal of human genetics man JJ, Decorte R, Kremensky I. 2001. 59(4):964. Y-chromosomal STR haplotypes in three Shi H, Dong YL, Wen B, Xiao CJ, Underhill major population groups in Bulgaria. Fo- PA, Shen PD, et al. 2005.Y-chromosome rensic science international 124(2):182- evidence of southern origin of the East 186. Asian–specific haplogroup O3-M122. The Zerjal T, Xue Y, Bertorelle G, Wells RS, Bao W, American Journal of Human Genetics Zhu S, et al. 2003. The genetic legacy of the 77(3):408-419. Mongols. The American Journal of Human Slatkin M. 1995. A measure of population Genetics 72(3):717-721. subdivision based on microsatellite allele