<<

ANALYSIS OF Y-CHROMOSOME POLYMORPHISMS IN

PAKISTANI POPULATIONS

Thesis submitted to the Institute of Medical Sciences

for the degree of Doctor of Philosophy.

BY

Sadaf Firasat

Centre of Human Genetics and Molecular Medicine

Sindh Institute of Medical Sciences

Sindh Institute of Urology and Transplantation (SIUT)

Karachi,

2010

TABLE OF CONTENTS

Title page

 Acknowledgements ii

 List of Tables iii

 List of Figures iv

 Summary vi

Introduction 1

Literature Review 19

Materials and Methods 34

Results

 Phylogeography of Pakistani ethnic groups. 51

 Comparison between the Pakistani and Greek populations 73

Discussion 86

 Comparison within Pakistan 88

 Comparison between the Pakistani and Greek population 94

 Comparison with world populations 98

 Insight in to populations origins 111

Conclusions 121

 References 122

 Appendix a

i

ACKNOWLEDGEMENT

I thank Prof. Dr. Syed Qasim Mehdi H.I. S.I., for his support, encouragement and for providing all the facilities for doing scientific work in his laboratory. The work presented in this thesis was done under the supervision of Dr. Qasim Ayub T.I. It is great pleasure for me to acknowledge the keen interest, advice, patient guidance and kindness that I have received from him during the course of this work. I would like to thank Dr. Shagufta Khaliq, (PoP), for teaching all the molecular genetics lab techniques and also to Dr Aiysha Abid for comments on this manuscript and suggestion for its improvement. I am also grateful to Mrs. Ambreen Ayub for her help in making the contour map. I thank my colleague Ms. Sadia Ajaz for her help and cooperation in proof reading the thesis. It has been an honor for me to work at SIUT and I thank Prof. Dr Adeeb Rizvi H.I. S.I., Director, SIUT, for his constant support and guidance. Finally, I would like to thank my parent, without their love and support the completion of this work would have not been possible.

ii

LIST OF TABLES

Table Title Page

I. The possible origins and language affinities of Pakistani populations. 21

II. A list of Y , markers, type of polymorphism and genotyping methods used in this study. Y haplogroups were determined in a hierarchal manner, screening initially with markers that identified deep lineages (bold) and subsequently genotyping markers that further delineated the tree in the target population. The typing methods were amplified fragment length polymorphism (AFLP), denaturing high performance liquid chromatography (DHPLC), amplification refractory mutation system polymerase chain reaction (ARMS-PCR) or dideoxy DNA sequencing (Seq). 41

III. List of SNPs typed by AFLP method 42

IV. YSTR Primer sequences. 46

V. Frequency of haplogroups B*, C*, E* and F* in ethnic groups from 53 Pakistan.

VI. Number and frequencies of populations fall in B-T. 60

VII. Y lineages found in the three Punjabi castes examined in this study. 63

VIII. Percentage of variation obtained by AMOVA at three levels of population hierarchy in ethnic groups from Pakistan. 68

IX. Population pair wise FSTs between Pakistani ethnic groups computed from Y haplogroup frequencies. FST p values (based upon 110 permutations) are given above the diagonal with *indicating significant pair wise differences. 69

X. Matrix of significant. FST p values (significance level =0.0500) based upon 110 permutations among the ethnic group of Pakistan. 70

XI. Weighted population pair wise ρ genetic distances (below diagonal) and FST values (above diagonal) based on STR variation within haplogroups. 80

XII. Description of World populations. 103

XIII. Y-STRS data of clade B lineages in Pakistan and African populations. 108 iii

LIST OF FIGURES

Figure Title Page

I. Map of Pakistan showing its neighbors, administrative regions and the geographical distribution of the populations that are included in this study. 20

II. Phylogenetic tree. 26

III. Distribution of haplogroups B*, C*, E* and F* in populations from northern and southern Pakistan. 54

IV. Y haplogroup frequency distribution in ethnic group of 55 Pakistani.

V. Distribution of major Y lineages (PK2, M52, M67, M27) frequencies in Pakistan. 64

VI. Distribution of major Y lineages (M357, M173, M17 and M124) frequencies in Pakistan 65

VII. Principal component analysis based on Y haplogroup frequencies in Pakistani populations. 67

VIII. Median-joining network of Lineage L individuals based on Y– STR haplotypes. 72

IX. A rooted maximum-parsimony tree of Y lineages found in the Greek, Burusho, Kalash, Pathan and Pakistani populations. 75

X. A plot of the first two principal coordinates based upon the analysis of Y haplogroup frequencies in Pakistani and Greek populations. 77

XI. A plot of the first two principal coordinates based upon the analysis of Y haplogroup frequencies in Pakistani and Greek samples (1=this study; 2 = Francalacci et al., 2003) using comparable biallelic markers. 78

XII. Neighbor-joining tree showing the relationship between the Greek and three Pakistani ethnic groups. The tree is based on ρ genetic distances. 81

XIII. Median-joining network of clade E lineages in Pakistan (open circles) and Greece (hatched circles). Circles represent haplotypes and have an area proportional to frequency. The Pathan individuals are shown in black. 83

XIV. Contour map showing the 9 Y-STR haplotypes frequency distribution in Eurasia and northern Africa. This haplotype was shared between three and a Pathan individual belonging to clade E1b1b1a. 85 iv

XV. The frequencies of Major haplogroup in Asian population. 105

XVI. Median-joining network of C lineage. 106

XVII. Distribution of L haplogroup in Indo Pak sub continent. 107

XVIII. Median-joining network of clade B lineages in Pakistan and African populations. Circles represent haplotypes and have an area proportional to frequency. The Pakistani individuals are shown in orange and light blue colour. 1 09

XIX. Geographic distribution of O haplogroup. 110

XX. Median-joining network H1-M52 lineage fall in Burusho, Kalash and Pathan, based on their Y-STR haplotype. 1 15

XXI. Possible origins a) Hazara b) Kalash c) Parsi d) Makrani – . 120

v

SUMMARY

- 1 -

The data presented in this thesis provides a comprehensive report on Y chromosomal diversity among different ethnic groups from Pakistan. It provides insights into the genetic variation in Pakistan in a global context and also sheds light on the patrilineal origins of these populations. The major conclusions are summarized as follows:

1. Genetic relationships in Pakistan are dictated primarily by

geographic proximity rather than linguistics:

The results suggest that within Pakistan male genetic relationships are dictated primarily by geographic proximity. Ethnic groups speaking Dravidian

(Brahui), Sino-Tibetan (Balti) or the language isolate Burushaski (Burusho) share genetic affinity with their Indo-European speaking geographic neighbors. Although the isolation of the Hunza Burusho in the mountains of northern Pakistan has led to the preservation of their language it has not made them genetically distinct in comparison with their neighbors in Pakistan.

Based on Y haplogroup frequencies, the majority of the ethnic groups from

Pakistan show evidence of admixture mostly with Central/South Asian and European populations. This is illustrated by the fact that the major haplogroups such as E*, J* and R*, that are frequent in west Asians and Europeans, together constitute 65% of the total. Haplogroups L1 and R2 are shared with populations from and constitute 11% of the Pakistani population.

2. The Karakoram Mountains form a formidable barrier to gene flow

from China:

Haplogroups, such as haplogroup C3 and O*, that are commonly observed in

East Asians, are rare, or absent in the Pakistani populations and constitute < 1.5 % of the total. Populations living in these mountain valley’s such as the Hunza Burusho,

Balti and Kashmiri are all genetically closer to other ethnic groups in Pakistan. This vi

low prevalence, or absence, of East Asian haplotypes in Pakistan indicates that the

Karakoram Mountains, which separate Pakistan and China, form a formidable barrier to gene flow from the north. The Hazara are the only population with significant East

Asian ancestry but historical records indicate that they did not cross this geographical boundary and arrived in the sub-continent from the West.

3. Genetic signatures of invasions:

The Indo-European contribution to the Y gene pool in Pakistan is substantial and is probably a reflection of the colonization of the subcontinent by invaders from West and . These probably replaced the indigenous Y haplogroups which are now mostly found in South Indians and isolated populations in the Andaman Islands.

Three populations (Burusho, Kalash and Pathan) also claim Greek ancestry following Alexander’s invasion of the subcontinent. However, the results shown here only provided strong support for a minor Greek genetic contribution to the Pathan gene pool.

The presence of a unique star cluster based on Y-STR haplotypes in haplogroup C3 Y chromosomes in the Hazara population has been linked to the male descendants of Genghis Khan (1162-1227). These Y chromosomes are prevalent in

Mongolia and are observed at a frequency of 60% in a much larger sample of Hazara males from northern and southern Pakistan that were analyzed in this study.

Although this haplogroup was also observed in the Burusho (8.2%) but these samples did not share the star haplotype pointing towards separate origins for these populations. Historical records also support the genetic relatedness between East

Asians and the Hazara.

vii

4. The Kalash as genetic outliers:

This study also demonstrates that the Kalash have a distinct genetic identity within Pakistan. Located in the remote valley’s of the Hindu Kush Mountains they show significant Caucasian ancestry but also have a high proportion of population specific haplogroup L3a that is not found elsewhere in Pakistan. Their genetic uniqueness is a reflection of genetic drift in an isolated population struggling to maintain their distinct cultural and religious identity.

Future Prospects:

This endeavour expands our knowledge about Pakistani populations and complements data obtained from analyzing autosomal and mitochondrial markers. It improves our understanding of geographic, linguistic and religious factors on population diversity and structure in this region and provides a basis for future work in this field.

viii

INTRODUCTION

- 2 -

“Where do we come from? What are we? Where are we going?” These provocative questions as framed in the title of the French artist Paul Gauguin’s painting have always aroused human curiosity. Using evidence from archaeology, fossils and lately genetics, scientists have gained insights into humanity’s past.

Human evolutionary history begins with the appearance of our species about

2.5 -1.5 million ago (MYA), the earliest evidence of which has been found in

Africa (Klien, 1989). With the passage of time, various species of the genus Homo have been identified including H. ergaster, H. erectus, Neanderthals and the H. floresiensis (Brown et al., 2004; Gabunia and Vekua, 1995; Swisher et al., 1994), all of whom are now extinct with the exception of modern H. Sapiens, the last fully developed species that appeared about 100,000 years ago in East Africa (Klien,

1989; Righmire, 1989). The demise of our early ancestors has been attributed to harsh weather conditions or the difficulty in finding food and other life necessities.

There is consensus among the modern scientific community that modern humans arose in Africa and several waves of migrations help explain their passage out of Africa. Evidence from fossils and archaeological remains suggest that expansion of modern humans became possible when weather conditions were favorable. The discovery of 125,000 old artifacts in Eritrea`s Red Sea coast (Walter et al.,2000) suggest that people from the Horn of Africa moved across the Arabian peninsula to the southern part of the Red Sea. They reached southern Asia, traveling further east to Australia (Stringer, 2000) around 50-60 thousand years ago

(KYA). The evidence found from Skhul and Qafzeh, in modern day Israel, dating 100

KYA suggests that another wave of migration humans crossed the Red sea and entered the Levantine region 47 KYA. From Arabia, people moved towards west and east and reached Western and Siberia about 40 KYA and East Asia about 39

KYA. These waves of migrations resulted in development of several populations and races of modern humans that are characterized by the differences in their physical appearance,culture and language.

1

Fossil and archaeological evidence in favour of an African origin for modern humans is also supported by molecular genetic evidence (Batzer et al., 1996;

Bowcock et al., 1991, 1994; Cann et al., 1987; Cavalli-Sforza et al., 1994; Horai et al., 1995; Jorde et al., 1995; Knight et al., 1996; Lahr and Foley, 1994; Leakey 1994;

Mountain et al., 1994; Perez-Lezaun et al., 1997; Ruvolo et al., 1993; Scozzari et al.,

1988; Shiver et al., 1997, Stringer and Andrew, 1988; Tattersall, 1997; Tishkoff et al.,

1996). This biological evidence has provided valuable insights and, in association with paleontology and archaeology, allowed the reconstruction of human history.

The blood groups were the first markers to be analyzed in human populations soon after the discovery of the ABO blood groups (Landsteiner, 1901). Variations in these blood groups were analyzed among Second World War soldiers and the slaves from different nations (Hirszfeld and Hirszfeld, 1919). This was followed by the discovery and analysis of variation of several classical serological markers such as the immunoglobulin allotypes, red cell enzymes, human leukocytes antigens (HLA)

(Dausset, 1954; Grubb and Laurell, 1956; Payne et al., 1964) and serum proteins

(Harris, 1966). All these markers collectively contributed to our understanding of the human variation and charted their origins and dispersals.

WHAT IS DNA? In 1953 the celebrated Nobel Prize winners Watson and Crick described the double helical chemical structure of DNA (Watson and Crick, 1953) and laid the foundations for the development of DNA based genetic markers that have now become the hallmark of research into our past history. The simple but elegant structure of DNA that they described has two anti-parallel polynucleotide chains with a sugar- phosphate backbone. The nucleotide bases in DNA are of only four kinds: adenine (A), guanine (G), cytosine (C) and thymine (T) that strictly obey hydrogen bonding of nucleotides A with T and G with C. The sequences of these bases in the

2

polynucleotide chain dictate the structure and function of proteins and every morphological and functional characteristic of each cell in the human body.

In humans DNA is present inside the cellular nucleus and the mitochondria, an extra nuclear organelle. In the mitochondria the DNA is small, circular and double stranded with a length of 16,569 base pairs (bp) (Anderson et al., 1981; Ruiz-Pesini et al., 2007). It consists of only 37 genes but has been extremely useful in tracing back the maternal origin of the human populations because it has three important characteristics:

1.) A maternal mode of inheritance (Giles et al., 1980).

2.) A high mutation rate (Olivio et al., 1983).

3.) A lack of recombination (Brown, 1979).

The human nuclear genome consists of a double stranded DNA molecule that is packed into 23 pairs of chromosome. Of these twenty-two pairs or autosomes are identical in both male and female. One pair, the sex chromosomal pair, is different in the sexes. Females have two X chromosomes whereas males have one X- chromosome which they inherit from the mother and one Y chromosome which is paternally inherited. This Y-chromosome is passed from a father to his son and does not undergo inter-chromosomal recombination for most of its length. This feature has been of great value in the study of variation in modern human males.

The completion of the Human Genome Project (International Human Genome

Sequencing Consortium, 2004) has revealed that enormous variation exists in our genome. Only 2-3% of our genome codes for functional molecules such as proteins and RNA. The intergenic regions, which constitutes 97-98% of the sequence, consists of repetitive sequences, regulatory sequences, pseudogenes, intermediate to large scale DNA copy number and sequence variants. All are remnants of our evolutionary past and provide valuable insights about what makes us human.

The human genome contains three billion pairs of nucleotides. The sequence of the nucleotides that constitutes the DNA strand carries all the genetic information

3

required for the survival of an organism. The gene, which codes for a protein product is located at a relatively fixed position on a chromosome and performs specific biological functions during the development of an individual from a fertilized egg and throughout life. Recent estimates show that the human nuclear genome contains about 20,000 –25,000 genes (The ENCODE Project Consortium, 2007).

Any change that occurs in the DNA sequence is referred to as a mutation or polymorphism. It can be categorized on the basis of its size as either a large or small scale mutation. Large scale mutations can also include abnormalities such as an alteration in chromosomal number that occur in Down’s syndrome (trisomy 21)

Klinefelter’s syndrome (XXY) and Turner syndrome (XO), or chromosomal translocations as observed in the Philadelphia chromosome t(9;22)(q34;q11). These chromosomal abnormalities can be easily detected by cytogenetic analysis. Small scale mutations refer to the alteration in the sequence of the nucleotides. This includes the replacement of one nucleotide with another, or the deletion, or insertion, of any of the four nucleotides resulting in a new allele for a particular gene. In some instances these new alleles may result in disease or improve the fitness of the organism. In most cases they are neutral changes and do not play any beneficial or detrimental role.

Any mutation in the germ line DNA sequence is inherited in a stable form and has the ability to pass from one generation to the next. Mutations can occur either at the time of recombination during meiosis, when the parental DNA is transmitted to their progeny or during mitotic cell division that occurs throughout the life time of an individual. They occur due to errors in DNA replication during cell division. Copying

DNA requires great accuracy for the insertion of the correct nucleotide to the growing polynucleotide strand. DNA replication enzymes, the DNA polymerases have proof reading activity that reduces the error rate. The 3`-5` exonuclease activity of these enzymes removes one incorrect nucleotide at a time from the 3` hydroxyl terminus until the correct nucleotide appears. Despite these effective DNA proof reading and

4

repair mechanisms replication error occurs at about10-9-10-11 per incorporated nucleotide (Cooper et al., 1995; Cooper et al., 2000).

HUMAN GENETIC POLYMORPHISMS

In humans 99.9 % of the genome is identical and only 0.1-2.0% of the DNA sequence shows variation. These variations result in genotypic differences between individuals as well as phenotypic differences commonly observed in traits such as height, facial morphology, skin, eye and hair colour. These variations occur due to polymorphisms which are non-pathogenic changes that exist at significant frequencies (usually > 1%) in any given population. To date many types of polymorphism have been discovered in the coding regions as well as in the non- coding regions of the human genome and they form the basis of all current “genetic markers”. They are used not only to unravel our evolutionary past but to genetically predict our biological future and as diagnostic markers.

The non-coding DNA sequences that constitute the bulk of the human genome are dispersed through out the genome. The exact function of these non- coding regions remain unknown and this non-genic DNA also known as selfish or

“junk” DNA.

Several recent findings have shown the dynamic nature of these regions that play a major role in gene regulation. The junk DNA does not encode for any product used by the cell. It has a tendency to repeat the sequences many times. In some instances this interferes with the function of other genes or increases their copy number. A great amount of non-coding DNA consists of short tandem repeats of nucleotide, in the form of an array or a block of bases, scattered through out the genome.

5

According to their size, the human polymorphisms can be classified as single nucleotide polymorphisms, and repeat polymorphisms that include satellite DNA, mini-satellite DNA, micro-satellite DNA and copy number variants.

SINGLE NUCLEOTIDE POLYMOPHISMS: The most common polymorphism in the human genome is the single nucleotide polymorphism (SNPs). SNP’s include single base substitutions, deletions or insertions. The base substitutions can be classified into two groups namely transitions and transversions. In case of transition the purine is replace by a purine

(A ↔ G) or a pyrimidine by a pyrimidine (C ↔ T). Transversion is the substitution of a purine by a pyrimidine (A/G → C/T) or vice versa (C/T → A/G). According to

Collins and Jukes (1994) the transition mutation occurs frequently in the mammalian genome as compared to transversions.

SNPs are dispersed throughout the genome such as in the promoter region, coding sequences, intronic sequences and non-coding regions. According to the single nucleotide polymorphism database the human genome contains more than 55 million SNPs. More than 6 million SNPs lie within genes (Serre and Hodson, 2006).

SNPs were the first generation of polymorphic genetic markers. Their use was realized in late 1970’s with the development of restriction fragment length polymorphism (RFLP) (Roberts and Murray, 1976). RFLP occurs when a mutation causes a loss or gain of the recognition site for a restriction enzyme. Restriction enzymes were discovered in 1968 (Meselson and Yucan, 1968) and they are of three types designated TYPE I, II and III. Among them TYPE II restriction enzyme are most useful for genotyping. These restriction endonucleases recognize specific DNA sequences and cut the DNA within, or near, the recognition sequence. The first polymorphism in a restriction enzyme site was observed for the human β globin structural gene with the restriction enzyme HpaI ( and Dozy, 1978).

6

Since then many SNP genotyping methods such as heteroduplex analysis

(Lichten and Fox, 1983), single-strand conformational polymorphism (Orita et al.,

1989), enzymatic mutation detection (Youil et al., 1995), microarray or variant detector arrays (Dong et al., 2001; Hacia et al., 1999; Hacia and Collins, 1999;

Marshall and Hodgson, 1998; Qi et al., 2001; Ramsay, 1998; Wang et al., 1998;

Yoshino et al., 2001), high-throughput SNP genotyping (Jenkins and Gibson, 2002,

McClay et al., 2002), and molecular beacon methods (Mhlanga and Malmberg, 2001) have been discovered to construct high-density SNP maps. More recently massively parallel resequencing has revolutionized the pace of discovery of SNPs in individual genomes and the Thousand Genome Project aims to catalogue SNPs occurring at frequencies of <1% in several diverse human populations (Wheeler et al., 2008).

In the present century SNPs have become the markers of choice for many applications in the forensic sciences and medical and evolutionary genetics. The recent discovery of large numbers of SNPs and the determination of their allelic frequencies in various populations provides a new approach to disease detection, anthropological studies and pharmaco-genetic analyses which will benefit the biomedical sciences. Studies have identified genetic variation due to SNPs as one of the factors associated with susceptibility to many common diseases such as heart disorders, blood pressure (Koschinsky et al., 2001), Type II (Tsunoda et al.,

2001), and asthma (Immervoll et al., 2001).

The discovery of million of SNPs has greatly aided the field of pharmaco- genetics and pharmacogenomics which aims to tailor drugs based on a person’s genotype. The relationship between the SNPs, disease and medicine are not the same among various populations or even among the individuals within a population.

Due to the presence of variations in the target genes or drug metabolizing enzymes, some patients suffering from the same disease exhibit a life-threatening adverse reaction to a particular medicine while others fail to show any adverse reaction.

Some show intermediate responses for the same drug. The genotype of an

7

individual based upon SNP markers will soon allow the design of different new and more efficacious drugs for individual patients.

SNPs have also helped in understanding how the modern humans and their genome has evolved. In particular, SNPs found on the Y chromosome and mitochondrial DNA have been used to describe the origins and migrations of our male and female ancestors, respectively.

COPY NUMBER VARIANTS:

Copy number variations (CNVs) are structural variations in DNA sequence that occur due to differences in the number of copies of a particular genomic region.

They evolve due to the duplication or deletions of DNA segment ranging several kilobase (kb) to mega base in size (Feuk et al., 2006).

CNVs were first uncovered among the normal, healthy human individuals soon after the completion of the human genome project and many studies have shown them to be as prevalent as SNPs and an important source of genetic variation, contributing to our uniqueness (Feuk et al., 2005; Hinds et al., 2006; Iafrate et al., 2004; Sebat et al., 2004; Sharp et al., 2005; Stefansson et al., 2005; Tuzun et al., 2005). It is estimated that about 12% of the human genome and thousands of genes differ with respect to copy number variation (Carter, 2007).

CNVs often encompass genes, and lead to dosage imbalances (Buckland,

2003; McCarroll et al., 2006; Repping et al., 2006). They have been shown to influence phenotypic variation, gene expression and gene dosage and are associated with several human diseases through these mechanisms. An increase in the copy number of EGFR gene increases risk for non-small cell lung cancer

(Cappuzzo et al., 2005). Another study has demonstrated that the high copy number of CCL3L1 is associated with lower susceptibility to human HIV infection (Gonzalez et al., 2005). Low copy number of FCGR3B (CD 16 cell surface immunoglobulin

8

receptor) can increase susceptibility to systemic lupus erythematosus and similar inflammatory immune system disorders (Aitman et al., 2006).

The most widely used method to study CNVs is by DNA microarray technology based on comparative genome hybridization (CGH) using synthesized oligonucleotides. This technology has been useful in the detection of new CNVs and their association with normal and disease phenotypes (Carter, 2007). In the most complete world wide analyses (Redon et al., 2006) the first-generation CNV map was constructed using two different platforms of microarrays: single-nucleotide polymorphism (SNP) genotyping arrays, and clone-based comparative genomic hybridization. In this survey a total of 1,447 copy number variable regions (CNVRs), covering 360 megabases (12% of the genome) were identified in 270 individuals that had been previously surveyed for SNPs (The International HapMap Consortium,

2005).

SATELLITE DNA: It is located mainly in the darkly stained region of chromosomes referred to as heterochromatin. Its exact function is unclear (Csink and Henikoff, 1998; Henikoff et al., 2001) but transcription is limited in this region and it is thought to play a role in the structure and function of centromeres (Grimes and Cooke, 1998). It consists of large blocks of short tandem repeats. Although genotyping these repeats are not easy, it has been used in human evolutionary studies (Oakey and Tyler-Smith, 1990).

MINI-SATELLITE DNA: The mini-satellite DNA or the variable number of tandem repeats (VNTR)

(Nakamura et al., 1987) was first identified in the human myoglobin gene (Jeffery et al., 1985). It consists of intermediate size arrays of short tandem repeats and thousands of arrays ranging from 0.1-20 kilobases (kb) are found in the euchromatic region of eukaryotes chromosome (Jeffreys, 1987).

9

Most mini-satellites are rich in GC content and clustered towards the ends of the chromosomes (i.e. telomeres) (Royle et al., 1988). The majority of mini-satellite

DNA is transcriptionally inactive, but in some cases they are expressed for example

MUC1 locus (Swallow et al., 1987).

Mini-satellites are highly polymorphic (Wong et al., 1987) with heterozygosity values between 70 - 90% (Jeffrey et al., 1985) and their mutation rate is also higher in comparison to the classical genetic markers (Jeffery et al., 1988). It is estimated that mutations occurs at a frequency of 1-2% per gamete per generation resulting in a new variant with a different repeat copy number in individuals and populations. Baird et al., (1986) were among the first to analyze two VNTR loci,

HRAS-I and D14S1 in various populations.

MICROSATELLITE DNA: The microsatellites also referred to as short tandem repeat (STRs) polymorphisms or simple sequence repeats (SSR) are a special class of tandem repeats firstly recognized by Birnboim and Straus (1975) as “polypyrimidinic stretches”. The term microsatellite was coined by Litt and Luty (1989) and Edward et al., (1991) coined the term STR.

STRs are composed of 1-6 base pair repeat units that follow each other in tandem (Tautz, 1989). Depending upon the number of bases in the repeat unit they are classified as mono-, di-, tri-, tetra-, penta-, or hexa-nucleotide repeats. The tetra- nucleotide repeat (GATA) and the array of TG repeats were the first di-nucleotide

STRs identified in human delta and beta globin gene (Miesfield et al., 1981).

Subsequently CA repeats were identified in the actin gene of cardiac muscles

(Hamada and Kakunaga, 1982) and several other di-nucleotide repeats (GT or CA) were described by these groups (Epplen et al., 1982; Hamada et al., 1982)

10

respectively. These repeats are found in the euchromatin region of the chromosomes and do not generally cluster near the telomeric regions.

STRs constitute about 2% of the human genome and are more frequent than the mini-satellites. Estimates place the number of STR loci to be approximately

100,000 in the human genome. Both mini-satellites and STRs can be produced by the unequal crossing over and by DNA slippage during replication (Kruglyak et al.,

1998; Toth et al., 2000). New STR alleles are thought to arise mostly by the DNA slippage during replication (Di Rienzo et al., 1994; Jeffrey et al., 1993; Kimmel and

Chakraborty, 1996; Shriver et al., 1993; Valdes et al., 1993).

In humans the di-, tri- and tetra-nucleotide repeats are more frequent in comparison with the large polymorphic repeats. Among all classes of STRs the most frequent are the di-nucleotide repeats that comprise 0.5% of the genome. They are highly polymorphic and tend to mutate more rapidly as compared to the tri- and tetra- nucleotides (Chakraborty et al., 1997; Webster et al., 2002). The motifs of CA/TG repeats are present at a frequency of 1 per 36 kb whereas the AT/TA motifs are present at 1 per 50 kb. The less common AG/CT arrays are presents at a frequency of 1 every 125 kb. The rarest di-nucleotide repeats are CG/GC repeats that are present at 1 per 10 Mb. Among the tri-nucleotides the most frequently found arrays are the ACC repeats followed by AGC, ACT and less common ACG.

Genetic variation at STR loci make them very useful genetic markers that have been extensively applied towards human identification specially in forensic cases (Budowle et al., 1998; 2001; Gill et al., 1994), linkage analysis of disease

(Dietrich et al., 1992; Hearn et al., 1992; Jefferys et al., 1985; Jefferys and Pena,

1993; Queller et al., 1993; Todd et al., 1991) and as a powerful tool for the investigation of human past and diversity (Bowcock et al., 1994). The multi-allelic variation at STR loci has been exploited by population geneticists to create a powerful, accurate and informative tool that has aided in reconstructing the

11

evolutionary history of man and exposed the relationship between various world populations and languages (Ayub et al., 2003; Rosenberg et al., 2002).

A striking feature of STRs is their high mutation rate in comparison with

SNPs. The average mutation rate for tri- and tetra-nucleotide repeats at autosomal loci is estimated between 7.0 x 10-4 and 9.3 x 10-4 (Zhivotovsky et al., 2000) and for

Y-chromosomal STRs estimates range between 2.4 x 10-3 and 6.9X10-4 per locus, per generation depending upon whether the mutation rate is observed (Kayser et al.,

2000) or inferred (Zhivotovsky et al., 2004).

Although there is some evidence that the STR loci are neutral in nature and not involved in any biological function, yet many studies show that some STRs, such as CA repeats, are involved in the enhancement of gene expression (Hamada et al.,

1984). Many of them have binding sites for specific nuclear proteins (Richards et al.,

1993), most of which promote homologous recombination (Treco and Arnheim,

1986). The tri-nucleotide STR loci are associated with several genetic diseases.

The first such association of the tri-nucleotide motif “CCG” was reported with fragile X syndrome (Fu et al., 1991; Kremer et al., 1991; Verker et al., 1991). In normal individuals 6 - 54 CCG repeats are located at the 5’ untranslated region of fragile X- mental retardation –1 gene (FMR1). In affected individuals these number between

52 to 1000 repeats. The meiotic instability of these repeats are associated with over a dozen of human diseases such as, X-linked spinal and bulbar muscular atrophy

SBMA (La Spada et al., 1991), myotonic dystrophy (Brook et al., 1992; Fu et al.,

1992).

TRANSPOSABLE ELEMENTS:

The other class of repetitive DNA includes the interspersed repetitive non- coding DNA that occupies 45% of the human genome (International Human Genome

Sequencing Consortium, 2001; Li et al., 2001). Polymorphisms of this class have

12

also been linked with certain diseases. These are derived from mobile DNA sequences, also called “transposable elements” (Prak and Haig, 2000; Smith, 1999).

These elements have an ability to migrate from one region of the human genome and integrate into another region (Prak and Haig, 2000; Smith, 1996). Until now there is no known mechanism for the removal of these elements.

The transposable elements can be characterized in to four groups:

A) Long interspersed nuclear elements (LINES)

B) Short interspersed nuclear elements (SINES)

C) Long terminals repeats LTR transposons (retro- virus like elements)

D) DNA transposons.

Depending upon the transposition mechanism these four groups are broadly organized into two groups:

1) Retrotransposons or retroposons: 2) DNA transposons. Retro transposons are transposable elements that make their copies through reverse transcriptase and include LINEs, SINEs and LTRs. Cellular reverse- transcriptases transcribe mRNA into neutral cDNA which is then integrated in any region of chromosomal DNA.

In DNA transposons the DNA sequences are excised and directly integrated into another place of the genome by a cut and paste mechanism. DNA transposons accounts for 3% of the human genome and virtually all human DNA transposons are non-functional (Strachan and Read, 2004).

The most successful and ancient transposable elements are the LINES.

These elements first appeared in the eukaryotic genomes about 600 million years ago (Malik et al., 1999) and collectively comprises about 21% of the human genome.

These elements are sub divided into three distantly related families LINES 1, LINES

2, and LINES 3. In comparison with LINES 2 and LINES 3 elements, the LINE 1

13

element is the only family, which is still being actively transposed (International

Human Genome Sequencing Consortium, 2001).

LINE 1 is an important transposable element about 6.0 kilo-bases (kb) long.

Recent estimates based on computational methods suggest that about 500,000 L1 fragments reside in the human genome and make up 17% of the genome. (Lander et al., 2001; Smith, 1996). These elements are mostly found in AT rich regions

(Kongberg and Rykowski, 1988). The LINE 1 element consists of two open reading frames ORF1 and ORF2. ORF1 encodes a 40 kilo Dalton (kDa) RNA-binding protein while ORF2 encodes 150 kDa protein, which have both endonuclease and reverse transciptase activity (Feng et al., 1996; Mathias et al., 1991). The LINE 1 transcript moves from the nucleus to the cytoplasm where it is translated to yield ORF proteins.

The LINE1 RNA assembles with its own encoded proteins and re-enters the nucleus, where the L1 endonuclease cleaves one strand of DNA preferably at the 5`-TTTT.↓A-

3`consensus site (Cost and Boeke, 1998; Feng et al., 1996; Jurka, 1997; Morrish et al., 2002) and the reverse transcriptase uses the same site to prime reverse transcription from the 3` end of the LINE RNA. At the time of integration, in most instances, the reverse transcription fails to proceed to the 5` end resulting in a truncated, non-functional copy of LINE 1 element.

In the human genome about 99.8% copies of the LINE1 elements present are defective (Gilbert et al., 2002; Kazazian and Moran,1998; Myers et al., 2002;

Ostertag and Kazazian, 2001; Sassaman et al., 1997) with an average size of 900 bp

(Lander et al., 2001). It is estimated that approximately 40 elements of L1 family are still functional and produce new copies (Sassaman et al., 1997). At least 1 in every

50 humans has a new genomic L1 insertion. These occur in the parental germ cell or during early embryonic development (Goodier et al., 2001; Luningprak et al., 2003;

Ostertag et al., 2002). The functional significance of this occurrence is unknown but these new copies can be used as genetic markers such as the L1 insertion in the centromeric alphoid array of human Y chromosome designated as LY1 (Santos et al.,

14

2000). Some times these insertions can lead to disease as in the case of hemophilia

B (Brooks et al., 2003; Kazazian et al., 1988).

SINES comprises 13% of the human genome. These sequences are 100-

400 bp long and include the Alu repeats which are dispersed throughout the human genome. Unlike LINE elements they do not encode any protein and use the LINE machinery for their transposition (Kajikawa and Okada, 2002). All, except one, of the families of SINE elements originated from tRNA. The only exception is the Alu family which originated from signal recognition particle component (SRP 7SL) RNA (Ullu and Tschudi, 1984).

The Alu elements are about 300 bp long and they constitute 10.7 % of the human genome. The Alu insertion has been postulated to have occurred early in primate evolution, about 30-65 million years ago (mya) (Batzer et al., 2002; Deininger et al., 1992; Deininger and Daniels, 1986; Deininger and Slagel, 1988; Kapitoov,

1996; Labuda et al., 1991; Shen et al., 1991). A subfamily of these Alu repeats termed as human specific (HS) repeats (Batzer et al., 1990) appeared in the human genome record within the last 6 million years (Batzer et al., 1991; Batzer and

Deininger, 1991). Approximately 75% of these HS repeats are present in all human populations indicating that they were inserted early in human evolution and were completely fixed before the migration of humans from Africa (Deininger et al., 1999).

Alu repeats have also proven to be extremely useful genetic markers (Myers et al., 2002; Watkins et al., 2001). About 25% (400 sites) of these recent Alu insertions are variable among world populations and highly informative in ascertaining the relationships between human populations. Several Alu insertions are associated with human diseases such as hypertension (Barley et al., 1996; Duru et al., 1994; Jeng et al., 1997), myocardial infarction (Ludwing et al., 1995), ventricular hypertrophy (Schunkert et al., 1994) and cardiomyopathy (Raynolds et al.,

1993).

15

In the human genome 8.5% of repetitive DNA belongs to LTR which comprises of autonomous and non-autonomous elements. About 4.7% of the human genome is occupied by the autonomous endogenous retroviral sequences (ERV).

This human ERV (HERV) contains many sub-families and shows a small number of polymorphism (Turner et al., 2001). Many of the LTRs are defective and transposition has been rare. The non-autonomous element of LTR consists of the

MaLR family accounts about 3.8% of human genome. This family lacks the pol gene and at times the gag gene.

Over the past decade the genetic variation of these DNA based markers has been exploited to unravel the paternal and maternal lineages and the relationship among modern humans (Cavalli-Sforza, 1994; Hammer et al., 1997; Quintana-Murci et al., 1999 a, b and c). The current study was designed to use polymorphic markers to uncover the genetic history of ethnic groups residing in present day Pakistan and provide basis for further analyses of these populations in genetic association and disease susceptibility studies.

THE GENETIC

The modern state of Pakistan was established on August 14, 1947, but the region where it is located, the Indo-Pak subcontinent, has been of importance throughout human history. The country lies on the postulated southern coastal route that modern humans took from Africa to Australia.

The earliest evidence indicates that humans were present in this region around 100,000 -150,000 years ago but the fossil record is non-existent. sites have been found in the Peshawar Valley in the north-west and at , in the south-east in the province of Baluchistan (Jarrige, 1991). The evidence found at

Mehrgarh indicates a modern human settlement dating to around 7,000 B.C. This predates the region's other earliest civilizations, the Indus Valley civilizations found

16

throughout the sub-continent with major centres at Harappa and Mohenjo-Daro in

Pakistan. This civilization flourished in the 3rd and 2nd millennia B.C. (2,500-1,500

B.C.).

Due to its geostrategic importance as the gateway to India this region was invaded many times. Around 1,500 B.C. the Indo-European speaking nomadic pastoral tribes, the so-called Aryans, entered this region through the Hindu Kush

Mountains and established their supremacy replacing the Dravidian language speakers who were thought to be there initially. Their rule lasted from about 1,500

B.C.–500 B.C. when this region was occupied by the Persian Empire. In 326 B. C. this region was conquered by Alexander the Great. Subsequently it was conquered by the Mauryas (305 B.C.), (97 B.C.), (711 A.D.), Turks (1001), Mughals

(16th cen.) and lastly by the British Empire.

India and Pakistan house many different races and languages and are often referred to as "a museum of races." Present day Pakistan has a population of over

170 million (Pakistan Economic Survey, 2006-2007) and consists of more than 12 ethnic and linguistic groups, the majority being descendants of the invader stocks.

Ethnic groups from the southern part of Pakistan include Baloch, Brahui, Makrani

Baloch, Makrani Negroid, Parsi and Sindhi. Major populations represented by the northern groups include Balti, Burusho, Kalash, Kashmiri, Pathan and . The latter form the majority population of this country and include several castes.

Linguistic groups found in Pakistan include a language isolate, Dravidians, Sino-

Tibetans and Indo-Europeans. The latter is spoken by a majority of the population.

STUDY OBJECTIVE:

The main objective of the study is to shed light on the population histories of numerous ethnic groups living in modern day Pakistan. Earlier studies used a only limited number of polymorphic Y chromosomal markers (Qamar et al., 1999, 2002) and since then many more informative Y-SNPs have been discovered (Karafet et al.,

17

2008) which have not been typed in this population. Another caveat of the earlier work was the lack of samples from the Punjab which constitutes the majority population of Pakistan and this has been addressed in this study.

The study aims to screen Y chromosomal variation in a large number of

Pakistani males from various ethnic and linguistic backgrounds in order to understand population origins and substructure and unravel the influence of Central

Asia, China, Greece and Persia on this population. Statistical analyses and simulation modeling is used to identify geographic origins of population groups, episodes of genetic bottlenecks, demographic expansions and genetic admixture. It is my hope that these analyses will improve our knowledge of group membership within Pakistan that will have practical applications in DNA based human forensic analyses, the design of disease association studies and have implications in rationalizing use of medicines tailored to an individual’s genetic make up.

18

LITERATURE REVIEW

3

PAKISTAN AND ITS POPULATIONS

Pakistan lies in a region that has seen the passage of many invaders and all have contributed to the racial and linguistic diversity found in this country. It is bordered by China in the north, India in the east, and on the west and the Indian Ocean straddle the southern coast line. The Pakistani population according to the Ministry of Finance is estimated to be 156,770,000 (Pakistan

Economic Survey, 2006-2007) but the World Health Organization estimates the number to be much higher.

Pakistan consists of four provinces, the northern areas and the Federally

Administered Tribal Areas (FATA) which are located on the Afghan frontier. More than 18 ethnic and 60 linguistic groups (Grimes, 1992) reside in this country. Major ethic groups include Baloch, Brahui, Pathans, Punjabis and . The majority

Punjabi speaking populations show a great and complex admixture of many ethnic caste and groups (Ibbetson, 1883) such as the Gujar, Jats, Meos, Rajput and Arians etc. Other ethnic groups that are of anthropological interest include the Makrani-

Negroid, Mohanna and Parsi in the south and Balti, Burusho, Kalash and Kashmiri in the north. Of particular interest are the Hazara population which resides in

Baluchistan and the North West Frontier Province (N.W.F.P.). The geographic locations of the above mentioned Pakistani population are shown in Figure I and their possible origins and linguistic affiliations are listed in Table I.

19

Figure I. Map of Pakistan showing its neighbours, administrative regions and the geographical distribution of the populations that are included in this study.

20

Table I: The possible origins and language affinities of Pakistani populations.

The numbers in brackets refers to the population size.

Location Population Language Suggested Origins

North

Balti (300,000) Sino-Tibetan Tibet.

Burusho (60,000) Isolate Greek; Central Asian.

Hazara Indo-European Genghis Khan’s soldiers.

Kalash (5,000) Indo-European Greece; ?

Kashmiri Indo-European Jewish, Indo-Aryans.

Pathan (17,000,000) Indo-European Jewish; Greek; Admixture.

Punjabi (63,000,000) Indo-European Admixture.

South

Baloch (4,000,000) Indo-European Aleppo, Syria

Brahui (1,500,000) Dravidian West and Central Asia

Makrani Baloch Indo-European West Asia

Makrani Negroid Indo-European Africa

Mohanna Indo-European Indigenous fishermen

Parsi (~2000) Indo-European Persia/Iran

Sindhi (15,300,000) Indo-European Admixture

21

Three major Pakistani populations: the Baloch, Brahui and Makrani reside in the province of Baluchistan and constitute the southern group. Historians believe that the Baloch migrated from West Asia to . They claim that they are of

Semitic stock and that between 1 and 2 millennium B.C. their homeland was the ancient region of Nineveh and Babylon in modern day Iraq. From there they migrated to Iran, Afghanistan and Pakistan. Many Baloch tribesmen reside in south- east Iran as well. Some historians also claim that they came from Aleppo in Syria in

682 A.D. (Quddus, 1990) when at least 44 tribes migrated to Iran. Their movement into Pakistan is considered to be recent. At the beginning of the 10th century they moved from Iran and occupied Sistan and as a result of Seljuq invasion they settled on land of . In the fifteen century they migrated eastward and settled in

Kachi. Now they occupy the area of Sibi and the Loralai District of Division in

Pakistan (Marri, 1985).

The Brahuis are considered to be the descendents of Turko-Iranian tribes that migrated from west and central Asia and settled in the Sarawan and Jhalawan regions of Kalat State in Baluchistan (Hughes-Buller, 1991; Quddus, 1990). They are the only group in Pakistan that speaks a Dravidian language.

The southwestern dry and arid Makran coast of Pakistan is home to two distinct populations of Makranis: ___ the Makrani-Baloch and Makrani-Negroid. The

Makrani-Baloch expresses linguistic and ethnic affiliation with the neighboring Baloch tribes (Grimes, 1992). However, many Makrani have Negroid features and are referred to as Makrani-. It has been hypothesized that they originated in

Africa and migrated to Pakistan along the coastal route.

Another population that reside in Baluchistan, mainly in and around the provincial capital, Quetta, are the Hazara. The name Hazara is derived from the

Persian word meaning thousand. This population is also found in the town of

Parachinar in the NWFP and widespread in Afghanistan. They have typical Mongol features and claim descent from a detachment of thousand soldiers left behind by 22

Genghis Khan during his invasion of India. Historical records show that they settled in Pakistan to escape persecution in neighboring Afghanistan.

The other populations from southern Pakistan include the Sindhi, Mohanna and Parsi all of whom reside in the south eastern province of Sindh. The Sindh province is referred to in several ancient texts ___ Sindomana by the Greek and

Sindhudesha by ancient . This region was conquered by the Greek,

Parthians, Brahmans, Arabs, and finally by the British and Mohenjo-Daro, the jewel of the Indus Valley Civilization, is located here. As a result of multiple invasions and migration the Sindhis are considered to be an ethnically mixed population of Indo-

European speakers. The Mohanna are another Indo-European population of fishermen who have been residing on the banks of the River Indus for centuries.

Little is known about their origins.

The suggested origin of the is in Persia (Nanavutty, 1997). They are the followers of the Iranian prophet Zoroaster, migrated from Iran to the state of

Gujrat in northwest India in 7th century A.D. after the collapse of the Sassanian

Empire. Many Parsis eventually settled in in India and in Pakistan, although very few remain in Pakistan.

Several populations reside in the northern part of Pakistan. The Pathans reside in the North West Frontier Province (N.W.F.P) and its adjoining tribal areas.

They also inhabit the southern and eastern part of Afghanistan and Baluchistan province of Pakistan. They are also known as Pushtuns, Pakhtuns or Afghans and are an Eastern Iranian ethno linguistic group formed by amalgamation of several tribes practicing a traditional code of conduct and honor. They claim to be descendants of soldiers who came with Alexander the Great and several historical sources suggest that they are of Semetic stock (Caroe, 1958).

Northern Pakistan is also home to some unique ethno-linguistic populations.

Among them are the Balti, Burusho and Kalsh. Baltis speak a Sino-Tibetan language and their suggested origin is in Tibet (Dani, 1991). They reside in Baltistan, the north

23

eastern Himalayan region of Pakistan.

The Burusho, one of the isolated northern populations, also believe that they are the descendants of Greek generals who came to the subcontinent with Alexander the Great in 327-323 B.C. (Biddulph, 1977). They reside in Hunza, Nagar and Yasin

Valleys in the Karakorum Mountains and are the only population in Pakistan who speak a language isolate.

The Kalash also claim descent from Greek Macedonia citing Alexander’s invasion of the subcontinent. They reside in the valleys of Bumburet, Rambur, and

Birir near in the Hindu Kush Mountain ranges in the NWFP. They have been extensively studied by anthropologists for their unique culture and traditions (Lines,

1999).

DEMOGRAPHIC HUMAN HISTORY

Human diversity occurs as a result of multiple events during human evolution, migration, and colonization (Lahr and Foley, 1994). Studies reveal that human history can be deciphered from the analyses of the human genome. The genomic variation in human individuals and populations contains enough information to allow the reconstruction of human population history, migration patterns and population structure.

At the beginning of 20th century data obtained from protein markers led to insights into human origins, divergence and demographic history (Cavalli-Sforza,

2005). However, in recent years DNA based markers have proved to be more efficient tools for elucidating questions of human evolution and migration. An informative DNA marker should be both highly polymorphic and selectively neutral.

DNA markers on the non-recombinant portion of the human Y chromosome and the mitochondrial DNA are polymorphic markers that have been successfully applied to shed light on human evolutionary history from the male and female perspective, respectively.

24

Y CHROMOSOMAL VARIATIONS

Y-chromosomal DNA polymorphisms were first reported in 1985 (Casanova et al., 1985; Lucotte and Ngo, 1985). Since then more than 600 binary polymorphisms, the majority of them being SNPs, and numerous multi-allelic STR markers have been identified on the human Y-chromosome (Karafet et al., 2008).

Since most of the Y chromosome does not undergo recombination these biallelic polymorphisms define unique mutational events and therefore, unique Y chromosomal “haplogroups.” The presence of numerous biallelic polymorphisms allows their organization in the form of a phylogenetic tree that shows relationships among the various Y haplogroups. Efforts by the Y Chromosome Consortium (YCC) have led to the development of a standardized nomenclature system for such a tree.

The initial tree based upon approximately 200 markers (Jobling and Tyler-Smith,

2003; Y Chromosome Consortium, 2002) was recently revised to identify 311 distinct

Y haplogroups (Karafet et al., 2008). The phylogenetic tree is rooted with respect to the ancestral state of non- human primate sequence.

The Y lineages on the phylogenetic tree contain major 20 major haplogroup clades designated A–T (figure II). Karafet et al., (2008), refer to these as in order to differentiate them from the 311 haplogroups that are identified by terminal mutations, but earlier studies use these terms interchangeably. Y chromosomes identified by STRs are designated as “haplotypes,” and those that are defined by the combination of biallelic markers and STRs are called “lineages” as proposed by de

Knijff (2000). A brief description of the salient features of major Y haplogroup clades follows:

HAPLOGROUP A:

Haplogroup or clade A* contains 12 additional haplogroup branches, all restricted to Africa (Hammer et al., 2001; Underhill et al., 2001). All individuals that

25

26

fall in this group carry the ancestral state for M42, M94, M139, and SRY10831.1 and derived state for M91 and P97 (Karafet et al., 2008). The M91 lineage is sub divided into three main haplogroup characterized by derived alleles for the markers P108, M6 and M32. These haplogroup have been mainly observed in the and Bantu speakers from South Africa, Pygmies from and in the Sudanese,

Ethiopian and Mali populations of East Africa (Hammer et al., 2001; Semino et al.,

2002; Underhill et al., 2001; Wood et al., 2005).

HAPLOGROUP B:

Clade B* haplogroup are characterized by having derived alleles for M60

SNP. They are also derived for the markers M42, M94 and M139. All 17 branches of clade B* are frequently found in sub-Saharan Africa. The major sub-clades are

B1* defined by M236 and B2* define by M182 haplogroup. Sub-clade B1a defined by the M146 marker is mainly found in Mali. The B2* cluster has several haplogroups one of which is derived for the marker, B2a*- M150, and is frequently observed in East Africa ( and ). The B2b* (M112 or M192 derived Y chromosomes) are found in Pygmies from central and southern Africa (Cruciani et al., 2002; Hammer et al., 2001; Jobling and Tyler-Smith, 2003; Semino et al., 2002;

Underhill et al., 2001; Wood et al., 2005).

The distribution and expansion of clades A* and B* suggests that these Y chromosomes spread very early within the African continent and is supported by the palaeo-anthropological record of human population expansions through out Africa, north and south of the Sahara Desert, eventually reaching the Levant about 130,000-

90,000 ago (Lahr and Foley, 1998).

HAPLOGROUP C:

A total of 30 mutations and 19 haplogroups are currently reported for this clade. It is defined by five mutations, the hallmark being the synonymous RPS4Y711

27

C to T transition (also referred to as M130) in the exon of the RPS4Y gene that was among one of the earlier Y chromosomal polymorphisms that were identified (Fisher et al., 1990). This clade has not been found in sub-Saharan Africa and the mutations defining this haplogroup probably occurred in Asia after the migration of modern humans out of Africa. Walter et al., (2000) has suggested that this mutation originated in south Asia about forty to fifty thousand years ago with the dispersal of modern humans from the Horn of Africa via a coastal or interior route towards south

Asia. The haplogroup is frequent in populations from Central and East Asia. It is also found in many indigenous Australasian and Polynesian populations and the

Native American Indian tribes (Capelli et al., 2001; Hammer et al., 2001; Hudjashov et al., 2007, Karafet et al., 2001; Kayser et al., 2006; Ke et al., 2001; Kivisild et al.,

2003; Scheinfeldt et al., 2006; Underhill et al., 2001; Zegura et al., 2004).

HAPLOGROUPS D and E:

A Y Alu polymorphism (YAP) defines these haplogroups. All Y chromosomes belonging to these branches have an Alu insertion. Clade D* is restricted to Asian populations mainly in Japan and Tibet (Su et al., 2000; Karafet et al., 2001). The 15 haplogroups that are part of this clade are all characterized by the presence of M174

T to C transition (Underhill et al., 2000). These are scattered throughout south East

Asia and among Andaman Islanders (Hammer et al., 2006; Thangaraj et al., 2003).

Clade E* is more mutationally diverse and widespread with 56 distinct haplogroups (Karafet et al., 2008). Y chromosomes belonging to clade E* have been found in Africa, Levant, Europe, Central and South Asia (Hammer et al., 1998;

Underhill et al., 2001). Clade E* haplogroups are derived for several markers including M96 and SRY4064. The major sub-clades are E1* and E2* that are characterized by derived alleles for P147 and M75. The topology and nomenclature of this branch has been recently revised with the discovery of several novel mutations. Important sub-clades of E* include E1b1* that is derived for the P2

28

polymorphism and accounts for 80% of clade E haplogroup. M2 or sY81 derived haplogroup (E1b1a*) are present at high frequencies in sub-Saharan Africa, whereas the E1b1b* haplogroups defined by the M215 mutation are frequently observed in north and east Africa, the Mediterranean basin and the Europe (Hammer et al.,

1997). It has been suggested that clade E* haplogroup were spread by the Bantu farmers during the Neolithic period (Passarino et al., 1998; Scozzari et al., 1999).

The representatives of these haplogroup traveled from the Middle East to southern

Europe and northern India and Pakistan (Cruciani et al., 2002; Hammer et al., 1998;

Semino et al., 2004; Sims et al., 2007; Underhill et al., 2001).

HAPLOGROUP F:

M168 derived haplogroup that have the derived allele for M89 C to T transition is frequent in non-African populations. Besides M89 and M213 (Underhill et al., 2000) this clade is now also identified by several markers discovered by

Hammer et al., (2001). The haplogroup probably arose in East Africa about 45,000 years ago and dispersed to Eurasia through the Levantine corridor. Underhill et al.,

(2001) have suggested that the African ancestors first migrated to the Middle East around 40,000 years ago and eventually expanded towards the west, east and north giving rise to several major clades (G–T) of the Y phylogenetic tree. F* is found mainly on the and in Sri-Lanka (Kivisild et al., 2003;

Sengupta et al., 2006).

HAPLOGROUP G:

Characterized by the M201 and P257 mutations this haplogroup is present in

South East Europe, the Mediterranean region, Anatolia, West and Central Asia

(Behar et al., 2004; Cinnioglu et al., 2004; Jobling and Tyler-Smith, 2003; Regueiro et al., 2006; Sengupta et al., 2006) and North (Nasidze et al., 2003).

29

HAPLOGROUP H:

Found almost exclusively in the Indo-Pak subcontinent these haplogroups are characterized by M69, a T to C mutation. The 10 currently identified haplogroups within this clade are separated into two major clusters: H1* and H2*. H1* clade is defined by the M52 A-C transversion whereas the H2* haplogroup is characterized by the Apt G to A transition. Both have been observed in India but only H1* has been reported in populations from Pakistan (Jobling and Tyler-Smith, 2003; Karafet et al., 2005; Sengupta et al., 2006).

HAPLOGROUP I:

It is one of the major clades found in European populations and defined initially by the M170 A-C transversion. It is thought that this mutation was acquired during the early expansion of Levantine populations towards the west. Clade I comprises 16 haplogroups. It is found at high frequency in North Europeans

(Hammer et al., 2001; Jobling and Tyler-Smith, 2003; Rootsi et al., 2004).

HAPLOGROUP J:

One of the major clades that defined by the 12f2a and more recently the

M304 deletion and P209 marker (Karafet et al. 2008). It has two main branches J1* which is M267 derived and J2* which is derived for M172 (Cinnioglu et al., 2004;

Underhill et al., 2000). The J* clade and its branches probably arose in the Middle

East and Anatolia () from where they spread to west Asia and Eurasia

(Hammer et al., 2000; Semino et al., 2004). It is frequent in both India and Pakistan

(Mohyuddin et al., 2006).

30

HAPLOGROUP K:

This haplogroup is a mixed bag characterized by derived alleles for the M9 (C-G transversion) marker (Underhill et al., 1997). Its low incidence in Africa illustrates that the mutation occurred after the migration out of Africa. A recent survey by

Karafet et al., (2008) demonstrated derived states for an additional three markers

(P128, P131 and P132) for this haplogroup. The K1 branch derived for M147 has been observed in populations from the Indo-Pak subcontinent (Underhill et al., 2001).

The K2 branch has been re-designated as haplogroup T* (Karafet et al., 2008).

HAPLOGROUP L:

The L* lineage probably arose in West Asia in a pre-Holocene era and was initially identified in samples from the Indus Basin in Pakistan (Underhill et al., 2000).

One branch L1 (derived for M27 and M76) probably arose in the Indo-Pak subcontinent. It is absent in North-East India and found at a low frequency in Central

India and the Northern region of India and Pakistan. The highest frequency at South

India and South-West Pakistan suggests that this lineage originated in the Indian

Peninsula (Sengupta et al., 2006). Other branches of haplogroup L* are present in the Middle East, Central Asia, Northern Africa, and Europe and along the

Mediterranean coast (Cinnioglu et al., 2004; Cruciani et al., 2002; Jobling and Tyler-

Smith 2003; Sengupta et al., 2006).

HAPLOGROUP M:

Characterized by the P256 SNP this clade is predominantly found in

Indonesia, Melanesia, Papua and New Guinea (Capelli et al., 2001; Hurles et al.,

2002; Kayser et al., 2006; Scheinfeldt et al., 2006; Su et al., 2000). Currently 20 mutations characterize the 12 haplogroups found within this branch (Karafet et al.,

2008).

31

HAPLOGROUPS N and O:

The A to G transition of M214 identifies the ancestor of two major haplogroups clades N* and O*. M231 and LLY22g characterize clade N* and N1* and the M175 deletion clade O* (Cinnioglu et al., 2004). Haplogroup N* probably originated in Asia but are now predominantly found in European populations (Karafet et al., 2001; Rootsi et al., 2007).

Clade O* is found at high frequency in East Asians. A major branch of this clade is characterized by the Y-SNP O3*-M122 and it predominates in East Asia and is found in a majority of the Chinese population. The microsatellite diversity in this sub haplogroups is highest in South-Chinese population indicating it appeared there before expanding northwards approximately 30,000-25,000 years ago (Shi et al.,

2005).

HAPLOGROUPS P, Q and R:

Clade P* is defined by the presence of 92R7, M45, M74 and several other

SNPs that are derived for the M9 mutation as well. This clade includes several major groups that are prevalent in various world populations.

Haplogroup Q* (derived for the C to T M242 mutation) probably arose in

Central Asia from where these chromosomes spread throughout the world (Seielstad et al., 2003). These Y chromosomes are found at high frequency in North Eurasia and Siberia (Karafet et al., 2002) and at lower frequencies in Europe, East Asia and the Middle East. One major branch of this haplogroup (Q1a3a*-M3) is almost exclusively restricted to the Native Americans (Zegura et al., 2004).

Eight mutations, including the M207 A-G SNP, represent clade R. This clade is further characterized into two sub-clusters R1*-M173 and R2-M124. It is assumed that around 30,000 years ago the R*-M207 mutation expanded westwards to reach

Europe, Caucasus, Middle East, Central Asia, northern India and Pakistan. The R1* haplogroup is one of the most common in Europe and west Asia and probably

32

originated in central Asia. The R1a1*-M17 clade that is characterized by deletion of the G nucleotide (Underhill et al., 1997) is frequently found in south-west Pakistan and (Jobling and Tyler-Smith, 2003).

HAPLOGROUPS S and T:

A reexamination of the Y phylogenetic tree led to the addition of haplogroups

S* and T* characterized by markers M230 and M184, respectively (Karafat et al.,

2008). Haplogroup S* chromosomes were previously characterized as K-M230 while those now belonging to clade T* were previously identified as haplogroup K-M70 (Y

Chromosome Consortium, 2002).

Clade S* lineages are also identified by P202 and P204 markers and are found in

Oceania and Indonesia (Kayser et al., 2006; Scheinfeldt et al., 2006). Clade T* that is also characterized by M70, M193 and M272 is further delineated by M320 and P77 and has been observed in the Middle East, Africa, and Europe (Underhill et al., 2001;

King et al., 2007).

33

MATERIALS AND METHODS

- 4 -

COLLECTION OF SAMPLES:

For this study, the blood samples were collected from1213 unrelated male subjects, belonging to sixteen ethnic groups of Pakistani population. Informed consent was obtained from all participants included in this study. Ethnicity of the sampled individuals was confirmed prior to collection.

10ml blood of each individual was collected in Vacutainer tubes (Becton

Dickinson, Mountain View, CA.). 66 samples belong to Baloch and 117 samples from Brahui population were collected from Quetta and Kalat Division in Baluchistan.

97 samples belong to Burusho population were collected from Hunza and Nagar in the Northern Areas. 224 Hazara samples were collected from the area of Parachinar and Quetta. 44 Kalash samples were collected from Chitral Division. The 90 blood samples of Parsis and 14 Balti samples were collected from Karachi. 96 Pathan samples were collected from the North-West Frontier Province. 138 Sindhi samples were collected from the Sukkur in Sindh. 16 samples of Meos, 10 Rajput and 159

Gujar samples were collected from the rural areas of Punjab Province. 27 Makrani-

Baloch, 33 Makrani-Negroid and 70 Mohanna samples were collected from interior part of Sindh Province. 12 Kashmiries were collected from Muzafrabad (Kashmir).

The 77 Greek DNA samples were provided by Dr. Myrto Papaioannou (Unit of

Prenatal Diagnosis, Center for Thalassemia, Laiko General Hospital, Athens,

Greece).

PREPARATION OF EPSTEIN-BARR VIRUS FROM B95-8 CELLS:

The Epstein-Barr Virus (EBV) producing B95-8 marmoset cell line (American

Type Culture Collection, Manassas, VA) was suspended (5 x 106 cells) in 10 ml of wash medium which consisted of RPMI-1640 (Sigma-Aldrich, St. Louis, MI) supplemented with 1% fetal calf serum (FCS; Biochrom AG, Berlin, Germany) and

1X GPPS (2 mM L glutamine, 100 U/ml penicillin, 1 mM sodium pyruvate and 50

µg/ml streptomycin) and centrifuged in an IEC-HN-SII bench top centrifuge

34

(International Equipment Company, Needham, MA), at 1000 rpm (300g) for 10 minutes. The supernatant was decanted and the pellet was washed twice in 5 ml of wash medium followed by centrifugation at 1000 rpm for 10 min. The cells were transferred into a 25 cm2 culture flask (Corning, Corning, NY) containing RPMI-1640 medium supplemented with 1X GPPS and 10% FCS. The flask was incubated at

37 C in a humidified atmosphere of 93% air and 7% CO2. The culture was gradually expanded and split first into a 75 cm2 and finally in 125 cm2 flasks. When the medium in the culture flask became yellow they were incubated at 34 C without any additional medium supplementation for 7 days to enhance EBV production. On the

8th day the cell pellet was removed from the suspension by centrifugation at 1000 rpm for 10 minutes. The supernatant containing EBV was filtered through a 0.45 M

Millipore membrane filter (Nilsson, 1976). The EBV supernatant was aliquoted into cryovials (Corning, Corning, NY) and stored at–70 C until use. 1 ml aliquot of this preparation was able to transform human B lymphocytes.

PREPARATION OF LYMPHOCYTES:

For the isolation of peripheral blood mononuclear cells (PBMC), approximately 5 ml venous blood was collected in acid citrate dextrose (ACD) vacutainer tubes (Becton Dickinson, San Jose, CA). The blood was layered over 3ml

Histopaque-1077 (Sigma Aldrich) in a sterile 15 ml polypropylene conical tube

(Corning, Corning, NY). Each sample was centrifuged at 2000 rpm (400g) for 20 minutes. The upper plasma layer was aspirated and PBMC were collected from the interface between the plasma and Histopaque and transferred in to another sterile 15 ml tube containing 10 ml wash medium and centrifuged at 1000 rpm for 10 minutes.

The supernatant was decanted and the cell pellet washed twice with 5 ml wash medium and resuspended in 1 ml of wash medium (Boyum, 1968). Cell viability was checked by the trypan blue exclusion test.

35

CELL COUNTING BY TRYPAN BLUE EXCLUSION TEST:

Cell viability was calculated by the trypan blue exclusion test as described by

Kruse, (1973). An equal volume (10 l) of cell suspension was mixed with 0.16%

(w/v) trypan blue solution in physiological saline. Cells were counted using a haemocytometer. Unstained live and blue stained dead cells were counted in the central 1mm square of the counting chamber. The cell viability was calculated by the following formula:

Number of live cells ÷ total number of cells x 100.

The total number of live cells per ml was calculated as follows:

Number of live cells x 2 (dilution factor) x 104.

ESTABLISMENT OF EBV TRANSFORMED LYMPHOBLASTOID CELL

LINES:

In order to preserve and obtain an inexhaustible supply of an individual’s DNA human lymphoblastoid cell lines were established. Approximately 4-5 x 106 PBMCs were transferred to a 25cm2 culture flask, containing 3 ml transformation medium

(RPMI-1640, 10% FCS, 1X GPPS, 0.05 mM beta- g/ml cyclosporin A) and 1 ml EBV supernatant prepared earlier. The flask was incubated at 37 C in a humidified atmosphere of 93% air and 7% CO2, keeping the cap of flask slightly loose (Walls and Crawford, 1987). The culture was visualized periodically under an inverted microscope. After 5-6 days when colonies formed and the culture medium became acidic, the culture was fed with feeding medium (RPMI-1640, 10-

15% FCS and 1X GPPS). When the transformed cell density in a culture flask had suitably increased, half of the culture was transferred into a 75cm2 culture flask and expanded for cryogenic preservation and DNA preparation.

36

CRYOPRESERVATION OF CELL LINES:

For cryogenic preservation, cell viability was checked by the trypan blue exclusion test as described earlier. Only cultures with cell viability > 90% were frozen. The volume of cell suspension containing 5 x 106 cells was centrifuged at

1000 rpm for 10 minutes. The supernatant was decanted and the cell pellet was resuspended in 1 ml of freezing mix (45% RPMI-1640, 45% FCS and 10% dimethylsulphoxide (DMSO; BDH, Poole, U.K) and transferred to a 1.2 ml cryogenic vial. The vial was kept in a polystyrene box at -70 C overnight so that the temperature decreased gradually. The following day the vial was transferred to the vapour phase of the liquid nitrogen cryo-storage system (Jencons, Leighton Buzzard,

UK) for long term storage.

EXTRACTION OF CELLULAR DNA:

For the isolation of total genomic DNA a modified organic method was used

(Maniatis et al., 1982). Approximately 5x107 lymphoblastoid cells established from each individual were pelleted into a sterile 50 ml polypropylene centrifuge tube. To the cell pellet 19 ml STE buffer (100 mM sodium chloride, 50 mM Tris and 1 mM

EDTA; pH 8.0) was added. Next 1 ml of 10% sodium dodecyl sulphate (SDS) was added dropwise with gentle vortexing, followed by 20 l of Proteinase K (20 mg/ml).

The samples were incubated overnight in shaking water bath at 55 C and extracted the following day with an equal volume of tris base equilibrated phenol (pH 8.0). The samples were mixed for 10 minutes, placed on ice for 10 minutes and then centrifuged in MSE 3000i (Mistral, UK) at 4 C for 40 minutes at 3200 rpm. The aqueous layer containing the nucleic acid was collected in a fresh, labeled 50 ml centrifuge tube. The next extraction was done by adding an equal volume of chilled

24:1 (v/v) Chloroform: isoamyl alcohol. The samples were mixed and the aqueous layer was collected in a fresh 50 ml tube. For precipitation of nucleic acids, 1/10 volume of 10 M ammonium acetate and an equal volume of chilled isopropanol were

37

added and mixed until white precipitates formed. These samples were stored over night at -20 C or at -70 C for 15-20 minutes. Samples were then centrifuged at 3200 rpm for 90 minutes to pellet the nucleic acid and the pellet was washed with 5 ml of chilled 70% ethanol. The pellets were vacuum dried for 10 minutes. To the pellets,

1ml Tris-EDTA (TE; 10 mM tris, 1 mM EDTA; pH 8) was added and the samples were incubated at 37 C for 1 hour to resuspend the pellets. To digest the RNA, 10 l of RNase A (10mg/ml) was added to the samples and they were incubated at 37 C for 2 hours in a shaking water bath. The RNase was subsequently removed by adding 50 l of 10% SDS and 5 l of proteinase K and incubation at 55 C for 1 hour in a shaking water bath. At this point the samples could be stored at 4 C till further extraction. Subsequent extract was carried out by adding 6 ml TE to each sample before extracting successively with an equal volume of phenol and chloroform: isoamyl alcohol. For precipitation of DNA, 1/10 volume of 10 M ammonium acetate and an equal volume of chilled isopropanol was added. The samples were mixed until the DNA was seen and stored at -20 C overnight or at -70 C for 15-20 minutes.

DNA was pelleted and washed with 5 ml of 70% chilled ethanol. The pellet was vacuum dried for 10 minutes and the DNA was resuspended in 1 ml of 10 mM Tris-

HCl (pH 8).

The optical density (OD) of the samples was measured at 260nm and 280nm (ideally

260/280 ratio=1.8) using a Hitachi U3210 spectrophotometer (Hitachi, Tokoyo,

Japan). The quantity of DNA was calculated by the following formula:

DNA concentration g/ml = Abs 260 x dilution factor x correction factor.

A dilution factor of 50 was usually employed and the correction factor for double stranded DNA is 50. If the OD260/OD280 ratio was 1.7-2.0, DNA was considered pure and free of contaminating phenol or proteins and for further analysis. Each sample was kept in an appropriately labeled microcentrifuge tube and stored at 4oC until use.

Some DNA samples were also directly prepared from the blood sample. The procedure for the extraction of the DNA from blood was the same as above with 38

some minor modifications. Initially the blood was mixed with the cell lysis buffer

(0.15 M ammonium chloride, 0.01 M potassium bicarbonate and 0.1mM of 0.5M

EDTA; pH 8.0) and kept on ice for 30 minutes. The samples were centrifuged for 10 minutes at 1200 rpm. The pellets were again washed with 10 ml of lysis buffer and centrifuged for 10 minutes at 1200 rpm. To this pellet 4.75 ml of STE buffer was added along with 250 μl of 10% SDS (drop wise with gentle vortexing) followed by 10

µl of proteinase K. The tube was incubated overnight in a rotary water bath at 55oC.

The next day, the samples were extracted using phenol and chloroform: isoamyl alcohol as described earlier. After this first extraction, 10 µl of RNAse A (10 mg/ml) was added and the samples were incubated at 37oC for 2 hours. After 2 hours the samples were again treated with 250 µl of 10% SDS and proteinase K and incubated at 55oC for 1 hour. Subsequent extraction and precipitation were the same as described for lymphoblastoid cell lines.

PHENOL EQUILIBRATION:

Analytical grade phenol (BDH) was redistilled at 160ºC to remove contaminants that cause breakdown or cross linking of nucleic acids. Aliquots of

200-500 ml distilled phenol were stored at -20ºC. Before use, the phenol was melted at 55-70ºC and α-hydroxyquinolin was added as an oxidant and RNase inhibitor at a final concentration of 0.1% (w/v). The melted phenol was extracted once with an equal volume of 1.0 M Tris buffer (pH 8.0) and 3 to 4 times with 0.1 M Tris (pH 8.0).

This equilibrated phenol was stored at 4ºC in equilibration buffer (0.1 M Tris) to which

0.2% β-merceptoethanol (v/v) was added. Under these conditions it was stable for approximately one month (Maniatis et al., 1982).

39

GENOTYPING OF Y MARKERS BY POLYMERASE CHAIN REACTION

(PCR):

Polymerase chain reaction was first described in 1985 (Saiki et al., 1985) and the method was extensively employed in this study to amplify the desired fragment of

Y chromosome from genomic DNA. The 93 Y markers that were genotyped in this study are shown in table II and a brief overview of the various methods used to detect them follows:

AMPLIFICATION REFRACTORY MUTATION SYSTEM (ARMS) PCR:

The ARMS PCR technique is a simple method for the detection of single base mutations. In this allele specific PCR the genomic DNA is only amplified when a specific allele is present. Two sets of reactions are run in parallel using three types of primers, one of which is common in both reactions. One set consists of the common primer and a primer that is specific for the normal sequence. The other contains the common primer and another that is specific for the mutant sequence.

The principle is that the extension of primer by DNA polymerase is dependent up on correct base pairing at the 3`end.

AMPLIFICATION FRAGMENT LENGTH POLYMORPHISM (AFLP) PCR:

The AFLP PCR is based on the principle that the base changes results in the creation or abolition of a restriction site. PCR primers are designed from sequences flanking the restriction site to produce a 100-500 base pair product. The amplified product is subsequently digested with the appropriate restriction enzyme and fragments are analyzed by agarose gel electrophoresis. The SNPs typed by AFLP method are listed in table III.

40

Table II: A list of Y haplogroups, markers, type of polymorphism and genotyping methods used in this study. Y haplogroups were determined in a hierarchal manner, screening initially with markers that identified deep lineages (bold) and subsequently genotyping markers that further delineated the tree in the target population. The typing methods were amplified fragment length polymorphism

(AFLP), denaturing high performance liquid chromatography (DHPLC), amplification refractory mutation system polymerase chain reaction

(ARMS-PCR) or dideoxy DNA sequencing (Seq).

Haplogroup Markers Polymorphism Genotyping Method Haplogroup Markers Polymorphism Genotyping Method Haplogroup Markers Polymorphism Genotyping Method

A M91 del T DHPLC H1b M97 T→G DHPLC O1b M110 T→C Seq A1 M31 G→C DHPLC H2 Apt G→A AFLP O2 P31 T→C Seq A2 M6 T→C DHPLC I M170 A→C ARMS O2a1 M88 A→G Seq A2 PK1 C→A AFLP J 12f2 del PCR O2a1 M111 del TT Seq A3a M32 T→C DHPLC J1 M267 T→G ARMS O2a1a PK4 A→T DHPLC B M60 ins T DHPLC J1a M62 T→C ARMS O2b SRY+465 C→T AFLP B2a M150 C→T DHPLC J2 M172 T→G ARMS O3 M122 T→C ARMS B2a1 M109 C→T DHPLC J2a1b M67 A→T ARMS O3a3 L1Y LINE1 ins PCR B2a1 M152 C→T DHPLC J2a1b1 M92 T→C ARMS O3a5 M134 del G DHPLC B2a1 M218 C→T DHPLC J2b M12 G→T ARMS O3a5a M117 del ATCT DHPLC C RPS4Y C→T AFLP K M9 C→G AFLP O3a5a M133 del T DHPLC C1 M8 G→T Seq K1 M147 ins T Seq P 92R7 C→T AFLP C2 M38 T→G Seq K4 M177 C→T Seq P M45 G→A DHPLC C3 M217 A→C Seq L M20 A→G AFLP P M74 G→A DHPLC C3 PK2 T-C ARMS L M11 A→G AFLP Q M242 C→T ARMS C3C M48 A→G ARMS L M185 C→T DHPLC Q2 M25 G→C DHPLC DE YAP Alu ins PCR L1 M27 C→G ARMS Q2 M143 G→T DHPLC E SRY-8299 G→A AFLP L1 M76 T→G DHPLC R M207 A→G ARMS E3a sY81 A→G AFLP L2 M317 del GA DHPLC R1 M173 A→C ARMS E3b1 M35 G→C ARMS L2 M349 G→T DHPLC R1a1 M17 del G ARMS E3b1a M78 C→T ARMS L3 M357 C→A DHPLC R1a1 SRY-1532 A→G→A AFLP E3b1a1 M148 A→G DHPLC L3a PK3 T→C ARMS R1a1a M56 A→T ARMS E3b1c M123 G→A ARMS NO M214 A→G ARMS R1a1b M157 A→C DHPLC E3b1c2 M136 C→T DHPLC N LLY22g C→A AFLP R1a1c M87 T→C DHPLC F M89 C→T ARMS N M231 G→A DHPLC R1a1d PK5 C→T AFLP G M201 G→T ARMS N3 TAT T→C AFLP R1b2 M73 del GT DHPLC G2a P15 C-T DHPLC O M175 del TTCTC Seq R1b3F SRY-2627 C→T AFLP H M69 T→C DHPLC O1 M119 A→C DHPLC R1c M343 C→A ARMS H1 M52 A→C ARMS O1a M101 C→T DHPLC R2 M124 C→T ARMS H1 M82 del AT DHPLC O1b M50 T→C DHPLC T M70 A→C ARMS H1a M36 T→G DHPLC O1b M103 C→T DHPLC T M193 ins CAAA DHPLC

41

Table III: List of SNPs typed by AFLP method.

SNO Markers Restriction Enzyme

1 Apt Hae III

2 Lly22g HindIII

3 M9 Hinf I

4 M11 Msp I

5 M17 Afl III

6 M20 Ssp I

7 PK1 Psp14061

8 PK5 Mnl1

9 RPS4Y Bsl I

10 SRY+465 FnuH I

11 SRY 1532 Dra III

12 SRY2627 Ban I

13 SRY8299 BsrBI

14 Sy81 Nla III

15 TAT Mae II

16 92R7 Hind III

42

PREPARATION OF AGAROSE GEL:

6g of molecular grade agarose (molecular biology grade; Sigma Chem. Co) was mixed in 300 ml of or TAE electrophoresis buffer (0.04M Tris-acetate and 0.01 M

EDTA / liter) to make a 2% (W/V) agarose gel. The agarose was melted in a microwave oven keeping the cap of the bottle loose. When the agarose was dissolved completely, 5 l (0.5µg/ml) ethidium bromide (Sigma-Aldrich, St.Louis, USA) was added and mixed thoroughly. The gel was placed on shaking water bath at 55 C for

20-25 minutes. A gel tray was sealed with rubber clamps and placed on a level horizontal surface. The required combs were placed at appropriate positions (0.5-

1.0mm above the base of the gel). The gel was poured into the gel tray. After the gel solidified, the combs and clamps were removed from the gel tray. The gel was placed in an electrophoresis tank containing appropriate 1X TAE electrophoresis buffer.

Orange G loading dye (0.125% orange G, 20% Ficoll, 100mM EDTA) was added to each sample and the samples were loaded on the gel. A 100 bp ladder

(Promega) was loaded in the first well. Electrophoresis was carried out for approximately 40 minutes at 150 volts using a power pack (3000 Bio Rad laboratories). Photographs were taken under UV transilluminator using the Syngene system (Bio imaging system, Cambridge, UK).

MULTIPLEX PCR:

Each sample was PCR amplified in a multiplex reaction consisting of 4 to 5 primer pairs which were labeled either with TET, HEX or FAM (Table IV). The multiplex

PCR assay was performed in a 10 µl final volume. The reaction mixture was prepared in two steps. In first step, Super Taq polymerase / Taq Start TM Antibody premix was prepared. Briefly, the premix consisted of the following: 0.13U Super Taq enzyme (HT

Biotechnology Ltd) was incubated with 2.3 µM Taq Start TM Antibody (Clontech) in the

43

presence of 0.874 µl /RXN Taq Start TM Dilution buffer for 5-7 minutes at room- temperature. In the second step, PCR master mix was prepared. Briefly the reaction consisted of following: 1x Supper Taq PCR Buffer1 (10mM Tris-HCl pH 9 , 1.5mM

Mgcl2, 50mM KCl, 0.01% gelatin and 0.01% Triton X-100), 0.7mM Mgcl2, 200 µM dNTPs, primer (concentration was described in table IV) and 1.225 µl /RXN Super Taq polymerase / Taq Start TM Antibody premix.

The above mixture was added in to the tubes containing 20ng (1µl) genomic

DNA. PCR was performed by Touch Down protocol as described in Ayub et al.,

(2000). PCR was carried out using the following conditions: 1 cycle of 1 minute at

940C; 8 cycles of 1 minute at 940C, 1 minute at 600C and 1 minute 720C (the annealing temperature was decreased by 0.5 C in each cycle); 30 cycles of 1 minute at 940C, 1 minute at 560C and 1 minute 720C; I cycle of 5 minute at 720C.

SAMPLE PREPARATION:

0.3 l of amplified product was mixed with 2.7 l of dye (0.342 l Dextran blue,

1.5 l formamide, 0.478 l autoclave deionized water and 0.38 l TAMRA 300 or 500 internal lane size standard / reaction). Samples was denatured at 90ºC for 2 minutes and placed on ice untilled loading. Samples were run on ABI 377 DNA sequencer for one and a half hour. The data was collected by using ABI collection software. The fragment sizes were estimated using Gene Scan software (v2.1). The allele were called using Genotyper software (v2.0).

4% POLY ACRYLAMIDE GEL PREPARATION:

5.4 g of urea was dissolved in 5 ml of autoclaved deionized water by continuous stirring and heating.1.5 ml of 40 %(19:1,acrylamide:bis acrylamide) acrylamide solution and 2-3 gm of mix bed ion- exchange resin was added to the urea and mixed for 2-3

44

minutes. The solution was filtered through a Whatmann No. 1 filter paper into a 50 ml graduated cylinder already containing 1.5 ml of 10X TBE (Trizma base; Tris

[hydroxymethyl] aminomethane 70g, 55g boric acid and 9.0g ethylene diamine tetra acetic acid (EDTA, pH 8-8.2). The volume was made to 15 ml and filtered through a

0.2 M Millipore filter paper using a Millipore vacuum filtration assembly. To the filtered solution 5 l of 10% ammonium per sulphate (APS) and 10.5 l TEMED was added just before pouring the gel.

The rear and the front plate (12 cm) were washed with 1% Alconox detergent first with de-mineralized water and then with deionized water. When plates were dry, the rear plate was placed on the gel casting apparatus (Sequencing Gel Caster: model

SGC-1) with the inside of the plate facing up. Wet 0.2 mm spacers were placed on the rear plate. The front plate was placed half way down on top of the rear plate. The 4 % acryamide solution was filled in a 50ml syringe and poured slowly between the two plates. The flat edge of a 0.2 mm comb was inserted in between the plates and plates were sealed with clamps. The plate assembly was left for 30-45 minutes for the gel to polymerize. The comb and clamps were removed. The plate assembly was washed with demineralized water then deionized water and left for 15- 20 minutes. The shark tooth side of the comb was inserted so that the teeth of the comb just touch the gel.

The plates were fixed on the gel cassette then on to the sequencer. The upper and lower buffer reservoirs were attached. Plate check was carried out to ensure that the gel plate was clean. 1X TBE buffer was filled in upper and lower buffer reservoirs.

Before loading the samples the gel was electrophoreses for 10 minutes.

45

Table IV: YSTR Primers sequences.

Dye Final Conc.

YSTR1 Primer name Primer Sequence label (µM)

DYS19-L CTA CTG AGT TTC TGT TAT AGT TET 0.236

DYS19-R ATG GCA TGT AGT GAG GAC A 0.236

DYS388-L GTG AGT TAG CCG TTT AGC GA TET 0.318

DYS388-R CAG ATC GCA ACC ACT GCG 0.318

DYS390-L TAT ATT TTA CAC ATT TTT GGG CC 0.127

DYS390-R TGA CAG TAA AAT GAA CAC ATT GC FAM 0.127

DYS391-L-N CTA TTC ATT CAA TCA TAC ACC CAT AT FAM 0.384

DYS391-R-N ACA TAG CCA AAT ATC TCC TGG G 0.384

DYS392-L-N AAA AGC CAA GAA GGA AAA CAA A 0.155

DYS392-R-N CAG TCA AAG TGG AAA GTA GTC TGG HEX 0.155

DYS393-L GTG GTC TTC TAC TTG TGT CAA TAC 0.18

DYS393-R AAC TCA AGT CCA AAA AAT GAG G HEX 0.088 YSTR2 DYS389I-L CCA ACT CTC ATC TGT ATT ATC TAT TET 0.032

DYS389I-R TCT TAT CTC CAC CCA CCA GA 0.032

DYS389II-L CCA ACT CTC ATC TGT ATT ATC TAT TET 0.032

DYS389II-R TTA TCC CTG AGT AGT AGA AGA AT 0.032

DYS425-L TGG AGA GAA GAA GAG AGA AAT 0.861

DYS425-R AGC TCT ACA AGC CAT TGT GAT CT FAM 0.861

cont.

46

Dye Final Conc. YSTR2 Primer name Primer Sequence label (µM)

DYS426-L GGT GAC AAG ACG AGA CTT TGT G HEX 0.30

DYS 426-R CTC AAA GTA TGA AAG CAT GAC CA 0.25

YSTR3 DYS434-L CAC TCC CTG AGT GCT GGA TT TET 0.2

DYS434-R GGA GAT GAA TGA ATG GAT GGA 0.2

DYS437-L GAC TAT GGG CGT GAG TGC AT HEX 0.1

DYS437-R AGA CCC TGT CAT TCA CAG ATG A 0.1

DYS435-L AGC ATC TCC ACA CAG CAC AC TET 0.05

DYS435-R TTC TCT CTC CCC CTC CTC TC 0.05

DYS438-L TGG GGA ATA GTT GAA CGG TAA HEX 0.2

DYS438-R GTG GCA GAC GCC TAT AAT CC 0.2

DYS436-L CCA GGA GAG CAC ACA CAA AA FAM 0.025

DYS436-R GCA ATC CAA CTT CAG CCA AT 0.025

DYS439-L TCC TGA ATG GTA CTT CCT AGG TTT TET 0.2

DYS439-R GCC TGG CTT GGA ATT CTT TT 0.2

47

AUTOMATED FLUORESCENT DNA SEQUENCING:

Automated sequencing (di-deoxy terminator cycle sequencing) was carried out using an ABI 377 DNA Sequencer and the dye terminator cycle sequencing ready reaction kit (version 3.1; Applied Bio system).

DNA was amplified by polymerase chain reaction in a 50 l reaction volume.

The reaction mixture contained: 1X PCR buffer II, 1.5mM MgCl2, 100 M dNTPs, 1U

DNA Taq polymerase, 1.0 M Primer (forward and reverse each) and 40ng DNA template. The following PCR cycling conditions were used for the amplification: 1 cycle of 4 minutes at 950C; 35 cycles of 1 minute at 950C, 1 minute (annealing) (depend on the primer and describe in Appendix I), 1 minute at 720C; 1 cycle of 10 minute at 720C.

Amplified PCR products were first checked on 2% agarose gel. The amplified product was precipitated with 50µl of 95% ethanol. Sample was then washed with

200µl of 70% ethanol and the pellets were resuspended in autoclaved deionised water. If required, PCR products were also purified with the QIAquick PCR product extraction kit (Qiagen) according to the manufacturer’s instruction. Sequencing reaction was carried out in 10.0 l total reaction volume consisted of the following: 2.0

l sterile deionised H2O, 4.0 l Terminator ready reaction mix. (Includes labelled dye terminators, buffer, and dNTPs), 1.0 l forward or reverse sequence specific primer and 3.0 l purified DNA (0.5 g).

PCR was performed using a Thermo Hybaid multi-block system (MBS 0.2S), or

Thermo Hybaid PxE 0.2 thermal cycler for 25 cycles as follows: 10 seconds at 96oC, 5 seconds at 50oC and 4 minutes at 60oC.

After amplification, the products were precipitated with 50µl of 95% ethanol, washed with 200µl of 70% ethanol and vacuum dried. The pellets were resuspended in 5µl of ABI loading buffer, diluted with formamide (1:5), samples was denatured at

95ºC for 2 minutes and placed on ice until loading. Samples were run on ABI 377

48

DNA sequencer for seven hour. The data was collected by using ABI collection software.

PREPARATION OF SEQUENCING GEL: To prepare sequencing gel, 9g of urea (6M) was dissolved in approximately

10ml of deionised water, placed on a hot plate with constant stirring. After dissolving the urea, 2.5ml of a 19:1 acrylamide gel solution (Sequa gel) and 2.5ml of 10X TBE was added q.s. to 25ml with sterile deionised water. The solution was filtered through a

0.2µm Millipore membrane filter and degassed using a Millipore vacuum filtration assembly. To the filtered solution, 200µl of 10% APS and 5µl of TEMED was added and immediately poured into the gel plates. The remaining procedure was same as mentioned for the 4% poly acryl amide gel preparation.

DENATURING HIGH PERFORMANCE LIQUID CHROMATOGRAPHY

(DHPLC):

The technique denaturing high performance liquid chromatography (DHPLC) was initially developed by Oefner and Underhill (1995). This is a powerful technique in which SNPs are identified by the presence of hetroduplexes in a mixture of amplified products from a wild type DNA (control sample) and the test sample. The DNA fragments are separated on a specialized DNA Sep column based upon the principle of ion-pair reversed phase HPLC carried out under denaturing conditions. The

Transgenomic WAVETM DNA fragment analysis system was used for DHPLC work.

PCR was carried out in 15 µl total reaction volume. The concentration of reagent for

PCR reaction is: 1X PCR Buffer, 1.5 mM MgCl2, 200 M dNTPs, 1U BioTaq DNA polymerase, 1.0 M Primer (forward and reversed each), 40ng DNA template (20ng/ l).

PCR cycling parameters were described in the Appendix I.

The quality of amplified product was first checked on a 2% agarose gel by taking 5 µl of each PCR product. Equal volumes of the PCR products of a wild type

49

and each test sample were separately mixed and denatured at 95oC for 5 minutes.

They were then allowed to reanneal by decreasing the temperature at the rate of

1.5oC/min from 95oC-25oC.

Before setting up the experiment, the instrument was initially allowed to run

(purged) with 33% of buffer A (0.1M triethylamonium acetate (TEAA) solution, pH 7.0),

33% of buffer B (0.1M TEAA solution containing 25% acetonitrile, pH 7.0) and 34% of buffer C (75% acetonitrile solution) for 2-5 minutes. After purging, the column was equilibrated for 30 minutes with 50% of buffer A and 50% buffer B at a flow rate of

0.9ml/min. Five needle and injection port washes were carried out using buffer D (8% acetonitrile).

The DNA sequence to be screened for polymorphisms was copied to the Wave

Maker (version 4.1) software and the appropriate temperature and gradient method for that particular sequence was determined. A sample sheet specifying the tube numbers, injection volumes, sample IDs and gradient was prepared. The system was initialized and run according to the manufacturer’s instructions.

The optimal melting temperature for any DNA fragment can be determined by electronic submission of sequence to the web site

(http://insertion.stanford.edu/melt.html).

50

RESULTS

- 5 -

SECTION 1

PHYLOGEOGRAPHY OF PAKISTANI ETHNIC GROUPS:

The Y chromosomal biallelic markers (base substitutions, insertions and deletions) identify stable Y haplogroups and lineages. More than 600 such markers on the male specific region of the human Y chromosome delineate >300 Y haplogroups with a worldwide distribution (Figure II). In this study 93 of these Y chromosomal biallelic markers were examined in 1,213 unrelated male individuals representing 16 ethnic groups from Pakistan. The ethnic groups were categorized broadly into two groups (Table I). The northern group was represented by unrelated males from the Balti, Burusho, Hazara, Kalash, Kashmiri, Pathan and Punjabi ethnic groups. Punjabis constitute the majority of Pakistan’s population and most reside in the Punjab province adjoining India. The Punjabi samples analyzed were 185 unrelated male samples of the Gujar, and Rajput castes. The southern group comprised of unrelated males from the Baloch, Brahui, Makrani-Baloch, Makrani-

Negroid, Mohanna, Parsi and Sindhi populations.

Y biallelic polymorphisms were typed using a hierarchical approach. All samples were initially analyzed for four markers representing clades close to the root of the Y phylogenetic tree. These included SRY10831.1 (clade B*), RPS4Y711 (clade

C*), YAP (clade E*) and M89 (clade F*). The frequencies of these B*, C*, E* and F* haplogroups in Pakistan are shown in Table V.

Futher subtyping of markers within each haplogroup revealed thirty-three haplogroups in different ethnic groups of Pakistan. Among these four (B*, C*, E* and

F*) haplogroups, F* was the most frequent in both northern and southern populations

(Figure III). As expected, the majority (85%) of Y chromosomes from Pakistan were derived from M89. The M89 derived alleles are frequently found in most world populations residing outside Africa, and represents YCC clades F through T (Figure

II). Twenty-five different haplogroups of F*-M89 chromosomes were found at varying

51

frequencies in the different ethnic groups of Pakistan (Table VI). The thirty-three haplogroups are summarized in Figure IV.

Clade A* is restricted to sub-Saharan African populations and was not observed in any individual belonging to Pakistan. However, a low frequency of B*-

M60 haplogroup was observed in 0.9 % of the Brahui and 3% of the Makrani-Negroid samples from southern Pakistan.

Haplogroup C* was the predominant haplogroup in the Hazara population

(60%). It was also present in the Brahui, Mohanna, Burusho, Meo and Gujar with a frequency that ranges from 1.6 to 8.2% (Tables VI and VII). Individuals carrying the derived allele for RPS4Y711 marker were further sub-typed for five additional markers that identify clades C1, C2 and C3. These included the markers M8 (C1*), M38

(C2*), M217, PK2 (C3*), and M48 (C3a). Of these, only PK2 was detected. The PK2 marker is one of the several novel SNPs and it is phylogenetically equivalent to haplogroup C3* (Mohyuddin et al., 2006). All of the Hazara (60%) and Burusho

(8.2%) RPS4Y711 derived Y chromosomes also had the derived allele for the PK2 marker.

YAP derived chromosomes constitute 3% of Pakistani population belonging to clade DE* were observed mainly in the southern populations. Except for the

Mohanna, this haplogroup was observed in all southern populations with frequency between 1.5%- 10.6%. The Pathans were the only northern population in which these chromosomes were observed (2.1%). Several off-shoots of DE* clade were analyzed and all Pakistani YAP positive (YAP+) Y chromosomes belonged to haplogroup E* and carried the derived allele for SRY-8299. Further sub-typing of clade

E* defined three informative haplogroups; E1b1a*, E1b1b1a*, and E1b1b1c*. The highest frequency of E1b1a* (marker sY81=M2) was observed in the Makrani-

Negroid (9.1%). These chromosomes were also found in the Makrani-Baloch (3.7%),

Brahui (3.4%) and in Baloch (1.5%). The remaining YAP+ chromosomes carried the

52

Table V: Frequency of haplogroups B*, C*, E* and F* in ethnic groups from Pakistan.

Population n B* C* E* F*

Northern

Balti 14 - - - 1.000

Burusho 97 - 0.082 - 0.918

Hazara 224 - 0.600 - 0.402

Kalash 44 - - - 1.000

Kashmiri 12 - - - 1.000

Pathan 96 - - 0.021 0.976

Punjabi 185 - 0.016 - 0.984

Southern

Baloch 66 - - 0.106 0.894

Brahui 117 0.009 0.017 0.034 0.940

Makarini Baloch 27 - - 0.074 0.926

Makarani Negroid 33 0.030 - 0.121 0.848

Mohanna 70 - 0.043 - 0.957

Parsi 90 - - 0.056 0.944

Sindhi 138 - - 0.022 0.978

Total 1213 0.002 0.124 0.022 0.852

53

Figure III: Distribution of haplogroups B*, C*, E* and F* in populations from northern and southern Pakistan.

54

Figure IV: Y haplogroups frequency distribution in ethnic groups of Pakistan.

55

derived allele for E1b1b1*-M35 haplogroup. This clade comprises six main branches which have a wide distribution in Africa, Asia and Europe. Of these, the E1b1b1a*-

M78 and E1b1b1c*-M123 derived chromosomes were observed in Pakistan. It was interesting that only two YAP+ populations i.e., Baloch from southern group and

Pathan from northern group share this E1b1b1a*-M78 haplogroup at a frequency of

6.1% and 2.1%, respectively. The majority of the southern populations carry the derived allele for the M123 marker. The frequency of E1b1b1c*-M123 haplogroup was 5.6% in the Parsi, 3.7% in the Makrani-Baloch, 2.2% in the Sindhi and 1.5% in the Baloch.The derived allele for M89 was observed at very high frequency in representatives from all population groups of Pakistan except for the Hazara. The following branches of this haplogroup were observed in Pakistan:

Haplogroup G*-M201 which is distributed mainly in Eurasian populations comprises 1.1% of the Pakistani Y chromosome. The frequency of this haplogroup was highest in the Kalash from northern Pakistan. Haplogroup G* was also observed in all southern populations except for the Baloch and Makrani-Baloch tribes. Low frequency of G* was observed in the Mohanna, Burusho, and Gujar Y chromosome.

One major sub-clade of this haplogroup G2a*, which is derived for the P15 polymorphism, accounts for a major proportion of the variation observed in this haplogroup in Pakistan. Haplogroup (G2a*) is widely distributed among the southern populations. Among the northern group only Kalash and Pathan Y-chromosome carry this haplogroup at a frequency of 18.1% and 1%, respectively.

The H1*-M52 haplogroup which is a sub clade of H*-M69 Y chromosomes exhibits a frequency of 4% in Pakistan. The highest frequency was found in the Balti

(7.1%), Kalash (20.4%), Punjabi (7.6%), Makrani Negroid (6.1%) and Sindhi (5.8%) samples (Table VI and Figure V). Individuals carrying the derived allele for H1* clade were further sub-typed for two markers that identify clade H1a1-M36 and H1a2-M97.

Neither H1a1 nor H1a2 haplogroup were present in Pakistan.

56

Haplogroup I*-M170, A-C mutation on the Y chromosome is thought to have arisen in Europe. The European Y-chromosome gene pool contains a high frequency of this haplogroup. In Pakistan, frequency of M170 polymorphism was

<0.1% as it was only observed in one individual belonging to the Hazara population.

Clade J*, characterized by the 12f2a deletion, was widely distributed across

Pakistan. The majority of these Y chromosomes were represented by the J2a2* (M-

67 derived) haplogroup that is a major branch of the J2*-M172 haplogroup. The

J2a2* haplogroup was found in all ethnic groups examined and constituted 10% 0f the population (Figure V). One offshoot of the J2a2* haplogroup, the J2a2a* haplogroup characterized by the derived allele for the biallelic marker M92, was observed in one southern population the Brahui (8.5%). The other main branch of the J lineage, J1*-M267, was also observed in this population in addition to the

Baloch, Makrani-Baloch and Sindhi from southern Pakistan. The Pathan was the only northern group that carried the J1* haplogroup, albeit at very low frequency

(1.0%).

A majority of non-African Y chromosomal haplogroup are derived for the M9 marker and fall in clades K*-T*. The derived allele for M9 is widespread in Pakistan and accounts for 61% of all Y-chromosomes, all of which were resolved into sub- clades L*, NO*, Q*, R* and T*. Lineages K1-K4, that are a component of the Asian

Y-chromosomal gene pool were not observed in Pakistan.

Sub-clade L*, defined by the A to G M20 SNP constitutes 11% of the

Pakistani population with frequency ranging from 1.1%-24.2%. Of the three well characterized branches in this haplogroup the most dominant off-shoot present in

Pakistan is L1 that has the derived allele for M27. L1 occurs at an average frequency of 5.0% and is present in all southern populations with a frequency of

24.2% in the Baloch and 1.4% in the Parsi. Among the northern populations this haplogroup is observed only in the Pathan and Punjabi (Tables VI, VII and Figure V). 57

The L2*-M317 haplogroup, another offshoot of L* was observed in only two southern populations the Parsis and Makrani- Baloch at frequencies of 13.3% and 3.7%, respectively. The remaining branch L3* had a more widespread distribution and the highest frequency was observed in the northern Burusho and Balti populations (Table

VI and Figure VI). L3a, a branch of L3*, characterized by the marker PK3 appears only in Kalash population at a relatively high frequency (23%).

An extremely low frequency of the NO* clade was observed in Pakistan. The

12 individuals belonging to various branches of this clade were observed in two northern (Burusho and Pathan) and two southern (Brahui and Mohanna) populations only. The N1* (LLY22g derived) Y chromosomes were present in a Brahui and

Mohanna individual. The newly discovered haplogroup O2a1a-PK4 was found only in the Pathan (4.2%) but the East Asian O3* M122 derived haplogroup was observed in the Brahui (<1%), Burusho (3.1%) and Pathan (1%) samples. LY1 derived haplogroup O3a3a* was present at low frequency in the Brahui only.

Two major Y haplogroups Q*-M242 and R*-M207 branch off clade P* that is delineated by numerous SNPs including 92R7, M45 and M74. All P* chromosomes were resolved into Q* and R* haplogroup. Haplogroup Q* occurs at an average frequency of 1.8% in Pakistan and is observed in four northern (Burusho, Hazara,

Pathan and Punjabi) and four southern (Baloch, Brahui, Makrani-Baloch and Sindhi) populations.

Haplogroup R* characterized by the M207 SNP has a widespread distribution in Pakistan. It has two major branches R1* (M173 derived) and R2 (M124 derived) which have a distinct geographic worldwide distribution. R1*, which is common in

Europe, West and Central Asia occurs at an average frequency of 4.8% and is observed in all the Pakistani populations (Table VI and Figure VI). One derivative of

M173, R1a1-M17, which occurs at an average frequency of 35.1% in the population, is the most common Y haplogroup in Pakistan. This particular haplogroup was present in all population included in this study (Table VI and Figure VI). The highest 58

frequency of R1a1* was observed in the Mohanna (71.4%) and lowest in the Parsi

(7.8%). Other populations with appreciable (>50%) frequency of R1a1* included the

Kashmiri (58.3%), Punjabi caste (56.7%), and Sindhi (51.4%). On the background of

R1a1* haplogroup one of newly discovered haplogroups R1a1e-PK5 was observed however, it was restricted only to the Burusho population (2.1%).

Haplogroup R2 that has the M124 derived allele occurs in many Pakistani populations and has an average frequency of 5.8%. Except for the Mohanna it is observed in all southern populations. Its distribution is patchy in the north of Pakistan and it is found only in the Burusho, Kashmiri and Punjabi populations (Figure VI).

Haplogroup K2 (Y Chromosome Consortium, 2002) was recently reassigned to new haplogroup T* (Karafet et al., 2008). This haplogroup is characterized by the derived allele for M70 and was only found in a single Pathan individual.

59

Table VI: Number and frequencies of populations fall in haplogroup B-I.

)

)

711

*

=M2

8299)

I

-

F

B E

C

G

C3 H1

G2a

(P15)

(PK2)

(M60) (M78) (M89) (M52)

E1b1a

(M123) (M201) (M170)

E1b1b1c

E1b1b1a

(sY81

(RPS4Y

(SRY

Population n No.Haplogroups North Balti 14 6 0 0 0 0 0 0 0 0 0 0 1(7.1) 0 Burusho 97 15 0 0 8(8.2) 0 0 0 0 1(1.0) 1(1.0) 0 4(4.1) 0 Hazara 224 9 0 0 134(60) 0 0 0 0 13(5.8) 0 0 0 1(0.5) Kalash 44 8 0 0 0 0 0 0 0 0 0 8(18.1) 9(20.4) 0 Kashmiri 12 5 0 0 0 0 0 0 0 0 0 0 0 0 Pathan 96 16 0 0 0 0 0 2(2.1) 0 2(2.1) 10(10.4) 1(1.0) 4(4.2) 0 Punjabi 185 14 0 3 (1.6) 0 0 0 0 0 7(4.0) 1(0.54) 0 14(7.6) 0 South Baloch 66 13 0 0 0 1(1.5) 1(1.5) 4(6.1) 1(1.5) 1(1.51) 0 0 0 0 Brahui 117 18 1(0.9) 2 (2.0) 0 0 4(3.4) 0 0 0 0 9(8.0) 1(1.0) 0 Makrani-B 27 11 0 0 0 0 1(3.7) 0 1(3.7) 0 0 0 0 0 Makrani-N 33 11 1(3.0) 0 0 1(3.0) 3(9.1) 0 0 0 0 1(3.0) 2(6.1) 0 Mohanna 70 9 0 3 (4.3) 0 0 0 0 0 0 1(1.4) 3(4.3) 2(2.9) 0 Parsi 90 11 0 0 0 0 0 0 5(5.6) 0 0 1(1.1) 2(2.2) 0 Sindhi 138 13 0 0 0 0 0 0 3(2.2) 2(1.5) 0 2(1.5) 8(5.8) 0 Total 1213 2 8 142 2 9 6 10 26 13 25 47 1 33 % (0.2) (0.7) (11.7) (0.2) (0.7) (0.5) (0.8) (2.1) (1.1) (2.1) (4.0) (0.08)

Cont.

60

Table VI: Number and frequencies of populations fall in haplogroup J-L.

J

L

J1 J2

L1 L2 L3

L3a

J2a2

(PK3)

(M67) (M92) (M20) (M27)

J2a2a

(12f2a) (M267) (M172) (M317) (M357)

Population n North Balti 14 0 0 0 2(14.3) 0 0 0 0 2(14.3) 0 Burusho 97 0 0 1(1.0) 7(7.2) 0 3(3.1) 0 0 14(14.4) 0 Hazara 224 21(9.4) 0 3(1.4) 1(0.5) 0 0 0 0 0 0 Kalash 44 0 0 0 4(9.1) 0 1(2.3) 0 0 0 10(23.0) Kashmiri 12 1(8.3) 0 0 1(8.3) 0 0 0 0 0 0 Pathan 96 0 1(1.0) 0 5(5.2) 0 0 5(5.2) 0 7(7.3) 0 Punjabi 185 1(0.54) 0 0 18(9.7) 0 2(1.1) 15(8.2) 0 4(2.2) 0 South Baloch 66 0 2(3.0) 0 6(9.1) 0 0 16(24.2) 0 3(4.5) 0 Brahui 117 5(4.3) 6(5.1) 0 10(8.5) 10(8.5) 0 7(6.0) 0 2(1.7) 0 Makrani-B 27 0 1(3.7) 0 5(18.5) 0 1(3.7) 2(7.4) 1(3.7) 0 0 Makrani-N 33 0 0 0 6(18.1) 0 0 2(6.1) 0 1(3.0) 0 Mohanna 70 0 0 0 3(4.3) 0 1(1.4) 6(8.6) 0 0 0 Parsi 90 0 0 0 35(38.9) 0 3(3.3) 1(1.4) 12(13.3) 0 0 Sindhi 138 2(1.45) 4(3.0) 0 19(14.0) 0 0 6(4.4) 0 4(3.0) 0 Total 30 14 4 122 10 11 60 13 37 10 1213 % (2.5) (1.2) (0.3) (10.1) (0.8) (0.9) (5.0) (1.1) (3.0) (0.8)

Cont.

61

Table VI: Number and frequencies of populations fall in haplogroup N-T.

T

R

Q

N1 R1 R2

O3

R1a1

(L1Y)

(PK4) (PK5)

(M17) (M70)

R1a1e

O2a1a O3a3a

(M122) (M242) (M207) (M173) (M124)

(LLY22g)

Population N North Balti 14 0 0 0 0 0 2(14.3) 1(7.1) 6(43.0) 0 0 0 Burusho 97 0 0 3(3.1) 0 2(2.1) 11(11.3) 1(1.0) 25(25.8) 2(2.1) 14(14.3) 0 Hazara 224 0 0 0 0 4(2.0) 0 26(11.6) 21(9.4) 0 0 0 Kalash 44 0 0 0 0 0 3(7.0) 1(2.3) 8(18.1) 0 0 0 Kashmiri 12 0 0 0 0 0 0 2(16.6) 7(58.3) 0 1(8.3) 0 Pathan 96 0 4(4.2) 1(1.0) 0 5(5.2) 1(1.0) 4(4.2) 43(44.8) 0 0 1(1.0) Punjabi 185 0 0 0 0 1(0.55) 2(1.1) 4(2.1) 105(56.7) 0 8(4.3) 0 South Baloch 66 0 0 0 0 2(3.1) 0 4(6.1) 19(28.8) 0 6(9.1) 0 Brahui 117 1(0.8) 0 1(0.8) 1(1.0) 1(1.0) 0 3(2.6) 45(38.4) 0 8(7.0) 0 Makrani-B 27 0 0 0 0 1(3.7) 0 1(3.7) 9(33.3) 0 4(15) 0 Makrani-N 33 0 0 0 0 0 0 4(12.1) 10(30.3) 0 2(6.1) 0 Mohanna 70 1(1.43) 0 0 0 0 0 0 50(71.4) 0 0 0 Parsi 90 0 0 0 0 0 1(1.1) 4(4.4) 7(7.8) 0 19(21.1) 0 Sindhi 138 0 0 0 0 6(4.3) 0 3(2.2) 71(51.4) 0 8(6.0) 0 Total 2 4 5 1 22 20 58 426 2 70 1 1213 % (0.2) (0.3) (0.4) (0.1) (1.8) (1.6) (4.8) (35.1) (0.2) (5.8) (0.1)

62

Table VII: Y lineages found in the three Punjabi castes examined in this study.

)

711

*

*

* *

*

*

* *

*

*

*

*

J

L

F

L1 R

G Q

R2

ons

C C

L3

H1 R1

J2a2

(M89) (M52) (M67) (M20) (M27) (M17)

R1a1

(M201) (12f2a) (M357) (M242) (M207) (M173) (M124)

(RPS4Y

Populati n No. haplogroups

Gujar 159 13 2 6 1 14 1 17 2 15 4 0 1 3 86 7 (1.3) (3.8) (0.6) (8.8) (0.6) (10.6) (1.3) (9.4) (2.5) - (0.6) (1.3) (55) (4.4)

Meo 16 4 1 0 0 0 0 1 0 0 0 0 0 1 13 0 (6.2) - - - - (6.3) - - - - - (6.25) (81) -

Rajput 10 5 0 1 0 0 0 0 0 0 0 1 1 0 6 1 - (10) ------(10) (10) - (60) (10)

Total 185 14 3 7 1 14 1 18 2 15 4 1 2 4 105 8 (%) (1.6) (3.8) (0.5) (7.6) (0.5) (9.7) (1.1) (8.1) (2.2) (0.5) (1.1) (2.2) (57) (4.3)

63

Figure V: Distribution of major Y lineages (PK2, M52, M67 and M27) frequencies in Pakistan (frequencies are shown in table VI).

64

Figure VI: Distribution of major Y lineages (M357, M173, M17 and M124) frequencies in Pakistan (frequencies are shown in table VI).

65

PHYLOGENETIC ANALYSES

PRINCIPAL COMPONENT ANALYSIS:

The Principal Component Analysis was carried out in order to examine population relationships. This analysis is based upon the frequencies of thirty three

Y haplogroups in Pakistani ethnic groups. The principal component, PC1 and PC2, account for 72% of the variation in the population (Figure VII). The PC analysis shows that the all Pakistani populations group together, with the exception of the

Hazara, who are relatively distinct from other Pakistani ethnic groups and are clustered in the lower right quadrant of the graph. Interestingly, other populations such as, Brahui and Balti which are linguistically different from others; and the

Kalash, that are isolated; did not stand out and grouped with other ethnic group from

Pakistan.

PHYLOGENETIC ANALYSIS:

Analysis of Molecular Variance (AMOVA) was carried out using the Arlequin software. The populations were grouped on the basis of ethnicity, geographic origin and the linguistic affiliation. On the basis of this analysis we ascribed that ethnically the population were significantly different from each other (p value Va vs FCT:

0.0205±0.0050). As expected, majority of the variation was explained by variation within Pakistani population (Table VIII).

The pair-wise FST values between Pakistani ethnic groups based on the haplogroups frequencies also corroborate this result. The P-value matrix of significance; based upon 110 permutations among the Pakistani populations with significance level of 0.05; also demonstrated that significant variation occurs among the populations (Tables IX and X).

66

Figure VII: Principal component analysis based on Y haplogroup frequencies in Pakistani populations.

Balti: Blt, Burusho: Bsk, Hazara: Hzr, Kalash: Kal, Kashmiri: Ksr, Pathan: Pkh,

Gujar: Gjr, Meo: Meo, Rajput: Rpt, Baloch: Ball, Brahui: Bru, Makrani-Baloch:

Mak-B, Makrani-Negroid, Mak-N, Mohanna: Mhn, Parsi: Prs, Sindhi: Sdh.

67

Table VIII: Percentage of variation obtained by AMOVA at three levels of population hierarchy in ethnic groups from Pakistan.

Basis for Number Percentage of variation Variance components Fixation Indices p value grouping of Among Among Within Va Vb FCT FSC FST Va vs FCT groups groups populations populations (1023 permutations) within groups None 1 - 15.22 84.78 0.0649 0.3617 - - 0.1522 - Ethnicity 13 14.45 0.90 84.65 0.0618 0.0038 0.0105 0.1445 0.1535 0.0205 ± 0.0050 Geographic 2 1.12 14.52 84.36 0.0048 0.0623 0.0112 0.1469 0.1564 0.4076 ± 0.0167 Linguistic 4 - 8.99 19.34 89.65 - 0.0363 0.0780 - 0.0899 0.1774 0.1035 0.9746 ± 0.0047

68

Table IX: Population pair wise FSTs between Pakistani ethnic groups computed from Y haplogroup frequencies.

FST p values (based upon 110 permutations) are given above the diagonal with * indicating significant pair wise

differences.

Population BAL BRU MAKB MAKN MHN PRS SDH BLT BSK HZR KAL KSR PKH MEO GJR RPT Baloch (BAL) - 0.0000* 0.3153 0.0630 0.0000* 0.0000* 0.0000* 0.1081 0.0000* 0.0000* 0.0000* 0.0360* 0.0000* 0.0000* 0.0000* 0.0360* Brahui (BRU) 0.0275 - 0.3063 0.1982 0.0000* 0.0000* 0.0180* 0.3243 0.0000* 0.0000* 0.0000* 0.2882 0.0090* 0.0000* 0.0000* 0.1801 0.1982 Makrani Baloch (MAKB) 0.0053 0.0016 - 0.8018 0.0000* 0.0090 0.0720 0.3423 0.0991 0.0000* 0.0000* 0.3063 0.05405* 0.0000* 0.0180* 0.0810 Makrani Negroid (MAKN) 0.0146 0.0088 -0.0146 - 0.0000* 0.0000* 0.0270* 0.5495 0.0090* 0.0000* 0.0000* 0.3063 0.0180* 0.0090 0.0000* 0.2973 Mohanna (MHN) 0.1405 0.0774 0.1280 0.1392 - 0.0000* 0.0000* 0.0180* 0.0000* 0.0000* 0.0000* 0.1711 0.0000* 0.5225 0.0000* 0.0000* Parsi (PRS) 0.1148 0.1268 0.0539 0.0728 0.3099 - 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.5855 Sindhi (SDH) 0.0549 0.0172 0.0143 0.0284 0.0376 0.1647 - 0.4234 0.0000* 0.0000* 0.0000* 0.6486 0.0270* 0.0720 0.3783 0.5585 Balti (BLT) 0.0339 0.0058 0.0019 -0.0087 0.0899 0.1261 -0.0026 - 0.4324 0.0000* 0.0000* 0.5225 0.4774 0.0810 0.1891 0.0720 Burusho (BSK) 0.0458 0.0389 0.0188 0.0273 0.1585 0.0991 0.0629 -0.0000 - 0.0000* 0.0000* 0.0270* 0.0000* 0.0000* 0.0000* 0.0000* Hazara (HZR) 0.2653 0.2603 0.2721 0.2580 0.3997 0.3058 0.3072 0.2882 0.2109 - 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0090 Kalash (KAL) 0.1002 0.0797 0.0799 0.0586 0.2338 0.1374 0.1224 0.0650 0.0759 0.2818 - 0.0000* 0.0000* 0.0000* 0.0000* 0.8918 Kashmiri (KSR) 0.0535 0.0052 0.0117 0.0149 0.0224 0.1798 -0.0144 -0.0124 0.0591 0.3150 0.1299 - 0.3513 0.3243 0.4864 0.3693 Pathan (PKH) 0.0418 0.0193 0.0264 0.0272 0.0580 0.1721 0.0129 -0.0075 0.0467 0.2812 0.1024 0.0023 - 0.0000* 0.0090 0.3693 Meo (MEO) 0.1653 0.0943 0.1408 0.1459 -0.0113 0.3160 0.0470 0.1031 0.1675 0.4194 0.2485 0.0112 0.0720 - 0.0630 0.4864 Gujjar (GJR) 0.0582 0.0279 0.0329 0.0416 0.0255 0.1941 0.0002 0.0062 0.0772 0.3193 0.1354 -0.0074 0.0164 0.0415 - - Rajput (RPT) 0.0590 0.0115 0.0216 0.0464 0.0096 0.2071 -0.0135 -0.0106 0.0429 0.3293 0.1292 -0.0389 -0.0047 0.0216 -0.0099

69

Table X: Matrix of significant. FST p values (significance level =0.0500) based upon 110 permutations among the ethnic group of Pakistan.

Population BAL BRU MAKB MAKN MHN PRS SDH BLT BSK HZR KAL KSR PKH MEO GJR RPT Baloch (BAL) ------+ - - + + + - + + + + + + + + - Brahui (BRU) + ------+ + + - + + + - + + + Makrani Baloch (MAKB) ------+ + - - - + + - - + + - Makrani Negroid (MAKN) ------+ + + - + + + - + + + - Mohanna (MHN) + + + + ------+ + + + + + - + - + - Parsi (PRS) + + + + + ------+ + + + + + + + + + Sindhi (SDH) + + - + + + ------+ + + - + - - - Balti (BLT) - - - - + + ------+ + - - - - - Burusho (BSK) + + - + + + + ------+ + + + + + - Hazara (HZR) + + + + + + + + + ------+ + + + + + Kalash (KAL) + + + + + + + + + + ------+ + + + + Kashmiri (KSR) + - - - - + - - + + + ------Pathan (PKH) + + - + + + + - + + + ------+ + - Meo (MEO) + + + + - + - - + + + - + ------Gujjar (GJR) + + + + + + - - + + + - + ------Rajput (RPT) + - - - - + - - - + + - - - -

70

MEDIAN-JOINING NETWORK:

Genetic variations among the Pakistani populations were further investigated by making median-joining network (Bandelt et al., 1995). Here we present L*-M20 lineage network (Figure VIII). The L lineage is considered to arise in Indus valley region during the Indus valley civilization. The network revealed four clusters, representing four haplogroups. Samples encircled in red represent L1-M27 haplogroup, samples carrying the L2*-M317 haplgroup were encircled in green and

L3a-PK3 samples were encircled in yellow. The remaining samples carry L3*-M357 haplogroup. The network of L lineage reveals considerable variation among the

Pakistani populations; conversely this net work shows a high degree of population- specific sub-structure. The network shows isolated Parsi-specific clusters at the upper right end containing 15 of 16 Parsis. The Kalash fall into two clusters and

Burusho make a cluster at the middle of the net work. Haplotype sharing is the other striking feature of this network. Within a specific population, for example, the

Burusho, Kalash and Parsi share some haplotypes. However, the four Baloch individuals shared their haplotype with Sindhi and Makrani-Baloch individuals from nearby southern population. Similarly, one haplotype was shared between a Brahui and a Makrani-Negroid individual.

71

Figure VIII: Median-joining network of Lineage L individuals based on Y–STR haplotypes.

72

SECTION 2

COMPARISONS BETWEEN THE PAKISTANI AND GREEK

POPULATIONS:

Current study also included three ethnic groups from northern Pakistan ___ the Burusho, Kalash and Pathan ___ that claim Greek ancestry. These populations were compared with extant Greek samples from Europe that were genotyped for the same Y markers. The Y-chromosomal haplgroups and their frequencies in the

Greeks, Burusho, Kalash, Pathan and the rest of the Pakistani populations are shown in Figure IX.

HAPLOGROUP FREQUENCIES IN PAKISTAN AND GREEK

POPULATIONS:

The combination of biallelic markers identified 13 Y-chromosomal haplogroups in the Greeks, 16 in the Pathan and 15 in the Burusho populations.

Only eight Y haplogroups were found in the Kalash population. More than 75% of these samples were represented by haplogroups which are frequent in West Asia,

Europe and the Mediterranean region.

A comparison of the three Pakistani ethnic groups with the Greek populations shows that certain haplogroups are shared between these populations. These include clades E*, F*, I*, J*, R1* and T*. Majority of the Pakistani and Greek Y chromosomes have the derived allele for the M207 marker that encompasses branches R1* and R1a1* of the Y chromosome phylogenetic tree (Figure IX). R1a1* was the most common haplogroup found in Pakistan (35.9%) and Greece (15.6%).

Compared to the Greek the frequency of haplogroup R1a1* was relatively higher in the Pathan (44.8%), Burusho (25.8%) and Kalash (18.2%) samples. Clade R1* represented by the derived allele for SNP M173 was observed in 11.7% of the Greek

73

and 5.32% of the Pakistani samples. The Greek population exhibited a higher frequency of this clade in comparison with the Burusho (1.03%), Kalash (2.27%) and

Pathan (4.2%).

Haplogroup J* was the other haplogroup that was found at a high frequency in the Greek (17%) and Pakistani (14.8%) samples. The overwhelming majority of

Greek J* chromosomes belonged to haplogroup J2* which was present at a comparable frequency in Pakistan. This haplogroup J2* (including all its derivatives) was present at a frequency of 15.6% in the Greek, 8.2% in the Burusho, 9.09% in the

Kalash and 5.2% in Pathan. The majority of J2* Y chromosomes in Pakistan belonged to haplogroup J2a2*, being derived for the marker M67. The Greek samples could not be typed for this SNP due to lack of DNA. The J1* haplogroup characterized by the derived allele for M267 was absent in the Burusho and Kalash populations and was found at low (1%) frequency in the Greek and Pathan.

Clade E* haplogroup were more frequent in the Greek (21%) as compared to

Pakistan (2.2%). The majority of haplogroup E* chromosomes belonged to clade

E1b1b1* (M35 derived) and all Greek and Pakistani samples were resolved into the branches E1b1b1a* (M78 derived) and E1b1b1c* (M123 derived). Among the three

Pakistani populations claiming Greek descent the M78 derived Y chromosomes were observed only in the Pathan (2%). This branch constituted 16.9% of the Greek samples. Clade E1b1b1c* was present at a frequency of only 2.6% in the Greek and was absent in the Burusho, Kalash, Pathan populations. Its frequency in the remaining Pakistani populations was 1%.

All G*-M201 derived Greek Y chromosomes (9% of total) belonged to the

G2a* haplogroup characterized by the T allele for SNP P15 (Hammer et al., 2000).

This haplogroup was observed in 18.18% of Kalash and 1% of the Pathan samples and was absent in the Burusho.

Two branches that frequently characterize Y chromosomes found outside Africa are

H* and I* which distinguish eastern and western populations respectively. 74

Figure IX: A rooted maximum-parsimony tree of Y lineages found in the Greek, Burusho, Kalash, Pathan and Pakistani populations. The lineages were defined by binary markers whose designations and population frequencies (percentages) are given below each branch. Branch lengths are arbitrary and the YCC lineage names (Karafet et al., 2008) are shown below the frequencies. Haplogroup and haplotypes diversity are shown for each population.

75

(Rootsi et al., 2004; Underhill et al., 2001). One Greek sample belonged to haplogroup H2*, which is characterized by the Apt G to A transition (Pandya et al.,

1998). These Y chromosomes are not found in Pakistan but have been observed in neighboring India and this is the first time they have been observed in Greece.

Haplogroup I* characterized by the derived allele for M170 is mainly restricted to Europe and was observed in 19.5% of the Greek sample. This haplogroup was not observed in the Burusho, Kalash or Pathan and its frequency in Pakistan was <

0.2%.

Only a small proportion of Y chromosomes remain unresolved in clade F* and were represented by 2% of the Pathan and 1% of the Greek and Burusho samples. It is possible that in this case distinct haplogroups, as yet unknown, are being classified into the same paraphyletic haplogroup.

STATISTICAL AND PHYLOGENETIC ANALYSES

PRINCIPAL COMPONENT ANALYSES:

In order to examine population relationships principal component analysis based upon Y haplogroup frequencies in the Greek and Pakistani ethnic groups was carried out (Figure X). The first two principal components, PC1 and PC2, account for

79% haplogroup frequency data and separate the populations according to their geographic locations. The plot shows the Pathan and Burusho populations clustering with the remaining Pakistani populations in the upper right quadrant of the graph.

The Kalash and Greek form two separate and distinct clusters. To ensure that the

Greek individuals included in this study were representative of the Greek population studied earlier, results of comparable biallelic data (Francalacci et al., 2003) were incorporated in the principal component analysis (Figure XI). The Greek population included in this study clustered with the Greek populations studied earlier but the distinct Kalash population cluster was not apparent.

76

Figure X: A plot of the first two principal coordinates based upon the analysis of Y haplogroup frequencies in Pakistani and Greek populations.

77

Figure XI: A plot of the first two principal coordinates based upon the analysis of Y haplogroup frequencies in Pakistani and Greek samples (1=this study; 2

= Francalacci et al., 2003) using comparable biallelic markers.

78

GENETIC DISTANCES AND PHYLOGENETIC ANALYSIS:

The genetic distances between the populations were calculated using measures that are more sensitive to recent events (Table XI). The Pakistani–Greek population pair wise FST values based on the variation of STRs within haplogroups

(Qamar et al., 2002) ranged from 0.131 to 0.213, with the lowest value between the

Pathan and the Greeks. Pairwise ρ genetic distances (the number of steps between a haplotype in one population and the closest haplotype in the second population,

averaged over all comparisons) (Bandelt et al., 1999) ranged from 4.3 to 8.1, with the lowest value again between the Pathan and the Greeks.

Phylogenetic analysis using the matrix of genetic distances between populations with tree validation carried out by bootstrap re-sampling (10,000 replicates) also demonstrated that of the three Pakistani populations, the Pathans were closest to the Greek (Figure XII).

Therefore, together these results, suggest that there might have been a low degree of recent Pathan–Greek admixture. Examination of individual lineages by the

NETWORK software using Y-STR frequencies was carried out to investigate this possibility further.

79

Table XI: Weighted population pair wise ρ genetic distances (below diagonal) and FST values (above diagonal) based on STR variation within haplogroups.

Greek Burusho Kalash Pathan

Greek 0.000 0.188 0.213 0.131

Burusho 5.659 0.000 0.214 0.196

Kalash 8.066 3.882 0.000 0.219

Pathan 4.277 2.451 3.254 0.000

80

Figure XII: Neighbor-joining tree showing the relationship between the Greek and three Pakistani ethnic groups. The tree is based on ρ genetic distances.

Bootstrap values from 10,000 replicates are shown.

81

MEDIAN-JOINING NETWORK:

A median-joining network of clade E1b1b1a* Y chromosomes was constructed in order to examine the genetic relationship between the Greek and

Pathan samples. A duplication of 10 and 13 repeat units was observed in the clade-

E derived Y chromosomes for the tri-nucleotide repeat DYS425 and this locus was subsequently excluded from the network. The most striking feature of this network was the sharing of haplotypes between the Pathan and Greek samples (Figure XIII).

One Pathan individual shared the same Y-STR haplotype with three Greek individuals, and the other Pathan sample was separated from this cluster by a single mutation at the DYS436 locus. This demonstrates a very close relationship between the Pathan and Greek E lineages.

82

Figure XIII: Median-joining network of clade E* lineages in Pakistan (open circles) and Greece (hatched circles). Circles represent haplotypes and have an area proportional to frequency. The Pathan individuals are shown in black.

83

CONTOUR MAPPING:

The worldwide distribution and frequency of the haplotype shared between the Greek and Pathan clade E1b1b1a* individuals was checked in the Y-STR

Haplotype Reference Database (YHRD; Roewer et al., 2001). Worldwide data for the subset of 16 Y-STRs including DYS19, DYS388, DYS389I, DYS389II, DYS390,

DYS391, DYS392, DYS393, DYS425, DYS426, DYS434, DYS437, DYS435,

DYS438, DYS436, DYS439 were not available in this database. However, part of this haplotype based upon a subset of nine Y-STRs (DYS19=15; 389I=13; 389II=29;

390=24; 391=10; 392=11; 393=12; 438=9; 439=12) was found in 53 individuals in a worldwide population sample of 7,897 haplotypes. This haplotype was highly specific for the Balkans. The contour map of this haplotype (Figure XIV) shows a major concentration in the Balkans, around Macedonia and Greece, with a low scattering in other European countries and a comparable frequency in Tunisia and and the Pathan. This gives a strong indication of an European, possibly Greek, origin of these Pathan Y chromosomes.

84

Figure XIV: Contour map showing the 9 Y-STR haplotype frequency distribution in Eurasia and northern Africa. This haplotype was shared between three Greeks and a Pathan individual belonging to clade E1b1b1a*.

85

DISCUSSION

- 6 -

Our DNA is inherited from our ancestors, so genetic analysis can be used to provide information regarding our history. The Y chromosome is particularly useful in this respect because most of it is passed down from father to son without change, except for the gradual accumulation of mutations which appear as DNA polymorphisms. The present study provides an example of the power of a genealogical approach to Y-chromosome analysis based on a hierarchical use of specific markers in the Pakistani population.

Pakistan lies on the postulated southern coastal route “out of Africa”. The earliest evidence suggests this region was colonized about 60,000-70,000 years ago.

Pakistan was the site of several ancient cultures such as Mehrgarh, one of the world's earliest known towns, present in the southern Pakistani province of

Baluchistan (Jarrige, 1991) and evidence from this region indicates that modern humans were settled in this region during the Neolithic period. The region's other earliest civilizations were the Indus Valley civilization at Harappa and Mohenjo-Daro.

Moreover, the Indo-Pak subcontinent has become home to a multitudinous variety of racial groups due to the invasion of the region through out the millennia. Thus, it is one of the most genetically diverse areas in the world today.

Present day Pakistan is bordered by Iran and Afghanistan on the west, India towards the east and China in the north. The Indian Ocean straddles its entire southern coast line. The Himalayan Hindukush Mountains form a formidable presence in the north and north west.

The diversity of Y chromosome has been extensively used to study the genetic variation in humans. Human Y chromosomes are delineated into distinct haplogroups and lineages, defined by a combination of unique event or biallelic polymorphisms and Y-STRs. Each haplogroup represents a unique chromosome lineage that originated from a single male ancestor somewhere in the world in the past. The discovery of new paragroups and the formerly discovered lineages have made it possible to carry out detailed population genetic analysis based on 86

haplogroup and haplotype frequencies. The spread of each haplogroup is assumed to be unaffected by both selection and male migration. However, the haplogroup frequencies in an area may be influenced by demographic factors and genetic founder effects such as gene flow and genetic drift.

In the current study we examined 93 biallelic markers in 1,213 male subjects from 16 ethnic groups of Pakistan and a Greek population by a variety of PCR techniques. The extensive analyses of Y diversity allowed us to investigate:

1. The genetic diversity within Pakistani ethnic groups from the male

perspective.

2. Comparison of three Pakistani populations (the Burusho, the Kalash

and the Pathan) with the Greek population. These Pakistani

populations claim that they are the descendent from the Greek

soldiers which were left behind in this region by Alexander the Great.

3. Genetic differences between male individuals from Pakistan in

comparison to world populations.

4. Gain insight into the origin of Pakistani ethnic groups.

87

PART 1

COMPARSION WITHIN PAKISTAN:

According to their geographic distribution Pakistani populations were characterized into two categories; the northern group that incorporated the Punjabi populations and a southern group. The northern populations that were screened included Balti, Burusho, Hazara, Kalash, Kashmiri, Pathan and the Punjabis (Gujar,

Meo and Rajput) castes. The populations from the south of Pakistan included

Baloch, Brahui, Makrani-Baloch, Makrani-Negroid, Mohanna, Parsi and Sindhi. The combination of 93 biallelic markers identified 33 stable Y chromosomal haplogroups in the Pakistani populations (Table VI).

Haplogroups H1*-M52, J2a2*-M67, L1-M27, R1a1*-M17, R2-M124 which are frequent in South Asia, Europe and the Mediterranean region, together make up 60% of the Pakistani populations. It was also observed that the southern population group is more genetically diverse as compared to the northern group. Forty-five percent

(45%) of southern populations carry these 33 Y haplogroups, whereas they are found in 39% and 15% of northern and Punjabi populations respectively. In this study, we also screened 1,213 Pakistani individuals for five novel Y-SNPs PK1-PK5

(Mohyuddin et al., 2006). Three SNPs identify population specific haplogroups within

Pakistan. L3a-PK3 was found solely in the Kalash population, the O2a1a-PK4 was restricted to Pathan population while R1a1e*-PK5 was confined to the Burusho.

Based upon the Y haplogroup frequencies principal component (PC) analysis, it is observed that all the ethnic groups from Pakistan cluster together except the

Hazara (Figure VII). Although the Pakistani population include geographically, culturally and the linguistically isolated ethnic groups such as Kalash, Burusho and the Dravidian speaking Brahui, however, they do not stand out in the over all comparison.

88

Haplogroup C*-chromosome and its off-shoot separate the northern and southern region within Pakistan. C*-RPS4Y haplogroup was only found in two southern populations the Mohanna (4.3%) and Brahui (2%). Interestingly, the

Punjabis from the northern part carry this haplogroup (1.6%) as well (Table VI).

However, C3-PK2, one of the newly identified off-shoots of C*-RPS4Y haplogroup was found only in two northern ethnic groups (Table VI). This haplogroup was highest among the Hazara (60%) followed by the Burusho (8.2%). The C*-RPS4Y haplogroup is fairly common in Central Asia and and it points towards the

Mongol origins of the Hazara population which is supported historically (Bellew,

1979) and genetically (Qamar et al., 2002; Zerjal et al., 2003). However, the origin of

Burusho is not well documented. Some claim that they are the descendants of

Greek soldiers while some others claim that they are descendants of Dards from

Central Asia (Biddulph, 1977). The analysis of Francalacci and Rootsi shows that the Haplogroup C* chromosome is not present in Greece (Francalacci et al., 2003;

Rootsi et al., 2004). On the other hand, one of the earlier studies shows that the populations belonging to Tajikistan clustered with Hunza Burusho (Wells et al.,

2001). Furthermore, the studies with the autosomal genetic markers (Ayub et al.,

2003; Mansoor et al., 2004) and markers of Y chromosome (Firasat et al., 2007) suggest that the Burusho are genetically close to their geographic neighbors. The high frequencies of haplogroup C *-chromosome in Hazara, Burusho and in Central

Asia suggest that the C*-chromosome arose in Central Asia before the separation of these two Pakistani populations (Mohyuddin et al., 2006).

Major haplogroups of clade E*, E1b1a*-sY81 and off-shoots of E1b1a*-sY81 were also detected with higher frequency in the southern group of Pakistan as compared to northern and the Punjabi group. Haplogroup E*-SRY-8299 has been reported to have a North African origin and is not found in northern Pakistani ethnic groups and the Punjabi group (Qamar et al.,1999). However, a low frequency of this haplogroup is found in the southern group of Pakistan (0. 2%). The haplogroup of 89

E1b1a*-sY81 (M2) is sub-Saharan in origin and is found in Baloch, Brahui, Makrani-

Baloch and Makrani-Negroid (1.5%, 3.4%, 3.7 and 9.1% respectively) populations of the south (Table VI). The highest frequency of haplogroup E1b1a*-Sy81 is found in the Makrani-Negroid population (9.1%) who are reported to have a recent African origin. The highest frequency of E1b1a*-Sy81 in Makrani-Negroid could represent the genetic legacy of the African slaves that were brought to the Indo-Pakistan subcontinent by the Arabs and European invaders.

The other sub clade of E-haplogroup is E1b1b1*-M35 that originated in East

Africa (Semino et al., 2004). The remaining E1b1b1* Pakistani Y chromosomes were resolved into two branches E1b1b1a*-M78 and E1b1b1c*-M123. The E1b1b1a*-M78 haplogroup was present only in Pathan (2.1%) from northern site and Baloch (6.1%) from southern site of Pakistan (Table VI). All the E1b1b1*-M35 chromosomes from southern Pakistan further resolved into E1b1b1c*-M123 haplogroup. The Y- chromosome of E1b1b1a*-M78and E1b1b1c*-M123 haplogroup are also found in

Iran (Regueiro et al., 2006), Turkey (Cinnioglu et al., 2004) and in Greece (Firasat et al., 2007). It is also possible that the clade E haplogroup expands with the spread of agriculture (Hammer et al., 1998; Semino et al., 2000).

The G*-M201 haplogroup is present with a low frequency in Pakistani ethnic groups. The highest frequency of G*-M201 haplogroup is only observed in Pathan

(10.4%). Towards the south the frequency of G*-M201 dramatically decreased and only 1.4% Mohanna carry this haplogroup (Table VI). Haplogroup G*-M201 occurs at

~ 30% in Georgia (Semino et al., 2000) and the north Caucasus (Nasidze et al.,

2003), 10.9% in Turkey (Cinnioglu et al., 2004), 2.2% in Iraq (Al-Zharery et al., 2003) and 1.33% in Iran (Regueiro et al., 2006). This haplogroup is also found in southeast

Europe and in the Mediterranean regions (Semino et al., 2000). In contrast to the haplogroup G*-M201, the G2a*-P15 haplogroup is the most frequently present haplogroup in Southern group of Pakistan. Except the Baloch and the Makrani-

Baloch this haplogroup is found in all other ethnic groups belonging to southern 90

Pakistan. However, from northern Pakistan only Kalash and Pathan carry this haplogroup. G2a*-P15 haplogroup occurs at 9% in Turkey (Cinnioglu et al., 2004),

5% in and Greece (DiGiacomo et al., 2003) and 7.33% in Iran and throughout the Middle East with a maximum of 19 % in the Druze (Hammer et al., 2000).

Haplogroup H1*-M52 was observed almost in all ethnic groups of Pakistan.

The highest frequency of H1*-M52 Y chromosome was found in Kalash (20.4%) followed by the Gujar (7.6%), Balti (7.1%), Makrani-Negroid (6.1%), Sindhi (5.8%) etc. (Tables VI and VII). Many studies have showed that the clade H originated within the Indo-Pak subcontinent (Gayden et al., 2007; Kivisild et al., 2003; Pandya et al., 1998; Sengupta et al., 2006). The frequency of this indigenous haplogroup was found higher in southern India (Ramana et al., 2001; Wells et al., 2001) as compared to the northwest Punjab (Kivisild et al., 2003). Other than India and Pakistan this haplogroup was found in Newar (6.1%), (11.7%) (Gayden et al., 2007) and in Turkey (0.38%) (Cinnioglu et al., 2004). The other branch of Clade H*, H2*-

APT, is also found with higher frequency in India but none of the Pakistani Y- chromosome carry this haplogroup. It is also interesting that the Greek Y chromosome carry H2*-APT haplogroup at low frequency (Firasat et al., 2007).

Haplogroup J* is identified by the 12f2 human endogenous retroviral polymorphism (Sun et al., 2000; Rosser et al., 2000). Haplogroup J* Y chromosome is widely distributed in Eurasia, Middle East, and in North Africa (Hammer et al.,

2001; Quintan-Murci et al., 2001). Haplogroup J* branches were distributed across all Pakistani populations. The low frequency of J1*-M267 was detected in Pakistani populations. This haplogroup characterized African and Arabian populations and the frequency of J1*-M267 chromosome decreases towards the north and east direction.

The high frequencies of this haplogroup were found in (38%) (Luis et al.,

2004); Iraq (33%) (Al-Zahery et al., 2003); (20%) (Luis et al., 2004);

(13%) (Semino et al., 2000); Turkey (9%) (Cinnioglu et al., 2004); Iran (10.5%)

(Regueiro et al., 2006); India (0.27%) and East Asia (0%) (Sengupta et al., 2006); 91

and in Pakistan (1.2%). The frequencies of this haplogroup indicate the differential influence from East Africa and Middle East in southwestern Asia. However, the other clade of J* haplogroup the J2* haplogroup are distributed mainly in west Asians and

Eurasian populations. The demographic expansion of J2* chromosomes occurred during the dispersal of Neolithic farmers (King and Underhill, 2002). Haplogroup J2* and its derivative were found at a frequency of 23% in Iran (Regueiro et al., 2006),

22.2% in Turkey (Cinnioglu et al., 2004), 9% in India (Sengupta et al., 2006) and

11.2% in Pakistan. There appears to be a decrease in the frequency of this haplogroup as one moves from the south west to the north east of Pakistan. A decrease in the frequency of J2* derivatives can be seen east of Iranian Plateau in

South Pakistan (7.7%), with a dramatic decline in north Pakistan (2.0%) and in

Punjabi caste (1.5%) (Table VI). Sengupta et al., (2006) shows that J2* clade is nearly absent in East Asia (1.14%). The presence of J2* and its derivative chromosome in the Pakistani populations indicates a Persian and Mediterranean gene flow and is supported by the high frequency of this haplogroup in the Parsis.

This population arrived in India from Iran (Quintana-Murci et al., 2001).

Haplogroup L* is delineated by the presence of M 20 mutation (Underhill et al., 1997). The L* haplogroup could be a recent event and arose in Indus valley region during the Indus valley civilization. This high frequency of L* haplogroup is found in the Indo-Pak subcontinent. The L* chromosome is largely restricted to south Caucasus populations (Weale et al., 2001), Middle East (Nebel et al., 2001b),

Pakistan (Qamar et al., 2002), India (Kivisild et al., 2003; Sengupta et al., 2006).

However one of its sub branches L1-M27 was found with high frequency in Pakistan

(5%), India (6.32%) (Sengupta et al., 2006) and Iran (2.6%) (Regueiro et al., 2006) while no L1-M27 chromosome was observed in East Asia (Sengupta et al., 2006) and in Turkey (Cinnioglu et al., 2004). Comparison among the three Pakistani groups (northern, southern and Punjabi group) displays a significant difference in

92

haplogroup distribution. A considerable diversity was noticed in populations belonging to southern Pakistan.

The most frequent haplogroup in Pakistan was haplogroup R* (48%) (Table

VI). This haplogroup is widespread in Europe, the Caucasus, West Asia, Central Asia and in South Asia (Sengupta et al., 2006) however, it is absent in Africa and the New

World chromosomes. The most frequently found sub clade of haplogroup R* is

R1a1*-M17. The haplogroup R1a1* chromosomes originated in Southern

Russia/ in the region between the Black and Caspian Seas. This R1a1* chromosome spread with the expansion of culture (Passarino et al., 2001;

Quintana-Murci et al., 2001; Wells et al., 2001; Sengupta et al., 2006). Recent studies showed that this chromosome covers the area ranging from India to Norway

(Kivisild et al. 2003; Passarino et al., 2002; Quintana-Murci et al., 2001) but it is almost absent in East Asia (Sengupta et al., 2006; Su et al., 1999).

In the indo-Pak subcontinent it has been postulated that this haplogroup coincided with the arrival of Indo-European nomadic pastoral tribes from West and

Central Asia (Quintana-Murci et al., 2001). However, the study by Sengupta et al.

(2006) revealed the Holocene expansion of this R1a1*-M17 chromosome before the arrival of Indo-European tribes from the north western side of India.

93

PART 2

COMPARISION BETWEEN PAKISTANI AND GREEK POPULATIONS:

In the present study the genetic relationship of three Pakistani populations

Burusho, Kalash and Pathan who claim descent from the Greek soldiers was compared with the extant Greek population. For this purpose a combination of ninety three (93) biallelic Y chromosome SNPs (Table II) and a set of 16 YSTRs were used

(Table IV). This extensive analysis of Y diversity within Greeks and three Pakistani populations allowed us to compare Y diversity within these populations and re- evaluate their suggested Greek origins.

The genetic relationship between the three Pakistani populations and the

Greeks can now be judged in the light of phylogenetic analyses and corresponding statistical results. The phylogenetic results (Figure IX) showed that clade H, clade I and the clade L haplogroups are the major haplogroups that separate Pakistani populations from the Greeks.

The H* haplogroup is an Asia specific haplogroup (Underhill et al., 2001).

Sub-branch of haplogroup H*, H1*-M52 was observed in Pakistani populations, but not in any of the Greek samples (Figure IX). However, the Indian specific branch

H2*-APT was not present in any Pakistani ethnic group but a low frequency (1.3%) was observed in Greek population (Firasat et al., 2007). The presence of the Indian specific sub-clade H2*-APT haplogroup in the Greek is the first time that this haplogroup has been observed in any western European population and could indicate ancient contacts.

On the other hand Haplogroup I*-M170 appears as a European specific haplogroup (Rootsi et al., 2004). The consistency of this result was also seen in our analyses and 19.5% Greeks have I-M170 Y chromosome (Figure IX). This haplogroup was absent in Burusho, Kalash and Pathan. Low contribution of this haplogroup was seen in the rest of the Pakistani ethnic groups.

94

Similarly clade L* observed only in Pakistani populations and absent in the

Greeks (Figure IX). Like haplogroup H*, the L*-M20 and R2-M124 are indigenous to the Indus Valley and south west Asia. Clade L* has been suggested to be associated with the spread of agriculture in the Indus Valley between 7000-2000 B.C. (Qamar et al., 2002). All L*-M20 derived Y chromosomes in the Kalash population were distinguished by the presence of a novel PK3 polymorphism which placed them in the sub-clade L3a (Figure IX). In the same way the R2-M124 was absent in Greeks and found 14.4% in Burusho and 5.74% in rest of Pakistani populations (Figure IX).

Clade E* Y chromosomes most probably originated in east Africa and spread in North Africa, Middle East, and European countries (Semino et al., 2004). In the

Pakistani populations, a low frequency of E* haplogroup was present as compared to the Greeks (2.5% and 21% respectively). Sub clade of E* haplogroup, E1b1b1a*-

M78, also arose in Africa (Cruciani et al., 2004). E1b1b1a*-M78 of haplogroup E* is the only branch that is present with low frequency in Pakistani populations (0.41%) and high frequency in Greek population (17%). Among the three Pakistani populations that claim Greek ancestry the Pathan were the only population in which a low frequency of clade E1b1b1a* -M78 was present (2.1%) (Figure IX). Even more compelling evidence in support of the genetic relationship between the Pathan and

Greek E1b1b1a*-M78 Y chromosome was provided by the median joining network

(Figure XIII). One Pathan shared the same Y-STR haplotype; that included a duplication of 10 and 13 repeats for the DYS425 locus; with three Greek individuals and the other was separated from this cluster by a single mutation which enabled us to estimate the Time to the most recent common ancestor (TMRCA)( mean ± SD), using the Network software as between 2000 ± 400 and 5000 ± 1200 Years before past (YBP) depending upon the observed (Kasyer et al., 2000) or inferred mutation rates (Zhivotovsky et al., 2004). This coincides with the period of Alexander’s invasion during 327-323 B.C. In addition, this haplotype was not found in any other

E1b1b1a*-derived Pakistani Y chromosome. However, this haplotype was observed

95

in 53 individuals in the Y-STR Haplotype Reference Database (YHRD) Kasyer et al.,

2000) and was highly specific for the Balkans the highest frequency being in

Macedonia.

It is worth emphasizing here that the chance of picking up rare events largely amplified by drift affecting a limited portion of the population cannot be discounted, and Cruciani et al., (2006) also recommend caution when using microsatellite alleles as surrogates of unique event polymorphisms. The genetic data alone do not tell us when the Balkan chromosomes arrived in Pakistan; therefore, it is necessary to turn to the historical record for this. There has been no known Greek admixture within the last few generations, but in addition to Alexander’s armies, the possibility of admixture between the Greek slaves who were brought to this region by Xerxes around one hundred and fifty years before Alexander’s arrival, and the local population, cannot be discounted (Firasat et al., 2007). At that time Afghanistan and

present day Pakistan were part of the Persian Empire (Wolpert, 2000). Nevertheless,

Alexander’s army of 25000–30000 mercenary foot soldiers from Persia and West

Asia and 5000–7000 Macedonian (Engles, 1981) perhaps provides a more likely explanation because of their elite status and substantial political impact on the region.

Several studies have shown that Clade E* is present at a relatively high frequency in the Greek population (Firasat et al., 2007; Francalacci et al., 2003;

Hammer et al., 2001). Our results have shown that the high frequency of clade H1*-

M52 and L3a-PK3 (20.45% and 22.7% respectively) and the lack of clade E* in the gene pool of Kalash, make the Kalash distinct from the Greeks (Figure IX).

The statistical analysis of results has also shown the highest pair-wise genetic distance [ΦST (0.213) and ρ (8.066)] values for the Kalash population (Table XI).

Moreover, the Kalash form a distinct cluster in the principal component analysis

(Figure X). On the basis of these results it is thus concluded that the true Greek contribution to the Kalash gene pool remains uncertain.

96

The presence of a unique population specific L3a-PK3 haplogroup in Kalash sample enabled us to use the BATWING algorithm (Wilson et al.,1998) to estimate the median TMRCA for the Kalash L3a lineages as 970 YBP (200-3500 YBP). This coincides with the arrival of the Kalash from Afghanistan into the Chitral Valley in northern Pakistan during the tenth and eleventh century AD (Lines, 1999).

The pair-wise genetic distance ΦST (0.188) and ρ (5.659) values reveal no

Greek connection for Burusho which is a language isolated-population. Furthermore, principal component analysis placed Burusho as being distinct from the Greek and closer to their neighbors in Pakistan (Figure X), suggesting that the linguistic differences arose after the common Y pattern was established. Alternatively, there may have been sufficient Y gene flow between populations to eliminate any initial differences that may have been present.

This study as a whole excludes a large Greek contribution to any Pakistani population confirming previous observations (Mansoor et al., 2004). However, it provides evidence in support of the Greek origins for a very small proportion of

Pathan as demonstrated by clade E* network (Figure XIII) and low pair-wise genetic distances between these two populations (Table XI). The contribution to the Kalash is unclear and no contribution to the Burusho could be detected. This conclusion requires the assumption that extant Greeks are representative of Alexander’s armies.

The failure to find a conclusive Y link with the extant Greek population could also be attributed to the fact that besides the 5000-7000 men strong Macedonian cavalry,

Alexander’s army also consisted of 25,000-30,000 mercenary foot soldiers from

Persia and West Asia (Engels, 1981) and populations from Pakistan have been shown to be closer to those from West Asia (Qamar et al., 2002; Quintana-Murci et al., 2001).

97

PART 3

COMPARISION WITH WORLD POPULATIONS:

In this part Pakistani populations compared with World populations by using the published haplogroup frequency data at similar molecular resolution. Table XII provides all information about Asian reference population that was used in this analysis.

The Pakistani Y chromosomes contain four major haplogroups, i.e. haplogroup C*, haplogroup J*, haplogroup L*, and haplogroup R*, which together account for 85.5% of total Y chromosome of Pakistani population (Table VI). The most frequently observed haplogroup in Pakistan are haplogroup R* which make

47.5% (including all the derivatives) of the total Pakistani population. The world wide data of Y chromosome show that the R* haplogroup with high frequency is present among populations belonging to western and southern countries. Among populations this haplogroup represents a variety of language groups such as

Dravidian, Indo-Iranian and Indo-European etc. However, haplogroup R* are rare

(present with low frequency) or absent in eastern countries populations. According to the Figure XV adapted from Gyden et al., 2007, the Kyrgyz Y chromosomes in central Asia have more than 50% haplogroup R*. The frequency gradually decreases in Kara kalpak (34%) and Kazak (11%). In west Asia the highest frequency of haplogroup R* is observed in northern Iran (27.2%), southern Iran

(25.6%), Syria (25%), Iraq (17.3%) and Lebanon (6%). Haplogroup R* is found in the southern Asian populations with a frequency of 62.1% in Newar, 59% in Punjab,

46.8% in Kathmandu and 31% in .

The second most abundant major clade is haplogroup J*, which occurs with an average frequency of 15% in Pakistan (Table VI). This haplogroup originated 98

about 30,000 YBP in Fertile Crescent (a region that today includes Israel, the West

Bank, Jordan, Lebanon, Syria and Iraq: Semino et al., 2004). The high frequencies among populations of the Middle East, North Africa and East Africa provide evidence that haplogroup J* expanded more in southern direction in these areas (Thomas et al., 1999). However, J2* originated in northern part of the Fertile Crescent. The presence of this haplogroup in Europe and in India, Pakistan and in reveals that haplogroup J2* expanded in both east and west directions (Al-Zahery et al.,

2003). The haplogroup J1*/J2* occurs at a frequency of 40.6%/15.8% in Jordan

(Flores et al., 2005), 37.2%/9.9% in Oman, 19.7/12.2% in Egypt (Luis et al., 2004),

9.2%/ 24.3% in Turkey (Cinnioglu et al., 2004)31%/ 26.6% in Iraq (Al-Zahery et al.,

2003), 13.8% / 18.9% in Iran (Nasidze et al., 2004; Underhill et al., 2000; Wells et al.,

2001), 16.3% /29.8% in Lebanon (Hammer et al., 2000; Semino et al.,2004; Wells et al., 2001), 32.4%/ 22.5% in Syria (Crucani et al., 2004; Di Giacomo et al., 2004;

Hammer et al ., 2000) 38.5% / 16.8% in Palestine (Crucani et al., 2004; Hammer et al., 2000; Nebel et al., 2001), 2.5% / 0.5% in Somalia (Sanchez et al., 2005) and

1.3% / 7.2% in Greek (Firasat et al., 2007).

12.3% of Pakistani Y chromosomes have haplogroup C*. This haplogroup is found at high frequency in Australian aborigines, Polynesians, Kazaks, Mongolians,

Manchurians, Tuva etc. Haplogroup C* is spread in all directions. For example, C* is found on the Indian subcontinent, and in parts of SE Asia. The C1* haplogroup found at low frequency in Japan, while C2* is found predominantly in New

Guinea, Melanesia, and Polynesia. The successful C3* haplogroup is originated in southeast or central Asia. From central Asia this haplogroup is expanded towards northern Asia and the Americas, and low concentrations are also found in eastern and central Europe, where it may represent evidence of the westward expansion of the Huns in the early middle ages. C4* is found among aboriginal Australians and a significant occurrence of C5* is found in India.

99

The Hazara are an ethnic group in Pakistan that claim to be descendents of Genghis Khan. The highest frequency of C3 haplogroup in Mongolia suggested that C3 chromosome spread widely during the time when Genghis Khan

(Mongol) conquered Asia. Haplogroup C3* is present in 60% males belonging to

Hazara and 8.2% of Burushos (Table VI). In a study conducted by Zerjal et al. (2003) the median-joining network (Bandelt et al., 1999) links the Hazara population to the male descendents of Genghis Khan (Figure XVI). This is due to the presence of the unique star cluster Y-STR haplotypes in haplogroup C3Y chromosomes. However, the star haplotype was not observed in Burusho population indicating separate origins of these two populations despite some sharing of haplogroup C3*.

The L* haplogroup is other main haplogroup in Pakistan. This haplogroup occurs at the background of M9 haplogroup. The segment of the M9 Eurasia Clan migrated south and reached the rugged, mountainous Pamir Knot region. Their L* haplogroup may have been born about 30,000 years ago and represents the earliest significant settlement of human in Indo subcontinent. Therefore, Haplogroup L* is known as the Indian Clan. Today, the L* haplogroup is found primarily as sub-group

L1 in India and Sri Lanka. Sub-group L3* is found mostly in Pakistan. Haplogroup L* can also be found in low frequencies in the Middle East and in Europe along the

Mediterranean coast.

Haplogroup L* is mainly associated with south Asia. The current analysis of

Sengupta et al., 2006, Thamseem et al., 2006 alongwith Cordaux et al., 2004 and

Basu et al., 2003 reveal that 7-15% Indian males have L* haplogroup while10.8%

Pakistani males carry this haplogroup (present study). As shown in Figure XVII, and the work conducted by Wells et al., 2001, a very high frequency of haplogroup L* was present in and western Pakistan than in south Pakistan. However a low frequency of haplogroup L* was observed in northern India and Pakistan while haplogroup L* absent in east India. A low frequency was found in Oman (0.8%: Luis 100

et al., 2004), Iraq (1%: Al-Zahery et al., 2003), Lebanon (2%: Hammer et al., 2000;

Semino et al., 2004; Wells et al., 2001), and Greece (1.1%: Di Giacomo et al., 2003;

Semino et al., 2004).

Haplogroup B* is one of the oldest Y-chromosome haplogroup confined in

African population (Knight et al., 2003). This haplogroup appears at low frequency all around Africa, but is at its highest frequency in Pygmy populations. In current study, an interesting observation was the presence of this ancient haplogroup B* lineage in two Pakistani males i.e. one that belongs to Brahui the Dravidian speaking population and the second one that belongs to Makrani-Negroid from the southern population.

Median-joining network (Bandelt et al., 1995) for the M60 derived Y haplotypes for

DYS19, 389I, 389b, 390 and 392 revealed that the Brahui sample (Y-STR haplotype

14_11_18_24_13) differed from three Sukuma individuals (Knight et al., 2003) at the

DYS19 locus only (16_11_18_24_13) (Figure XVIII). However, the Makrani Negroid

(Y-STR haplotype 15_10_18_21_11) differed from one individual belonging to

Hadzabe population at the 389b, 390, and 392 loci (15_10_17_20_13) (Table XIII).

The time of separation between the populations, estimated by the software Network

(Bandelt et al., 1995) was approximately 5000-10,000 years. These results exclude an ancient migration and suggest that a more recent migratory event is responsible for this separation. It is possible that these chromosomes originated as the M2 derived chromosomes found in some populations of southern Pakistan as described by Qamar et al., 2002. However, Qunitana-Murci et al., 2004 described it as genetic legacy of the slave trade that existed between the southern coast of Pakistan and

East Africa.

Haplogroup O* is commonly present in East and South Asia. 80-90% of all men in East and Southeast Asia carry this haplogroup; however, a low frequency

(0.82%) of this haplogroup was observed in Pakistan (Figure XIX).

101

In comparison with worldwide data, it is suggested that the gene pool of

Pakistani ethnic groups is much closer to the western populations as compared to the populations of the east and south east Asia. It is illustrated by the presence of frequently found haplogroups like, J* and R* etc. are also contributed in and the European gene pool but not found in China and Japan. However, the low prevalence, or absence, of East Asian i.e C3 and O*haplogroups in Pakistan indicates that the Karakoram Mountains, which separate Pakistan and China, form a formidable barrier to gene flow from the north. The Hazara are the only population that have 60% C3 Y-chromosome shows significant East Asian (Mongolian) ancestry but historical records indicate that they did not cross this geographical boundary and arrived in the subcontinent from the West.

102

Table XII: Description of World populations.

Geographic Abbreviation Language Family No. of References

Region and Subjects

Population

Middle East:

Northern Iran NIR Indo-European 33 Regueiro et al., 2006

Southern Iran SIR Indo-European 117 Regueiro et al., 2006

Iraq IRQ Afro-Asiatic 139 Al-Zahery et al., 2003

Lebanon LEB Afro-Asiatic 50 Wells et al., 2001

Syria SYR Afro-Asiatic 20 Semino et al., 2000

Central Asia:

Kazak KAZ Altaic 54 Wells et al., 2001

Kyrgyz KYR Altaic 52 Wells et al., 2001

Karalkalpak KAR Altaic 44 Wells et al., 2001

Shugnan SHU Indo-European 44 Wells et al., 2001

Mongolia MON Altaic 24 Wells et al., 2001

Tibet TIB Sino-Tibetan 156 Gayden et al., 2007

South Asia:

Adi ADI Sino-Tibetan 55 Cordaux et al., 2004

Gujarat GUJ Indo-European 29 Kivisild et al., 2003

Punjab PUN Indo-European 66 Kivisild et al. 2003

Pakistan PAK Indo-European 1213 Present study

Tamang TAM Sino-Tibetan 45 Gayden et al., 2007

Newar NEW Sino-Tibetan 66 Gayden et al., 2007

Kathmandu KAT Indo-European 77 Gayden et al., 2007

103

Northeast Asia:

Korea KOR Altaic 74 Karafet et al., 2001

Japan JAP Altaic 259 Hammer et al., 2006

Tuva TUV Altaic 42 Wells et al., 2001

Buryat BUR Altaic 81 Karafet et al., 2001

Manchu MAN Altaic 35 Xue et al., 2006

Southeast Asia:

Philippines PHI Austronesian 48 Karafet et al., 2005

Malaysia MAL Austronesian 32 Karafet et al., 2005

Vietnam VIE Austronesian 70 Karafet et al., 2005

Bali BAL Austronesian 551 Karafet et al., 2005

Southern Han SHA Sino-Tibetan 166 Karafet et al., 2005

104

Figure XV: The frequencies of Major haplogroups in Asian population. The population’s legends are shown in Table XII.

105

Figure XVI. Median-joining network of C* lineages. The central star-cluster profile is 10-16-25-10-11-13-14-12-11-11-11-12-8-10-10, for the loci DYS389I-

DYS389b-DYS390-DYS391-DYS392-DYS393-DYS388-DYS425-DYS426-

DYS434-DYS435-DYS436-DYS437-DYS438-DYS439. Circles represent lineages, area is proportional to frequency, and color indicates population of origin. Lines represent microsatellite mutational differences.

adapted from Zerjal et al.2003.

106

Figure XVII: Distribution of L* haplogroup in Indo Pak sub continent.

adapted from Sengupta et al. 2006.

107

Table XIII: Y-STRS data of clade B* lineages in Pakistan and African populations.

DYS19_389I_389b_390_392

TOTAL Biaka Brahui Hadzabe Lisongo Makrani Negroid Mbuti San Sukuma H1 14_11_15_25_14 1 1

H2 14_11_18_24_13 1 1

H3 15_10_14_21_13 2 2

H4 15_10_15_20_13 1 1

H5 15_10_15_22_13 7 7

H6 15_10_17_20_13 1 1

H7 15_10_18_21_11 1 1

H8 15_11_16_23_13 1 1

H9 16_10_15_24_13 2 1 1

H10 16_11_13_24_13 2 2

H11 16_11_14_24_13 1 1

H12 16_11_15_25_13 1 1

H13 16_11_16_20_13 1 1

H14 16_11_16_23_13 1 1

H15 16_11_18_24_13 3 3

H16 16_7_14_24_13 1 1

H17 16_7_15_24_13 1 1

H18 16_7_16_24_13 1 1

H19 17_11_13_24_13 1 1

H20 17_11_14_24_13 1 1

H21 17_11_16_20_13 1 1

H22 17_7_16_24_13 1 1

H23 18_11_16_23_13 1 1

108

Figure XVIII: Median-joining network of clade B* lineages in Pakistan and

African populations. Circles represent haplotypes and have an area proportional to frequency. The Pakistani individuals are shown in orange and light blue colour.

109

Figure XIX: Geographic distribution of haplogroup O3.

adapted from Shi et al. 2005.

110

PART 4

INSIGHTS INTO POPULATION ORIGINS:

Pakistan is geo-strategically placed and has witnessed many invasions and migrations from the west over the centuries. Present day Pakistan is bordered by

Iran and Afghanistan on the west, India towards the east and China in the north. The

Indian Ocean straddles its entire southern coast line.

In the light of Y haplogroup frequencies which used to perform the statistical analysis and allow us to interpret the origin of Pakistani populations.

BALTI:

The Balti reside in eastern Baltistan in northern Pakistan, and there are approximately 300,000 Balti speakers in Pakistan. Their language (Balti) is a Sino-

Tibetan language and they are thought to have originated in Tibet. However, not all

Balti speakers that are found in Pakistan are from Tibetan stock. With the passage of time many other populations that entered their territory, such as the Shins, Arabs,

Persian and Turks gradually mixed with the Balti people. Although this study analyzed only a few unrelated Balti samples yet they did not observe Y lineages commonly found in Tibet. Clade D* which is present at high frequency in the Tibetan population was not observed in the Balti (Table VI). The results were consistent with the earlier study carried out by Qamar et al., (2002).

HAZARA:

The Hazara population, which is ethnically related to their brethren in neighbouring Afghanistan, stand out on the basis of their Y haplogroup frequencies.

Hazara individuals have typical Mongolian features and they claim to be descendants of Genghis Khan’s army. Their name is derived from the Persian word “hazar”

111

meaning “thousand”, because troops were left behind in detachments of a thousand

(Qamar et al., 2002). An earlier study done on a limited number of samples (n = 33) has shown them to be closer to populations in Mongolia (Qamar et al., 2002) and the star Y-STR haplotype (Figure XVI) observed in this population suggested that they were direct descendant’s of Genghis Khan (Zerjal et al., 2003).

The present study analyzed a much larger population sample (n =224) from a wider geographical area in Pakistan. The earlier samples were collected from NWFP and the additional samples were from Quetta, Baluchistan. Two haplogroups predominated in this population, i.e. haplogroup R* (21%) and haplogroup C* (64%)

(Table VI). Haplogroup R* is also present at high frequency in other ethnic groups of

Pakistan (53.5%, when the Hazara are excluded). However, haplogroup C* is rare in other Pakistani populations. It is present at a frequency of 1.3%, when the are excluded. This haplogroup is fairly common in Central Asia and Mongolia and points towards the Mongol origins of the Hazara population (Figure XXI).

BURUSHO:

The Burusho, who speak Burushaski, are of particular genetic, linguistic and anthropological interest. Their language is one of the few remaining language isolates in the world (Dani, 1991; Grimes, 1992). Approximately 60,000 Burusho are estimated to reside in present day Pakistan. The samples used here were collected from the valleys of the Karakorum Mountains in Hunza, Nagar and Yasin. The origin of Burusho is not well documented. Some claim they are descendants of four generals in Alexander’s army (Dani, 1989). Others believe them to be Dardics from

Central Asia, or nomads from Pamir, who migrated to this area, and displaced the original inhabitants (Biddulph, 1977).

Studies with the autosomal (Ayub et al., 2003; Mansoor et al., 2004) and Y chromosomal markers (Firasat et al., 2007) suggest that the Burusho have the same

112

genetic makeup as their geographical neighbours in Pakistan. Preliminary study by

Wells et al. using a limited number of Y markers showed that the Hunza Burusho clustered with populations from Tajikistan (Wells et al., 2001) but found no such evidence using a larger number of markers. The high frequencies of Central Asian haplogroup C* chromosomes in the Burusho and Hazara indicate that these arose in

Central Asia before the separation of these two Pakistani populations (Mohuuddin et al., 2006). There is also no evidence of genetic relatedness with the Greek.

Haplogroup C* is absent in Greeks (Francalacci et al., 2003; Rootsi et al., 2004), and haplogroup E* which is common in Greece is absent in the Burusho (Figure IX).

Although they share R1a1* hapologroups but the branch derived from R1a1* that was observed in 2 burusho individuals points towards a long separation, based on microsatellite variation.

KALASH:

The Kalash have been isolated for centuries in the Hindu Kush mountain ranges of northern Pakistan. Their language, Kalasha, belongs to the Dardic group of

Indo-European languages. They are around 3000-6000 in present day Pakistan.

Oral traditions ascribe their origins to a mythical place called “Tsiam”, which some claim refers to Syria (Decker, 1992). Various scholars have attributed their origins to the remnants of Alexander’s army (Robertson, 1896). The lack of clade E* chromosomes, which are present at a relatively high frequency in the Greek population (Francalacci et al., 2003; Hammer et al., 2001) and the presence of clade

H* (20%) and L3a (23%) make the Kalash distinct from the Greeks (Firasat et al.,

2007). However, the presence of high frequency of haplogroup R* (27%) indicates that they have a predominantly European component and their possible origin is described in Figure XXI. Study of maternal (mitochondrial) (Schurr et al., 2000), paternal (Y chromosome SNP and STR) (Qamar et al., 2002) and autosomal STR

(Mansoor et al., 2004) has also demonstrated their greater affinity with European

113

populations. In the principal component analyses based on haplogroup frequencies, the Kalsah are distinct from the other ethnic groups of Pakistan (Figure X). The presence of a unique Y haplogroup (L3a) observed only in this population suggests genetic drift (L3a) in this population. The timing of their isolation can be better studied by analyzing populations from Nuristan, Afghanistan from where they are thought to have migrated to settle in in northern Pakistan. The median-joining network for H1*-M52 (Figure XX) which is present at appreciable frequencies in the Burusho, Kalash and the Pathan based on 16 Y-STRs also shows a high degree of Kalash specific substructure. Except for one individual all the

Kalash samples fall in one cluster. From the network it appears that H1*-M52 spread to neighboring northern populations. Taken together these results suggest that the high frequency of unique population specific SNPs and haplogroups in this group are probably due to genetic drift in a population that has been isolated for centuries in the

Hindu Kush Mountains.

PATHAN:

The last of the northern population with claims to Greek origins, the Pathans, occupy vast tracts of land in Pakistan and neighbouring Afghanistan. In Pakistan the vast majority of Pathans reside in the NWFP and Baluchistan province of Pakistan. The provincial metropolis of Peshawar (NWFP) and Quetta (Baluchistan) have large

Pathan populations and are the important centers of Pathan in Pakistan. According to the Population Census Organization, retrieved 7 June

2006 (http://www.newworldencyclopedia.org/entry/Pashtun_people) and Census of

Afghans in Pakistan, (UNHCR Statistical Summary Report http://www.unhcr.org/cgi- bin/texis/vtx/home/opendoc.pdf ) the Pathan population constitutes 15% of the population of present day Pakistan. Their language, Pashtu, is classified under the

Indo-Iranian branch of the Indo-European languages and linguistically they are classified as an Iranian people (Nicholas and Asmatullah, 2007). Folklore legends

114

Figure XX: Median-joining network H1*-M52 lineage fall in Burusho, Kalash and Pathan, based on their Y-STR haplotype.

115

claim that either they are of Jewish origin (Ahmed, 1952) or are descendants of

Alexander’s army (Bellew, 1998).

In present study, the presence of small amount of haplogroup E1b1b1a* chromosome that is present with large amount in Greek (Figure IX) provide an evidence of a small Greek contribution to the Pathan gene pool that will likely require further investigation in order to ascertain its pervasiveness (Firasat et al., 2007).

However, earlier studies carried out by Quintana-Murci (2004) and Mansoor (2004) using mitochondrial DNA and STR markers demonstrated that the Pathans are mainly related to the Iranians and their geographic neighbors in northern Pakistan.

PARSI:

The origins of the Parsi are well-documented and there are only a few thousand Parsi inhabitants in Pakistan now. These followers of the Persian Prophet

Zoroaster (http://www.ozemzil.com.au/~Zarathus/Zor33.html) migrated to India after the collapse of the Sassanian Empire in the 7th century A.D. and settled in the northwest Indian province of Gujarat in 900 A.D. where they were called the “Parsi”

___ meaning “from Iran”. Eventually they moved to Mumbai in India and Karachi in

Pakistan, from where the present population was sampled (Figure XXI). They speak indo-European language.

The earlier study of their Y chromosomes (Qamar et al., 2002) showed that the Parsis are genetically closer to Iranians than to their neighbors in Pakistan. In this study, 39% of the Parsis sampled belonged to haplogroup J* (Table VI). This is similar to the frequency of this haplogroup (40%) in the present day Iranian population (Qamar et al., 2002). Surprisingly based upon their mitochondrial DNA variation the Parsis were genetically close to Gujrati population of India (Quintana-

Murci et al., 2004) rather than to the Iranians, indicating a loss of mitochondrial DNA of Iranian origin mainly due to their admixture with the local population in India after

116

their seventh century migration.

BALOCH:

Balochis are affiliated with the Iranian Baloch tribes across the south West border with Iran and these people speak the language Balochi an Indo-Aryan language (Grimes, 1992). Currently around 8 million Balochis live in Pakistan.

Researchers are unsure of their origins. Some scholars believe that they belong to the northern regions of Elburz, a mountain range in North Iran, whereas others claim they came from Aleppo in Syria or Mesopotamia.

Y data analysis demonstrate that Syrians and Iranian people are characterized by the presence of low frequency of haplogroup R* (9-26%) and high frequency of haplogroup J* (35-57%) (Hammer et al., 2000), which is converse to the frequency distribution of these haplogroups in the Baloch. Approximately (29%) of the Baloch Y chromosomes carry the haplogroup R* and only 9% carry haplogroup

J* (Table VI). These results support the earlier observation (Qamar et al., 2002) that used a limited number of Y markers. HLA data supports genetic relatedness among the Baloch tribes of Iran and Pakistan (Farjadian et al., 2004). In worldwide surveys of HGDP-CEPH HGDP cell line panels, the Baloch are closely related to their geographic neighbours and share the same branch as populations from the Middle

East and West Eurasia (Jakobsson, 2008; Li, 2008).

BRAHUI:

Brahui people are found in the central region of province of

Pakistan. About 1.5 million Brahuis reside the Sarawan and Jhalawan region of Kalat state, Baluchistan (Hughes-Buller, 1991). They speak Brahui language that belongs to the Dravidian language family (Grimes, 1992). Dravidians are found mostly in southern India, Sri Lanka, , Pakistan, Afghanistan and Iran. Dravidians are supposedly Indian in origin (Fuller, 2003). However, according to proto-Elamo-

117

Dravidian hypothesis, they originated in the Iranian province of Elam and were once spread over a much larger area, including Iran, Pakistan, Afghanistan and all India

(McAlpin, 1974, 1981). According to some historical traditions, Brahuis are the descendants of western Asian people (McAlpin, 1974, 1981) such as, Turko-Iranian tribe and Scythians (Hughes-Buller, 1991). Some historians also claim that they have the same origins as that of Baloch (Hughes-Buller, 1991; Quddus, 1990).

Brahuis are widely suggested to be remnants of a formerly widespread Dravidian entered in South Asia with the expansion of Dravidian speaking farmers (Quintana-

Murci, 2001).

In order to detect its true origin a set of 117 Y Brahui’s chromosome were analyzed. The result of present study was compared with neighboring populations.

The presence of two Y chromosomal haplogroups, haplogroup J* and haplogroup L*

(Table VI) reveal the movement of population from west Asia to south Asia and from

India to Pakistan respectively.

The highest frequency of haplogroup J* is found in Iranian populations (30-

60%: Quintana-Murci et al., 2001), and in the Fertile Crescent region that includes,

Palestinians (51%), Lebanese 46% and Syrians 57% (Hammer et al., 2000). These results indicate that the haplogroup J* originated in west Asia and from there they spread to South Asia. The presence of high frequency of haplogroup J* in Brahui

(26.5%) also confirmed these observations. The major movement of population from west Asia to south Asia is correlated with the expansion of farming economy that started between 6th and 5th millennia B.C. from Iran to Indo-Pak subcontinent. After this, the other major development was the expansion of domesticated animals by the pastoral nomadic. Probably the expansion of haplogroup J* has been associated with the dispersal of farmers and pastoral nomadic (Dravidian) in southern Asia (Cavalli-

Sforza, 1988; Renfrew, 1987). However, Sengupta et al., 2006 suggests the origin of

Dravidian is in India. They deduced by the presence of indigenous haplogroup L1-

M76 (M-27) in Dravidian speakers (7.5% in India). The 6% Brahui Y chromosome

118

carry L1-M76 haplogroup provides an idea that Brahui could migrated to Baluchistan from India. It is also proved by the mean microsatellite variance which is higher in

India (0.35) than in Pakistan (0.19) (Sengupta et al., 2006).

MAKRANI NEGROID:

The Negroid Makrani has African physical traits, reside along the southern

Makran coastal region of Pakistan and speak an Indo-Eurpeon language. It has been speculated that they represent migrants from Africa (Figure XXI) but the timing of this migration is uncertain (Ansari, 1996). Although they do have low frequency of sub-Saharan African haplogroups such as E1b1a* they also exhibit a sizeable proportion of L*, J* and R*. L* haplogroup are mostly restricted to the Indo-Pak subcontinent and haplogroups J* and L* to Eurasia. The contribution of African Y chromosome to this population was estimated to be approximately 12% (Qamar et al., 2002) and mitochondrial DNA data supported these results. This data alongwith their history as remnants of the east African slave trade indicated that they were probably recent settlers (Quintana-Murci et al., 2004).

119

Figure XXI: Possible origins a) Hazara b) Kalash c) Parsi d) Makrani –

Negroid

MONGOLIA West Eurasian Y lineages

Y Lineage C from East Asia

Origins: Hazara Origins: Kalash a b

Y lineages from West Asia (Iran) Y lineages from sub-Saharan Africa

Iran Gujrat Origins: Parsi

Mumbai Origins: Makrani Negroid

c d

adapted from Mehdi, S.Q. 2007

120

CONCLUSIONS:

The molecular analysis of the human genome is providing a better understanding of human ancestry and diversity from both the maternal and paternal perspective. The evolutionary antiquity of Pakistani populations and the subsequent migration from west Asia, Europe and to a less extent from East Asia has resulted in a rich tapestry of socio-cultural, linguistic and biological diversity. This study provides the report on the diversity in Pakistani – population on the basis of haplogroup frequencies. It provides insight into the genetic relationship of the Pakistani population with respect to each other as well as the other world population. These studies will serve as a background for epidemiological work in different populations of the world. The genetic makeup of a population determines the differences in incidence and prognosis of various diseases across different populations. The study will provide major insights where a patient’s origin will be useful in determining the predisposition to various diseases. The knowledge of a population’s genetic composition will also be helpful in eliminating any spurious risk factors for different diseases. Furthermore, apart from the inherited diseases, the study will be of immense medical importance in understanding susceptibility and resistance to infectious diseases as well as the efficacy of drug treatment, heralding the era of genomic medicine.

121

REFERENCES

- 7 -

Ahmad AKN. (1952). Jesus in heaven on earth. The Civil and Military Gazette Ltd, Lahore, Pakistan.

Aitman TJ, Dong R, Vyse TJ, Norsworthy PJ, Johnson MD, Smith J, Mangion J, Roberton-Lowe C, Marshall AJ, Petretto E, Hodges MD, Bhangal G, Patel SG, Sheehan-Rooney K, Duda M, Cook PR, Evans DJ, Domin J, Flint J, Boyle JJ, Pusey CD and Cook HT.(2006). Copy number polymorphism in Fcgr3 predisposes to glomerulonephritis in rats and humans. Nature. 439:851-855.

Al-Zahery N, Semino O, Benuzzi G, Magri C, Passarino G, Torroni A, Santachiara- Benerecetti AS. (2003). Y-chromosome and mtDNA polymorphisms in Iraq, a crossroad of the early human dispersal and of post-Neolithic migrations. Mol Phylogenet Evol. 28:458-472.

Anderson S, Bankier AT, Barrell BG, De Bruijn MHL, Coulson AR, Drouin J, Eperon IC, Nierlich DP, Roe B A, Sanger F, Schreier PH, Smith AJH, Staden R and Young IG.(1981). Sequence and organization of the human mitochondrial genome. Nature. 290: 457-465.

Ansari SSA.(1996). The Afghan or Pathans. In: The Musalman races found in Sindh, Baluchistan and Afghanistan. Indus publications, Karachi.pp9-16.

Ayub Q, Mansoor A, Ismail M, Khaliq S, Mohyuddin A, Hameed A, Mazhar K, Rehman S, Siddiqi S, Papaioannou M, Piazza A, Cavalli-Sforza LL and Mehdi SQ. (2003). Reconstruction of human evolutionary tree using polymorphic autosomal microsatellites. Am J Phys Anthropol.122:259-268.

Baird M, Balazs I, Giusti A, Miyazaki L, Nicholas L, Wexler K, Kanter E, Glassberg J, Allen F, Rubinstein P, and Sussman L.(1986). Allele frequency distribution of two highly polymorphic DNA sequences in three ethnic groups and its application to the determination of paternity. Am J Hum Genet. 39:489-501.

Bandelt HJ, Forster P, SykesBC, and Richards MB.(1995). Mitochondrial Portraits of Human Populations Using Median Networks. Genetics. 141: 743-753.

Bandelt HJ, Forster P and Rohl A.(1999). Median-joining networks for inferring intraspecific phylogenies. Mol Biol Evol. 16: 37– 48.

Barley J, Blackwood A, Miller M, Markandu ND, Carter ND, Jeffery S, Cappuccio FP, MacGregor, GA and Sagnelle GA.(1996). Angiotensin converting enzyme gene I/D polymorphism, blood pressure and the rennin-angitensin system in Caucasians and Afro-Caribbean peoples. J Hum Hypertens. 10: 31-35.

Basu A, Mukherjee N, Roy S, Sengupta S, Banerjee S, ChakrabortyM, Dey B, Roy M, Roy B, Bhattacharyya NP, Roychoudhury S, Majumder PP.(2003). Ethnic India: a genomic view, with special reference to peopling and structure. Genome Res. 13:2277–2290.

Batzer MA, Kilroy GE and Richard PE.(1990). Structure and variability of recent inserted Alu family members. Nucleic acids Res. 18:6793-6798.

Batzer M A and Deininger PL.(1991). A human-specific subfamily of Alu sequences. Genomics 9:481-487.

122

Batzer MA, Gudi VA, Mena JC, Foltz DW, Herrera RJ and Deininger PL.(1991). Amplification dynamics of Human-specific (HS) Alu family members. Nucleic Acids Res.19:3619-3623.

Batzer MA, Acrot SS, Phinney JW, Alegria-Hartman M, Kass DH, Milligan SM, Kimpton C, Gill P, Hochmeister M, Ioannou PA, Herrera RJ, Boudreau DA, Scheer WD, Keats BJ, Deininger PL, Stoneking M.(1996). Genetic variation of recent Alu insertion in human populations. J mol Evol. 42:22-29.

Batzer MA and Deininger PL.(2002). Alu repeats and human genomic diversity. Rev Genet. 3:370-379.

Behar DM, Garrigan D, Kaplan ME, Mobasher Z, Rosengarten D, Karafet TM, Quintana-Murci L, Ostrer H, Skorecki K, and Hammer MF.(2004). Contrasting patterns of Y chromosome variation in Ashkenazi Jewish and host non-Jewish European populations. Hum. Genet. 114: 354–365.

Bellew HW.(1979). The races of Afghanistan. Sang-e-Meel Publications, Lahore, Pakistan.

Bellew HW. (1998). An enquiry into the ethnography of Afghanistan. Vanguard Books, Lahore, Pakistan.

Biddulph J.(1977). Tribes of the Hindoo Koosh. Karachi, Pakistan: IndusPublications.

Birnboim HC and Straus NA.(1975). DNA from Eukaryotic cells contain unusually long pyrimidine sequences. Can J Biochem. 53:640-643.

Bowcock A M, Kidd J, Moutain JL, Hebert JM, Carotennuto L, Kidd KK and Cavalli- Sforza LL.(1991). Drift, admixture, and selection in human evolution: a study with DNA polymorphisms. Proc Natl Acad Sci. USA 88: 839-843.

Bowcock A M, Ruiz-Linares A, Tomfohrde J, Minch E, Kidd JR and Cavalli-Sforza LL.(1994). High resolution of human evolutionary trees with polymorphic microsatellites. Nature. 368:455-457.

Boyum A.(1968). Seperation of lymphocytes and erythrocytes by centrifugation. Scand.J Clin Lab Invest. 21:( Supplement 97), pp91.

Brook JD, McCurrach ME, Harley HG, BucklerA J, Church D, Aburatani H, Hunter K, Stanton VP, Thirion JP, Hudson T, Sohn R, Zemelman B, Snell RG, Rundle SA, Crow S, Davies J, Shelbourne P, Buxton J, Jones C, Juvonen V, Johnson K, Harper PS, ShawDJ, and Housman DE.(1992). Molecular basis of myotonic dystrophy: expansion of trinucleotide (CTG|) repeat at the3’ end of the transcript encoding a protein kinase family member. Cell. 68:799-808.

Brooks MB, Gu W, Barnas JL, Ray J and Ray KA.(2003). Line 1 insertion in the Factor IX gene segregates with mild hemophilia B in dogs. Mamm Genome. 14:788- 795.

Brown P, Sutikna T, Morwood MJ, Soejono RP, Jatmiko, Saptomo EW, Due RA. (2004). A new small-bodied hominin from the Late Pleistocene of Flores, Indonesia. Nature. 431:1055-1061.

123

Brown WM, George M Jr and Wioson AC.(1979). Rapid evolution of animal mitochondrial DNA. Proc Natl Acad Sci. USA 76:1967-1971.

Budowle B, Moretti TR, Niezgoda SJ and Brown BL. (1998). CODIS and PCR- based short tandem repeat loci: Law enforcement tools. In: Second European Symposium on Human Identification 1998, Promega Corporation, Madison, Wisconsin pp 73-88.

Budowle B and Chakraborty R.( 2001). Population variation at the CODIS core short tandem repeat loci in Europeans. Leg Med (Tokyo) 3:29-33.

Cann RL, Stoneking M and Wilson AC.(1987). Mitochondrial DNA and human evolution. Nature. 325: 31-36.

Capelli C, Wilson JF, Richards M, Stumpf MP, Gratrix F, Oppenheimer S, Underhill P, Pascali VL Ko TM and Goldstein DB.(2001). A predominantly indigenous paternal heritage for the Austronesian-speaking peoples of insular Southeast Asia and Oceania. Am J Hum Genet. 68: 432-443.

Cappuzzo F, Toschi L, Domenichini I, Bartolini S, Ceresoli GL, Rossi E, Ludovini V, Cancellieri A, Magrini E, Bemis L, Franklin WA, Crino L, Bunn PA Jr, Hirsch FR, Varella-Garcia M.(2005). HER3 genomic gain and sensitivity to gefitinib in advanced non-small-cell lung cancer patients. Br J Cancer. 93:1334-40.

Caroe O.(1958). The Pathans. Karachi, Pakistan: Oxford University Press.

Carter NP.(2007). Methods and strategies for analyzing copy number variation using DNA microarrays. Nat. Genet. 39: Suppl: S16-S21.

Casanova M, Leroy P, Boucekkine C, Weissenbach J, Bishop C, Fellous M, Purrello M, Fiori G and Siniscalco M.(1985). A human Y-linked DNA polymorphism and its potential for estimating genetic and evolutionary distance. Science. 230:1403-1406.

Cavalli-Sforza LL.(1988). The Basque population and ancient migrations in Europe. Munibe. 6:129-137.

Cavalli-Sforza LL, MenozziP and Piazza A.(1994). The History and Geography of Human Genes. Princeton University Press, Priceton.

Cavalli-Sforza LL.(2005). The Human Genome Diversity Project: past, present and future. Nat Rev Genet. 6:333-40.

Chakraborty R, Kimmel M, Stivers DN, Deka R and Davison LJ.(1997). Relative mutation rates at di-,tri-, and tetra- nucleotide microsatellite loci. Proc Natl Acad. Sci. USA 94:1041-1046.

Cinnioğlu C, King R, Kivisild T, Kalfoğlu E, Atasoy S, Cavalleri GL, Lillie AS, Roseman CC, Lin AA, Prince K, Oefner PJ, Shen P, Semino O, Cavalli-Sforza LL and Underhill PA.(2004). Excavating Y-chromosome haplotype strata in Anatolia. Hum Genet.114: 127–148.

Collins DW and Jukes TH.(1994). Rates of transition and transversion in coding sequences since the human- rodent divergence. Genomics. 20: 386-396.

124

Cooper DN and Krawczak M.(1995). An introduction to the structure, function and expression of human genes. In: Human gene mutation. Bios Scientific Publishers Limited. UK. pp 19-48.

Cooper DN, Krawczak M, Antonorakis SE.(2000). The nature and mechanisms of human gene mutation. In: The Metabolic and Molecular Bases of Inherited Disease, Vol. 1, 8th Edn (eds Scriver CR, Beaudet AL, Sly WS, Valle D). Mc Graw-Hill, New York.

Cordaux R, Weiss G, Saha N, Stoneking M.(2004). The northeast Indian passageway: a barrier or corridor for human migrations? Mol Biol Evol .21:1525- 1533.

Cost GJ and Boake JD.(1998). Targeting of human retrotransposons integration is directed by the specificity of the L1 endonuclease for regions of unusual DNA structure. Biochemistry 37:18081-18093.

Cruciani F, Santolamazza P, Shen P, Macaulay V, Moral P, Olckers A, Modiano D, Holmes S, Destro-Bisol G, Coia V, Wallace DC, Oefner PJ, Torroni A, Cavalli-Sforza LL, Scozzari R and Underhill PA.(2002). A back migration from Asia to sub-Saharan Africa is supported by high-resolution analysis of human Y-chromosome haplotypes. Am J Hum Genet. 70: 1197–1214.

Cruciani F, La Fratta R, Santolamazza P, Sellitto D, Pascone R, Moral P, Watson E, Guida V, Colomb EB, Zaharova B, Lavinha J, Vona G, Aman R, Calı` F, Akar N, Richards M, Torroni A, Novelletto A, Scozzari R.(2004). Phylogeographic analysis of haplogroup E3b (E-M215) Y chromosomes reveal multiple migratory events within and out of Africa. Am J Hum Genet. 74:1014–1022.

Cruciani F, La Fratta R, Torroni A Underhill PA, Scozzari R.(2006). Molecular dissection of the Y chromosome haplogroup E-M78 (E3b1a): a posteriori evaluation of a microsatellite-network-based approach through six new biallelic markers. Hum. Mutation 2006; 27: 831– 832.

Csink AK and Henikoff S.(1998). Some thing from nothing: the evolution and utility of satellite repeats. Trends Genet.14: 200-204.

Dani AH.(1989). Early history the early inhabitants. In:History of Northern Areas of Pakistan. National Institute of Historical and Culture Research, Islamabad, Pakistan. pp110-157.

Dani AH.(1991). “History of Northern Areas of Pakistan.” National Institute of Historical and Culture Research, Islamabad, Pakistan.

Dausset J.(1954). Leuko-agglutinins IV: Leuko agglutinins and blood transfusion. Vox Sanguinis 4: 190.

Decker KD.(1992). Sociolinguistic survey of Northern Pakistan. Vol 5, Languages of Chitral. National Institute of Pakistan Studies, Islamabad.

Deininger PL and Daniels GR.(1986). The recent evolution of mammalian repetitive elements. Trend Genet. 2:76-80.

Deininger PL and Slagel VK. (1988). Recently amplified Alu family members share a common parental Alu sequences. Mol. Cell Biol. 8:4566-4569. 125

Deininger PL, Batzer MA, Hutchinson III CA and Edgell MH. (1992). Master genes in mammalian repetitive DNA amplification. Trend Genet. 8:307-312.

Deininger PL, Sherry ST, Risch G, Donaldson C, Robichaux MB, Soodyall H, Jenkins T, Sheen F-M, Swergold G, Stoneking M, Batzer MA.(1999). Interspersed repeat insertion polymorphisms for studies of human molecular anthropology. In: Genomic Diversity, Application in Human . (eds Papiha SS, Deka R, Chakraborty R). Kluwer Academic / Plenum Publishers. New York, Boston, Dordrecht, London, Moscow. de Knijff P.(2000). Message through bottle necks: On the combined use of slow and fast evolving polymorphic markers on the human Y chromosome. Am J Hum Genet. 67:1055-1061.

Dietrich W, Katz H, Lincoln SE, Shin H-S, Friedman J, Dracopoli NC and Lander ES.(1992). A genetic map of mouse suitable for intra specific crosses. Genetics 131:423-447.

Di Giacomo F, Luca F, Anagnou N, Ciavarella G, Corbo RM, Cresta M, Cucci F, Di Stasi L, Agostiano V, Giparaki M,Loutradis A, Mammi C, Michalodimitrakis EN, Papola F, Pedicini G, Plata E, Terrenato L, Tofanelli S, Malaspina P,Novelletto A. (2003). Clinal patterns of humanYchromosomal diversity in continental Italy and Greece are dominated bydrift and founder effects. Mol Phylogenet Evol. 28:387– 395.

Di Giacomo F, Luca F, Popa LO, Akar N, Anagnou N, Banyko J, Brdicka R, Barbujani G, Papola F, Ciavarella G, Cucci F, Di Stasi L, Gavrila L, Kerimova MG, Kovatchev D, Kozlov AI, Loutradis A, Mandarino V, Mammi' C, Michalodimitrakis EN, Paoli G, Pappa KI, Pedicini G, Terrenato L, Tofanelli S, Malaspina P, Novelletto A.(2004). Y chromosomal haplogroup J as a signature of the post-neolithic colonization of Europe. Hum Genet. 115:357-371.

Di Rienzo A, Peterson AC, Garza JC, Valdes AM, Saltkin M and Freimer NB. (1994). Mutational process of simple-sequence repeat loci in human populations. Proc Natl Acad. Sci. 91:3166-3170.

Dong SL, Wang E, Hsie L, Cao YX, Chen XG, Gingeras TR.(2001). Flexible use of high density oligonucleotide arrays for single nucleotide polymorphism discovery and validation. Genome Res. 11:1418-1424.

Duru K, Farrow S, Wang JM, Lockette W and Kurtz T. (1994). Frequency of a deletion polymorphism in the gene for angiotensin converting enzyme is increased in with hypertension. Am J Hypertens. 7:759-762.

Edwards A, Civitello A, Hammond HA and caskey CT.(1991). DNA typing and genetic mapping with trimeric and tetrameric tandem repeats. Am J Hum Genet. 49:746-756.

Engels DW.(1981). Alexander the Great and the logistics of the Macedonian Army. Berkeley, CA: University of California Press.

Epplen JT, Mc Carrey JR, Sutou S and Ohno S.(1982). Base sequences of a cloned snake W-chromosome DNA fragment and identification of a male putative mRNA in the mouse. Proc Natl Acad. Sci. USA 79:3798-3802.

126

Farjadian S, Naruse T, Kawata H, Ghaderi A, Bahram S, Inoko H.(2004). Molecular analysis of HLA allele frequencies and haplotypes in Baloch of Iran compared with related populations of Pakistan. Tissue Antigens. 6:581-587.

Feng Q, Moran JV, Kazazian HHJr and Boeke JD.(1996). Human L1 retrotransposon encodes a conserved endonuclease required for retrotransposition. Cell 87:905-916.

Feuk L, MacDonald JR, Tang T, Carson AR, Li M, Rao G, Khaja R, Scherer SW. (2005). Discovery of human inversion polymorphisms by comparative analysis of human and chimpanzee DNA sequence assemblies. PLoS Genet. 1: 489–498.

Feuk L, Carson AR and Scherer SW.(2006). Structural variation in the human genome. Nature Reviews Genetics 7:85-97.

Firasat S, Khaliq S, Mohyuddin A, Papaioannou M, Tyler-Smith C, Underhill PA, Ayub Q.(2007). Y-chromosomal evidence for a limited Greek contribution to the Pathan population of Pakistan. Eur J Hum Genet.15:121-126.

Fisher EM, Beer-Romero P, Brown LG, Ridley A, McNeil JA, Lawrence JB, Willard HF, Bieber FR, Page DC.(1990). Homologous ribosomal protein genes on the human X and Y chromosomes: escape from X inactivation and possible implications for Turner syndrome. Cell 63:1205-1218.

Flores C, Maca-Meyer N, Larruga JM, Cabrera VM, Karadsheh N, Gonzalez AM. (2005). Isolates in a corridor of migrations: a high-resolution analysis of Y chromosome variation in Jordan. J Hum Genet. 50: 435-441.

Francalacci P, Morelli L, Underhill PA, Lillie AS, Passarino G, Useli A, Madeddu R, Paoli G, Tofanelli S, Calò CM, Ghiani ME, Varesi L, Memmi M, Vona G, Lin AA, Oefner P, Cavalli-Sforza LL.(2003). Peopling of three Mediterranean islands (Corsica, , and Sicily) inferred by Y-chromosome biallelic variability. Am J Phys Anthropol. 121:270-9.

Fuller D.(2003). An agricultural perspective on Dravidian historical linguistics: archaeological crop packages, livestock and Dravidian crop vocabulary. In: Bellwood P, Renfrew C (eds). Examining the farming/language dispersal hypothesis. McDonald Institute for Archaeological Research, Cambridge, United Kingdom, pp191-213.

Fu Y-H, Kuhl DPA, Pizzuti A, Pieretti M, Sutcliffe JS, Richards S, Verkerk AJM, Holden JH, Fenwick RG, Warren ST, Oostra BA, Nelson DL and Caskey CT. (1991). Variation of the CGG repeats at the fragile X site results in the genetic instability:resolution of the Sherman paradox. Cell. 67:1047-1058.

Fu Y-H, Pizzuti A, Fenwick RGJr, King J, Rajnarayan S, Dunne PW, Dubel J, Nasser GA, Ashizawa T, de Jong P, Wieringa B, Korneluk R, Perryman MB, Epstein HF, and Caskey CT.(1992). An unstable triplet repeat in a gene related to myotonic muscular dystrophy. Science. 255:1256-1258.

Gabunia L and Vekua A.(1995). A Plio-pleistocene hominid from Dmanisi, East Georgia, Caucasus. Nature. 373: 509-512.

Gayden T , Cadenas AM, Regueiro M, Singh NB, Zhivotovsky L A, Underhill PA, Cavalli-Sforza LL and Herrera RJ.(2007). The Himalayas as a Directional Barrier to Gene Flow. Am J Hum Genet. 80:884-894.

127

Gilbert N, Lutz-Prigge S and Moran J V.(2002). Genomic deletions created upon LINE-1 retrotransposition. Cell 110:315-325.

Giles RE, Blanc H, cann HM and Wallace DC.(1980). Maternal inheritence of human mitochondrial DNA. Proc Natl Acad. Sci. USA 77:6715-6719.

Gill P, Ivanov PL, Kimpton C, Piercy R, Benson N, Tully G, Evett I, Hagelberg E and Sullivan K.(1994). Identification of the remains of the Romanov family by DNA analysis. Nat Genet. 6:130-135

Goodier JL, Ostertag EM, Du K and Kazazian HH Jr.(2001). A novel active L1 retrotransposon subfamily in the mouse. Genome Res.11:1677-1685. Erratum in: Genome Res 11:1968.

Gonzalez E, Kulkarni H, Bolivar H, Mangano A, Sanchez R, Catano G, Nibbs RJ, Freedman BI, Quinones MP, Bamshad MJ, Murthy KK, Rovin BH, Bradley W, Clark RA, Anderson SA, O'Connell RJ, Agan BK, Ahuja SS, Bologna B, Sen L, Dolan MJ and Ahuja SK.(2005). The Influence of CCL3L1 Gene-Containing Segmental Duplications on HIV-1/AIDS Susceptibility. Science. 307, 1434-1440.

Grimes BF.(1992). “Ethnologue: Languages of the World,” 12th ed., Summer Institute of Linguistics, Inc., Dallas, Texas, USA.

Grimes B and Cooke H.(1998). Enginering mammalian chromosomes. Hum Mol Genet. 7: 1635-1640.

Grubb R and Laurell AB.(1956). Hereditary serological human serum groups. Acta Pathol Microbiol Scand. 39:390-398.

Hacia JG, Fan J-B, Ryder O, Jin L, Edgemon K, Ghandour G, Mayer RA, Bryan Sun, Hsie L, Robbins CM, Brody LC, Wang D, Lander ES, Lipshutz R, Fodor SPA and Collins FS.(1999). Determination of ancestral alleles for human singlenucleotide polymorphisms using high-density oligonucleotide arrays. Nat Genet. 22: 164-167.

Hacia JG and Collins FS.(1999). Mutational analysis using oligonucleotide microarrays. J Med Genet. 1999 36:730-736.

Hamada H and Kakunaga T.(1982). Potential Z-DNA forming sequences are highly dispersed in the human genome. Nature. 298:396-398.

Hamada H, Petrino MG and Kakunaga T.(1982). A novel repeated element with Z- DNA forming potential is widely found in evolutionary diverse eukaryotic genomes. Proc Natl Acad. Sci. USA 79:6465-6469.

Hamada H, Seidman M, Howard BH and Gorman CM.(1984). Enhance gene expression by poly (dT-dG). (dC-dA) sequence. Mol Cell Biol. 4:2622-2630.

Hammer MF, Spurdle AB, Karafet T, Bonner MR, Wood ET, Novelletto A, Malaspina P, Mitchell RJ, Horai S, Jenkins T and Zegura SL.(1997). The geographic distribution of human Y chromosome variation. Genetics.145:787-805.

Hammer MF, Karafet TM, Rasanayagam A, Wood ET, Altheide TK, Jenkins T, Griffiths RC, Templeton AR and Zegura SL.(1998). Out of Africa and back again: Nested cladistic analysis of human Y chromosome variation. Mol Biol Evol. 15: 427- 441. 128

Hammer MF, Redd AJ, Wood ET, Bonner MR, Jarjanazi H, Karafet T, Santachiara- Benerecetti S, Oppenheim A, Jobling MA, JenkinsT, Ostrer H and Bonne-Tamir B.(2000). Jewish and Middle Eastern non-Jewish populations share a common pool of Y-chromosome biallelic haplotypes. Proc Natl Acad Sci. 97: 6769-6774.

Hammer MF, Karafet TM, Redd AJ, Jarjanazi H, Santachiara-Benerecetti S, Soodyall H, and Zegura SL. (2001). Hierarchical patterns of global human Y- chromosome diversity. Mol Biol Evol. 18:1189-1203.

Hammer MF, Karafet TM, Park H, Omoto K, Harihara S, Stoneking M, and Horai S.(2006). Dual origins of the Japanese: Common ground for hunter-gatherer and farmer Y chromosomes. J Hum Genet. 51:47-58.

Harris H. (1966). Enzyme polymorphism in man. Proc R Soc Lond B Biol Sci. 22:298-310.

Hearn CM, Ghosh S and Todd JA.(1992). Microsatellite for linkage analysis of genetic traits. Trends Genet. 8: 288-294.

Henikoff S, Ahmed K and Malik HS.(2001). The centromere paradox: Stable inheritance with rapidly evolving DNA. Science. 293: 1098-1102.

Hinds DA, Kloek AP, Jen M, Chen X and Frazer KA.(2006). Common deletions and SNPs are in linkage disequilibrium in the human genome. Nat Genet. 38: 82–85.

Hirszfeld L and Hirszfeld H.(1919). Serological differences between the blood of different races: The results of researches on the Macedonian front. Lancet ii: 675- 679.

Horai S, Haysaka K, Kondo R, Tsugane K and Takahata N.(1995). Recent African origin of modern humans revealed by complete sequences of hominoid mitochondrial DNAs. Proc Natl Acad. Sci. USA 92: 532-536.

Hudjashov G, Kivisild T, Underhill PA, Endicott P, Sanchez JJ, Lin AA, Shen P, Oefner P, Renfrew C, Villems R and Forster P.(2007). Revealing the prehistoric settlement of Australia by Y chromosome and mtDNA analysis. Proc Natl Acad. Sci. 104:8726–8730.

Hughes-Buller R.(1991). Imperial Gazetteer of India, Provincial Series Balochistan, Sange-Meel publication, Lahore, Pakistan. Pp 89-91.

Hurles ME, Nicholson J, Bosch E, Renfrew C, Sykes BC and Jobling MA. (2002). Y chromosomal evidence for the origins of Oceanic-speaking peoples. Genetics. 160: 289–303.

Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW. and Lee C.(2004). Detection of large-scale variation in the human genome. Nat Genet. 36: 949–951.

Ibbetson D.(1883). Punjab Caste. Sang-e-Meel publications, Lahore. pp 9-16.

Immervoll T, Loesgen S, Dütsch G, Gohlke H , Herbon N, Klugbauer S , Dempfle A , Bickeböller B , Becker-Follmann J, Rüschendorf F, Saar K, Reis A , Wichmann H-E and Wjst M.(2001). Fine mapping and single nucleotide polymorphism association results of candidate genes for asthma and related phenotypes. Hum Mutat. 18:327- 336. 129

International HapMap Consortium.(2005). A haplotype map of the human genome. Nature. 437:1299-1320.

International Human Genome Sequencing Consortium.(2001). Initial sequencing and analyses of the human genome. Nature 409: 860-921.

International Human Genome Sequencing Consortium.(2004). Finishing the euchromatic sequence of the human genome. Nature. 431:931-945.

Jakobsson M, Scholz SW, Scheet P, Gibbs JR, VanLiere JM, Fung HC, Szpiech ZA, Degnan JH, Wang K, Guerreiro R, Bras JM, Schymick JC, Hernandez DG, Traynor BJ, Simon-Sanchez J, Matarin M, Britton A, van de Leemput J, Rafferty I, Bucan M, Cann HM, Hardy JA, Rosenberg NA, Singleton AB.(2008). Genotype, haplotype and copy-number variation in worldwide human populations. Nature 451:998-1003.

Jarrige J.(1991). Mehrgarh: Its place in the development of ancient cultures in Pakistan. In: “ Forgotten citis on the Indus Early Civilization in Pakistan from 8th to 2nd Millennium BC”.(Eds.M, Jansen, M. Mulloyn and G Urban). Verlag Philipp von Zabern, Mainz, Germany. 34-50.

Jefferys AJ, Wilson V and Thein SL.(1985). Individual- specific “finger prints”of human DNA. Nature. 316:76-79.

Jeffreys AJ.(1987). Highly variable minisatellites and DNA fingerprints. Biochemical Society Transactions 15:309-317.

Jeffery AJ, Royle V, Wilson V and Wong Z.(1988). Spontaneous mutation rate to new length allele at tandem repetitive hypervariable loci in human DNA. Nature. 332:278-281.

Jeffreys AJ and Pena SD.(1993). Brief introduction to human DNA fingerprinting. EXS. 67:1-20.

Jeng JR, Harn HJ, Jeng CY, Yueh KC and Shieh SM.(1997). Angiotensin I converting enzyme gene polymorphism in Chinese patients with hypertension. Am J Hypertens. 10: 558-561.

Jenkins S and Gibson N.(2002). High-throughput SNP genotyping. Funct Genomics. 3:57-66.

Jobling MA and Tyler-Smith C. (2003). The human Y chromosome: An evolutionary marker comes of age. Nat Rev Genet. 4: 598-612.

Jorde LB, Bamshad MJ, Watkins WS, Zenger R, Fraley AE, Krakowiak PA, Carpenter KD, Soodyall H, Jenkins Tand Rogers AR.(1995). Origins and affinities of modern human: a comparison of mitochondrial and nuclear genetic data. Am J Hum Genet. 57: 523-538.

Jurka J.(1997). Sequence patterns indicate an enzymatic involvement in integration of mammalian retroposons. Proc Natl Acad Sci. USA 94:1872-1877.

Kajikawa M and Okada N.(2002). LINEs mobilize SINEs in the eel through a shared 3` sequence. Cell 111:433-444.

130

Kan YW and Dozy AM.(1978). Polymorphism of DNA sequence adjacent to human β globin structural gene: relation ship to sickle mutation. Proc Natl Acad. Sci. USA 75:5631-5635.

Kapitonov V and Jurka J.(1996). The age of Alu subfamilies. J Mol Evol. 42:59-65.

Karafet T, Xu L, Du R, Wang W, Feng S, Wells RS, Redd AJ, Zegura SL and Hammer MF.(2001). Paternal population history of East Asia: Sources, patterns, and micro evolutionary processes. Am J Hum Genet. 69: 615–628.

Karafet TM, Osipova LP, Gubina MA, Posukh OL, Zegura SL, and Hammer MF. (2002). High levels of Y-chromosome differentiation among native Siberian populations and the genetic signature of a boreal hunter-gatherer way of life. Hum Biol. 74: 761-789.

Karafet TM, Lansing JS, Redd AJ, Reznikova S, Watkins JC, Surata SP, Arthawiguna WA, Mayer L, Bamshad M, Jorde LB, Hammer MF.(2005). Balinese Y- chromosome perspective on the peopling of Indonesia: Genetic contributions from pre-Neolithic hunter-gatherers, Austronesian farmers, and Indian traders. Hum Biol. 77: 93-114.

Karafet TM, Mendez FL, Meilerman MB, Underhill PA, Zegura SL and Hammer MF.(2008). New binary polymorphisms reshape and increase resolution of the human Y chromosomal haplogroup tree. Genome Res.185:830-838.

Kayser M, Roewer L, Hedman M, Henke L, Henke J, Brauer S, Kru¨ ger K, Krawczak M, Nagy M, Dobosz T, Szibor R, de Knijff P and Sajantila A.(2000). Characteristics and frequency of germline mutations at microsatellite loci from the human Y chromosome, as revealed by direct observation in father/son pairs. Am J Hum Genet. 66:1580-1588.

Kayser M, Brauer S, Cordaux R, Casto A, Lao O, Zhivotovsky LA., Moyse-Faurie C, Rutledge RB, Schiefenhoevel W, Gil,D, Lin AA, Underhill PA , Oefner PJ, Trent RJ and Stoneking M.(2006). Melanesian and Asian origins of Polynesians: mtDNA and Y chromosome gradients across the Pacific. Mol Biol Evol. 23: 2234-2244.

Kazazian HH Jr and Moran JV.(1998). The impect of L1 retrotransposons on the human genome. Nat Genet. 19:19-24.

Kazazian HH Jr, Wong C, Youssoufian H, Scott AF, Phillips DG and Antonarakis S.(1988). Haemophilia A resulting from denovo insertion of L1 sequences represents a novel machanism for mutation in man. Nature (London) 332:164-166.

Ke Y, Su B, Song X, Lu D, Chen L, Li H, Qi C, Marzuki S, Deka R, Underhill P, Xiao C, Shriver M, Lell J, Wallace D, Wells RS, Seielstad M, Oefner P, Zhu D, Jin J, Huang W, Chakraborty R, Chen Z and Jin L.(2001). African origin of modern humans in East Asia: A tale of 12,000 Y chromosomes. Science 292: 1151-1153.

Kimmel M and Chakraborty R.(1996). Measure of variation at DNA repeat loci under a generalized stepwise mutation model. Theor Pop Biol. 50:345-367.

King R and Underhill PA.(2002). Congruent distribution of Neolithic painted pottery and ceramic figurines with Y-chromosome lineages. Antiquity 76:707-714.

131

King TE, Bowden GR, Belaresque PL, Adams SM, Shanks ME and Jobling MA. (2007). Thomas Jefferson’s Y chromosome belongs to a rare European lineage. Am J Phys Anthropol. 132: 583–589.

Kivisild T, Rootsi S, Metspalu M, Mastana S, Kaldma K, Parik J, Metspalu E, Adojaan M, Tolk HV, Stepanov V, Gölge M, Usanga E, Papiha SS, Cinnioğlu C, King R, Cavalli-Sforza L, Underhill PA, Villems R.(2003). The genetic heritage of the earliest settlers persists both in Indian tribal and caste populations. Am J Hum Genet. 72: 313–332.

Klein RG.(1989). The Human Career: Human Biological and Cultural Origin. Chicago: Chicago University Press.

Knight A, Batzer MA, Stoneking M, Tiwari HK, Scheer WD, Herrera RJ, Deinninger PL.(1996). DNA sequences of Alu elements indicatea recent replacement of the human autosomal genetic complement. Proc Natl Acad. Sci. USA 93: 4360-4364.

Knight A, Underhill PA, Mortensen HM, Zhivotovsky LA, Lin AA, Henn BM, Louis D, Ruhlen M, Mountain JL.(2003). African Y chromosome and mtDNA divergence provides insight into the history of click languages. Curr Biol. 13:464-473.

Kongberg J R and Rykowski M C.(1988). Human genome organization: Alu, lines, and the molecular structure of metaphase chromosome bands. Cell 53:391-400.

Koschinsky ML, Boffa MB, Nesheim ME, Zinman B, Hanley AJG, Harris SB, Cao H and Hegele RA.(2001). Association of a single nucleotide polymorphism in CPB2 encoding the thrombin-activable fibrinolysis inhibitor (TAFI) with blood pressure. Clin Genet. 60:345-349.

Kremer EJ, Pritchard M, Lynch M, Yu S, Holman K, Baker E, Warren ST, Schlessinger D, Sutherland GR, and Richards RI.(1991). Mapping of DNA instability at the fragile X to a trinucleotide repeat sequence p(CCG)n. Science 252:1711-1714.

Kruglyak S, Durrett RT, Schug MD, Aquadro CF.(1998). Equilibrium distributions of microsatellite repeat length resulting from a balance between slippage events and point mutations. Proc Natl Acad Sci U S A. 95:10774-10778.

Kruse PE Jr, and Patterson MK.(1973). Tissue Culture: Methods and application. Academic Press, NewYork. pp16-17.

Labuda D, Sinnett D, Richer C, Deragon JM and Striker G.(1991). Evolution of mouse B1 repeats: 7SL RNA folding pattern conserved. Mol Evol. 325:405-414.

Lahr MM and Foley RA.(1994). Multiple dispersals and modern human origins. Evolutionary Anthropology. 3: 48-60.

Lahr MM and Foley RA.(1998). Towards a theory of modern human origins: Geography, demography, and diversity in recent human evolution. Am J Phys Anthropol. 41:137-176.

Lander E S, Linton L M, Birren B, Nusbaum C, Zody M C, Baldwin J, Devon K, Dewar K, Doyle M, Fitzhugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J,LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, 132

Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang HM, Yu J, Wang J, Huang GY, Gu J, Hood L, Rowen L, Madan A, Qin SZ, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan HQ, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M,Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JGR, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang WH, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I; Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N,Pollara VJ, Ponting CP, Schuler G, Schultz JR, Slater G, Smit AFA, Stupka E, Szustakowki J, Thierry-Mieg D, Thierry- Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ and Int Human Genome Sequencing Conso.(2001). Initial sequencing and analysis of human genome. Nature 409: 860-921.

LandsteinerK.(1901). Uber agglutinationsersheimun normalen menschlichengen Blutes Wein. Klin. Wschr. 14:1132-1134.

La Spada AR, Wilson AM, Lubahn DB, Harding AE and Fish beck KH.(1991). Androgen receptor gene mutations in X-linked spinal and bulbar muscular atrophy. Nature. 352:77-79.

Leakey R.(1994). The origin of human kind. Basic Books, A Division of Harper Colllins, New York.

Lichten MJ, Fox MS. (1983). Detection of non-homology containing heteroduplex molecule. Nucleic Acid Res. 11:3959-3971.

Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, Myers RM.(2008). Worldwide human relationships inferred from genome-wide patterns of variation. Science. 319:1100-4.

Li W-H, Gu Z, Wang H and Nekrutenko A.(2001). Evolutionary analyses of the human genome. Nature. 409, 847-849.

Lines M.(1999). The Kalasha people of North-western Pakistan. Peshawar, Pakistan: Emjay Books International.

133

Litt M and Luty JA.(1989). A hypervariable microsatellite revealed by in vitro amplification of a dinucleotide repeat within the cardiac muscle actin gene. Am J Hum Genet. 44:397-401.

Lucotte G and Ngo NY.(1985). p49f, A highly polymorphic probe, that detects Taq1 RFLPs on the human Y chromosome. Nucleic Acids Res.13:8285.

Ludwing E, Comeli PS, Aderson JL, Marshall HW, Lalouel JM, and Ward RH. (1995). Angiotensin-converting enzyme gene polymorphism is associated with myocardial infarction but not with development of coronary stenosis. Circulation 91:2120-2124.

Luis JR, Rowold DJ, Regueiro M, Caeiro B, Cinnioğlu C, Roseman C, Underhill PA, Cavalli-Sforza LL, Herrera RJ.(2004). The Levant versus the Horn of Africa: evidence for bidirectional corridors of human migrations. Am J Hum Genet. 74:532- 44.

LuningPrak ET, Dodson AW, Farkash EA, Kazazian HHJr.(2003). Tracking an embryonic L1 retrotransposition event. Proc Natl Acad Sci U S A. 100:1832-7.

Malik HS, Burke W D and Eickbush T H. (1999). The age and evolution of non-LTR transposable elements. Mol Biol Evol .16:793-805.

Maniatis T, Fritsch EF and Sambrook J.(1982). Molecular cloning: A laboratory manual. Cold Spring Harbor laboratory, Cold Spring Harbor. New York.

Mansoor A, Mazhar K, Khaliq S, Hameed A, Rehman S, Siddiqi S, Papaioannou M, Cavalli-Sforza LL, Mehdi SQ, Ayub Q.(2004). Investigation of the Greek ancestry of populations from northern Pakistan. Hum Genet.114:484-90.

Marri MKBB.(1985). “Search lights on Baloch and Balochistan.” 3rd Edition. Nisa traders, Quetta, Pakistan.

Marshall A and Hodgson J.(1998). DNA chips: An array of possibilities. Nature Biotechnology 16: 27–31.

Mathias SL, Scott AF, Kazazian H H Jr, Boeke J D and Gabriel A.(1991). Reverse transcriptase encoded by a human transposable element. Science. 254:1808-1810.

McAlpin DW.(1974). Towards proto-Elamo-Dravidian. Language. 50:89-101.

McAlpin DW.(1981). Proto-Elamo-Dravidian: the evidence and its implications. Trans Am Phil Soc. 71:3-155.

McCarroll SA, Hadnott TN, Perry GH, Sabeti PC, Zody MC, Barrett JC, Dallaire S, Gabriel SB, Lee C, Daly MJ, Altshuler DM and The International HapMap Consortium.(2006). Common deletion polymorphisms in the human genome. Nat Genet. 38: 86–92.

Mc Clay JL, Sugden K, Koch HG, Higuchi S and Craig IW.(2002). High-throughput single nucleotide polymorphisms genotyping by fluorescent competitive allele-specific polymerase chain reaction (SNiPTag). Anal Biochem. 301:200-206.

Mehdi, SQ.(2007), "Genetics of Pakistani Populations in an Asian and Global Context", in Cavalli-Sforza, L.L. and Feldman, M. (eds), Human Population Genetics:

134

Evolution and Variation , The Biomedical & Life Sciences Collection, Henry Stewart Talks Ltd, London. (online at http://hstalks.com/bio).

Meselson M and Yucan R. (1968). DNA restriction enzyme from Ecoli. Nature 217:1110-1114.

Mhlanga MM and Malmberg L.(2001). Using Molecular Beacons to Detect Single- Nucleotide Polymorphisms with Real-Time PCR. Methods. 25:463-471.

Miesfeld R, Krystal M and Arnheim N.(1981). A member of a new repeated sequence family which is conserved throughout eucaryotic evolution is found between the human and ß globin genes. Nucl. Acids Res. 9:5931-5948.

Mohyuddin A, Ayub Q, Underhill PA, Tyler-Smith C and Mehdi SQ.(2006). Detection of novel Y SNPs provides further insights into Y chromosomal variation in Pakistan. J Hum Genet. 51:375-378.

Morrish TA, Gilbert N, Myser JS, Vincent BJ, Stamato TD, Taccioli GE, Batzer M A and Moran JV.(2002). DNA repair mediated by endonuclease-independent LINE-1 retrotransposition. Nat Genet. 31:159-165.

Mountain JL and Cavalli-Sforza LL.(1994). Inference of human evolution through cladistic analysis of nuclear DNA restriction polymorphisms. Proc Natl Acad. Sci. USA 91: 6515-6519.

Myers JS, Vincent BJ, Udall H, Watkins W S, Morrish T A, Kilroy G E, Swergold G D, Henke J, Henke L, Moran J V, Jorde LB and Batzer MA.(2002). A comprehensive analysis of recently integrated human Ta L1 elements. Am. J. Hum Genet. 71: 312- 326.

Nakamura Y, Leppert M, O’Conell P, Wolff R, Holm T, Culver M, martin C, Fujimoto E, Hoff M, Kumlin E, and White R.(1987). Variable number of tandem repeat (VNTR) markers from human gene mapping. Science. 235:1616-1622.

Nanavutty P.(1997). The Parsis. National Book Trust, New Delhi, India.

Nasidze I, Sarkisian T, Kerimov A and Stoneking M. (2003). Testing hypotheses of language replacement in the Caucasus: evidence from the Y-chromosome. Hum Genet. 112:255-261.

Nasidze I, Ling EYS, Quinque D, Dupanloup I, Cordaux R, Rychkov S, Naumova O, Zhukova O, Sarraf-Zadegan N, Naderi GA, Asgary S, Sardas S, Farhud DD, Sarkisian T, Asadov C, Kerimov A, Stoneking M.(2004). Mitochondrial DNA and Y- chromosome variation in the Caucasus. Ann Hum Genet. 68: 205–221.

Nebel A, Filon D, Brinkmann B, Majumder PP, Faerman M, Oppenheim A. (2001). The Y chromosome pool of Jews as part of the genetic landscape of the Middle East. Am. J. Hum. Genet. 69: 1095–1112.

Nicholas Awde and Asmatullah Sarwan. Pashto Dictionary & Phrasebook: Pashto- English, English-Pashto. (Hippocrene Books, 2003, ISBN 078180972X) retrieved 10 January 2007.

135

Oakey R, Tyler-Smith C.(1990). Y chromosome DNA haplotyping Suggest the most European and Asian men are descended from one of two males. Genomics. 7:325- 330.

Oefner PJ and Underhill PA.(1995). Comparative DNA sequence by denaturing high performance liquid chromatography (DHPLC). Am J Hum Genet. 57:A266.

Olivio PD, Van de Walle MJ, LaipisPJ and Hauswirth WW.(1983). Nucleotide sequence evidence for rapid genotypic shifts in the bovine mitochondrial DNA D- loop. Nature. 306:400-402.

Orita M, Iwahana H, Kanazawa H, Hayashi K and Sekiya T.(1989). Detection of polymorphisms of human DNA by gel electrophoresisas single-strand conformation polymorphisms. Proc. Natd. Acad. Sci. USA 86: 2766-2770.

Ostertag EM and Kazazian HHJr. (2001). Twin priming a proposed mechanism for the creation of inversion in L1 retrotransposition. Genome Res. 11:2059-2065.

Ostertag EM, DeBerardinis RJ, Goodier JL, Zhang Y, Yang N, Gerton GL and Kazazian HHJr. (2002). A mouse model of human L1 retrotransposition. Nat Genet. 32:655-660.

Pakistan Economic Survey.(2006-2007). An accountancy publication www.accountancy.com.pk.

Pandya A, King TE, Santos FR, Taylor PG, Thangaraj K, SinghL, Jobling MA, Tyler- Smith C.(1998). A polymorphic human Y-chromosomal G to A transition found in India. Ind J Hum Genet. 4:52–61.

Passarino G, Semino O, Quintana-Murci L, Excoffier L, Hammer M and Santachiara- Benerecetti AS.(1998). Different genetic components in the Ethiopian population, identified by mtDNA and Y-chromosome polymorphisms. Am J Hum Genet.62:420- 434.

Passarino G, Semino O, Magri C, Al-Zahery N, Benuzzi G, Quintana-Murci L, Andellnovic S, Bullc-Jakus F, Liu A, Arslan A, Santachiara-Benerecetti AS (2001). The 49a,f haplotype 11 is a new marker of the EU19 lineage that traces migrations from northern regions of the Black Sea. Hum Immunol 62:922-32. Erratum in: Hum Immunol 62:1313-14.

Passarino G, Cavalleri GL, Lin AA, Cavalli-Sforza LL, Børresen-Dale AL, Underhill PA.(2002). Different genetic components in the Norwegian population revealed by the analysis of mtDNA and Y chromosome polymorphisms. Eur J Hum Genet. 10:521-529.

Payne R, Tripp M, Weigle J, Bodmer W and Bodmer J.(1964). A new leukocyte iso- antigen system in man. Cold Spring Harbor Quantitative Biology.29:28p5.

Perez-Lezaun A, Calafell F, Mateu E, Comas D, Ruiz-Pacheco R and Bertranpetit J.(1997). Microsatellite variation and the differentiation of modern humans. Human Genet. 99:1-7.

Prak EL and Haig HKJr. (2000). Mobile elements and the human genome. Nature Rev Genet. 1:134-144.

136

Qamar R, Ayub Q, Khaliq S, Mansoor A, Karafet T, Mehdi SQ and Hammer MF. (1999). African and Levantine origins of Pakistani YAP+ Y chromosomes. Hum Biol. 71:745-755.

Qamar R, Ayub Q, Mohyuddin A, Helgason A, Mazhar K, Mansoor A, Zerjal T, Tyler- Smith C and Mehdi SQ. (2002). Y-chromosomal DNA variation in Pakistan. Am J Hum Genet.7:1107-1124.

Qi XQ, Bakht S, Devos KM, Gale MD and Osbourn A. (2001). L-RCA (Ligation rolling circle amplification): a general method for genotyping of single nucleotide polymorphism (SNPs). Nucleic Acids Res. 29: U68-U74.

Quddus SA.(1990). “A Tribal Balochistan”. Ferozsons (PVt.) Ltd., Lahore, Pakistan.

Queller DC, Strassmann JE and Colin RH.(1993). Microsatellites and kinship. Tree 8:285-288.

Quintana-Murci L, Semino O, Minch E, Passarimo G, Brega A and Santachiara- Benerecetti AS.(1999a). Further characteristics of proto-European Y chromosomes. Eur J Hum Genet. 7:603-8.

Quintana-Murci L, Semino O, Poloni ES, Liu A, Van Gijn M, Passarino G, Brega A, Nasidze IS, Maccioni L, Cossu G, al-Zahery N, Kidd JR, Kidd KK and Santachiara- Benerecetti AS.(1999b). Y-chromosome specific YCAII, DYS19 and YAP polymorphisms in human populations: a comparative study. Ann Hum Genet. 63:153- 166.

Quintana-Murci L, Semino O, Bandelt HJ, Passarino G, McElreavey K and Santachiara-Benerecetti AS. (1999c). Genetic evidence of an early exit of Homo sapiens sapiens from Africa through eastern Africa. Nat Genet. 23:437-441.

Quintana-Murci L, Krausz C, Zerjal T, Sayar SH, Hammer MF, Mehdi SQ, Ayub Q, Qamar R, Mohyuddin A, Radhakrishna U, Jobling MA, Tyler-Smith C and McElreavey K.(2001). Y-Chromosome Lineages Trace Diffusion of People and Languages in Southwestern Asia. Am J Hum Genet. 68:537-542.

Quintana-Murci L, Chaix R, Wells RS, Behar DM, Sayar H, Scozzari R, Rengo C, Al-Zahery N, Semino O, Santachiara-Benerecetti AS, Coppa A, Ayub Q, Mohyuddin A, Tyler-Smith C, Qasim Mehdi S, Torroni A, McElreavey K. (2004). Where west meets east: the complex mtDNA landscape of the southwest and Central Asian corridor. Am J Hum Genet. 74:827-45.

Ramana GV, Su B, Jin L, Singh L, Wang N, Underhill PA, Chakraborty R (2001). Y chromosome SNP haplotypes suggest evidence of gene flow among caste, tribe, and the migrant populations of Andhra Pradesh, South India. Eur J Hum Genet. 9:695–700.

Ramsay G. (1998). DNA chips: state of the art. Nat Biotechnol. 16:40-44.

Raynolds MV, Bristow M R, Bush E W, Abraham W T, Lowes B D, Zisman L S, Taft CS, and Perryman MB.(1993). Angiotensin-converting enzyme DD genotype in patients with ischaemic or idiopathic dilated cardiomyopathy. Lancet 342:1073-1075.

Regueiro M, Cadenas AM, Gayden T, Underhill PA and Herrera RJ. (2006). Iran: Tricontinental nexus for Y-chromosome driven migration. Hum Hered. 61:132–143. 137

Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, González JR, Gratacòs M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW and Hurles ME. (2006). Global variation in copy number in the human genome. Nature. 444: 444-454. Repping S, van Daalen SK, Brown LG, Korver CM, Lange J, Marszalek JD, Pyntikova T, van der Veen F, Skaletsky H, Page DC and Rozen S. (2006). High mutation rates have driven extensive structural polymorphism among human Y chromosomes. Nat Genet. 38:463-467.

Renfrew C.(1987). Archaeology and language: the puzzle of Indo-European origins. Jonathan Cape, London.

Ricards RI, Holman K, Yu S and Sutherland GR.(1993). Fragile X syndrome unstable element, p(CCG)n, and other simple tandem repeat sequences are binding sites for specific nuclear proteins. Hum. Mol.Genet. 2:1429-1435.

Righmire GP.(1989). Middle stone agehumans from eastern and southern Africa. In: P Mellars and CB Stringer (eds): Te human Revolution. Edinburgh: Edinburgh University Press, pp109-122.

Robertson GS. (1896). The Kafirs of the Hindu-Kush. Oxford University Press, Karachi, Pakistan.

Roberts RJ and Murray K. (1976). Restriction Endonucleases. CRC Crit Rev Biochem. 1976 4:123–164.

Roewer L, Krawczak M, Willuweit S, Nagy M, Alves C, Amorim A, Anslinger K, Augustin C, Betz A, Bosch E, Cagliá A, Carracedo A, Corach D, Dekairelle AF, Dobosz T, Dupuy BM, Füredi S, Gehrig C, Gusmaõ L, Henke J, Henke L, Hidding M, Hohoff C, Hoste B, Jobling MA, Kärgel HJ, de Knijff P, Lessig R, Liebeherr E, Lorente M, Martínez-Jarreta B, Nievas P, Nowak M, Parson W, Pascali VL, Penacino G, Ploski R, Rolf B, Sala A, Schmidt U, Schmitt C, Schneider PM, Szibor R, Teifel- Greding J, Kayser M.(2001). Online reference database of European Y- chromosomal short tandem repeat (STR) haplotypes. Forensic Sci Int. 118: 106-113.

Rootsi S, Magri C, Kivisild T, Benuzzi G, Help H, Bermisheva M, Kutuev I, Barać L, Pericić M, Balanovsky O, Pshenichnov A, Dion D, Grobei M, Zhivotovsky LA, Battaglia V, Achilli A, Al-Zahery N, Parik J, King R, Cinnioğlu C, Khusnutdinova E, Rudan P, Balanovska E, Scheffrahn W, Simonescu M, Brehm A, Goncalves R, Rosa A, Moisan JP, Chaventre A, Ferak V, Füredi S, Oefner PJ, Shen P, Beckman L, Mikerezi I, Terzić R, Primorac D, Cambon-Thomsen A, Krumina A, Torroni A, Underhill PA, Santachiara-Benerecetti AS, Villems R and Semino O. (2004). Phylogeography of Y-chromosome haplogroup I reveals distinct domains of prehistoric gene flow in Europe. Am J Hum Genet. 75:128-137.

Rootsi S, Zhivotovsky LA, Baldovic M, Kayser M, Kutuev IA, Khusainova R, Bermisheva MA, Gubina M, Fedorova SA, Ilumäe AM, Khusnutdinova EK, Voevoda MI, Osipova LP, Stoneking M, Lin AA, Ferak V, Parik J, Kivisild T, Underhill PA and Villems R.(2007). A counter-clockwise northern route of the Y-chromosome haplogroup N from Southeast Asia towards Europe. Eur J Hum Genet. 15: 204-211.

138

Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA and Feldman MW.(2002). Genetic structure of human populations. Science. 298:2381- 2385.

Rosser ZH, Zerjal T, Hurles ME, Adojaan M, Alavantic D, Amorim A, Amos W, Armenteros M, Arroyo E, Barbujani G, Beckman G, Beckman L, Bertranpetit J, Bosch E, Bradley DG, Brede G, Cooper G, Côrte-Real HB, de Knijff P, Decorte R, Dubrova YE, Evgrafov O, Gilissen A, Glisic S, Gölge M, Hill EW, Jeziorowska A, Kalaydjieva L, Kayser M, Kivisild T, Kravchenko SA, Krumina A, Kucinskas V, Lavinha J, Livshits LA, Malaspina P, Maria S, McElreavey K, Meitinger TA, Mikelsaar AV, Mitchell RJ, Nafa K, Nicholson J, Nørby S, Pandya A, Parik J, Patsalis PC, Pereira L, Peterlin B, Pielberg G, Prata MJ, Previderé C, Roewer L, Rootsi S, Rubinsztein DC, Saillard J, Santos FR, Stefanescu G, Sykes BC, Tolun A, Villems R, Tyler-Smith C, Jobling MA.(2000). Y-chromosomal diversity in Europe is clinal and influenced primarily by geography, rather than by language. Am J Hum Genet. 67:1526-1543.

Royle NJ, Clarkson RE, Wong Z, Jeffery AJ.(1988). Clustering of hypervariable minisatellite in the proterminal region of human autosome. Genomics. 3:352-360.

Ruiz-Pesini E, Lott MT, Procaccio V, Poole JC, Brandon MC, Mishmar D, Yi C, Kreuziger J, Baldi P and Wallace DC.(2007). An enhanced MITOMAP with a global mtDNA mutational phylogeny. Nucleic Acids Res. 35:D823–D828.

Ruvolo ME, Zehr S, von Dornum M, Pan D, Chang B and Lin J.(1993). Mitochondrial COII sequences and modern human origins. Mol Biol Evol 10:1115- 1135.

Saiki RK, Scharf S, Faloona F, Mullis KB, Horn GT, Erlich HA and Arnheim N. (1985). Enzymatic amplification of beta-globin genomic sequences and restriction site analysis for diagnosis of sickle cell anemia. Science 230:1350-1354.

Sanchez JJ, Hallenberg C, Borsting C, Hernandez A, Morling N. (2005). High frequencies of Y chromosome lineages characterized by E3b1, DYS19-11, DYS392- 12 in Somali males. Eur J Hum Genet. 13: 856-866.

Santos FR, Pandya A, Kayser M, Mitchell RJ, Liu A, Singh L, Destro-Bisol G, Novelletto A, Qamar R, Mehdi SQ, Adhikari R, de Knijff P and Tyler-Smith C. (2000). A polymorphic L1 retroposon insertion in the centromere of the human Y chromosome. Hum Mol Genet. 9:421-430.

Sassaman DM, Dombroski BA, Moran JV, Kimberland ML, Naas TP, De Berardinis RJ, Gabriel A, Swergold GD and Kazazian HHJr.(1997). Many humanL1 elements are capable of retrotransposition. Nat Genet. 16:37-43.

Scheinfeldt L, Friedlaender F, Friedlaender J, Latham K, Koki G, Karafet T, Hammer M, and Lorenz J.(2006). Unexpected NRY chromosome variation in Northern Island Melanesia. Mol Biol Evol. 23:1628-1641.

Schunkert H, Hense HW, Holmer SR, Stender M, Perz S, Keil U, Lorell BH, and Riegger GA. (1994). Association between a deletion polymorphism of the Angiotensin- converting enzyne gene and left ventricular hypertrophy. N Engl J Med. 330:1634-1638.

Schurr TG, Maggi WR, Fowler K, Wallace DC. (2000). The ethnic origins of an enigmatic south Asian population, the Kalasha of northern Pakistan, as revealed by mtDNA variation. Am J Hum Genet. 67:217. 139

Scozzari R, Torroni A, Semino O, Sirugo G, Brega A and Santachiara Benerecetti AS.(1988). Genetic studies on the population and mitochondrial DNA polymorphism. Am J Hum Genet. 43:534-544.

Scozzari R, Cruciani F, Santolamazza P, Malaspina P, Torroni A, Sellitto D, Arredi B, Destro-Bisol G, De Stefano G, Rickards O, Martinez-Labarga C, Modiano D, Biondi G, Moral P, Olckers A, Wallace DC and Novelletto A.(1999). Combined use of biallelic and microsatellite Y-chromosome polymorphisms to infer affinities among African populations. Am J Hum Genet. 65:829-46.

Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, WalkerM, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A and Wigler M.(2004). Large-scale copy number polymorphism in the human genome. Science 305: 525-528.

Seielstad M, Yuldasheva N, Singh N, Underhill P, Oefner P, Shen P and Wells RS.(2003). A novel Y-chromosome variant puts an upper limit on the timing of first entry into the Americas. Am J Hum Genet. 73:700-755.

Sengupta S, Zhivotovsky LA, King R, Mehdi SQ, Edmonds CA, Chow CE, Lin AA, Mitra M, Sil SK, Ramesh A, Usha Rani MV, Thakur CM, Cavalli-Sforza LL, Majumder PP, Underhill PA.(2006). Polarity and temporality of high-resolution Y-chromosome distributions in India identify both indigenous and exogenous expansions and reveal minor genetic influence of Central Asian pastoralists. Am J Hum Genet. 78:202-221.

Semino O, Passarino G, Oefner PJ, Lin AA, Arbuzova S, Beckman LE, De Benedictis G, Francalacci P, Kouvatsi A, Limborska S, MarcikiaeM, Mika A, Mika B, Primorac D, Santachiara-Benerecetti AS, Cavalli-Sforza LL, Underhill PA.(2000). The genetic legacy of Palaeolithic Homo sapiens sapiens in extant Europeans: a Y- chromosome perspective. Science. 290:1155-1159.

Semino O, Santachiara-Benerecetti AS, Falaschi F, Cavalli-Sforza LL, and Underhill PA.(2002). and Khoisan share the deepest clades of the human Y- chromosome phylogeny. Am J Hum Genet. 70:265-268.

Semino O, Magri C, Benuzzi G, Lin AA, Al-Zahery N, Battaglia V, Maccioni L, Triantaphyllidis C, Shen P, Oefner PJ, Zhivotovsky LA, King R, Torroni A, Cavalli- Sforza LL, Underhill PA and Santachiara-Benerecetti AS.(2004). Origin, diffusion, and differentiation of Y-chromosome haplogroups E and J: Inferences on the neolithization of Europe and later migratory events in the Mediterranean area. Am J Hum Genet. 74:1023-1034.

Serre D and Hudson TJ. (2006). Resources for Genetic Variation Studies. Annu Rev Genomics Hum. 7: 443-457.

Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM, Clark RA, Schwartz S, Segraves R, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A and Wigler M. (2005). Segmental duplications and copy-number variation in the human genome. Am J Hum Genet. 77:78-88.

Shen MR, Batzer MA and Deininger PL. (1991). Evolution of the master Alu gene (s). J Mol Evol. 33:311-320.

140

Shi H, Dong YL, Wen B, Xiao CJ, Underhill PA, Shen PD, Chakraborty R, Jin L, and Su B.(2005). Y-chromosome evidence of southern origin of the East Asian-specific haplogroup O3-M122. Am J Hum Genet 77: 408-419.

Shriver MD, Jin L, Chakrabraty R and Boerwinkle E.(1993). VNTR allele-frequency distribution under the stepwise mutation model-a computer stimulation approach. Genetics. 134:983-993.

Shriver MD, Jin L, Ferrell RE and Deka R. (1997). Micosatellite Data support an early population expansion in Africa. Genomes Res 7: 586-591.

Sims LM, Garvey D and Ballantyne J. (2007). Sub-populations within the major European and African derived haplogroups R1b3 and E3a are differentiated by previously phylogenetically undefined Y-SNPs. Hum Mutat. 28:97.

Smith AF.(1996). The origin of interspersed repeats in the human genome. Curr Opin Genet Dev. 6:743-778.

Smith AF.(1999). Interspersed repeats and other mementos of transposable elements in mammalian genome. Curr Opin Genet Dev. 9:657-663.

Stefansson H, Helgason A, Thorleifsson G, Steinthorsdottir V, Masson G, Barnard J, Baker A, Jonasdottir A, Ingason A, Gudnadottir VG, Desnica N, Hicks A, Gylfason A, Gudbjartsson DF, Jonsdottir GM, Sainz J, Agnarsson K, Birgisdottir B, Ghosh S, Olafsdottir A, Cazier JB, Kristjansson K, Frigge ML, Thorgeirsson TE, Gulcher JR, Kong A and Stefansson K.(2005). A common inversion under selection in Europeans. Nat Genet. 37:129-137.

Strachan T and Read AP.(2004). Human Molecular Genetics, 3rd ed. Garland Science, London and New York.

Stringer CB and Andrews P. (1988). Genetic and fossils evidence for the origin of modern humans. Science. 239:1263-1268.

Stringer C. (2000). Palaeoanthropology. Coasting out of Africa. Nature 405:24-27.

Swallow DM, GENDLER S, GRIFFITHS B, CORNEY G, Taylor-Papadimitriou J And Bramwell ME. (1987). The human tumour-associated epithelial mucins are coded by an expressed hypervariable gene locus PUM. Nature. 328:82-84.

Swisher CC 3rd, Curtis GH, Jacob T, Getty AG, SuprijoA, Widiasmoro.(1994). Age of the earliest known hominids in Java, Indonesia. Science 263: 1118-1121.

Su B, Xiao J, Underhill P, Deka R, Zhang W, Akey J, Huang W, Shen D, Lu D, Luo J, Chu J, Tan J, Shen P, Davis R, Cavalli-Sforza L, Chakraborty R, Xiong M, Du R, Oefner P, Chen Z, Jin L.(1999). Y-chromosome evidence for a northward migration of modern humans into eastern Asia during the last Ice Age. Am J Hum Genet. 65:1718–1724.

Su B, Jin L, Underhill P, Martinson J, Saha N, McGarvey ST, Shriver MD, Chu J, Oefner P, Chakraborty R and Deka R. (2000). Polynesian origins: Insights from the Y chromosome. Proc Natl Acad Sci. 97: 8225–8228.

Sun C, Skaletsky H, Rozen S, Gromoll J, Nieschlag E, Oates R & Page D C. (2000). Deletion of azoospermia factor a (AZFa) region of human Y chromosome caused by recombination between HERV15 proviruses. Hum. Mol. Biol. 9: 2291-2296. 141

Tattersall I. (1997). Out of Africa again ------and again? Sci Am. 276:60-67.

Tautz D.(1989). Hypervariability of simple sequences as a general source for polymorphic DNA markers. Nucleic Acids Res. 17: 6463-6471.

Thangaraj K, Singh L, Reddy AG, Rao VR, Sehgal SC, Underhill PA, Pierson M, Frame IG, and Hagelberg E. (2003). Genetic affinities of the Andaman Islanders, a vanishing human population. Curr Biol. 13:86-93.

Thanseem I, Thangaraj K, Chaubey G, Singh VK, Bhaskar LV, Reddy BM, Reddy AG, Singh L. (2006). Genetic affinities among the lower castes and tribal groups of India: inference from Y chromosome and mitochondrial DNA. BMC Genet. 7:42.

The ENCODE Project Consortium.(2007). Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 447:799-816.

Thomas MG, Bradman N, Flin HM.(1999). High throughput analysis of 10 microsatellite and 11 diallelic polymorphisms on the human Y-chromosome. Hum Genet 105:577–581.

Tishkoff SA, Dietzsch E, Speed W, Pakstis AJ, Kidd JR, Cheung K, Bonne`-Tamir B, Santachiara-Benerecetti AS, Moral P and Krings M.(1996). Global patterns of linage disequilibrium at the CD4 locus and modern human origins. Science. 271:1380- 1387.

Todd J A, Aitman TJ, Cornall RJ, Ghosh S, Hall JRS, Hearne CM, KnighT AM, Love JM, Mcaleer MA, Prins J-B, Rodrigues N, Lathrop M, Pressey A, Delarato NH, Peterson LB and Wicker LS.(1991). Genetic analysis of auto immune type 1 diabetes mellitus in mice. Nature. 351: 542-547.

Toth G. Gaspari Z, and Jurka J.(2000). Microsatellite in different eukaryotic genomes:survey and analysis. Genome Res. 10:967-981.

Treco D and Arnheim N.(1986). The evolutionary conserved repetitive sequence d(TG.AC)n promotes reciprocal exchange and generate unusual recombinants tetrads during yeast meiosis. Mol Cell Biol. 6:3934-3947.

Tsunoda K,Sanke T,Nakagawa T,Furuta H andNanjo K.(2001). Single nucleotide polymorphism (D68D, T to C) in the syntaxin 1A gene correlates to age at onset and insulin requirement in Type II diabetic patients. Diabetologia 44:2092-2097.

Turner G, Barbulescu M, Su M, Jensen-SeaanMI, Kidd KK and Lenz J.(2001). Insertional polymorphism of full-length endogenous retroviruses in humans. Curr Biol. 11:1531-1535.

Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E, Hayden H, Albertson D, Pinkel D, Olson MV and Eichler EE.(2005). Fine-scale structural variation of the human genome. Nat Genet. 37:727-732.

Ullu E and Tschudi C.(1984). Alu sequences are processed 7SL RNA genes. Nature 312:171-172.

Underhill PA, Jin L, Lin AA, Mehdi SQ, Jenkins T, Vollrath D, Davis RW, Cavalli- Sforza LL and Oefner PJ.(1997). Detection of numerous Y chromosome biallelic

142

polymorphisms by denaturing high-performance liquid chromatography. Genome Res. 7:996-1005.

Underhill PA, Shen P, Lin AA, Jin L, Passarino G, Yang WH, Kauffman E, Bonné- Tamir B, Bertranpetit J, Francalacci P, Ibrahim M, Jenkins T, Kidd JR, Mehdi SQ, Seielstad MT, Wells RS, Piazza A, Davis RW, Feldman MW, Cavalli-Sforza LL and Oefner PJ.(2000). Y chromosome sequence variation and the history of human populations. Nat Genet. 26:358-61.

Underhill PA, Passarino G, Lin AA, Shen P, Mirazon Lahr M, Foley RA, Oefner PJ, and Cavalli-Sforza LL.(2001). The phylogeography of Y chromosome binary haplotypes and the origins of modern human populations. Ann Hum Genet. 65: 43– 62.

Valdes AM, Saltkin M and Freimer NB. (1993). Allele frequency at microsatellite loci: the stepwise mutation model revisited. Genetics. 133:737-749.

Verkerk AJMH, Pieretti M, Sutcliffe JS, Fu Y-H, Kuhl DPA, Pizzuti A, Reiner O, Richards S, Victoria MF, Zhang F, Eussen BE, van Ommen G-JB, Blonden LAJ, Riggins GJ, Chastain JL, Kunst CB, Galjaard H, Caskey CT, Nelson DL, Oostra BA and Warren S.(1991). Identification of the gene (FMR-1) containing CGG repeat coincident with a brekpoint cluster region exhibiting length variation in fragile X syndrome. Cell. 65:905-914.

Walls EV and Crawford DH.(1987). Generation of lymphoblastoid cell lines using Epstein-Barr virus. In: Lymphocytes, A practical apporch. Ed. Klaus G.G.B. IRL press, Oxford. pp 157.

Walter RC, Buffler RT, Bruggemann JH, Guillaume MM, Berhe SM, Negassi B, Libsekal Y, Cheng H, Edwards RL, von Cosel R, Néraudeau D and Gagnon M.(2000). Early human occupation of Red sea coast of Eritrea during the last inter giacial. Nature. 405:65-69.

Wang DG, Fan J-B, Siao C-J, Berno A, Young P, Sapolsky R, Ghandour G, Perkins N, Winchester E, Spencer J, Kruglyak L, Stein L, Linda H, Topaloglou T, Hubbell E, Robinson E, Mittmann M, Morris MS, Shen N, Kilburn D, Rioux J, Nusbaum C, Rozen S, Hudson TJ, Lipshutz R, Chee M and Lander ES.(1998). Large-Scale Identification, Mapping, and Genotyping of Single-Nucleotide Polymorphisms in the Human Genome. Science. 280:1077-1082.

Watkins WS, Ricker CE, Bamshad MJ, Carroll ML, Nguyen SV, Batzer MA, Harpending HC, Rogers AR, Jorde LB.(2001). Patterns of ancestral human diversity: an analysis of Alu insertion and restriction-site polymorphisms. Am. J. Hum Genet. 68:738-752.

Watson JD and Crick FHC.(1953). A Structure for Deoxyribose Nucleic Acid. Nature. 171:737-738.

Weale ME, Yepiskoposyan L, Jager RF, Hovhannisyan N, Khudoyan A, Burbage- Hall O, Bradman N, Thomas MG.(2001). Armenian Y chromosome haplotypes reveal strong regional structure within a single ethno-national group. Hum Genet.109:659-674.

143

Webster MT, Smith NG, Ellegren H. (2002). Microsatellite evolution inferred from human-chimpanzee genomic sequence alignments. Proc Natl Acad Sci USA 99:8748-8753.

Wells RS, Yuldasheva N, Ruzibakiev R, Underhill PA, EvseevaI, Blue-Smith J, Jin L, Su B, Pitchappan R, Shanmugalakshmi S, Balakrishnan K, Read M, Pearson NM, Zerjal T, Webster MT, Zholoshvili I, Jamarjashvili E, Gambarov S, Nikbin B, Dostiev A, Aknazarov O, ZallouaP, Tsoy I, Kitaev M, Mirrakhimov M, Chariev A, Bodmer WF.(2001). The Eurasian heartland: a continental perspective on Y-chromosome diversity. Proc. Natl. Acad. Sci. USA 98:10244–10249.

Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM.(2008). The complete genome of an individual by massively parallel DNA sequencing. Nature. 17:872-876.

Wilson IJ, Balding DJ.(1998). Genealogical inference from microsatellite data. Genetics 150:499–510.

Wolpert S.(2000). A new history of India. Oxford University Press, New York.

Wood ET, Stover DA, Ehret C, Destro-Bisol G, Spedini G, McLeod H, Louie L, Bamshad M, Strassmann BI, Soodyall H and Hammer MF.(2005). Contrasting patterns of Y chromosome and mtDNA variation in Africa: Evidence for sex-biased demographic processes. Eur J Hum Genet 13: 867–876.

Wong Z, Wilson V, Patel I, Povey S, Jeffreys AJ.(1987). Characterization of a panel of highly variable minisatellites cloned from human DNA. Ann Hum Genet. 51(Pt 4):269-288.

Xue Y, Zerjal T, Bao W, Zhu S, Shu Q, Xu J, Du R, Fu S, Li P, Hurles ME, Yang H andChris Tyler-Smith C.(2006). Male demography in East Asia: a north-south contrast in human population expansion times. Genetics. 172:2431–2439.

Y Chromosome Consortium.(2002). A nomenclature system for the tree of human Y-chromosomal binary haplogroups. Genome Res. 12:339-348.

Yoshino T, Takeyama H and Matsunaga T.(2001). Single nucleotide polymorphism analysis using a bacterial magnetic particle microarray. Electrochemistry 69:1008- 1012.

Youil R , Kemper B W, and Cotton R G.(1995). Screening for mutations by enzyme mismatch cleavage with T4 endonuclease VII. Proc Natl Acad Sci. USA 92:87-91.

Zegura SL, Karafet TM, Zhivotovsky LA and Hammer MF.(2004). High-resolution SNPs and microsatellite haplotypes point to a single, recent entry of Native American Y chromosomes into the Americas. Mol Biol Evol. 21:164-175.

Zerjal T, Xue Y, Bertorelle G, Wells RS, Bao W, Zhu S, Qamar R, Ayub Q, Mohyuddin A, Fu S, Li P, Yuldasheva N, Ruzibakiev R, Xu J, Shu Q, Du R, Yang H, Hurles ME, Robinson E, Gerelsaikhan T, Dashnyam B, Mehdi SQ, Tyler-Smith C. (2003). The genetic legacy of the Mongols. Am J HumGenet. 72:717-21.

144

Zhang F, Su B, Zhang YP and Jin L. (2007). Genetic studies of human diversity in East Asia. Phil. Trans. R. Soc. B 362: 987–995.

Zhivotovsky LA, Bennett L, Bowcock AM and Feldman MW.(2000). Human population expansion and microsatellite variation. Mol Biol Evol. 17:757-767.

Zhivotovsky L, Underhill P, Cinnioğlu C, Kayser M, Morar B, Kivisild T, Scozzari R , Cruciani F, Destro-Bisol G and Spedini G. (2004). The Effective Mutation Rate at Y Chromosome Short Tandem Repeats, with Application to Human Population- DivergenceTime.AmJHumGenet.74:50-61.

145

APPENDIX

- 8 -

Appendix I: List of Y-SNPs analyzed along with their primer sequences and PCR amplification conditions used in this study.

SNO. Y-SNP GENOTYPING PRIMER PRIMER SEQUENCE ANNEALING METHOD DESIGNATION TEMPERATURE°C

1 Apt AFLP TEK E TGG ATT GCA TTC AAC TTC ACT TAC 65.5 TEK G CTG AGT TCA AAT GCT CGG GTC TC 2 LLY22g AFLP LLY22gF CCA CCCAGT TTT ATG CAT TTG 55 LLY22gR ATA GAT GGC GTC TTC ATG AGT 3 L1Y PCR L1YF GCA CAA TGT GCA CAT GTA CCC TA L1YR TGA TGT GTG CAT TCA TCT CAT ATA T 4 M6 DHPLC M6 F CAC TAC CAC ATT TCT GGT TGG 63, 56 M6 R CGC TGA GTC CAT TCT TTG AG 5 M8 Sequencing M8 F CCC ACC CAC TTC AGT ATG AA 56 M8 R AGG CTG ACA GAC AAG TCC AC 6 M9 AFLP M9F GCA GCA TAT AAA ACT TTC AGG 55 M9R AAA ACC TAA CTT TGC TCA AGC 7 M11 AFLP M11R TTC ATC ACA AGG AGC ATA AAC AA 55 M11F CCC TCC CTC TCT CCT TGT ATT CTA CC 8 M12 ARMS PCR M12 F ACT AAA ACA CCA TTA GAA ACA AAG G 57 M12Nor R AGC AAC ATA GTG ACC CCC AAC M12Mut R GCA ACA TAG TGA CCC CCA AA 9A M17 AFLP M17F GTG GTT GCT GGT TGT TAC GT 60 M17R AGC TGA CCA CAA ACT GAT GTA GA 9B M17 ARMS M17FN TTG CTG GTT GTT ACG GGG 60 M17FM GTTG CTG GTT GTT ACG GGT M17R GCT ATT CTT GTT TCT CCA GGC 10 M20 AFLP M20F GAT TGG GTG TCT TCA GTG CT 60 M20R CAC ACA ACA AGG CAC CAT C 58 11 M25 DHPLC M25 F AAA GCG AGA GAT TCA ATC CAG 63, 56 M25R TTT TAG CAA GTT AAG TCA CCA GC 12 M27 ARMS-PCR M27 F CGG AAG TCA AAG TTA TAG TTA CTG G 65 M27RNL TAT AGG AAT CGA GGT TCA GGT CAG M27 RMT TAT AGG AAT CGA GGT TCA GGT CAC

a

SNO. Y-SNP GENOTYPING PRIMER PRIMER SEQUENCE ANNEALING METHOD DESIGNATION TEMPERATURE°C

13 M31 DHPLC M31 F GAA CC AGA CAA TAC GAA ATA GAA G 63, 56 M31 R TTT AGC GGC TTA TCT CAT TAC C 14 M32 DHPLC M32 F TTG AAA AAA TAC AGT GGA AC 63, 56 M32 R CAA GTG TTT AAG GAT ACA GA 15 M35 ARMS-PCR M35 FN ATT TTC CTT TGG GAC ACT AG 58 M35 FM ATT TTC CTT TGG GAC ACT AC M35 R AGA GGG AGC AAT GAG GAC A 16 M36 DHPLC M36 F AGA TCA TCC CAA AAC AAT CAT AA 63, 56 M36 R AAG GCT GAA ATC AAT CCA ATC TG 17 M38 Sequencing M38 F CAG TTT TTA GAG AAT AAT GTC CT 63, 56 M38 R TTA AAG AAA AGA AAA GCA GAT G 18 M45 DHPLC M45F GCT GGC AAG ACA CTT CTG AG 63, 56 M45R AAT ATG TTC CTG ACA CCT TCC 19 M48 ARMS-PCR M48 FN TGA CAA TTA GGA TTA AGA ATA TTA TA M48 FM TGA CAA TTA GGA TTA AGA ATA TTA TG M48R AAA ATT CCA AGT TTC AGT GTC ACA TA 20 M50 DHPLC M50 F CGG CAA CAG TGA GGA CAG T 63, 56 M50 R TGC TTC AGG AGA TAG AGG CTC 21 M52 ARMS-PCR M52FC TAT CGG CCT CCT GAG TAC CTG 60 M52RG CAA GAA ACC TAT CAA ACA TCC G M52FM CAA GAA ACC TAT CAA ACA TCC TC 22 M56 ARMS PCR M56R TCT CAT TGC TGC CTC TCT TTA 55 M56FNL GCA ATG GGA GGA TTA CGA CA M56FMT GCA ATG GGA GGA TTA CGA CT 23 M60 DHPLC M60 F GCA CTG GCG TTC ATC ATC T 63, 56 M60 R ATG TTC ATT ATG GTT CAG GAG G 24 M62 ARMS-PCR M62 FNL GGA ATT AAT TAT TTC TCT TTC TCA T 54 M62 FMT GGA ATT AAT TAT TTC TCT TTC TCA C M62 R TGG TGG CAT GTG CCT GTG TT 25 M67 ARMS-PCR M67 F CCA TAT TCT TTA TAC TTT CTA CCT 55 M67 RNL TCG TGG ACC CCT CTA TAC A M67 RMT TCG TGG ACC CCT CTA TAC T b

SNO. Y-SNP GENOTYPING PRIMER PRIMER SEQUENCE ANNEALING METHOD DESIGNATION TEMPERATURE°C

26 M69 DHPLC M69 F GGT TAT CAT AGC CCA CTA TAC TTT G 63, 56 M69 R ATC TTT ATT CCC TTT GTC TTG CT 27 M70 ARMS-PCR M70 FNL GGA CTC ATG TCT CCA TGA GTA 58 M70 FMT GGA CTC ATG TCT CCA TGA GTC M70 R ATC TTT ATT CCC TTT GTC TTG CT 28 M73 DHPLC M73 F CAG AAT AAT AGG AGA ATT TTT GGT 63, 56 M73 R ATT TTC CTT ATT TTC TAA GCA GC 29 M74 DHPLC M174 F ATG CTA TAA TAA CTA GGT GTT GAA G 63, 56 M174 R AAT TCA GCT TTT ACC ACT TCT GAA 30 M76 DHPLC M76 F TAG AAG TAG CAG ATT GGG AGA GG 63, 56 M76 R CCT GAT AAA ATG AAA AAA ATG GTC 31 M78 ARMS-PCR M78 F TGG TTC TCC ACT ACA GGA GA 61 M78 RN ATT TTG AAA TAT TTG GAA GGG TG M78RM TAT TTT GAA ATA TTT GGA AGG GTA 32 M82 DHPLC M82 F CTG TAC TCC TGG GTA GCC TGT 63, 56 M82 R AAG AAC GAT TGA ACA CAC TAA CTC 33 M87 DHPLC M87 F TCC CAT TAT TTG CTA TAT TTG CT 55 M87 RNL AAC AAG CTG GCA TCA GAA TAT AA M87RMT CAA GCT GGC ATC AGA ATA TAG 34 M88 Sequencing M88 F ATT CTA GGG TCA GGC AAC TAG G 63, 56 M88 R TGT TTG TTC TAT TCT ATG GTC TTC C 35 M89 ARMS-PCR M89 F AGA AGC AGA TTG ATG TCC CAC T 62 M89 RNL AAC TCA GGC AAA GTG AGA GAA G M89 RMT AAC TCA GGC AAA GTG AGA GAA A 36 M91 DHPLC M91F GAG CTT GGA CTT TAG GAC GG 63, 56 M91R AAA CTT TAA GGC ACT TCT GGC 37 M92 ARMS-PCR M92 F GGC CTT ATA AGA TTG GCA TAC 62 M92 RNL CTA AAT ACT GTT GGA GCC TAT A M92 RMT CTA AAT ACT GTT GGA GCC TAT G 38 M97 DHPLC M97 F GTT GCC CTC TCA CAG AGC AC 63, 56 M97R AAG GTC ACT GGA AGG ATT GC 39 M101 DHPLC M101 F TCA CAG CAG CTT CAG CAA A 63, 56 c

SNO. Y-SNP GENOTYPING PRIMER PRIMER SEQUENCE ANNEALING METHOD DESIGNATION TEMPERATURE°C

M101 R ATA AAA ATT AGA CTC TGT GTT ACT AGC 40 M103 DHPLC M103 F CAG TAA GTG AAC TCA CAC ATA ATT CC 63, 56 M103 R CCA GTT TTA TTT CAG TTT CAC AGC 41 M109 DHPLC M109 F GGG TAT CAA AAT GTC TTC AAC CT 63, 56 M109 R GGG AAT TTC CTG CTA CTT GC 42 M110 Sequencing M110F CAG GGA AGG ACC GTA AAA GG 63, 56 M110 R ATG TTT ATC ATG TGC AGT AAA GGT T 43 M111 Sequencing M111 F AAT CTT CTG CAA AGG GTT CC 63, 56 M111 R CAG CTA CAA AAC AAA ATA CTG GAC 44 M117 DHPLC M117 F AAG TAT GAC TTA TGA AGT ACG AAG AAA 63, 56 M117 R ATT CAG TTA GAT TTT ACA ATG AGC A 45 M119 DHPLC M119 F GAA TGC TTA TGA ATT TCC CAG A 63, 56 M119 R TTC ACA CAA TAT ACA AGA TGT ATT CTT 46 M122 ARMS-PCR M122FN AAT TGA GAT ACT AAT TCA T 50 M122FM AAT TGA GAT ACT AAT TCA C M122R AAA ACT TTA TCA TAT TGA G 47 M123 ARMS-PCR M123 F CAG CGA ATT AGA TTT TCT TGC 58 M123RN GTA TCT GAA CTA GCA TAT CTG M123RM AGT ATC TGA ACT AGC ATA TCT A 48 M124 ARMS-PCR M124 F TGC CTT TTG GAA ATG AAT AAA TC 60 M124 N ACA AAC TCA GTA TTA TTA AAC CG M124 R ACA AAC TCA GTA TTA TTA AAC CA 49 M133 DHPLC M133 F TGA AAT GGA AAT CAA TAA ACT CAG T 63, 56 M133 R CCT TTT CTT TTT CTT TAA CCC TTC 50 M134 DHPLC M134 F AGA ATC ATC AAA CCC AGA AGG 63, 56 M134 R TCT TTG GCT TCT CTT TGA ACA G 51 M136 DHPLC M136 F ATG TGA AGA CAA CAC TGT GTG G 63, 56 M136 R TTG TGG TAG TCT TAG TTC TCA TGG 52 M143 DHPLC M143 F ATG CTA TAA TAA CTA GGT GTT GAA G 63, 56 M143 R AAT TCA GCT TTT ACC ACT TCT GAA 53 M147 Sequencing M147 F GTA TTC TGG GGC AAT TTT AGG 94-63-56-72 94-56-72 M147 R TTG ATA CAA GAG GTT ATT TTA AGC A 0.5Cdec/cycle d

SNO. Y-SNP GENOTYPING PRIMER PRIMER SEQUENCE ANNEALING METHOD DESIGNATION TEMPERATURE°C

54 M148 DHPLC M148 F AAC AGA ATT ATC AGG AAA AGG TTT 63, 56 M148 R TTT TAC TTG TTC GTG TAC TTT CAA 55 M150 DHPLC M150 F GCA GTG GAG ATG AAG TGAG AC 63, 56 M150 R CCT ACT TTC CCC CTC TTC TG 56 M152 DHPLC M152 F AAG CTA TTT TGG TTT CTT TCA 63, 56 M152 R GCC TTG TGT GGG TAT GAT TG 57 M157 DHPLC M157F GCT GGC AAG ACA CTT CTG A 55 M157RNL ACC AAA GGT CAT TTG TGG AT M157RMT CCA AAG GTC ATT TGT GGA G 58 M170 ARMS-PCR M170 N TAT TTA CTT AAA AAT CAT TGT TCA 56 M170FCmutant TAT TTA CTT AAA AAT CAT TGT TCC M170 Rnormal CTT TTT TCA GTT CTT CAT CAG TTA 59 M172 ARMS-PCR M172 FNL CCC AAA CCC ATT TTG ATG CTA T 61 M172 FMT CCC AAA CCC ATT TTG ATG CTA G M172 R TCA CAG TGG ATC CAT CTT CAC T 60 M173 ARMS-PCR M173 N AAT TCA AGG GCA TTT AGA ACA M173 FC AAT TCA AGG GCA TTT AGA ACC 56 M173R TAT CTG GCA TCC GTT AGA AAA G 55 61 M175 Sequencing M175 F TTG AGC AAG AAA AAT AGT ACC CA 94-63-56-72 94-56-72 M175 R CTC CAT TCT TAA CTA TCT CAG GGA 0.5Cdec/cycle 62 M177 Sequencing M177 F TTT AAC ATT GAC AGG ACC AG 94-63-56-72 94-56-72 M177 R GTG TTG GTT CTC CTG TAA AG 0.5Cdec/cycle 63 M185 DHPLC M185 F GGA GTA CCT ATC ACT GAA TGT GC 63, 56 M185 R GTC ATT CAT TTC TGC TTG GAA C 64 M193 DHPLC M193 F GCC TGG ATG AGG AAG TGA G 63, 56 M193 R GCC TTC TCC ATT TTT GAC CT 65 M201 ARMS PCR M201 FN AAT AAT CCA GTA TCA ACT GAG AG 56 M201 FM TAA TAA TCC AGT ATC AAC TGA GAT M201 R GTT CTG AAT GAA AGT TCA AAC GT 66 M207 ARMS-PCR M207 FN TAA GTC AAG CAA GAA ATT TTA 56 M207 FD TAA GTC AAG CAA GAA ATT TTG 52 M207 R CAA AAT TCA CCA AGA ATC CTT G e

SNO. Y-SNP GENOTYPING PRIMER PRIMER SEQUENCE ANNEALING METHOD DESIGNATION TEMPERATURE°C

67 M214 ARMS-PCR M214 F CAA GCG TAG AGG TAT TAC TAC AA 66 M214RNL TGA GAC ACT GTC TGA AAA CAA TA M214 RMT TGA GAC ACT GTC TGA AAA CAA TG 68 M217 Sequencing M217 F GCT TAT TTT TAG TCT CTC TTC CAT 63, 56 M217 R ACC TGT TGA ATG TTA CAT TTC TTT 69 M218 DHPLC M218 F TTG TGA GTT TTT TTC CAT CAA TC 63, 56 M218 R TTT ATT GAC GAT GGT ATT AGA AGA G 70 M231 DHPLC M231F CCT ATT ATC CTG GAA AAT GTG G 63, 56 M231R ATT CCG ATT CCT AGT CAC TTG G 71 M242 ARMS-PCR M242 F AAC TCT TGA TAA ACC GTG CTG 61 M242 RNL CAC GTT AAG ACC AAT GCC ATG M242 RMT CAC GTT AAG ACC AAT GCC ATA 72 M267 ARMS-PCR M267 F TTA TCC TGA GCC GTT GTC C M267 RNL CCA CAC AAA ATA CTG AAC GAT 62 M267 RMT CCA CAC AAA ATA CTG AAC GAC 58 73 M317 DHPLC M317 F TGG TTC TAC AGT TGG GAT TTT G 63, 56 M317 R CCT TAA TAA CCG AGG CAC AA 74 M343 ARMS M343 F TTT AAC CTC CTC CAG CTC TG M343RNL CCA CAT ATC TCC AGG TCT AG M343RMT CCA CAT ATC TCC AGG TCT AT 75 M349 ARMS M349 F TGG GAT TAA AGG TGC TCA TG 58 M349RN CCT AAG GTC AGA AAG TTT TAA C M349 RM CCT AAG GTC AGA AAG TTT TAA A 76 M357 DHPLC M357 F CCC CGT TTT TTC CTC TCT GCC 63, 56 M357 R CAC GTA ACC TGG GAT GGT CAT A 77 P15 DHPLC P15F AGA GAG TTT TCT AAC AGG GCG 63, 56 P15R TGG GAA TCA CTT TTG CAA CT 78 P31 Sequencing P31 F TAA GGC TGC GTG TTC CCT AT 63, 56 P31 R GCA CTG TCA CTG TGG ATG TT 79 PK1 AFLP PK1 F TCA ACT TTC TTA AAT GAT TGT ACG TT PK1 R TCT GTT CAG GAG AAC CTC TAT GG 80 PK2 ARMS-PCR PK2 F TGT GTC CTG GTG TCT TTT GG 67 f

SNO. Y-SNP GENOTYPING PRIMER PRIMER SEQUENCE ANNEALING METHOD DESIGNATION TEMPERATURE°C

PK2 RN GGT GTA CAA AAT AGT TTT TGT TTT TGA TCT AA PK2 RM GGT GTA CAA AAT AGT TTT TGT TTTT GAT CTC G 81 PK3 ARMS-PCR PK3 F TGT GTC CTG GTG TCT TTT GG 68 PK3 N AAA GCC ACC ATC TCA AGA TGG TGT ACT A PK3 M AAA GCC ACC ATC TCA AGA TGG TGT ACT G 82 PK4 DHPLC PK4 F CCA TCC TCC CAT GGC TAG T 63, 56 PK4 R GCT TCC AAG GTG CCC TTT AT 83 PK5 AFLP PK5 F TTC CAA ACA CAT GCT TCT GC 58.5 PK5 R TAA AAA GGA GGA GGG ACT GC 84 RPS4Y AFLP RPS4Y L CCA CAG AGA TGG TGT GGG TA 61 RPS4Y R GAG TGG GAG GGA CTG TGA GA 85 SRY+465 AFLP SRY13 GCC GAA GAA TTG CAG TTT 58 SRY14 GTT GAT GGG CGG TAA GTG GC 86 SRY1532 AFLP SRY1 TCC TTA GCAACC ATT AAT CTG G 60 SRY2 AAA TAGCAAAAA ATG ACA CAA GGC 87 SRY2627 AFLP SRY-2627 F CGC GGC TTT GAA TTT CAA GCT CTG 63 SRY-2627 R TAA GAG TCC CTC GGG GCC CTG G 88 SRY8299 AFLP SRY8299 R ACA GCA CAT TAG CTG GTA TGA C SRY8299 F TCT CTT TAT GGC AAG ACT TAC G 89 sY81 AFLP SY810.1 AGG CAC TGG TCA GAA TGA AG 56 SY810.2 AAT GGA AAA TAC AGC TCC CC 90 TAT AFLP TAT 1 GAC TCT GAG TGT AGA CTT GTG A 60 TAT 3 GAA GGT GCC GTA AAA GTG TGA A 91 YAP PCR YAP 1 CAG GGG AAG ATA AAG AAA TA 59 YAP 2 ACT GCT AAA AGG GGA TGG AT 92 12f2 PCR 12F2 F TCT TCT AGA ATT TCT TCA CAG AAT TG 59 12F2 D CTG ACT GAT CAA AAT GCT TAC AGA TC 93 92R7 AFLP 92R7 L GCC TAT CTA CTT CAG TGA TTT CT 62 92R7 L (R ) GAC CCG CTG TAG ACC TGA CT 92R7 A TGC ATG AAC ACA AAA GAC GTA 65 92R7 B GCA TTG TTA AAT ATG ACC AGC

g

S S M M R R Y Y 9 9 1 1 1 1

0 0

P P 8 8 3 3 9 9 1 1 7 7 . . 1 1 , ,

M M 4 4 2 2 , ,

M M 9 9 4 4 , ,

M M 1 1 3 3 9 9 M M P 6 6 9 0 0 , , ,

M M M 1 1 1 6 8 8 8 1 1 , , ,

M P P 8 8 2 5 5 9 , , 4

P P 9 9 0 0 R R Y P A P P 1 4 S S P , 4 4

( M Y Y M 7 7 8 1 1 1 1 1 9 )

( (

, M M M

M 1 1 1 2 4 3 3 1 0 0 5 3 ) )

, , M

M M 2 0 2 2 1 1 3 6 6 , ,

P P 1 1 8 8 4 4 , ,

P P 2 2 5 5 5 5 , ,

P P 2 2 6 6 0 0 ( M ( M M P S S M 1 M 0 0 M 2 1 R R 1 1 4 2 2 2 2 6 f 9 7 7 2 Y Y 1 1 8 2 0 9 , 4 4 7 3 3 2 4 4 M a 1 0 0 , 5 5 6 6 , ,

4 4 1

P M 5 5 , , M

7

) ) 2 M M 4 0 3 5 2 , 0 9 9 7

8 P 4 6 6 , , 3 , P

P P 8 2 , 2 2

0 M 9 9 9 , , 2

P P 5 1 1 8 5 5 ,

0 0 P , , 2

P P 1 1 1 2 5 5 ,

2 2 U , , 1

P P 7 5 5 9 4 4 , ,

P P 1 1 5 5 5 5 , ,

P P 1 1 5 5 6 6 , ,

P P 1 1 6 6 2 2 , ,

P P 1 1 6 6 8 8 , ,

P P 1 1 6 6 9 9 , ,

P P 1 1 7 7 0 0 , ,

P P 1 1 7 7 1 1 , ,

P P 1 1 7 7 2 2 , ,

P P 1 1

7 7

M P M M 3 3

P

, , 2

1 2 1

2

P P

5 P P 1 3 7

7

S 6 1 1

,

1 5

M M

P P

,

7 7

M , 9

4 4 P 2

2

6 6 , , 1

R 0

R

P P

7 7 8 ,

7

1 1 1 1 6 M

,

0 0 7 7

,

M

2 5 5

9 9 P

Y

2

, ,

7 7 4 1

, P P

5

9

M

, 1 1 1

M

7 7

7 7 6 ,

6 6 1 P

7

9

, , , 4 1

( ( M

1

9

S S

P P

6 1 R R

3

8

Y Y

8 5 - - 8 8

,

2 2 2 2

8 8 M

M 9 9 = =

2 M M

6 6 9

4 4 5

0 0

1

) )

1 1

4

P

7 P

2 2

6 6

3 3 M M M P P M M P M M P M P P M M M P M M M M M M M A M M M M M P M M M M M M P M P M P M M M L 1 1 9 9 1 1 6 6 2 7 7 5 1 5 1 1 5 5 3 3 2 2 1 8 2 8 3 2 2 2 2 4 3 2 P 3 3 4 3 3 4 2 L 4 4 9 9 0 0 2 1 2 5 , , 1 8 5 5 5 5 5 5 5 5 2 2 8 8 8 , 1 , 8 1 1 3 3 , 5 0

4 5 0 4 5 0 5 Y 7 7

8 8 6 7

4 M M 7 7 , , , , T 5 3 2 2 M M 7 7 5 6 6 M 3 7 7 6 1 7 6 1

3 7 2 2 2 , P P M M , , , , , , , 1 1

,

1 1

2

5

P

6 6 M P P

M M 4 4 5 5 M 0 0 g M M = 1 8 8 , , k k 7 7 3 5 5 3 3 P

1 3 2 2 M M 2 2 , , 4 , , 8 0

8 7 0

8 8 , , M M 2 M M 7 6 2 2

3 7 8 8 P P , 3 3 6 6 , , 1 1

4 4 P , , 4 4 M P 3 3

4 4 M M . . 2 1 1 3 1 1 1 2 , , 0 4 4 , , 0

4 P P , 9 9 M M 6

, P , , 1 1 ,

1 1

P M M 2 2 4 M 7 7 2 2 2 0 7 7 9 9 1 2 , 1 1

, , 8 7 M

, , P P 6 ,

M M 4 , 3 3 P

5 M 7 7 1 1 2 0 . . 3 3 2 1 1 1 5 5 9 8 , ,

, , , 9 P P

M M P , 4 4

M 2 1 1 1 1 3 4 4 . . 2 1 1 2 1 1 M M 9 P P , , , , ,

6 P

P P 1 1 3 3 P M M , 3 7 7 3 3

1 1 2 P 6 1 1 7 7 , , 9 9 8

3 . 9 9 M M 0 0 2 0 5 6 6 , , , 1 1

, , 1 1 P

3 3 M M 2 2 2 2 2 f f 2 2 8 2 2 0 0 5 b b 6 6 , ,

M M 2 2 1 1 2 2 , ,

M M M M E E H H 1 1 1 1 1 1 6 6 , ,

. . P P 1 1 3 3 , ,

P P 4 4 , ,

P P 5 5 , ,

P P 3 3 6 6 . . 1 1 , ,

P P k k 1 1 , ,

P P 2 2 4 4 7 7 , ,

P P P D M M M M M M P M M P M M D P D P 2 2 P M M T 3 P P

M Y 4 4 1 3 E 3 1 4 4 P 2 2 1 S 2 A

Y Y 1 7 7 7 2 3 8 3 3 1 2 S 1 7 2 8 8 1 H , , 2 2 2 T Y 5 . S S

5 5 8 7

0 , 2 , 8 2 4 7 P P 8 3 2 7 8

2

2 3 3 7 7

M M ( , 1 3 , 6 1 1

M . 0 9 9

P 2 3 = P , 7 7 1 2 1 1

4 < P

1 P 9 9 0 2 6 P P

6 /1 9 2 , , 2 1 6 2

) 8 8 4 P P , 4 9 ,

P 1 M , 4 1 1

, 1 P 8 8

2

P 0

0 0 2

2

5 2 , , 1

1

3 P P 6 ,

1 , 1 1 M

, P 8 8

3 P 1 1 2 1 2 1 4 3 7 3 ,

, P

P 2 2 1 3 8 M M M P M M M M P M P M P M M M M M M M M M M P P 4 , P M

1 1 4 3 1 1 6 , 8 2 P 1 1 1 1 6 2 2 2 1 1 2 3 3 3 3 4

2 2 2 7 2 8 8 7 P 3 1 7 2 5 2 5 8 , , 1 2 3 5 8 1 1 3 4 1 0

0 0 , 9 9 9 2 9 5 1 5 1 2 P P 5 1 7 8 9 8 9 9 0 9

0 M , , 4 3 9 1 1

P P , 6 , 3 , ,

P 2 2 , P 2 M M

1 1 P 9 2 2 1 1 1 1 3 2 2 8 8 , , , 3 0

0 0 P P P 8 , = =

2 2 , 1 P

P P 9 9 P 9 2 8 8 3 3 7 2 2 8 8 , 4 1

, , P 2 ,

P P 1 P 4 4 9 2 6 6 9 2 , , 2

P P = 1 1 U 8 8 2 2 2 5 0 M M M M P M M M M M M M M M M M U P U ,

2 2

1 1 3 3

1 1 1 1 1 1 5 1 1 1 1 1 1 P M M P P M M M M P ( M M S M 6 6 7 7 0 5 5 1 1 0 9 4 5 5 8 4 5 5 0 9 2 4 8 2 R 2 3 3 3 8 8 1 1 5 5 3 0 1 3 6 6 , 1 9 4 5 9 4 5 , 1 2 8 9 0 M M M M

5 4 3 2 , , 2 6 Y 4 2 2 0 . . M M , ,

3 2 2 1

, 6 3 3 P P 1 4 3 6 0 0 5 4 5 4 1 U U

= 0 6 6 M , 1 , 2 2 4 1 4 1 8

1 1

( 3 6 6 P 1 N 6 6 , , 1 0 1 8 8

. , , ) 2 9 9 2 M M

2 4 1 6 6 M M 7 1 3 4 , , 9 9

. 1 1 3 U U 2 0 0 5 5 5 , , 2 2

6 6 4 M M 4 4 , , )

7 7 9 9 M M 8 8 1 1 9 9 5 5

M M M P P U U P M M M ( M M S M U U P M M M M V P M M M M M P P P 0 M M P P P M 5 5 1 R 2 2 7 6 7 6 3 3 1 1 2 8 8 1 9 1 3 8 2 7 8 1 2 7 8 2 1 2 4 1 7 1 6 8 2 2 2 6 0 0 2 2 2 6 9 7 7 1 Y 8 5 5 5 7 4 7 2 8 1 2 8 8 1 5 0 5 1 5 3 1 2 6 9 9 5 0 4 4 9 6 8 4 3 1 4 9 9 . 9 4 7 , , 2 5 6

P P = 4 5 M 2 2 ) ,

7 7 P 3 7 7 4 5 , , 9 9

P P ,

2 2 M 7 7 1 8 8 7 6 P P P P P P P P M M M M M M M U U P P M P V V M M M M V V M M V V V V ( P N P ( N 1 1 9 1 1 9 1 1 0 0 5 5 2 1 1 2 2 1 1 1 1 1 1 6 6 2 2 3 2 2 1 1 2 2 1

3 3 4 4 1 1 1 1 1 1 1 1 . . 2 2 9 9 1 1 9 3 3 9 9 5 5 2 2 5 5 2 2 0 0 3 2 2 8 4 8 8 4

4 4 2 2 2 2 5 6 5 6 2 2 2 2 7 , , 0 0 0 0 8 8 5 4 , , 4

4 4

, , V V M M , ,

M M 5 5

M M 3 3 M M 7 7 5 5 6 6 2 2 1 1 ) ) 1 9 9 1 0 0 9 9 9 9 0 0 2 2 0 0 , , , ,

5 5 M M 0 0 2 2 f f 2 2 2 2 ( ( 0 0 P P , ,

) ) P P 2 2 8 8 9 9 M P M P P P P P P P M M P P P M M M P M M M M M M M M M M M M M P M M P 6 9 6 9 1 3 1 3 5 5 5 5 2 2 6 7 4 2 2 3 4 3 5 5 2 1 1 7 1 3 1 2 2 1 1 3 2 2 2 2 2 3 2 3 4 3 4 3 1 1 8 5 5 9 8 9 1 1 6 1 1 3 3 3 0 0 1 1 0 0 1 1 . . 0 0 , 5 5 , , , 9 5 5 , , , 8 8 8 8 , 1 1

M M , , P P , , M M M M . .

1 1 M M M M 1 1 7 7 6 6 1 1 0 0 3 7 3 7 1 1 3 3 2 2 0 0 2 , 2 , 6 6 , , 9 9

, , M M 5 5 9 9 M M

P P 8 8 1 1 2 2 6 6 2 2 9 9 7 7 1 1 , ,

M M 2 2 0 0 2 2 , ,

M M 2 2 1 1 9 9 , , M M 3 3 0 0 5 5

M M M M

P P M P M M M P M M M M P P V M M V P P M P M P

M M M P P M M M V V P P M

M P P M M M M M M M M M P M M P M M P P U U M P P P P M M V V M M

M

M

P M M M P

P P

M M M M M S M P U U P P P N P P P N 7 7 P P P 1 1 1 1 3 3 8 8 1 1 6 6 1 1 2 2 1 S 1 S 2 2 M M M P P 2 4 4 2 1 1 1 1 9 4 9 4 7 9 1 1 2 2 3 3 2 2 1 1 1 1 3 3 9 1 1 4

1 1 2 1 1 1 1 2 1 1 1 8 2 8 2

2 3 1 1 1 3 3

2 3

5 8 1 1 1 1 1 1 1 1 1

R 6 5 1 4 2 1 5 4 2 , , 7 5 5 1 1 9 9 8 8 0 0 2 2 1 2 7 1 6 6 1 1 5 5 5 5 0 4 4 0 3 0 3 0 1 1 3 6 1 2 8 5 Y Y 2 2 1 1 7 7 8 8 6 6 2 6 1 1 6 4

4 4 8 7 7

0 6 6 6 4 9 4 9

2 7 0 0 0 1 5

7 1 1 1

0 8 8 2 0 0 , , 4 4

6 3 2 7 2 3 7 P P 1 2 0 5 8 8 0 6 7 2 2 3 3 Y 8 8 9 6 6 9 7 7 4 4 7 5 5 2 3 9 3 4 4 , , 1 1 7 9 , 3

8 8 1 8 8 2 2 4 1 1 3 4 5 5 , 0 , 0 = 9 7 7 1 7 7 , 3 , ,

, ,

P P

. .

7 6 2 . . ,

7 7 , ,

0 6 3 2 P P

, M , 2 . . 2 2 , , M M 2 2 M

M 6 7 9 ,

M M M M

6 1 1

, 1 1 P 0 0

M M = M M 2 8 8 M M

6 M 2 7 M , , 1 1 M , , 1 0 0 7 4 4 1 1 U 1

2 2 M 1 1 3 6 1 ( M M 1 1 3 3 0 2 2 6 1 1 3 3 0 8 M M 1 S 5 5 6 0 6 8 8 6 6 4 1 8 3 2 2

3 1 2 2 1 P 8 6 1 3 3 8 , 1 1 1 3 , ,

6 9 3 ,

7 M 1 1

P P 9 7 Y M 8 M 1 3 3 ) + 7 2 1 2 2 3 0 0 , ,

1 6

9 P P 3

9 5 5 6 0 0

8 M M 2 2 0 4 5 1 M

M M M P

M P 4 P P 3 N N M 8 P P P M M P 9 2 3 M M M M M P P M S M M 7 2 M P P M U P M M k 1 1 4 4 5 1 6 2 6 9 8 2 k 1 5 9 3 2 Z R 1 1 1 7 1 5 1 4 0 9 k 1 8 1 4 1 0 3 1 7 0 1 6 3 1 1 3 6 1 4 4 7 9 9 9 6 6 9 3 Y 8 5 0 3 9 3 5 1 4 7 2 6 2 9 4 4 9 7 8 4 7 9 . 3 6 0 1 , 2 3

8 P , =

M 1 M 0 8 1 6 7 7 , ,

7

P M 2 2 9 0 2 4

L L L L L L

K K R N R R L N K K R N N N N N N N N R R R R R R R K O Q Q R R R R R R O Q Q R R R R R R D D D D D D D D D D D D D D H Q R R R R E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E H O Q R R R H H H H H H H I I I I I I I I I I O O O O O O O Q Q Q R G H O Q Q G M M M M M M M M M M M M O O O Q Q R F O O O O O O O O O O O O O P Q I I I I I I O O O B B B B B B B B B B B B B B B B B G G G G G G G G G G O F F H

D C C C C C C C C C C C C C C C C C C C C

J

J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J A A A A A A A A A A A A A A A A A A A A A A A A A A A A

1 1 2 1 2 2 2 1 2 2 2 1 2 2

1 * 1 1 1

1

1

1 3 1 1 2 2

2 2 2 1 2 3 2 2 2 2 2 2 2 2 4 2 2 2 2 2 1 1 1 2 * 1 2 1 1 1 2 1 2 2 2 1 1 1 1 2 2 * 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 2 1 * 2 1 1 3 1 1 1 1 1 1 1 1 2 2 1 2 1 2 2 2 3 2 3 1 2 3 3 * 1 3 3 3 3 1 * 3 3 1 2 2 3 1 2 3 3 3 3 1 1 b 2 3 2 1 1 1 2 2 2 3 3 1 3 3 1 1 1 2 2 3 1 3 1 1 1 2 2 * 4 2 1 2 2 2 1 2 3 1 1 2 3 3 3 1

5 1 1 1 2 2 2 2 3 3 3 3 3 3 3 1 1 1 2 2 2 2 3 3 3 3 3 3 3

6 2 b 4 5 2 a a 3 * 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 * 1 1 1 1 1 1 1 2 2 3 2 e 2 d b * 2 * 2 2 c

d

e a b b 3 b a a

b b

b e b b a b a 1 1 1 2 2 2 2 2 2 2 2 2 3 3 b a b a a

a a

a * a a c a a a b b a b b a a a b a a a b a b b b a a a b b b b * * a b b b * 1 1 1 2 b b b b a a b a a b a b * a a a a * b e a b b b a b a * a a a c a a a a a a * a b b c c c c a * a a * a a * a b a a a a a a a b a a a a a a a * a b a a a a b a a a a b * * a a * 1 a b c a

a * a b * a b c a b b b b b b * a b * a b c a b b b b b b * a b d a b 1 c a a * * a * * * * a a b b b b a a 4 2 * a 2 2

a 2 c 1 b 2 2 9 a 2 4 1 1 2 2 1

5 6 7 *

2 a a a b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b * a b b b b b 1 * 1 2 2 2 1 * 1 3 8 1 1 1 1 1 3 1 2 1 1 1 1 1 1 2 * 1 1 1 * 1 1 1 1 1 1 1 1 * 1 1 1 1 1

* * * 1 1 1 * 3 3 3 3 5 0 2 4 4 4 3 3 * 2 * 1 1 1 3 3 4 6 3 3 3 2 3 3 * 1 2 * 3 1 1 1 1 6 4 3 5 * a a * a a a a a a a a * a * 3 3 2 3 1 2 * 1 a * * * 1 b 1 * 1 1 2 2 2 * 1 1 2 2 2 * a b * 2 d 1 4 1

2 1 a * c 3 1 2 1 3

*

* 1 1 1 *

* a b c d e 2 b f * a * b b b b b b b b * a b b b b b b b c b

a * * a a b b c a b a a c c * * 1 2 * b c b c * * a

a b a b c * a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 * 1 1 1 1 a

*

* a * a b a * a b

* a * a a

a 1 1 * a b

* 1 * 1 1 1 1 1 2 3 *

2 2 2 * 2 1 2 2 1 2 2 2 2 2 2 * 2 1 1 1

* 1 1 2

1

2 * 3

a

* a a a a a a a a a a a a a a a a a a a b b b b b b b b b b b b b b b b b b b b b b b c * a a a

c d g e * f g a * a b h h h g

* 2 a b

*

* a a b b

1

* 2 * 1 2 * 1 2 3 4 5 6 7 7 7 7 7 7 8 8 8 8 8 9 * 1 2

* 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

* 1

* 1

* a a a a a * a a a a

* a a a a a a a a a a a b b b c c c c d e f

* 1 2 3 3 * 1 1 1

* 1 1 1 2 2 2 3 3 3 4 * 1 1 1

* 1 2

* a * a 2

* a b * a b * a b * a b

ASIA & AMERICA JAPAN INDIA NEW GUINEA EUROPE AFRICA AFRICA & MIDDLE EAST EURASIA NORTH EUROPE MEDITERRANEAN & LEVANT ASIA INDUSVALLEY AUSTRALASIA AMERICA EURASIA I II V IV III VI VIII VII X IX