Peter Gwozdz when it has been artificially constructed, as in a modal haplotype. Markers are also called loci–singular locus. Short Tandem Repeat (STR) data for Y chromosomes are available on the web, for example at Ysearch "Clusters" are sets of Y-DNA samples classified by STR (www.ysearch.org), Yhrd (www.yhrd.org) - Willuweit marker value. This definition is imprecise because the (2007), and Family Tree DNA (FTDNA) word "clusters" is used loosely in the literature and on (www.familytreedna.com). The methods introduced by the web. Some uses of "clusters" do not fit even this this article are generally applicable to Y-STR data; the imprecise definition. Y-STR clusters are valuable for tools in the follow the FTDNA genetic genealogy because the Y chromosome does not format of 12, 25, 37, and 67 standard marker sets, plus recombine. Clusters are generally provided as hypothet- a few additional rare compound markers, and minus ical subdivisions of accepted clades, where accepted rare missing markers. clades are based on Single Nucleotide Polymorphism (SNP) markers or other Unique Event Polymorphism The on page 156 provides a concise two-page (UEP) markers. STR clusters provide clues on where to statement of the method of the article, where terms in look in the search for SNP subdivision markers. Clus- boldface are defined specifically for this method. Where ters are published mostly via web sites; there are scores these terms appear in the text, they are usually shown in of web pages that provide cluster classifications. italics. Ysearch has some "modal" STR haplotypes for clusters. FTDNA has a large number of links to "projects," many "Haplotype" has many meanings. In this article a of which offer proposed cluster classifications. haplotype is a set of Y-STR values for a specified set of markers. Another meaning of "haplotype" is a There does not seem to be a single preferred method for "sample"–the set of STR values from a particular man the initial selection of clusters as candidate clades, and (plural–samples or data or database). For clarity, I indeed I have no method to offer other than the obvious avoid that latter meaning, using only the former mean- methods: sorting STR data in a search for correlated ing of "haplotype." A haplotype may have no corre- STR marker values; searching for very common haplo- sponding sample in a particular database, for example types (many samples in a database) that are relatively isolated (few neighbors - fewer samples at haplotypes ____________________________________________________________ with one mutation step genetic distance); searching for unusual values of a slowly mutating STR marker; using Address for correspondence: Peter Gwozdz, [email protected] a network - joining computer program in a search for long, isolated branches. Cluster discovery is as much an Received: June 5, 2009; accepted: September 13, 2009 art as an objective method; many cluster classifications 137 138 on the web seem to spring from the experienced intu- because of the restriction (3) to s, as ition of the author. explained below. We presume that proposed Y-STR clusters will someday Methods for identification of STR cluster candidates are be confirmed or refuted as clades, with the discovery of mentioned only briefly here. This article concentrates SNP markers, taken to be the gold standard for identifi- on methods to validate the quality of particular " " cation of clades, called haplogroups. Haplogroups can of cluster candidates with objective formal evidence. be treated as clusters insofar as an STR haplotype can be Validation in this article means evidence by statistical assigned to a haplogroup with high probability albeit assessment. Statistical isolation of a is considered not with certainty, for example Athey (2005, 2006). evidence (not proof) that the is a clade. Comparing The haplogroup of a cluster is called the stem haplo- , those with relatively stronger isolation evi- group, or parent haplogroup. STR data is rapidly accu- dence are considered relatively more likely to represent mulating on the web, so hypothetical cluster clades. subdivisions will no doubt "stay ahead" of SNP haplo- group division for the immediate future. There are other methods of cluster assessment, but there does not seem to be an accepted method of assessing Cruciani (2004) is a classic example of cluster analysis, clusters found by various methods. For example, a with subsequent SNP validation of three of four clusters cluster based on a single rare STR mutation may corre- by Cruciani (2006). The latter article is titled as an spond to a valid young clade, because for a very young evaluation of a network approach because a network- clade a mutation in a slowly mutating marker is almost joining analysis of the former article is largely verified. as good as an SNP, although such a cluster may score However, it is not clear that the former clusters (defined poorly in my SBP method and may be missed by a by STR values) were initially chosen on the basis of the network - joining program. Network joining programs network, as opposed to being chosen by STR values and offer assessment as branch length, indicating genetic validated by the network analysis and subsequent SNPs. distance, but such assessment cannot be applied to judge It is not completely clear how one may select objectively the merit of a cluster identified by a different program or some cluster candidates based on network branches, but by another means. Also, most network - joining pro- reject others. The three 2004 clusters that were validat- grams connect all samples into branches (clusters and ed in 2006 were concentrated in three different geo- subclusters) so adding just one new sample may rear- graphic areas. The fourth 2004 cluster was not range quite a few branches. This is not unique to geographically concentrated and was shown in 2006 to network - joining programs; any cluster analysis is be composed of a mix of SNP clades and therefore not a sensitive to the statistical uncertainty associated with valid clade. Also, the four 2004 clusters are the four sampling. Statistical analysis is required to judge how largest clusters in the networks; smaller clusters were robust a cluster is to the vagaries of sample collection. not proposed as hypothetical clades, so statistical sam- This article presents the SBP method for statistical as- pling confidence is implicit. Cruciani (2006) is really a sessment, not to replace but to add to existing methods validation of the consideration of multiple lines of evi- of cluster assessment, with the caveat that the SBP dence for cluster analysis, in this case network-joining method applies to most but not all clusters. analysis, geographic concentration, STR correlation, and cluster size. In the published literature I could find no formal article mentioning isolation assessment by consideration of a The "modal haplotype" for a cluster is the set of most (" " defined below) in the distribution of genetic common STR values for a cluster, at any specified set of distance. Perhaps this idea is briefly mentioned in web markers. discussions; for example Mayka (2007) has been brought to my attention. I use the word " " to mean: (1) a hypothetical Y-DNA clade, proposed as a subdivision of an accepted It is understood that STR clusters (and ) are statis- SNP-defined haplogroup; and (2) a proposed modal tical, so even with confirmation by an SNP marker, at haplotype for that clade at any specified set of STR least a low percentage of men who closely match an STR markers; and (3) a set of haplotypes including the modal cluster will turn out to not belong to the corresponding haplotype and all those STR haplotypes that differ clade; these are called in this article, and a slightly from the modal haplotype, as further defined method is proposed to estimate the . The below ( in Haplospace); and (4) a cluster of (SBP) is proposed as a Y-DNA samples, from a specified database, matching high estimate of that . any of the set of haplotypes (at the same specified mark- ers). A relatively small SBP represents a relatively isolated , which is considered to be relatively strong evidence All correspond to clusters, but not all clusters (as that the corresponds to a clade. This article discuss- the word is used in the literature and on the web) are Gwozdz: Y-STR mountains in haplospace, Part 1: Methods 139 es reasons other than simple statistics why it is not It is also understood that most young Y-DNA clades possible to calculate the exact probability that a cannot be represented as valid STR clusters. The distri- corresponds to a clade. So for example a 5% bution of STR values tends to be continuous. Particular- SBP does not mean 95% probability that a hypothetical ly with population growth, unique combinations of STR corresponds to a clade. values are statistically unlikely. Very old clades may become STR clusters due to the statistics of long time However, a relatively high SBP can eliminate a cluster without population growth, but the oldest clades are from serious consideration: If a proposed STR cluster already identified as haplogroups. Young STR clusters has SBP greater than 50% (not an isolated , or can appear due to founder effects, such as population not enough data for statistical significance) that means bottlenecks or migration. A list of hypothetical clades the (foreigners that do not belong based on STR , as subdivisions of a current haplo- to the hypothetical clade) is significant. The true group, provides only those clades with particularly is probably less than the SBP, because SBP is strong founder effects. defined here as an objective high estimate of the . Averaged Squared Distance (ASD) in STR data is used to estimate the age of haplogroups or clusters.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages22 Page
-
File Size-