Peter Gwozdz
Total Page:16
File Type:pdf, Size:1020Kb
Peter Gwozdz comparison to the historical record is beyond the scope of this article. The first article in this two-part series developed a set of tools for the analysis of Y-STR data from groups of I use the conventional notation “Haplogroup R1a1” for related people (Gwozdz, 2009). The present article generally accepted SNP haplogroup clades, and the no- applies these tools to Polish Y-STR data. A number of tation “P type” for types that are introduced here, and specially defined terms are used in this article and they “Pc type” for subtypes. A “type” is as defined in Part I are defined in the companion article (Part I). of this two-part series of articles. Most of the types are presented as hypothetical subdivisions of Haplogroup Pawlowski (2002) identified a Y-STR haplotype that is R1a1 (R-M17 or R-M198), so the stem haplogroup is 15 times more common in Poland than in the rest of R1a1 if not specified. In this article I retain the R1a1 Europe. That haplotype, determined at nine STR mark- designation for Haplogroup R-M17 in order to be com- ers, represents 4.6% of Pawlowski’s Polish data. This patible with the reference data. It is the same haplo- haplotype, called “P type” here, is evidence of a domi- group currently designated as R1a1a (ISOGG, 2009). nant Y-DNA clade in a relatively recent Polish popula- The conventional asterisk (*) as used in the construction tion expansion. One purpose of this article is to add “R1a*” indicates those downstream subsets of R1a that evidence, and to estimate the size and age of the hypoth- are not in any presently defined sub-haplogroups of R1a. esized P type clade. “R1a” without an asterisk means all of R1a. For exam- ple, R1a includes R1a1; R1a* does not include R1a1. Roewer (2005) analyzed European Y-STR data, show- ing it can supplement single nucleotide polymorphism About half of Polish Y-DNA is in Haplogroup R1a (SNP) markers in the study of population structure (Wiik, 2008; McDonald, 2005). within historical times. Pawlowski’s data source is a subset of Roewer’s source, Yhrd, a European forensic “Polish” is difficult to define exactly. Biskupski (2006) compilation. Another purpose of this article is to identi- published a delightful essay regarding the ambiguity of fy common Polish STR haplotypes, as modal haplotype the word “Polish.” In this article “Polish” simply refers candidates for hypothetical clades, with estimated size to people either born in historical Poland or who have and age. Other than brief comments in discussion, patrilineal lines that go back to historical Poland. ____________________________________________________________ The on page 156 of the companion article Address for correspondence: Peter Gwozdz, [email protected] provides a concise two-page statement of the method of this article, with terms defined in boldface. Received: June 5, 2009; accepted: September 13, 2009 159 160 FTDNA projects include family samples. These are edited, as explained in the companion article. Family The Y-STR data for this article is available from three sets were identified by noting identical family names Internet sources: Ysearch, Yhrd (Willuweit, 2007), and with identical 67-marker haplotypes, or similar haplo- Family Tree DNA (FTDNA). FTDNA has a number of types with sequential ID codes (or nearly sequential). “projects” that focus on surnames, haplogroups, and All but one of each family set were removed, leaving a geographic areas. The primary source for this article is representative with the most markers. For the Polish the Polish Project. See Web Resources at the end of this Project, where family sets are uncommon, only 13 of article for URLs. There is incidental use of data from 913 (1.4%) data entries were removed. Only two of the other projects, which data is available by substitution of R1a1 haplotypes were removed. Those 13 removals are the quoted project name for /polish/ in the public ad- from eight family pairs and one family set of six samples, dress for the Polish Project. of which five were removed. That set of six, which had identical haplotypes at 12 markers (four were identical Polish Project data follows the FTDNA standards and at 25 markers), is an example of how large unedited marker order, with progressive STR marker sets avail- family sets can distort haplotype frequency data. Addi- able, accumulating 12, 25, 37, and 67 markers, plus a few tional sets of very similar (zero genetic distance steps at additional rare compound markers, and minus rare miss- 25 markers, or less than two steps at 37, or less than ing markers. Data from all available STR markers are four steps at 67) haplotypes without identical names, 61 used here. The Polish Project has at least the standard 12 samples (6.8%), were not removed. Insofar as this markers for all samples, so those 12 markers provide the editing may not remove some family sets without family best statistics, although at lowest resolution. names, the data are biased slightly toward larger clus- ters. Insofar as this editing may remove men who The Ysearch database consists primarily of Y-STR haplo- actually submitted data independently, data is biased type data from individual men (herein, “samples”), but slightly, toward smaller clusters. also includes some modal haplotypes, fictitious entries for research purposes. The latter records were removed from Ysearch also has family sets, but Ysearch ID codes are the data before analysis. not assigned in sequence, so editing was restricted to identical family names with 67-marker haplotypes at Gwozdz: Y-STR mountains in haplospace, Part II: Application to common Polish clades 161 zero or one step separation. Also, Ysearch results were dler (2006) mutation rates to calculate age from ASD. As sorted by family name and visually scanned for suspi- explained in the companion article, I distinguish two cious sets of samples, for which the detailed data were kinds of age: TMRCA vs. time of hypothetical population checked by ID code for “contact person," where the expansion. same contact person is assumed to be the collector of a family set; only one such set was identified. Statistical methods are described in the companion article. See the companion article, “Y-STR Mountains in Hap- lospace, Part I: Methods,” for the “mountain” method. The word “type” means “mountain” in these articles. The FTDNA Polish Project data were downloaded 15 Briefly, a “type” is defined using an STR cluster for May 2009. An archive copy of the data is available as which a step frequency graph looks like a mountain. “PolishProject15May2009.xls” in the is an example. A “signature” is the modal . The total of 913 samples was edited to 900 as ex- haplotype using markers that best distinguish from plained above in . Editing and other types in the parent haplogroup. A “definition” matching notes are in “Evaluator15May2009.xls” and includes the modal haplotype and “cutoff” that produc- again in “Results15May2009.xls." All Polish Project es the best mountain. The cutoff is the next step, just samples have values for at least the first 12 markers. beyond the mountain. The “background” of samples gives the ten most common 12-marker haplo- that may not belong to the type for statistical reasons is types, which are all the haplotypes with more than six estimated based on the samples in the “gap” of steps samples. Also in are four of the eleven haplotypes with low sample frequency just beyond the mountain. with six or five samples (discussed below). The file The statistical background percent (SBP), a recommend- “Polish12.xls” has all 614 haplotypes at 12 markers in a ed measure of type quality, is the percent of samples in format similar to . a mountain that may not belong to the type for statisti- cal reasons, at 70% confidence maximum, including P type is the most common haplotype at 12 markers in objective adjustment for systematic statistical errors. A Poland. The 12-marker haplotypes in are an summary of the mountain method is at the end of the introduction; P, K, N, Y, etc. types are defined using the companion article. best ranked of all 67 markers with more discussion below. A small SBP means an isolated type, taken as evidence, My “Code Type” labels of are assigned albeit not proof, that the type represents a clade. with hypothetical foresight, anticipating the 67-marker results presented later in this article. A common haplo- Concentration of a type in Poland is taken as evidence, type, even at only 12 markers, provides a candidate modal albeit not proof, that the type represents a clade. haplotype for a type, as explained in the companion article and as demonstrated below. (“Type” means: The Polish Project is used as representative of Poland. hypothetical clade / modal haplotype / set of possible Yhrd is used to verify and update Pawlowski’s most haplotypes / associated cluster of data). Most of the common haplotype, here called P type. The Ysearch P haplotypes in are such candidates, but the statisti- type fraction of entries with “Poland” as origin is used cal evidence at 12 markers is insufficient by itself to as a calibration of the Polish fraction on Ysearch; confidently conclude that any of the entries is a other types with similar Polish fraction in Ysearch are modal haplotype for a type. Even at 67 markers some of taken to be similarly Polish. them are highly hypothetical, as discussed below. The fourth most common haplotype in , eleven sam- Age is estimated using Averaged Squared Distance ples, which I call PbKa, could also be a candidate type, but (ASD) in STR data, as explained in the companion I have found no evidence for it.